# The system

In any typical Machine/Deep Learning (ML/DL) work, the components are (1) Data, (2) Model.

Now, take a look at the `Explorer` on the left tab of `vs code`.

The directoies look like this

```
./
  |- .image/
  |- .venv/      <--- Python Libraries
  |- dataset/    <--- Where we keep dataset
  |- models/     <--- Where we store model
  |- workshop/
  |- .Dockerfile
  |- docker-compose.yml
  |- Pipfile     <--- Where the list of needed Python libraries is
  |- Pipfile.lock  <- Version control of Python libraries
  |- README.md
```

In this workshop, we will go through the concept of ML/DL.

*Note: We assume that you use `GitHub Codespaces`.*

## 1. Upload some data

Download this [workshop.zip](https://drive.google.com/file/d/10CZ6VRNnX006BxWRYlyNk8BRYmvWaXpK/view?usp=share_link) from `Google Drive`.
Then, upload the zip file to `Codespace`.

Extract the zip file into the `dataset` folder.

This is what you should have in the `dataset` folder.

```txt
./dataset
  |- workshop/
    |- OM/
        |- iphone 11/
            |- Indoor/
                |- 1/
                |- 2/
                ...
            |- Outdoor/
                |- 1/
                |- 2/
                ...
        |- samsung S21/
            |- Indoor/
                |- 1/
                |- 2/
                ...
            |- Outdoor/
                |- 1/
                |- 2/
                ...
    |- meta.csv
```

## 2. Understand Data a bit

The sample of image from `OM/iphone 11/Indoor/1/IMG_7455.JPG` folder.

<img src="../.image/sample.JPG" width="200"/>

And this is the first top 5 row of `meta.csv` files.

```
id,value
1,1.96
2,1.72
3,2.78
4,2.62
5,2.49
```

The dataset is structure as `dataset/<datset_name>/<element>/<device>/<environment>/<id>/<image_name>`

Thus, the level of `OM` of this image is `1.96`

## 3. Create Python Dataset Class

The first piece of code we need to write is `Dataset` class.
This will represent data in the world of `Python`.

In [1]:
import torch
from torch.utils.data import Dataset
from glob import glob
import os
import pandas as pd
from torchvision import io
from enum import Enum

import torchvision.transforms as transforms
class Devices(Enum):
    iphone_11 = 'iphone 11'
    samsung_s21 = 'samsung S21'
    all = '*'

class Environments(Enum):
    indoor = 'Indoor'
    outdoor = 'Outdoor'
    all = '*'

class Elements(Enum):
    om = 'OM'
    p = 'P'
    k = 'K'

class Preprocessing(Enum):
    training = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize(350),
        transforms.CenterCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.RandomVerticalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    ])

    inferencing  = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize(350),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    ])


class SoilDataset_bigset(Dataset):
    def __init__(self, dataset_name:str, element:Elements, device:Devices, 
                 environment:Environments, 
                 preprocessing:Preprocessing, 
                 clip_target:bool=False, 
                 normalize_target:bool=False):
        self.element:Elements = element
        self.clip_target:bool = clip_target
        self.normalize_target:bool = normalize_target
        
        # Set max value for clipping and normalizing
        self.max_value:float = 0
        if(self.element == Elements.om):
            self.max_value = 8.0
        elif(self.element == Elements.p):
            self.max_value = 1000.0
        elif(self.element == Elements.k):
            self.max_value = 1500.0

        # BasePath of the dataset
        base_path:str = "../dataset"
        dataset_path:str = os.path.join(base_path, dataset_name)
        if(os.path.exists(dataset_path) == False):
            raise ValueError(f"Path={dataset_path} is not exist. Execution path={os.getcwd()}")

        # Inside this path there must be a list of folders arange by mobile phone. Use device enum.
        # Inside those mobile phone are 2 folders indicate the environment the image was taken in. Use environment enum.
        image_folder = os.path.join(dataset_path, element.value, device.value, environment.value)
        self.imgs = glob(os.path.join(image_folder,'*/*'))
        print(f"Found {len(self.imgs)} images in {image_folder}.")

        # Load csv file for lookup the target value
        target_path:str = os.path.join(dataset_path,element.value,'meta.csv')
        self.target_df = pd.read_csv(target_path, index_col='id')

        self.signature = os.path.join(element.value,device.value,environment.value)
        self.preprocessing = preprocessing.value

    def get_target(self, img_path:str) -> float:
        assert len(img_path.split('/')) == 8, f"Expect img_path to have 8 folders but got {img_path=}"
        target_id = int(img_path.split('/')[6])
        target_value = float(self.target_df.loc[target_id].iloc[0]) # type:ignore
        if(self.clip_target and (target_value > self.max_value)):
            target_value = self.max_value
        if(self.normalize_target):
            target_value = target_value / (self.max_value)
        return float(target_value)
        
    def __len__(self):
        return len(self.imgs)

    def __getitem__(self, idx):
        img_path = self.imgs[idx]
        y = self.get_target(img_path=img_path)
        y = torch.tensor(y)
        X = io.read_image(img_path)
        if self.preprocessing:
            X = self.preprocessing(X)
        return X.float(), y.float(), img_path

In [2]:
dataset = SoilDataset_bigset(dataset_name="workshop", 
                             element=Elements.om, 
                             device=Devices.all, 
                             environment=Environments.outdoor, 
                             preprocessing=Preprocessing.training, 
                             clip_target=True, normalize_target=True)

Found 200 images in ../dataset/workshop/OM/*/Outdoor.


In [3]:
# Try to load the dataset that only get `samsung S21`

dataset = SoilDataset_bigset(dataset_name="workshop", 
                             element=Elements.om, 
                             device=Devices.samsung_s21, 
                             environment=Environments.all, 
                             preprocessing=Preprocessing.training, 
                             clip_target=True, normalize_target=True)

Found 200 images in ../dataset/workshop/OM/samsung S21/*.


In [4]:
# Try to load the dataset that only get `Outdoor` images from `all` devices
dataset = SoilDataset_bigset(dataset_name="workshop",
                            element=Elements.om,
                            device=Devices.all,
                            environment=Environments.outdoor,
                            preprocessing=Preprocessing.training,
                            clip_target=True, normalize_target=True)

Found 200 images in ../dataset/workshop/OM/*/Outdoor.


### Terminalogy

![Alt text](https://useruploads.socratic.org/vcKPankTBCVld2hWf2dw_7182277_orig.jpg)

$ y = ax+b $ here $y$ is **dependent variable**, $x$ is **independent variable**, $a$ is **effect size**, and $b$ is **bias**.

They also have different name based on researcher from each field.

X:
- Inputs
- Features
- Feature Vector
- Independent/Explanatory/Exogenous variables
- Predictor variables

Y:
- Outputs
- Labels (known outcomes)
- Dependent/Explained/Predicted variable
- Outcome
- Target

A:
- Slope
- Effect

B:
- Bias
- Intercept

Together (a,b) is *weight*.

In mathematic, we can also abstract the equation into a `function of` $y = f(x)$ or $f: X -> y$.

In machine learning, we use the term `hypothesis` instead of function, so you may see $y = h(x)$ or $h: X -> y$

Then, we want to differentiate the actual target with the predicted target so we use $\hat{y}$ to indicate prediction.

Finally, the aggrement on using $x$ for vector and $X$ for matrix. 


## 4. Model

> A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is probabilistic model.
>
> [ref](https://en.wikipedia.org/wiki/Statistical_model)

In short, a model is an equation. Modelling is an activity to find the model that get $\hat{y}$ that very close to $y$.

Here in Soil project, we want to read image and predict the level of something. Our images is $X$ and the level of something is $y$.

The common family of model that we use for image sample is *Convolutional Neural Network (CNN)*.

![Alt text](../.image/CNN.webp)

For now, we see this model as $h()$ function that takes $X$ and give you $\hat{y}$.

Let's load the model name `AlexNet`

In [5]:
from torchvision import models
from torchvision.models import AlexNet_Weights

model = models.alexnet(weights=AlexNet_Weights.DEFAULT, progress=True)
model


AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

### Problems

There are mainly two problems in ML/DL; (1) Regression problem and (2) classification problem.

It is easy to identify. When $y$ is a continuous value, it is regression. When $y$ is a discrete value, it is classification.

Our $y$ in Soil project is continuous, therefor, our problem is a regression problem.

The answer from `h()` is in range ($-\infty$, $\infty$). 

We need one small modification to the `AlexNet` in order for it to answer 


In [6]:
model.classifier[6] = torch.nn.Linear(in_features=4096, out_features=1, bias=True)
model.eval


<bound method Module.eval of AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, ou

## 5. Using the model

Let's just see the model in action.

In [9]:
x,y,image_path = dataset.__getitem__(99)

# y_hat = h(x)
y_hat = model(x.reshape((1,3,224,224)))

print(f"{y=}, {y_hat=}")

y=tensor(0.3713), y_hat=tensor([[-0.0566]], grad_fn=<AddmmBackward0>)


## 6. Training

You can see, using the model is pretty simple and a bit boring, to be frank.

The prediction is also no where close to the answer.

So, we now proceed to the training.


In [10]:
from torch.utils.data import DataLoader, random_split

train_dataset, test_dataset = random_split(dataset=dataset, lengths=[0.8,0.2], generator=torch.Generator().manual_seed(42))
train_dataset.dataset.preprocessing = Preprocessing.training.value # type:ignore

train_loader = DataLoader(dataset=train_dataset, batch_size=4, shuffle=True,  num_workers=2)
test_loader  = DataLoader(dataset=test_dataset,  batch_size=4, shuffle=False, num_workers=2)

In [11]:
import math
def train_test(model:torch.nn.Module, train_loader:DataLoader, test_loader:DataLoader, epochs:int, lr:float, DEVICE:torch.device):
    J_fn = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.to(DEVICE)

    import time
    #for epochs
    best_loss = math.inf

    for e in range(epochs):
        start_time = time.time()
        model.train()
        train_mse = 0
        for b, (image, label, _) in enumerate(train_loader):
            # print(f'start:{b}')
            #image: (B, C, W, H)
            #label: (B)
            image = image.to(DEVICE)
            label = label.to(DEVICE)
            
            yhat = model(image) #1. model
            train_loss = J_fn(yhat.reshape(-1), label.reshape(-1)) #2. loss
            #2.1 collect the loss and acc

            optimizer.zero_grad() #3. zero_grad
            train_loss.backward() #4. backward
            optimizer.step() #5. step
            
            train_mse += train_loss.detach().cpu()

        total_time = time.time() - start_time
        print(f"TRAIN|{e=} {total_time=} {train_mse}")
        
        if( train_mse <= best_loss ):
            print('save model!!')
            torch.save(model.state_dict(), "../models/alex.pth")
            best_loss = train_mse
        # Testing
        start_time = time.time()
        model.eval()
        test_mse = 0
        for b, (image, label, _) in enumerate(test_loader):
            image = image.to(DEVICE)
            label = label.to(DEVICE)
            yhat = model(image) #1. model
            test_loss = J_fn(yhat.reshape(-1), label.reshape(-1)) #2. loss
            test_mse += test_loss.detach().cpu()
        total_time = time.time() - start_time
        print(f"TEST |{e=} {total_time=} {test_mse}")
        
                
    return model

In [12]:
epochs = 10
# learning_rate
lr = 0.001
DEVICE = 'cpu'
train_test(model=model, train_loader=train_loader, test_loader=test_loader, epochs=epochs, lr=lr, DEVICE=DEVICE)

TRAIN|e=0 total_time=38.60555672645569 142.1394500732422
save model!!
TEST |e=0 total_time=4.058891534805298 0.2912577986717224
TRAIN|e=1 total_time=41.52930736541748 0.8152303099632263
save model!!
TEST |e=1 total_time=4.186108112335205 0.2021433711051941
TRAIN|e=2 total_time=41.038854360580444 0.7103885412216187
save model!!
TEST |e=2 total_time=4.221719026565552 0.3206513226032257
TRAIN|e=3 total_time=40.8500235080719 0.8834637403488159
TEST |e=3 total_time=3.835456371307373 0.10868373513221741
TRAIN|e=4 total_time=39.56671380996704 0.6896519660949707
save model!!
TEST |e=4 total_time=4.262863874435425 0.26721158623695374
TRAIN|e=5 total_time=41.29210567474365 0.6864010691642761
save model!!
TEST |e=5 total_time=4.066243886947632 0.1045185849070549
TRAIN|e=6 total_time=39.46724820137024 0.6413669586181641
save model!!
TEST |e=6 total_time=4.070074796676636 0.14533808827400208
TRAIN|e=7 total_time=39.22891902923584 0.6574790477752686
TEST |e=7 total_time=3.857133388519287 0.132057234

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

## 7. Test the model

In [13]:
x,y,image_path = dataset.__getitem__(0)

y_hat = model(x.reshape((1,3,224,224)))

print(f"{y=}, {y_hat=}")

y=tensor(0.2450), y_hat=tensor([[0.2450]], grad_fn=<AddmmBackward0>)
