<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# EDA 
<!--- @wandbcode{course-lesson1} -->

In this notebook, we will download a sample of the [BDD100K](https://www.bdd100k.com/) semantic segmentation dataset and use W&B Artifacts and Tables to version and analyze our data. 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!mkdir -p /content/gdrive/MyDrive/modules

In [None]:
import sys
sys.path.append('/content/gdrive/MyDrive')

In [None]:
!pip install wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.13.10-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.15.0-py2.py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.3/181.3 KB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting setproctitle
  Downloading setproctitle-1.3.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31 kB)
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.30-py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━

In [None]:
DEBUG = False # set this flag to True to use a small subset of data for testing

In [None]:
from fastai.vision.all import *
from modules import params
import wandb

We have defined some global configuration parameters in the `params.py` file. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In the section below, we will use `untar_data` function from `fastai` to download and unzip our datasets. 

In [None]:
URL = 'https://storage.googleapis.com/wandb_course/bdd_simple_1k.zip'

In [None]:
path = Path(untar_data(URL, force_download=True))

In [None]:
path.ls()

(#3) [Path('/root/.fastai/data/bdd_simple_1k/images'),Path('/root/.fastai/data/bdd_simple_1k/LICENSE.txt'),Path('/root/.fastai/data/bdd_simple_1k/labels')]

Here we define several functions to help us process the data and upload it as a `Table` to W&B. 

In [None]:
def label_func(fname):
    #print(f"fname.stem :{fname.stem}")
    return (fname.parent.parent/"labels")/f"{fname.stem}_mask.png"

def get_classes_per_image(mask_data, class_labels):
    unique = list(np.unique(mask_data))
    #print(f"unique: {unique}")
    result_dict = {}
    for _class in class_labels.keys():
        #print(f"_class: {_class}")
        result_dict[class_labels[_class]] = int(_class in unique)
        #print(f"result_dict: {result_dict}")
    return result_dict

def _create_table(image_files, class_labels):
    "Create a table with the dataset"
    #print(f"class_labels: {class_labels}")
    #for _lab in list(class_labels):
    #  print(_lab)
    labels = [str(class_labels[_lab]) for _lab in list(class_labels)]
    #print(f"labels: {labels}")
    table = wandb.Table(columns=["File_Name", "Images", "Split"] + labels)
    
    for i, image_file in progress_bar(enumerate(image_files), total=len(image_files)):
        image = Image.open(image_file)
        #print(f"image :{image}")
        #print(f"image_file :{image_file}")
        mask_data = np.array(Image.open(label_func(image_file)))
        np.set_printoptions(threshold=np.inf)
        #print(f"mask_data :{mask_data}")
        class_in_image = get_classes_per_image(mask_data, class_labels)
        table.add_data(
            str(image_file.name),
            wandb.Image(
                    image,
                    masks={
                        "predictions": {
                            "mask_data": mask_data,
                            "class_labels": class_labels,
                        }
                    }
            ),
            "None", # we don't have a dataset split yet
            *[class_in_image[_lab] for _lab in labels]
        )
    
    return table

We will start a new W&B `run` and put everything into a raw Artifact.

In [None]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="upload")
raw_data_at = wandb.Artifact(params.RAW_DATA_AT, type="raw_data")

In [None]:
raw_data_at.add_file(path/'LICENSE.txt', name='LICENSE.txt')

ArtifactManifestEntry(path='LICENSE.txt', digest='X+6ZFkDOlnKesJCNt20yRg==', ref=None, birth_artifact_id=None, size=1594, extra={}, local_path='/root/.local/share/wandb/artifacts/staging/tmpdkoynbre')

Let's add the images and label masks.

In [None]:
raw_data_at.add_dir(path/'images', name='images')
raw_data_at.add_dir(path/'labels', name='labels')

[34m[1mwandb[0m: Adding directory to artifact (/root/.fastai/data/bdd_simple_1k/images)... Done. 0.9s
[34m[1mwandb[0m: Adding directory to artifact (/root/.fastai/data/bdd_simple_1k/labels)... Done. 0.5s


Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table. 

In [None]:
image_files = get_image_files(path/"images", recurse=False)

# sample a subset if DEBUG
if DEBUG: image_files = image_files[:1]

In [None]:
table = _create_table(image_files, params.BDD_CLASSES)

In [None]:

(path/"images").ls()

(#1000) [Path('/root/.fastai/data/bdd_simple_1k/images/5ffe9db5-fd4e0001.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/4d31466d-21002ea9.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/6ee57024-3c9e350d.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/115e4aff-00000000.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/941f3bb8-1184941a.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/a5dd241e-e61c0e76.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/4ddaf49d-46e97f82.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/377f88d4-00000000.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/5d2d790d-3d1b8b5d.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/1bc82b26-7f99de31.jpg')...]

In [None]:
(path/"labels").ls()

(#1001) [Path('/root/.fastai/data/bdd_simple_1k/labels/8c976e04-47482559_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/6a6e903e-71e70666_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/5b4d4333-d74a0d2a_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/7f1d11ea-fdf50001_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/58a6d4a6-a959f89a_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/23ae9ed0-8b5e7a2b_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/197c8ab8-da380000_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/90c4d040-61fa675b_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/1f7d6452-22ae9b98_mask.png'),Path('/root/.fastai/data/bdd_simple_1k/labels/3ece6409-7245f7ab_mask.png')...]

Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`. 

In [None]:
raw_data_at.add(table, "eda_table")

ValueError: ignored

In [None]:
run.log_artifact(raw_data_at)
run.finish()

### **Parte2**

Data preparation
In this notebook we will prepare the data to later train our deep learning model. To do so,

we will start a new W&B run and use our raw data artifact
split the data and save the splits into a new W&B Artifact
join information about the split with our EDA Table

In [None]:
!pip install wandb

In [None]:
import os, warnings
import wandb

import pandas as pd
from fastai.vision.all import *
from sklearn.model_selection import StratifiedGroupKFold

from wandbmodules import params
warnings.filterwarnings('ignore')

In [None]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="data_split")

Let's use artifact we previously saved to W&B (we're storing artifact names and other global parameters in params).

In [None]:
raw_data_at = run.use_artifact(f'{params.RAW_DATA_AT}:latest')
path = Path(raw_data_at.download())

In [None]:
path.ls()

To split data between training, testing and validation, we need file names, groups (derived from the file name) and target (here we use our rare class bicycle for stratification). We previously saved these columns to EDA table, so let's retrieve it from the table now.

In [None]:
fnames = os.listdir(path/'images')
groups = [s.split('-')[0] for s in fnames]

In [None]:
orig_eda_table = raw_data_at.get("eda_table")

In [None]:
y = orig_eda_table.get_column('bicycle')

Now we will split the data into train (80%), validation (10%) and test (10%) sets. As we do that, we need to be careful to:

avoid leakage: for that reason we are grouping data according to video identifier (we want to make sure our model can generalize to new cars or video frames)
handle the label imbalance: for that reason we stratify data with our target column
We will use sklearn's StratifiedGroupKFold to split the data into 10 folds and assign 1 fold for test, 1 for validation and the rest for training.

In [None]:
df = pd.DataFrame()
df['File_Name'] = fnames
df['fold'] = -1
     

In [None]:
cv = StratifiedGroupKFold(n_splits=10)
for i, (train_idxs, test_idxs) in enumerate(cv.split(fnames, y, groups)):
    df.loc[test_idxs, ['fold']] = i

In [None]:
df['Stage'] = 'train'
df.loc[df.fold == 0, ['Stage']] = 'test'
df.loc[df.fold == 1, ['Stage']] = 'valid'
del df['fold']
df.Stage.value_counts()

train    800
valid    100
test     100
Name: Stage, dtype: int64

In [None]:
df.to_csv('data_split.csv', index=False)

We will now create a new artifact and add our data there.

In [None]:
processed_data_at = wandb.Artifact(params.PROCESSED_DATA_AT, type="split_data")

In [None]:
processed_data_at.add_file('data_split.csv')
processed_data_at.add_dir(path)

[34m[1mwandb[0m: Adding directory to artifact (./artifacts/bdd_simple_1k:v0)... Done. 4.7s


Finally, the split information may be relevant for our analyses - rather than uploading images again, we will save the split information to a new table and join it with EDA table we created previously.

In [None]:
data_split_table = wandb.Table(dataframe=df[['File_Name', 'Stage']])

In [None]:
join_table = wandb.JoinedTable(orig_eda_table, data_split_table, "File_Name")

Let's add it to our artifact, log it and finish our run.

In [None]:
processed_data_at.add(join_table, "eda_table_data_split")

ArtifactManifestEntry(path='eda_table_data_split.joined-table.json', digest='3DQw8QZSt7+FkxDuaHSg7A==', ref=None, birth_artifact_id=None, size=127, extra={}, local_path='/root/.local/share/wandb/artifacts/staging/tmpdr5z4npn')

In [None]:
run.log_artifact(processed_data_at)
run.finish()

### **Parte3**


Baseline solution
In this notebook we will create a baseline solution to our semantic segmentation problem. To iterate fast a notebook is a handy solution. We will then refactor this code into a script to be able to use hyperparameter sweeps.

In [None]:
import sys
sys.path.append('/content/gdrive/MyDrive')

In [None]:
import wandb
import pandas as pd
from fastai.vision.all import *
from fastai.callback.wandb import WandbCallback

from wandbmodules import utils, params
from wandbmodules.utils import get_predictions, create_iou_table, MIOU, BackgroundIOU, RoadIOU, TrafficLightIOU, TrafficSignIOU, PersonIOU, VehicleIOU, BicycleIOU
     


Again, we're importing some global configuration parameters from params.py file. We have also defined some helper functions in utils.py - for example metrics we will track during our experiments.

Let's now create a train_config that we'll pass to W&B run to control training hyperparameters.

In [None]:

train_config = SimpleNamespace(
    framework="fastai",
    img_size=(180, 320),
    batch_size=8,
    augment=True, # use data augmentation
    epochs=10, 
    lr=2e-3,
    pretrained=True,  # whether to use pretrained encoder
    seed=42,
)

We are setting seed for reproducibility.

In [None]:

set_seed(train_config.seed, reproducible=True)

In [None]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="training", config=train_config)

[34m[1mwandb[0m: Currently logged in as: [33mlzeladam[0m. Use [1m`wandb login --relogin`[0m to force relogin


As usual, we will use W&B Artifacts to track the lineage of our models.



In [None]:
processed_data_at = run.use_artifact(f'{params.PROCESSED_DATA_AT}:latest')
processed_dataset_dir = Path(processed_data_at.download())
df = pd.read_csv(processed_dataset_dir / 'data_split.csv')

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k_split:latest, 813.25MB. 4010 files... 
[34m[1mwandb[0m:   4010 of 4010 files downloaded.  
Done. 0:0:3.6


We will not use the hold out dataset stage at this moment. is_valid column will tell our trainer how we want to split data between training and validation.

In [None]:
df = df[df.Stage != 'test'].reset_index(drop=True)
df['is_valid'] = df.Stage == 'valid'

In [None]:

def label_func(fname):
    return (fname.parent.parent/"labels")/f"{fname.stem}_mask.png"

We will use fastai's DataBlock API to feed data into model training and validation.

In [None]:
# assign paths
df["image_fname"] = [processed_dataset_dir/f'images/{f}' for f in df.File_Name.values]
df["label_fname"] = [label_func(f) for f in df.image_fname.values]

In [None]:
def get_data(df, bs=4, img_size=(180, 320), augment=True):
    block = DataBlock(blocks=(ImageBlock, MaskBlock(codes=params.BDD_CLASSES)),
                  get_x=ColReader("image_fname"),
                  get_y=ColReader("label_fname"),
                  splitter=ColSplitter(),
                  item_tfms=Resize(img_size),
                  batch_tfms=aug_transforms() if augment else None,
                 )
    return block.dataloaders(df, bs=bs)

We are using wandb.config to track our training hyperparameters.

In [None]:
config = wandb.config

In [None]:
config

{'framework': 'fastai', 'img_size': [180, 320], 'batch_size': 8, 'augment': True, 'epochs': 10, 'lr': 0.002, 'pretrained': True, 'seed': 42}

In [None]:
dls = get_data(df, bs=config.batch_size, img_size=config.img_size, augment=config.augment)

We will use intersection over union metrics: mean across all classes (MIOU) and IOU for each class separately. Our model will be a unet based on pretrained resnet18 backbone.

In [None]:
metrics = [MIOU(), BackgroundIOU(), RoadIOU(), TrafficLightIOU(), \
           TrafficSignIOU(), PersonIOU(), VehicleIOU(), BicycleIOU()]

learn = unet_learner(dls, arch=resnet18, pretrained=config.pretrained, metrics=metrics)
  

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


  0%|          | 0.00/44.7M [00:00<?, ?B/s]

In fastai we already have a callback that integrates tightly with W&B, we only need to pass the WandbCallback to the learner and we are ready to go. The callback will log all the useful variables for us. For example, whatever metric we pass to the learner will be tracked by the callback.

In [None]:

callbacks = [
    SaveModelCallback(monitor='miou'),
    WandbCallback(log_preds=False, log_model=True)
]

Let's train our model!

In [None]:
learn.fit_one_cycle(config.epochs, config.lr, cbs=callbacks)

epoch,train_loss,valid_loss,miou,background_iou,road_iou,traffic_light_iou,traffic_sign_iou,person_iou,vehicle_iou,bicycle_iou,time
0,0.519869,0.488336,0.150119,0.719564,0.268039,0.0,0.0,0.0,0.06323,0.0,00:52
1,0.371543,0.287116,0.33488,0.891312,0.789182,0.0,0.0,0.0,0.663666,0.0,00:50
2,0.371511,0.306927,0.323028,0.87056,0.797487,0.0,0.0,0.0,0.593146,0.0,00:49
3,0.311843,0.263902,0.344346,0.903568,0.817864,0.0,0.003589,0.0,0.6854,0.0,00:49
4,0.268859,0.257055,0.348983,0.905938,0.820556,0.011431,0.000469,0.0,0.704484,0.0,00:49
5,0.244692,0.236134,0.352803,0.911288,0.831298,0.009097,0.0,0.0,0.717935,0.0,00:49
6,0.225204,0.248345,0.364093,0.907665,0.833872,0.090784,0.0,0.0,0.716331,0.0,00:49
7,0.198232,0.238001,0.370885,0.916689,0.836729,0.090763,0.0,0.0,0.752012,0.0,00:48
8,0.189807,0.232065,0.371814,0.918429,0.842106,0.089024,0.0,0.0,0.753141,0.0,00:49
9,0.181106,0.228729,0.379446,0.919163,0.842285,0.136638,0.000268,0.0,0.757767,0.0,00:48


Better model found at epoch 0 with miou value: 0.15011882278899838.
Better model found at epoch 1 with miou value: 0.3348798851119916.
Better model found at epoch 3 with miou value: 0.3443458899992291.
Better model found at epoch 4 with miou value: 0.3489825142739512.
Better model found at epoch 5 with miou value: 0.3528027614255999.
Better model found at epoch 6 with miou value: 0.36409313850604036.
Better model found at epoch 7 with miou value: 0.3708846211687341.
Better model found at epoch 8 with miou value: 0.37181432914173546.
Better model found at epoch 9 with miou value: 0.37944599517718247.


In [None]:
samples, outputs, predictions = get_predictions(learn)
table = create_iou_table(samples, outputs, predictions, params.BDD_CLASSES)
wandb.log({"pred_table":table})

We are reloading the model from the best checkpoint at the end and saving it. To make sure we track the final metrics correctly, we will validate the model again and save the final loss and metrics to wandb.summary.

In [None]:
scores = learn.validate()
metric_names = ['final_loss'] + [f'final_{x.name}' for x in metrics]
final_results = {metric_names[i] : scores[i] for i in range(len(scores))}
for k,v in final_results.items(): 
    wandb.summary[k] = v

In [None]:
wandb.finish()

0,1
background_iou,▁▇▆▇██████
bicycle_iou,▁▁▁▁▁▁▁▁▁▁
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
eps_0,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eps_1,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eps_2,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
lr_0,▁▂▂▃▄▅▆▇███████▇▇▇▇▆▆▆▅▅▅▄▄▄▃▃▃▂▂▂▁▁▁▁▁▁
lr_1,▁▂▂▃▄▅▆▇███████▇▇▇▇▆▆▆▅▅▅▄▄▄▃▃▃▂▂▂▁▁▁▁▁▁
lr_2,▁▂▂▃▄▅▆▇███████▇▇▇▇▆▆▆▅▅▅▄▄▄▃▃▃▂▂▂▁▁▁▁▁▁
miou,▁▇▆▇▇▇████

0,1
background_iou,0.91916
bicycle_iou,0.0
epoch,10.0
eps_0,1e-05
eps_1,1e-05
eps_2,1e-05
final_background_iou,0.91916
final_bicycle_iou,0.0
final_loss,0.22873
final_miou,0.37945
