<a href="https://colab.research.google.com/github/Guilherme-De-Marchi/wandb-courses/blob/main/mlops-001/lesson1/02_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
from google.colab import drive
drive.mount('/content/drive')

!pip install wandb

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
import sys
sys.path.append('/content/drive/MyDrive/Colab Notebooks/wandb-courses/mlops-001/lesson01')

import os, warnings
import wandb

import pandas as pd
from fastai.vision.all import *
from sklearn.model_selection import StratifiedGroupKFold

import params
# warnings.filterwarnings('ignore')

In [17]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="data_split")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Let's use artifact we previously saved to W&B (we're storing artifact names and other global parameters in params).

In [18]:
raw_data_at = run.use_artifact(f'{params.RAW_DATA_AT}:latest')
path = Path(raw_data_at.download())

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 813.75MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:55.6


In [20]:
# path.ls()
(path/'images').ls()

(#1000) [Path('artifacts/bdd_simple_1k:v0/images/0bc1e57d-0331a843.jpg'),Path('artifacts/bdd_simple_1k:v0/images/bc92a87d-aaafd269.jpg'),Path('artifacts/bdd_simple_1k:v0/images/38b474cc-3efec58a.jpg'),Path('artifacts/bdd_simple_1k:v0/images/835df354-00000000.jpg'),Path('artifacts/bdd_simple_1k:v0/images/67c3676f-0aa098df.jpg'),Path('artifacts/bdd_simple_1k:v0/images/00e9be89-00001605.jpg'),Path('artifacts/bdd_simple_1k:v0/images/a06a4b0e-c5241837.jpg'),Path('artifacts/bdd_simple_1k:v0/images/c6d7400d-8a08045e.jpg'),Path('artifacts/bdd_simple_1k:v0/images/84cad8e7-6a28398d.jpg'),Path('artifacts/bdd_simple_1k:v0/images/af59aee0-759730e7.jpg')...]

To split data between training, testing and validation, we need file names, groups (derived from the file name) and target (here we use our rare class bicycle for stratification). We previously saved these columns to EDA table, so let's retrieve it from the table now.

In [21]:
fnames = os.listdir(path/'images')
groups = [s.split('-')[0] for s in fnames]

In [22]:
orig_eda_table = raw_data_at.get("eda_table")

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 813.75MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:2.0


In [23]:
y = orig_eda_table.get_column('bicycle')



Now we will split the data into train (80%), validation (10%) and test (10%) sets. As we do that, we need to be careful to:

    avoid leakage: for that reason we are grouping data according to video identifier (we want to make sure our model can generalize to new cars or video frames)
    handle the label imbalance: for that reason we stratify data with our target column

We will use sklearn's StratifiedGroupKFold to split the data into 10 folds and assign 1 fold for test, 1 for validation and the rest for training.


In [24]:
df = pd.DataFrame()
df['File_Name'] = fnames
df['fold'] = -1

In [25]:
cv = StratifiedGroupKFold(n_splits=10)
for i, (train_idxs, test_idxs) in enumerate(cv.split(fnames, y, groups)):
    df.loc[test_idxs, ['fold']] = i

In [26]:
df['Stage'] = 'train'
df.loc[df.fold == 0, ['Stage']] = 'test'
df.loc[df.fold == 1, ['Stage']] = 'valid'
del df['fold']
df.Stage.value_counts()

train    800
test     100
valid    100
Name: Stage, dtype: int64

In [27]:
df.to_csv('data_split.csv', index=False)

We will now create a new artifact and add our data there.

In [28]:
processed_data_at = wandb.Artifact(params.PROCESSED_DATA_AT, type="split_data")

In [29]:
processed_data_at.add_file('data_split.csv')
processed_data_at.add_dir(path)

[34m[1mwandb[0m: Adding directory to artifact (./artifacts/bdd_simple_1k:v0)... Done. 6.6s


Finally, the split information may be relevant for our analyses - rather than uploading images again, we will save the split information to a new table and join it with EDA table we created previously.

In [30]:
data_split_table = wandb.Table(dataframe=df[['File_Name', 'Stage']])

In [31]:
join_table = wandb.JoinedTable(orig_eda_table, data_split_table, "File_Name")

Let's add it to our artifact, log it and finish our run.

In [32]:
processed_data_at.add(join_table, "eda_table_data_split")

ArtifactManifestEntry(path='eda_table_data_split.joined-table.json', digest='FupBXkCDCQFdJP6HwIeN6w==', ref=None, birth_artifact_id=None, size=127, extra={}, local_path='/root/.local/share/wandb/artifacts/staging/tmpeaaa8cq4')

In [33]:
run.log_artifact(processed_data_at)
run.finish()