#### This notebook is a part of the tutorial series on Effective MLOps with WANDb. All credits to the author of the tutorial. I'm just the small guy trying to learn and help you along the way with better comments and explanations.

> 
##### Exploratory Data Analysis is an approach to analyzing and summarizing datasets to figure out underlying distributon, structure and relationships.
##### 1. The main goal as dicussed before is to uncover underlying distribution of the data, presence of the outliers and any potential relationships between the variables.
##### 2. Univariate analysis -> Analyzing each variable independently and summarizing their distirbution using visual plots like histogram, boxplots and frequency distributions
##### 3. Bivariate analysis -> Analyzing relationship between two variables using scatter plots, cross tabulations and correlation of coefficients
##### 4. Multivariate analysis -> Analyzing the relationship between multiple variables using dimension reduction techniques including PCA,t-SNE and clustering algorithms.
##### 5. Data visualization -> Invovles plot creation for visual exploration of data for identification of patterns, trends and outliers.

In [1]:
### Clone fastai repo for use in this notebook
!pip3 install torch torchvision -f https://download.pytorch.org/whl/torch_stable.html
!pip3 install jupyter_contrib_nbextensions
!pip3 install fastai

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [2]:
## Let's start with some imports
!pip install tensorboard 
!pip3 install wandb -qqq



In [3]:
### Import fastai and wandb for logging metrics
import wandb
import params
from fastai.vision.all import *

In [4]:
#login with your own wandb account
wandb.login()


[34m[1mwandb[0m: Currently logged in as: [33mktest123[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

###### Dataset details 
###### The BDD or Berkeley Driving Dataset is a large-scale dataset consisting of 100K driving videos collected from more than 50k rides. Each videos is 40 sec long and 30 fps, more than 100 frames in total. The dataset includes diverse scene types including city streets, residential aread and highways, and diverse weather conditions at different times of the day. This dataset can be used for multi tasking including lane detection, semantic segmentation , instance segmentation, panoptic segmentation, multi-object tracking , segmentation tracking and more. 

###### The dataset is available at https://bdd100k.com/ for more details.

In [5]:
## Download the dataset
URL = 'https://storage.googleapis.com/wandb_course/bdd_simple_1k.zip'
path = Path(untar_data(URL,force_download=True))

In [6]:
### Let's look at the data
(path/'images').ls()

(#1000) [Path('/home/puget/.fastai/data/bdd_simple_1k/images/00e9be89-00001605.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/3b76e313-a4f861d4.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/4ac80c15-d1dcc514.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/20b675b9-aeb033a5.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/7b3ee12a-26590001.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/70d7553c-7da414da.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/9ece8bd6-3cede2ad.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/a91b7555-00000920.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/c060ea1f-12965687.jpg'),Path('/home/puget/.fastai/data/bdd_simple_1k/images/972ab49a-6a6eeaf5.jpg')...]

In [20]:
### Let's create a class to preprocess the data and upload it to a wandb table
class PreprocessData():
    def annotate_funct(fname):
        """ Function to annotate the data """
        return (fname.parent.parent/"labels")/f"{fname.stem}_mask.png"

    def image_per_class(data,labels):
        """ Function to count the number of images per class """
        mask_lst = list(np.unique(data))
        output = {}
        for i in labels.keys():
            output[labels[i]] = int(i in mask_lst)
        return output

    def table_creation(files,labels):
        """ Function to create the wandb table with the dataset """

        lb = [str(labels[i]) for i in list(labels)]
        table = wandb.Table(columns=["File Name","Image","Split"]+lb)

        for i,f in progress_bar(enumerate(files),total=len(files)):
            img = Image.open(f)
            mask_data = np.array(Image.open(PreprocessData.annotate_funct(f)))
            image_class = PreprocessData.image_per_class(mask_data,labels)
            table.add_data(
                str(f.name),
                wandb.Image(
                    img,
                    masks={
                        "predictions" : {
                            "mask_data" : mask_data,
                            "class_labels" : labels, 
                            }
                    }
                ),
            "None",
            *[image_class[i] for i in lb])

        return table


##### Before we dive right into the code, let's understand what wandb is and how it can help us with our MLOps journey.
##### W&B is a tool for tracking, visualizing and comparing machine learning experiments. It allows us to track and visualize metrics including accuracy, loss and other custom metrics as well as model gradients and weights during the training process. Not just that, it also incorporates comparison of multiple runs of the same or different models and share results with the rest of the team. Below mentioned are some common terms used in W&B.

##### 1. Runs -> A run denotes a single execution of a machine learning model including the input data , configuration and the output metrics.
##### 2. Metrics -> Metrics are quantifiable measurements of the performance of a model including loss, accuracy, precision, recall, F1 score and more.
##### 3. Artifacts -> Artifacts consists of additional information that can be logged along with a run including model weights, model architecture, model predictions, model gradients and more.
##### 4. Projects -> Come on, anyone can explain that but in short it's a way of organizing runs and experiments with each project having multiple runs and experiments.
##### 5. Experiments -> Collection of runs sharing the same codebase, data and hyperparameters
##### 6. Compare -> Compare is a feature that allows us to compare multiple runs and experiments in terms of their metrics and artifacts.
##### 7. Dashboard -> A web based interface that allows us to view and explore runs, experiments and projects.
##### 8. Integrations -> The flexibility of W&B allows us to blend it with different frameworks and libraries including PyTorch, TensorFlow, Keras, Scikit-learn, XGBoost for tracking training progress and logging experimental results.
##### 9. Sweeps -> Sweeps allows to perform hyperparameter tuning and optmization of the model by running multiple experiments with different hyperparameters.
##### 10. Notebooks -> W&B can track and monitor the progress of your jupyter notebooks and share the results with the rest of the team.

In [9]:
## Assuming you went through above explanation, let's start a new wandb run and put everything into a raw artifact
run = wandb.init(project="mlops",entity="ktest123",job_type="upload_data")
raw_data_artifact =  wandb.Artifact('bdd_simple_1k', type="raw_data")

VBox(children=(Label(value='0.004 MB of 0.010 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.430197…

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01666851151676383, max=1.0)…

In [12]:
## Start adding files to artifact
raw_data_artifact.add_file(path/'LICENSE.txt',name='LICENSE.txt')

ArtifactManifestEntry(path='LICENSE.txt', digest='X+6ZFkDOlnKesJCNt20yRg==', ref=None, birth_artifact_id=None, size=1594, extra={}, local_path='/home/puget/.local/share/wandb/artifacts/staging/tmp7midv_o5')

In [13]:
### Let's add the images and labels to the artifact
raw_data_artifact.add_dir(path/'images',name='images')
raw_data_artifact.add_dir(path/'labels',name='labels')

[34m[1mwandb[0m: Adding directory to artifact (/home/puget/.fastai/data/bdd_simple_1k/images)... Done. 0.2s
[34m[1mwandb[0m: Adding directory to artifact (/home/puget/.fastai/data/bdd_simple_1k/labels)... Done. 0.2s


In [15]:
### Let's get the image files
image_files = get_image_files(path/'images',recurse=True)

In [21]:
BDD_CLASSES = {i:c for i,c in enumerate (['background','road','traffic light', 'traffic sign', 'person','vehicle','bicycle'])}

In [22]:
### Let's create a wandb table with the dataset
wandb_table = PreprocessData.table_creation(image_files, BDD_CLASSES)

TypeError: Class labels must be a dictionary of numbers to strings