<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png"
     width="200px"
     height="auto"/>
</p>



# <h1 align="center" id="heading">Phase V - Data and Model Version Control using DVC and MLflow</h1>



## ☑️ Objectives
At the end of this session, you will have a brief understanding of how to:
- [ ] Track different different datasets
- [ ] Track and compare ML experiments
- [ ] Track and survey ML models

## 🛠️ Pre-Assignment

1. Create a virtual environment with 🐍 conda:  

```console
conda env create -f environment.yml
```

2. Activate your conda virtual environment:

```console
conda activate dvc_mlflow_env
```

3. Setup mlflow by running this command in your terminal
`mlflow server --host 0.0.0.0 --port 5000   --backend-store-uri sqlite:///mlflow.db   --default-artifact-root $PWD/mlruns`


## Tasks
There are two tasks for this phase:
1. Open and complete `dvc_mlflow.ipynb`
2. Submit a link to your Hugging Face profile
3. Submit screenshots of your MLflow experiments with the name `lastname_firstname_screenshot-number`


## Background
Please review the weekly narrative [here](https://www.notion.so/Week-2-Analyzing-Market-Sentiment-Phase-IV-and-V-Quality-and-Version-Control-90188b366dd94c7b81b3d9a2c6e978d1#2bd3411ff3ba4ad48f11124ee59a144f)

## References
[DVC Docs](https://dvc.org/doc)\
[MLflow Docs](https://www.mlflow.org/docs/latest/index.html)\
[Hugging Face Transformers](https://huggingface.co/docs/transformers/main_classes/callback)\
[MLflow - Experiments](https://www.mlflow.org/docs/latest/tracking.html)

# Setup 

##  Git setup

Create the `demo` directory. Change the workspace to `demo` (use % for changing into demo)

In [None]:
#INSERT_CODE_HERE

Initialize the Git repository (remember to use ! before your command)

In [None]:
#INSERT_CODE_HERE

Configure your Git credentials

In [None]:
#!git config --global user.email ""
#!git config --global user.name ""
#!git config --global credential.helper store

## DVC setup

Initiate DVC

In [None]:
#INSERT_CODE_HERE

The same way as GitHub provides storage hosting for Git repositories, DVC remotes provide a location to store and share data and models. You can pull data assets created by colleagues from DVC remotes without spending time and resources to build or process them locally. Remote storage can also save space on your local environment – DVC can fetch into the cache directory only the data you need for a specific branch/commit.

Add `remote_dvc` in the current directory.

In [None]:
#INSERT_CODE_HERE


## Mlflow setup

Set the `MLFLOW_TRACKING_URI` variable to your `0.0.0.0` on port `5000`

In [None]:
import os

#INSERT_CODE_HERE

# I. First iteration

First of all, let's create a new branch called "first-iteration" using git

In [None]:
#INSERT_CODE_HERE

Next, let's pull a dataset containing reddit comments to our workspace and move it to the git and dvc repository we have prepared inside the demo folder

In [None]:
!git clone -b v1 https://huggingface.co/datasets/fourthbrain-demo/reddit-comments-demo  /tmp/data

Copy the data folder over to our current data folder

In [None]:
!cp -r /tmp/data $PWD/data

Let's take a look at the datasets 

In [None]:
!ls -hsl data

### Working with DVC

Let's track all the `csv` files in `data` folder with dvc

This will :

    Adds your train.csv and test.csv files to .gitignore

    Creates two files with the .dvc extension, train.csv.dvc and test.csv.dvc

    Copies the train.csv and test.csv to a staging area

In [None]:
#INSERT_CODE_HERE

Track the changes with git for `data/train.csv.dvc` `data/test.csv.dvc` `data/.gitignore`

To enable auto staging, run: `dvc config core.autostage true`

In [None]:
#INSERT_CODE_HERE

Next, let's commit the recent changes

In [None]:
#INSERT_CODE_HERE

Use DVC to upload data from the cache to remote storage


In [None]:
#INSERT_CODE_HERE

At this point, if we check the demo/data folder, we can see that the csv files are gone, and we only have the dvc tracking files. Let's try to use that to pull the data again

In [None]:
!ls -lhs data

Now, since our datasets are pushed to the dvc remote, we can safely remove them from the project.

In [None]:
#INSERT_CODE_HERE

At this point, if we check the demo/data folder, we can see that the csv files are gone, and we only have the dvc tracking files. Let's use DVC to get the data again.

In [None]:
#INSERT_CODE_HERE

We check our data folder one more time.

In [None]:
!ls -lhs data

Nice! We have pulled the `train.csv` and `test.csv` files from our remote storage!

### Training model

Please create a new user token in Hugging Face [here](https://huggingface.co/settings/tokens)

Let's login to your Hugging Face account so you can manage your model repositories. notebook_login will launch a widget in your notebook where you'll need to add your Hugging Face token

In [None]:
from huggingface_hub import notebook_login
notebook_login()

Let's use Datasets library to load and preprocess the train and test datesets so we can then use them data for training the model with a 1:10 split.

In [None]:
from datasets import load_dataset

train_dataset = #INSERT_CODE_HERE
test_dataset = #INSERT_CODE_HERE

To preprocess our data, you will use tokenizer from a pretrained model. Use `AutoTokenizer` load the `distilbert-base-uncased` model.

In [None]:
from transformers import AutoTokenizer
tokenizer = #INSERT_CODE_HERE

Here, we prepare the text inputs for the model for both splits of our dataset (training and test) by using the map method

In [None]:
def preprocess_function(data):
   return tokenizer(data["comment"], truncation=True, padding=True)
 

In [None]:
tokenized_train = train_dataset.map(#INSERT_CODE_HERE)
tokenized_test = test_dataset.map(#INSERT_CODE_HERE)

Then, let's define the metrics you will be using to evaluate how good is your fine-tuned model (`accuracy` and `f1 score`)

In [None]:
import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = #INSERT_CODE_HERE
   load_f1 = #INSERT_CODE_HERE
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

To speed up training, let's use a data_collator to convert your training samples to PyTorch tensors and concatenate them with the correct amount of padding

In [None]:
from transformers import DataCollatorWithPadding
data_collator = #INSERT_CODE_HERE

Now, let's define our base model. For this, we'll use `distilbert-base-uncased` pretrained model with our 2 labels.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(#INSERT_CODE_HERE)



Before training our model, we need to define the training arguments and define a Trainer with all the objects you constructed up to this point. Find the proper callback to send the experiment to MLflow.

In [None]:
from transformers import TrainingArguments, Trainer
from transformers.integrations import MLflowCallback
repo_name = "bert_model"
 
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
)
 
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
   callbacks=#INSERT_CODE_HERE
)

It's time to fine-tune the model on the reddit comments dataset. Let's train it!

In [None]:
#INSERT_CODE_HERE
print("Nice! Now we have trained our model, let's head to MLflow and see if our experiment was tracked")

Head to your browser and open http://127.0.0.1:5000 and you should see your experiment. Please remember to start the MLflow server as instructed at the beginning of this assignment.

## Second iteration

Let's try to make our model better at predicition! In this iteration, we are going to use a new dataset to fine-tune a different nlp model.

In [None]:
# creating a new branch
!git checkout -b "second-iteration"

Let's remove the old datasets and pull the newer ones

In [None]:
!rm -rf data/*.csv

In [None]:
!git clone -b v2 https://huggingface.co/datasets/fourthbrain-demo/reddit-comments-demo  /tmp/data-v2

In [None]:
!cp -r /tmp/data-v2/* data

In [None]:
!ls -lsh data

Same as we did earlier in the first iteration, let's track our new datasets with DVC.

In [None]:
#INSERT_CODE_HERE

And now let's track these changes with git

In [None]:
#INSERT_CODE_HERE

Use git to commit our updates

In [None]:
#INSERT_CODE_HERE

Use DVC to upload data from the cache to remote storage

In [None]:
#INSERT_CODE_HERE

Let's use Datasets library to load and preprocess the train and test datesets so we can then use them data for training the model with a 1:10 split.

In [None]:
from datasets import load_dataset

train_dataset = #INSERT_CODE_HERE
test_dataset = #INSERT_CODE_HERE

Instead of distilbert-base-uncased, let's see if fine-tuning `roberta-base` would give better results !

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(#INSERT_CODE_HERE)

Now, let's define our base model. For this, we'll use `roberta-base` pretrained model with our 2 labels.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(#INSERT_CODE_HERE)

Here, we prepare the text inputs for the model for both splits of our dataset (training and test) by using the map method

In [None]:
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

We create a repo named `alberta_base` with our following training parameters. This experiment will be pushed to MLflow.

In [None]:
repo_name = "alberta_base"
 
training_args = TrainingArguments(
   output_dir=#INSERT_CODE_HERE,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=#INSERT_CODE_HERE,
)
 
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=#INSERT_CODE_HERE,
   eval_dataset=#INSERT_CODE_HERE,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
   callbacks=#INSERT_CODE_HERE
)

It's time to fine-tune the model on the reddit comments dataset. Let's train it!

In [None]:
#INSERT_CODE_HERE

Congratulations! You just practiced tracking your datasets and experiments through DVC and MLflow. If you want to learn more, check out the following resources!

[Track Machine Learning Training Runs](https://docs.databricks.com/applications/mlflow/tracking.html)\
[How We Track Machine Learning Experiments with MLFlow](https://www.datarevenue.com/en-blog/how-we-track-machine-learning-experiments-with-mlflow)\
[Configure a DVC remote without a DevOps degree](https://dagshub.com/blog/configure-a-dvc-remote-without-a-devops-degree/)\
[How to Compare ML Experiment Tracking Tools to Fit Your Data Science Workflow](https://dagshub.com/blog/how-to-compare-ml-experiment-tracking-tools-to-fit-your-data-science-workflow/)