# Data Version Control and Experiment Tracking with DVC and Dagshub!

In this tutorial, we will take a fraud detection model we have built with transaction data and learn how to version our dataset and track our experiments. We will use two tools to do this, DVC and DagsHub.

### Why should you version your dataset and keep track of experiment results?

Why do we need to version our data? Well simply put, we need to make sure that we can reproduce our experiments accurately and reliably. Without a version control system, it makes it cumbersome to guarantee reproducible experiments, and to know which experiments have been successful. This becomes amplified when you are dealing with teams, where you may have multiple people working on the same model producing different results. Specifically, data versioning and experiment tracking aim to solve the following problems:
- Maintain our datasets and models in the same manner as our codebase.
- Ensure that our experiments are reproducible and reliable.
- Enable collaboration with other team members so that we can maintain a single source of truth for our data and experiments.

### When to version your data and use experiment tracking?

Ideally, as this system is relatively easy to setup and provides utility even as a solo data scientist, it should be implemented in most cases. As a solo data scientist, you can simply version your data and experiment results as you go, without maintaining external spreadsheets of results and manually versioning data you use. In team settings, systems such as these become indespensible tools for seamless collaboration, making sure you can share results and use more traditional software engineering methodologies to maintain your repository such as code review, devops, and CI/CD.

### Tut Prerequisites

Prerequisites:
- Install Docker
- Install Python3.8+
- Install JupyterLab
- Install Git
- Create a Github Account

By the end of this tutorial you will be able to:
- Setup DVC for version controlling datasets and models
- Link your GitHub repository to DagsHub
- Use DagsHub to track your experiments
  
You should download the data required for this tutorial from [here](https://drive.google.com/file/d/1MidRYkLdAV-i0qytvsflIcKitK4atiAd/view?usp=sharing). This is originally from a [Kaggle dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data) for Fraud Detection. Place this dataset in a `data` directory in the root of your project. You can run this notebook either in VS Code or Jupyter Notebooks.

We will need a number of libraries for this tutorial so boot up a terminal and install them before you proceed.

```bash
pip -r requirements.txt
```

## Data Versioning with DVC

When we version our software projects, we typically use a version control system such as [git](https://git-scm.com). Using a system like this allows us to keep track of any changes we make to the codebase, easily collaborate with others by merging versions, and revert to earlier versions if something breaks. Git does have a limitation; it is not built to handle large files or binary files like those used in model building. DVC is a version control system built for datasets and models. You can think of it like git, but allows you to version both large files and model files, treating them in the same way as git.

Since you have forked this repo on github, you already have git version control active. You can now use DVC to version your data and models. Initialise DVC, and commit the changes DVC made to git.
    
```bash
dvc init
git commit -m "Initialise DVC"
```

Our git repo is now a DVC repo too!. Let's add the `data` directory we created to DVC.

```bash
dvc add data
git add data.dvc .gitignore
git commit -m "Add data directory to DVC"
```

Now push your changes to Github!
```
git push -u origin master
```

That's how can version your data and models with DVC! We now want a central remote repository that deals with DVC repos. Enter DagsHub.

## DagsHub Setup

While it is useful to be able to use DVC to version our datasets and models, we need to have a central remote store that will take advantage of the DVC versioning system in the same manner that Github takes advantage of git. This is what DagsHub will allow us to do, acting not only as a source of truth for our codebase, but also as a central repository for our data, data pipeline, models and experimentation.

Navigate to [DagsHub](dagshub.com) and sign up for a free account. You can login with your Github account.

You should be greeted with the following screen. Click the **Connect** button and select **Github**. 

![DagsHub Entry](media/dags_entry.png)

You will be prompted to connect a Github repository. Search for repo fork in and connect! You'll be greeted with a familiar looking screen. This is your repo page. You should refer back to this page throughout the tutorial.

![DagsHub Repo](media/dags_repo.png)

If we want our data to be viewable in DagsHub, we need to add our dataset to DVC and set the DVC remote to our DagsHub repo.

```bash
dvc remote add origin --local <https://dagshub.com/><username>/<repo_name>.dvc
```

Now we need to tell DVC how to auth.

```
dvc remote modify origin --local auth basic
dvc remote modify origin --local user <username>
dvc remote modify origin --local ask_password true
```

Puede fallar, la anternativa es:

```
dvc remote add origin-s3 s3://dvc
dvc remote modify origin-s3 endpointurl <https://dagshub.com/><user-name>/<repo-name>.s3
dvc remote modify origin-s3 --local access_key_id <Token>
dvc remote modify origin-s3 --local secret_access_key <Token>

```

Before you push your DVC repo, navigate to your DagsHub settings and create a *password* if you do not have one set. Then push! (It may take a while)
```
dvc push -r origin-s3
```

You can now view your data in your DagsHub repo. Our file is a little too big to view in the dashboard, bit you can view the raw file to verify it is working.

To show you how this updates, let's add some more files to the data directory. Using DVC, run the `preprocess.py` script to add some more files to generate train-test splits of the data to be used in training and testing. 

We could run the file as is with vanilla python, but we can also use DVC to specify a *pipeline stage*. This tells DVC what the inputs and outputs are for this particular command, which can be used to determined whether in the future the stage needs to be run again with a future build. With the following command we will create a stage called *preprocessing*, and specify all the inputs and outputs of the stage, as well as the processing file itself. In this case, we specify the name of the stage with `-n`, all inputs with `-d` and all outputs with `-o`, and finally specify the actual command to run for the stage as the last argument.
```bash
dvc run -n preprocessing -d data/train_transaction.csv \
-d preprocess.py \
-o feature_sets/X_train.csv \
-o feature_sets/X_test.csv \
-o feature_sets/y_train.csv \
-o feature_sets/y_test.csv \
python preprocess.py
```

This should generate 6 new files, `X_train.csv`, `X_test.csv`, `y_train.csv`, and `y_test.csv`, as well as new DVC files, `dvc.lock` and `dvc.yaml`. These new DVC files specify how the data pipeline works, the data versioning, and determines which pipeline stages need to be rerun upon rebuild. Add them to the git and DVC repo, commit and push.

```
git add dvc.lock dvc.yaml feature_sets/.gitignore
git commit -m "Add generate preprocessed dataframes"
git push -u origin master
dvc push -r origin
```

You can now have a look at your Data Pipeline in DagsHub! You can keep adding new stages to the pipeline, and DVC will automatically determine which stages need to be run while DagsHub will visualise it. We will create a new stage in the next section to see how this works.

![DagsHub Data Pipeline](media/dags_data_pipeline.png)


## Experiment Tracking

DagsHub doesn't only supply you with a remote DVC store, but also allows you to track your experiments along with your versioned code and data. We are going to modify the supplied file, `train.py`, to show how this is done. At the top of the file, add the following:

```python
...
from xgboost import XGBClassifier
import dagshub
...
```

Now, we are going to wrap our training code in a `dagshub_logger` context manager. The logger has 2 methods, `log_hyperparams` and `log_metrics`, for logging hyperparameters and metrics respectively (duh). Modify the `XGBClassifier` and `.fit()` lines with one indent and wrap it in a `dagshub.dagshub_logger` object as is below. If you'd like, you can be more specific about which hyperparameters you want to log by specifying them manually to the logger (we just use `.get_params()` here).
```python
...
with dagshub.dagshub_logger() as logger:
    xgb = XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective="binary:logistic",
        nthread=4,
        scale_pos_weight=1,
        seed=27,
    )

    # Log your hyperparameters with DagsHub
    logger.log_hyperparams(model_class=type(xgb).__name__)
    logger.log_hyperparams({'model': xgb.get_params()})

    model = xgb.fit(X_train, y_train)
    y_pred = xgb.predict(X_test)

    accuracy = round(accuracy_score(y_test, y_pred), 3)
    roc_auc = round(roc_auc_score(y_test, y_pred), 3)
    
    # Log your metrics with DagsHub
    logger.log_metrics(
        {'accuracy': accuracy}
    )
    logger.log_metrics(
        {'roc_auc': roc_auc}
    )
...
```

Save the changes and git commit. Time to run an experiment! Run your modified training file in a new DVC pipeline stage. You may notice we have a new flag `-M`. This tells DVC this is a *metric* output to be tracked by git. There is another option `-p`, which allows us to specify a hyperparameter file as a pipeline dependency. In this case, we are tracking `params.yml` as a metric artifact instead as we are specifying the hyperparameters in code and this file is output by the DagsHub logger.
```bash
dvc run -n training -d feature_sets/X_train.csv \
-d feature_sets/X_test.csv \
-d feature_sets/y_train.csv \
-d feature_sets/y_test.csv \
-d train.py \
-o models/xgb-fraud-classifier.joblib \
-M metrics.csv \
-M params.yml \
python train.py
```

This creates 3 new files, our 2 DagsHub files `metrics.csv` and `params.yml`, and our model file `models/xgb-fraud-classifier.joblib`. Add the Dagshub files to the git repo, the model file to dvc, and commit. It is also a good idea to tag your commit with some sort of model version.
```bash
git add metrics.csv params.yml models.dvc .gitignore
git commit -m "Train XGBoost model with OHE features, v0.1"
git tag -a "v0.1" -m "xgb model v0.1"
git push -u origin master
dvc push -r origin
```

You will see that now you can view your experiments in DagsHub under the **Experiments** tab.

![DagsHub Experiments](media/dags_experiments.png)

You will also see your updated data pipeline.

![Updated DagsHub Data Pipeline](media/dags_data_pipeline_new.png)

## Running a new experiment

Say we are testing a new hypothesis and want to run a new experiment with the same model, but with a different set of hyperparameters. After some analysis, we change the *n_estimators* and *max_depth* hyperpameters in the `training` file, finalising them to the following. Save the file modification.

```python
xgb = XGBClassifier(
    n_estimators=50,
    learning_rate=0.1,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    nthread=4,
    scale_pos_weight=1,
    seed=27,
)
```

We can now run the experiment again. Since we have specified how each pipeline should be run previously, we now simply tell dvc to *reproduce* the entire pipeline.
```bash
dvc repro
```

Great! Push your changes and view your new experiment in DagsHub.

![DagsHub New Experiment](media/dags_experiments_new.png)

You are now armed with the tools to successfully run experiments and track your results in DagsHub. You can visit each commit of a experiment seperately to see the exact code and data that was used to train the model in that particular run. Reproducing that result is simply a matter of checking out the commit and running the pipeline!