# Compare model training runs on dataset versions

## Introduction

You can version datasets, models, and other file objects as Artifacts in Neptune.

This guide shows how to:
* Keep track of the dataset version with Neptune artifacts
* See if models were trained on the same dataset version
* Compare datasets in the Neptune UI to see what changed

By the end of this guide, you will train a few models on different dataset versions and compare those versions in the Neptune UI.

[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/experiments?compare=IwdgNMQ&split=tbl&dash=artifacts&viewId=2b313653-1aa2-40e8-8bf2-cd13f0f96862&base=DAT-18&to=DAT-17)

![image](https://neptune.ai/wp-content/uploads/artifacts-compare-runs-on-dataset.png)

## Setup

Install dependencies

In [None]:
! pip install neptune-client>=0.10.10 scikit-learn==0.24.1

## Step 1: Prepare a model training script

As an example I'll use a script that trains an Scikit-learn model on iris dataset.

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier 

TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'

PARAMS = {'n_estimators': 5,
          'max_depth':1,
          'max_features':2,
         }

def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
    TARGET_COLUMN = ['variety']
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(**params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score

## Step 2: Initialize Neptune and create new run

Connect your script to Neptune application and create new run.

In [None]:
import neptune.new as neptune

run = neptune.init(project='common/data-versioning',
                       api_token='ANONYMOUS')

Click on the link above to open this run in Neptune.

For now it is empty but keep the tab with the Run open to see what happens next. 

**Few explanations**

In the above code You tell Neptune: 

* **who you are**: your Neptune API token `api_token` 
* **where you want to send your data**: your Neptune `project`.

At this point you have new Run in Neptune. For now on you will use `run` to log metadata to it.

---

**Note**


Instead of logging data to the public project 'common/quickstarts' as an anonymous user 'neptuner' you can log it to your own project.

To do that:

1. Get your [Neptune API token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)
2. Pass the token to ``api_token`` argument of ``neptune.init()`` method: ``api_token=YOUR_API_TOKEN``
3. Get your [Neptune project name](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)
3. Pass your project to the ``project`` argument of the ``neptune.init()``.

For example:

```python
neptune.init(project='my_workspace/my_project', 
             api_token='MY_API_TOKEN')
```

## Step 2: Add tracking of the dataset version and parameters

Save datasets versions as Neptune artifacts

In [None]:
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

**Note:**

You can also version the entire folder where your datasets by running

```python
run["datasets"].track_files(DATASET_FOLDER)
```

## Step 3: Run model training and log parameters and metrics to Neptune

Now train a model and log the test score to Neptune

In [None]:
run["parameters"] = PARAMS

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

run["metrics/test_score"] = score

Stop logging to the current Neptune Run. 

In [None]:
run.stop()

## Step 4: Change training dataset

Let's now change the training dataset that we'll be using

In [None]:
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'

## Step 5: Run model training on a new training dataset

Let's run model trainin again.
* Initialize the Neptune Run

In [None]:
new_run = neptune.init(project='common/data-versioning',
                       api_token='ANONYMOUS')

* Log dataset versions

In [None]:
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

* Execute model training

In [None]:
new_run["parameters"] = PARAMS

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

new_run["metrics/test_score"] = score

Stop logging to currently active Neptune Run

In [None]:
new_run.stop()

## Step 6: Compare model training runs in the Neptune UI

To see that the score changed due to different dataset version: 
* Go to the `Runs table` in the Neptune UI
* Click on **+Add column**, type in 'artifacts/train' and click on it to add to the `Runs table`
* Click on the **Eye** icon and go to [Compare runs > Artifacts](https://docs.neptune.ai/you-should-know/comparing-runs#artifact) to see how the datasets changed

[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/experiments?compare=IwdgNMQ&split=tbl&dash=artifacts&viewId=2b313653-1aa2-40e8-8bf2-cd13f0f96862&base=DAT-18&to=DAT-17)

![image](https://neptune.ai/wp-content/uploads/artifacts-compare-runs-on-dataset.png)
