# Version datasets in model training runs

## Introduction

You can version datasets, models, and other file objects as Artifacts in Neptune.

This guide shows how to:
* Keep track of a dataset version in your model training runs with artifacts  
* Query the dataset version from previous runs to make sure you are training on the same dataset version
* Group your Neptune Runs by the dataset version they were trained on

By the end of this guide, you will train a few models making sure that the same dataset was used and see the Runs for this dataset version in the Neptune UI. 


[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/experiments?compare=IwdgNMQ&split=tbl&dash=artifacts&viewId=6777136b-938e-4639-943d-3f6bc52f8497)

![image](https://neptune.ai/wp-content/uploads/artifacts-grouped-by-dataset-version.png)

## Setup

Install dependencies

In [None]:
! pip install neptune-client>=0.10.10 scikit-learn==0.24.1

## Step 1: Prepare a model training script

Create a training script where you:
* Specify dataset paths for training and testing
* Define model parameters
* Calculate the score on the test set

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier 

TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'

def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
    TARGET_COLUMN = ['variety']
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(**params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score

## Step 2: Initialize Neptune and create new run

Connect your script to Neptune application and create new run.

In [None]:
import neptune.new as neptune

run = neptune.init(project='common/data-versioning',
                   api_token='ANONYMOUS')

Click on the link above to open this run in Neptune.

For now it is empty but keep the tab with the Run open to see what happens next. 

**Few explanations**

In the above code You tell Neptune: 

* **who you are**: your Neptune API token `api_token` 
* **where you want to send your data**: your Neptune `project`.

At this point you have new Run in Neptune. For now on you will use `run` to log metadata to it.

---

**Note**


Instead of logging data to the public project 'common/quickstarts' as an anonymous user 'neptuner' you can log it to your own project.

To do that:

1. Get your [Neptune API token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)
2. Pass the token to ``api_token`` argument of ``neptune.init()`` method: ``api_token=YOUR_API_TOKEN``
3. Get your [Neptune project name](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)
3. Pass your project to the ``project`` argument of the ``neptune.init()``.

For example:

```python
neptune.init(project='my_workspace/my_project', 
             api_token='MY_API_TOKEN')
```

## Step 2: Add tracking of the dataset version

Save datasets versions as Neptune artifacts

In [None]:
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

**Note:**

You can also version the entire folder where your datasets by running

```python
run["datasets"].track_files(DATASET_FOLDER)
```

## Step 3: Run model training and log parameters and metrics to Neptune

Log parameters to Neptune

In [None]:
PARAMS = {'n_estimators': 5,
          'max_depth':2,
          'max_features':1,
         }
run["parameters"] = PARAMS

Log test score to Neptune

In [None]:
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score

Get the Run ID of your model training from Neptune. 

This will be useful when asserting the same dataset versions on the baseline and new datasets. 

In [None]:
baseline_run_id = run['sys/id'].fetch()
print(baseline_run_id)

Stop logging to the current Neptune Run. 

In [None]:
run.stop()

## Step 4: Add a version check for the training and testing datasets

You can fetch the dataset version hash from the baseline and compare it with the new current version of the dataset.

Create a new Neptune Run and track the dataset version:

In [None]:
new_run = neptune.init(project='common/data-versioning',
                       api_token='ANONYMOUS')

new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

Get the Neptune Run object for the baseline model:

In [None]:
baseline_run = neptune.init(project='common/data-versioning',
                            api_token="ANONYMOUS", 
                            run=baseline_run_id,
                            mode="read-only")

Fetch the dataset version with the `.fetch_hash()` method

In [None]:
baseline_run["datasets/train"].fetch_hash()

Compare the current dataset version with the baseline dataset version

In [None]:
new_run.wait() # force asynchronous logging operations to finish

assert baseline_run["datasets/train"].fetch_hash() == new_run["datasets/train"].fetch_hash()
assert baseline_run["datasets/test"].fetch_hash() == new_run["datasets/test"].fetch_hash()

## Step 5: Run model training with new parameters

Change the parameters and run model training

In [None]:
PARAMS = {'n_estimators': 8,
          'max_depth':3,
          'max_features':2,
         }
new_run["parameters"] = PARAMS

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

new_run["metrics/test_score"] = score

Stop logging to currently active Neptune Run

In [None]:
new_run.stop()
baseline_run.stop()

## Step 6: See all model training runs for this dataset version

To see all training runs for a particular dataset version:
* Go to the `Runs table` in the Neptune UI
* Click on **+Add column**, type in 'artifacts/train' and click on it to add to the `Runs table`
* Add parameters and test score in the same way
* See that your model training run improved thanks to better parameters because the dataset version didn't change. 

You can also [use ** +Group by**](https://docs.neptune.ai/how-to-guides/neptune-ui/groupby) to group by train dataset versions and find the training runs you care about quickly. 

[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/experiments?compare=IwdgNMQ&split=tbl&dash=artifacts&viewId=6777136b-938e-4639-943d-3f6bc52f8497)

![image](https://neptune.ai/wp-content/uploads/artifacts-grouped-by-dataset-version.png)