# Compare model training runs on dataset versions

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/examples/blob/main/how-to-guides/data-versioning/notebooks/Compare_model_training_runs_on_dataset_versions.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a><a target="_blank" href="https://github.com/neptune-ai/examples/blob/main/how-to-guides/data-versioning/notebooks/Compare_model_training_runs_on_dataset_versions.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a><a target="_blank" href="https://app.neptune.ai/o/common/org/data-versioning/runs/compare?viewId=2b313653-1aa2-40e8-8bf2-cd13f0f96862&dash=artifacts&compare=IwdgNMQ&base=DAT-18&to=DAT-17"> 
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a><a target="_blank" href="https://docs.neptune.ai/tutorials/comparing_artifacts/|">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>

## Introduction

You can version datasets, models, and other file objects as Artifacts in Neptune.

This guide shows how to:
* Keep track of the dataset version with Neptune artifacts
* See if models were trained on the same dataset version
* Compare datasets in the Neptune app to see what changed

By the end of this guide, you will train a few models on different dataset versions and compare those versions in the Neptune app.

![image](https://neptune.ai/wp-content/uploads/artifacts-compare-runs-on-dataset.png)

## Before you start

This notebook example lets you try out Neptune as an anonymous user, with zero setup.

If you want to see the example logged to your own workspace instead:

  1. Create a Neptune account. [Register &rarr;](https://neptune.ai/register)
  1. Create a Neptune project that you will use for tracking metadata. For instructions, see [Creating a project](https://docs.neptune.ai/setup/creating_project) in the Neptune docs.

## Install Neptune and dependencies

In [None]:
! pip install -U neptune scikit-learn

## Download data

In [None]:
! curl https://raw.githubusercontent.com/neptune-ai/examples/main/how-to-guides/data-versioning/datasets/tables/train.csv --create-dirs -o ../datasets/tables/train.csv
! curl https://raw.githubusercontent.com/neptune-ai/examples/main/how-to-guides/data-versioning/datasets/tables/test.csv --create-dirs -o ../datasets/tables/test.csv
! curl https://raw.githubusercontent.com/neptune-ai/examples/main/how-to-guides/data-versioning/datasets/tables/train_v2.csv --create-dirs -o ../datasets/tables/train_v2.csv

## Prepare a model training script

As an example, we'll use a script that trains a scikit-learn model on the iris dataset.

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

TRAIN_DATASET_PATH = "../datasets/tables/train.csv"
TEST_DATASET_PATH = "../datasets/tables/test.csv"

params = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = ["sepal.length", "sepal.width", "petal.length", "petal.width"]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(**params)
    rf.fit(X_train, y_train)

    return rf.score(X_test, y_test)

## Initialize Neptune and create new run

To create a new run for tracking the metadata, you tell Neptune who you are (`api_token`) and where to send the data (`project`).

You can use the default code cell below to create an anonymous run in a public project. **Note**: Public projects are cleaned regularly, so anonymous runs are only stored temporarily.

### Log to your own project instead

Replace the code below with the following:

```python
import neptune
from getpass import getpass

run = neptune.init_run(
    project="workspace-name/project-name",  # replace with your own (see instructions below)
    api_token=getpass("Enter your Neptune API token: "),
)
```

To find your API token and full project name:

1. [Log in to Neptune](https://app.neptune.ai/).
1. In the bottom-left corner, expand your user menu and select **Get your API token**.
1. The workspace name is displayed in the top-left corner of the app. 

    To copy the project path, in the top-right corner, open the settings menu and select **Properties**.

For more help, see [Setting Neptune credentials](https://docs.neptune.ai/setup/setting_credentials) in the Neptune docs.

In [None]:
import neptune

run = neptune.init_run(project="common/data-versioning", api_token=neptune.ANONYMOUS_API_TOKEN)

**To open the run in the Neptune web app, click the link that appeared in the cell output.**

We'll use the `run` object we just created to log metadata. You'll see the metadata appear in the app.

## Add tracking of the dataset version and parameters

Save datasets versions as Neptune artifacts

In [None]:
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

**Note:**

You can also version the entire folder where your datasets by running

```python
run["datasets"].track_files(DATASET_FOLDER)
```

Also, people often keep track of datasets at the project level with [Project metadata](https://docs.neptune.ai/api-reference/project).

For more information see [Organize and share dataset versions](https://docs.neptune.ai/how-to-guides/data-versioning/organize-and-share-dataset-versions).

## Run model training and log parameters and metrics to Neptune

Now train a model and log the test score to Neptune

In [None]:
run["parameters"] = params

score = train_model(params, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

run["metrics/test_score"] = score

## Stop logging to the current run 
<font color=red>**Warning:**</font><br>
Once you are done logging, you should stop tracking the run using the `stop()` method.
This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

In [None]:
run.stop()

## Change training dataset

Let's now change the training dataset that we'll be using

In [None]:
TRAIN_DATASET_PATH = "../datasets/tables/train_v2.csv"

## Run model training on a new training dataset

Let's run model training again.
* Initialize the Neptune run

In [None]:
new_run = neptune.init_run(project="common/data-versioning", api_token=neptune.ANONYMOUS_API_TOKEN)

* Log dataset versions

In [None]:
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

* Execute model training

In [None]:
new_run["parameters"] = params

score = train_model(params, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

new_run["metrics/test_score"] = score

Stop logging to currently active Neptune run

In [None]:
new_run.stop()

## Compare model training runs in the Neptune app

To see that the score changed due to different dataset version: 
* Go to the `Runs table` in the Neptune app
* Click on **+Add column**, type in 'artifacts/train' and click on it to add to the `Runs table`
* Click on the **Eye** icon and go to [Compare runs > Artifacts](https://docs.neptune.ai/you-should-know/comparing-runs#artifact) to see how the datasets changed

<a target="_blank" href="https://app.neptune.ai/o/common/org/data-versioning/runs/compare?viewId=2b313653-1aa2-40e8-8bf2-cd13f0f96862&dash=artifacts&compare=IwdgNMQ&base=DAT-18&to=DAT-17"> 
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

![image](https://neptune.ai/wp-content/uploads/artifacts-compare-runs-on-dataset.png)
