# Organize and share dataset versions

## Introduction

You can log and query metadata at a project level, including dataset and model versions, text notes, images, notebook files, and anything else you can log to a single Run.

This guide shows how to:
* Log versions of all the datasets used in a project
* Organize dataset version metadata in the Neptune UI
* Share all the currently used dataset versions with your team
* Assert that you are training on the latest dataset version available

By the end of this guide, you will log various dataset versions, organize them in the Neptune UI and see how to share them with a persistent link.

[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/metadata?path=datasets%2Ftrain_sampled&attribute=latest)

![image](https://user-images.githubusercontent.com/41324509/156204557-0878b92e-4ef3-4978-869d-e841d47d9f18.png)

## Before you start

Make sure that you have:
* [Python 3.7+ installed](https://www.python.org/downloads/),
* a [Neptune account](https://neptune.ai/register),
* [created a project](https://app.gitbook.com/@neptune-ai/s/docs-re-positioning/administration/workspace-project-and-user-management/projects#create-project) from the Neptune UI that you will use for tracking metadata.

**Tip**

Registering with Neptune and creating a project is optional in case you are just trying out the application as an "ANONYMOUS" user.

In [None]:
! pip install neptune-client scikit-learn==0.24.1

## Initialize the Neptune project 

In [None]:
import neptune.new as neptune

project = neptune.init_project(name="common/data-versioning", api_token="ANONYMOUS")

**Note**


Instead of logging data to the public project 'common/data-versioning' as an anonymous user 'neptuner' you can log it to your own project.

To do that:

1. Get your [Neptune API token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)
2. Pass the token to ``api_token`` argument of ``neptune.init_project()`` method: ``api_token=YOUR_API_TOKEN``
3. Get your [Neptune project name](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)
3. Pass your project to the ``name`` argument of the ``neptune.init_project()``.

For example:

```python
neptune.init_project(name="YOUR_WORKSPACE/YOUR_PROJECT", api_token="YOUR_API_TOKEN")
```

## Log various dataset versions to Neptune

Create a few different training data samples and log them as different dataset versions to a Neptune project. 

In [None]:
import pandas as pd

train = pd.read_csv("../datasets/tables/train.csv")

for i in range(5):
    train_sample = train.sample(frac=0.5 + 0.1 * i)
    train_sample.to_csv("../datasets/tables/train_sampled.csv", index=None)
    project[f"datasets/train_sampled/v{i}"].track_files(
        "../datasets/tables/train_sampled.csv", wait=True
    )

---
**Note**

In this case, you need to use ``wait=True`` to ensure all the logging operations are finished. 
By default, Neptune logs almost everything asynchronously.

---

You can confirm that it was logged by looking at the project metadata structure in the **datasets** namespace. 

In [None]:
project.get_structure()["datasets"]

**You should see something like this:**

{'train_sampled': {'latest': <neptune.new.attributes.atoms.artifact.Artifact at 0x117af5b40>,
  'v0': <neptune.new.attributes.atoms.artifact.Artifact at 0x117af6ef0>,
  'v1': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b491e0>,
  'v2': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b49240>,
  'v3': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b492a0>,
  'v4': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b49300>}}

Get the latest version of a dataset and create a separate dataset version called 'latest'.

In [None]:
def get_latest_version():
    artifact_name = project.get_structure()["datasets"]["train_sampled"].keys()
    versions = [int(version.replace("v", "")) for version in artifact_name if version != "latest"]
    return max(versions)


latest_version = get_latest_version()
print("latest version", latest_version)

In [None]:
project["datasets/train_sampled/latest"].assign(
    project[f"datasets/train_sampled/v{latest_version}"].fetch(), wait=True
)

## See dataset versions in the Neptune UI and share them with the team

You can get a list of all datasets used in a project by running ``project.get_structure()`` function.

In [None]:
project.get_structure()["datasets"]

**You should see something like this:**

{'train_sampled': {'latest': <neptune.new.attributes.atoms.artifact.Artifact at 0x117af5b40>,
  'v0': <neptune.new.attributes.atoms.artifact.Artifact at 0x117af6ef0>,
  'v1': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b491e0>,
  'v2': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b49240>,
  'v3': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b492a0>,
  'v4': <neptune.new.attributes.atoms.artifact.Artifact at 0x117b49300>}}

You and your team can also see and access all that in the Neptune UI.
Go to your project and then **Project Metadata > datasets**

![image](https://neptune.ai/wp-content/uploads/Screenshot-from-2021-12-23-14-00-51.png)

As all links in the Neptune UI URL of the Project metadata for your project is persistent. 

For example:

https://app.neptune.ai/o/common/org/data-versioning/metadata?path=datasets%2Ftrain_sampled&attribute=latest

## Create a new Neptune Run

Connect your script to the Neptune application and create a new Run.

In [None]:
run = neptune.init(project="common/data-versioning", api_token="ANONYMOUS")

## Assert that you are training on the latest dataset version

Log dataset version of the dataset you want to train your models on as a Neptune artifact

In [None]:
TRAIN_DATASET_PATH = "../datasets/tables/train_sampled.csv"
run["datasets/train"].track_files(TRAIN_DATASET_PATH, wait=True)

Assert that it is the same dataset as the latest dataset version in your project. 

In [None]:
assert run["datasets/train"].fetch_hash() == project["datasets/train_sampled/latest"].fetch_hash()

**Note:**

You can also download the latest version of the dataset by running 

```python
project['datasets/train_sampled/latest'].download()
```

## Run model training and log parameters and metrics to Neptune

Now train a model and log the test score to Neptune.

In [None]:
from sklearn.ensemble import RandomForestClassifier

TEST_DATASET_PATH = "../datasets/tables/test.csv"

PARAMS = {
    "n_estimators": 8,
    "max_depth": 3,
    "max_features": 2,
}
run["parameters"] = PARAMS

train = pd.read_csv(TRAIN_DATASET_PATH)
test = pd.read_csv(TEST_DATASET_PATH)

FEATURE_COLUMNS = ["sepal.length", "sepal.width", "petal.length", "petal.width"]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

rf = RandomForestClassifier(**PARAMS)
rf.fit(X_train, y_train)

score = rf.score(X_test, y_test)
run["metrics/test_score"] = score

## Stop logging  
<font color=red>**Warning:**</font><br>
Once you are done logging, you should stop tracking the run using the `stop()` method.
This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

In [None]:
run.stop()
project.stop()