# Organize and share dataset versions

## Introduction

You can log and query metadata at a project level, including dataset and model versions, text notes, images, notebook files, and anything else you can log to a single Run.

This guide shows how to:
* Log versions of all the datasets used in a project
* Organize dataset version metadata in the Neptune UI
* Share all the currently used dataset versions with your team
* Assert that you are training on the latest dataset version available

By the end of this guide, you will log various dataset versions, organize them in the Neptune UI and see how to share them with a persistent link.

[See this example in Neptune](https://app.neptune.ai/o/common/org/data-versioning/metadata?path=datasets%2Ftrain_sampled&attribute=latest)

![image](https://neptune.ai/wp-content/uploads/Screenshot-from-2021-12-23-11-46-23.png)

## Setup

Install dependencies

In [1]:
! pip install neptune-client>=0.14 scikit-learn==0.24.1

## Step 1: Initialize the Neptune project 

In [2]:
import neptune.new as neptune

project = neptune.init_project(name="common/data-versioning", api_token="ANONYMOUS")

Remember to stop your project once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


**Few explanations**

In the above code, You tell Neptune: 

* **who you are**: your Neptune API token `api_token` 
* **where you want to send your data**: your Neptune `project`.

At this point, you can log metadata to the Neptune `project`. 

---

**Note**


Instead of logging data to the public project 'common/data-versioning' as an anonymous user 'neptuner' you can log it to your own project.

To do that:

1. Get your [Neptune API token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)
2. Pass the token to ``api_token`` argument of ``neptune.init()`` method: ``api_token=YOUR_API_TOKEN``
3. Get your [Neptune project name](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)
3. Pass your project to the ``project`` argument of the ``neptune.init_project()``.

For example:

```python
neptune.init_project(name='my_workspace/my_project', 
                     api_token='MY_API_TOKEN')
```

## Step 2: Log various dataset versions to Neptune

Create a few different training data samples and log them as different dataset versions to a Neptune project. 

In [3]:
import pandas as pd

train = pd.read_csv('../datasets/tables/train.csv')

for i in range(5):
    train_sample=train.sample(frac=0.5 + 0.1*i)
    train_sample.to_csv('../datasets/tables/train_sampled.csv', index=None)
    project[f'datasets/train_sampled/v{i}'].track_files('../datasets/tables/train_sampled.csv', wait=True)

---
**Note**

In this case, you need to use ``wait=True`` to ensure all the logging operations are finished. 
By default, Neptune logs almost everything asynchronously.

---

You can confirm that it was logged by looking at the project metadata structure in the **datasets** namespace. 

In [4]:
project.get_structure()

{'datasets': {'train_sampled': {'latest': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2650>,
   'v0': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa27d0>,
   'v1': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2910>,
   'v2': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2a50>,
   'v3': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2850>,
   'v4': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2c90>}},
 'sys': {'creation_time': <neptune.new.attributes.atoms.datetime.Datetime at 0x7fa0ddfa2dd0>,
  'id': <neptune.new.attributes.atoms.string.String at 0x7fa0ddfa2f10>,
  'modification_time': <neptune.new.attributes.atoms.datetime.Datetime at 0x7fa0ddfa2f50>,
  'monitoring_time': <neptune.new.attributes.atoms.integer.Integer at 0x7fa0ddfa2110>,
  'owner': <neptune.new.attributes.atoms.string.String at 0x7fa0ddfa2290>,
  'ping_time': <neptune.new.attributes.atoms.datetime.Datetime at 0x7fa0ddfa2090>,
 

Get the latest version of a dataset and create a separate dataset version called 'latest'.

In [5]:
def get_latest_version():
    artifact_name = project.get_structure()['datasets']['train_sampled'].keys()
    versions = [int(version.replace('v','')) for version in artifact_name if version != 'latest']
    latest_version = max(versions)
    return latest_version

latest_version = get_latest_version()
print('latest version', latest_version)

latest version 4


In [6]:
project['datasets/train_sampled/latest'].assign(project[f'datasets/train_sampled/v{latest_version}'].fetch(), wait=True)

## Step 3: See dataset versions in the Neptune UI and share them with the team

You can get a list of all datasets used in a project by running ``project.get_structure()`` function.

In [7]:
project.get_structure()['datasets']

{'train_sampled': {'latest': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2650>,
  'v0': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa27d0>,
  'v1': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2910>,
  'v2': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2a50>,
  'v3': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2850>,
  'v4': <neptune.new.attributes.atoms.artifact.Artifact at 0x7fa0ddfa2c90>}}

You and your team can also see and access all that in the Neptune UI.
Go to your project and then **Project Metadata > datasets**

![image](https://neptune.ai/wp-content/uploads/Screenshot-from-2021-12-23-14-00-51.png)

As all links in the Neptune UI URL of the Project metadata for your project is persistent. 

For example:

https://app.neptune.ai/o/common/org/data-versioning/metadata?path=datasets%2Ftrain_sampled&attribute=latest

## Step 4: Create a new Neptune Run

Connect your script to the Neptune application and create a new Run.

In [8]:
run = neptune.init(project='common/data-versioning', api_token='ANONYMOUS')

https://app.neptune.ai/common/data-versioning/e/DAT-67


Info (NVML): NVML Shared Library Not Found. GPU usage metrics may not be reported. For more information, see https://docs-legacy.neptune.ai/logging-and-managing-experiment-results/logging-experiment-data.html#hardware-consumption 


Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


**Few explanations**

In the above code, You tell Neptune: 

* **who you are**: your Neptune API token `api_token` 
* **where you want to send your data**: your Neptune `project`.

At this point, you have new Run in Neptune. For now on you will use `run` to log metadata to it.

---

**Note**


Instead of logging data to the public project 'common/quickstarts' as an anonymous user 'neptuner' you can log it to your own project.

To do that:

1. Get your [Neptune API token](https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token)
2. Pass the token to ``api_token`` argument of ``neptune.init()`` method: ``api_token=YOUR_API_TOKEN``
3. Get your [Neptune project name](https://docs.neptune.ai/getting-started/installation#setting-the-project-name)
3. Pass your project to the ``project`` argument of the ``neptune.init()``.

For example:

```python
neptune.init(project='my_workspace/my_project', 
             api_token='MY_API_TOKEN')
```

## Step 5: Assert that you are training on the latest dataset version

Log dataset version of the dataset you want to train your models on as a Neptune artifact

In [9]:
TRAIN_DATASET_PATH = '../datasets/tables/train_sampled.csv'
run["datasets/train"].track_files(TRAIN_DATASET_PATH, wait=True)

Assert that it is the same dataset as the latest dataset version in your project. 

In [10]:
assert run["datasets/train"].fetch_hash() == project['datasets/train_sampled/latest'].fetch_hash()

**Note:**

You can also download the latest version of the dataset by running 

```python
project['datasets/train_sampled/latest'].download()
```

## Step 6: Run model training and log parameters and metrics to Neptune

Now train a model and log the test score to Neptune.

In [11]:
from sklearn.ensemble import RandomForestClassifier 

TEST_DATASET_PATH = '../datasets/tables/test.csv'

PARAMS = {'n_estimators': 8,
          'max_depth':3,
          'max_features':2,
         }
run["parameters"] = PARAMS

train = pd.read_csv(TRAIN_DATASET_PATH)
test = pd.read_csv(TEST_DATASET_PATH)

FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

rf = RandomForestClassifier(**PARAMS)
rf.fit(X_train, y_train)

score = rf.score(X_test, y_test)
run["metrics/test_score"] = score



Stop logging to Neptune

In [12]:
run.stop()
project.stop()

Shutting down background jobs, please wait a moment...
Done!


Waiting for the remaining 6 operations to synchronize with Neptune. Do not kill this process.


All 6 operations synced, thanks for waiting!
Shutting down background jobs, please wait a moment...
Done!
All 0 operations synced, thanks for waiting!
