# How to add dataset to Layer

[![Open in Layer](https://development.layer.co/assets/badge.svg)](https://app.layer.ai/layer/iris/) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/layerai/examples/blob/main/tutorials/add-datasets-to-layer/how_to_add_dataset_to_layer.ipynb) [![Layer Examples Github](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com/layerai/examples/tree/main/tutorials/add-datasets-to-layer)


Layer helps you build, train and track all your machine learning project metadata including ML models and datasets‍ with semantic versioning, extensive artifact logging and dynamic reporting with local↔cloud training.

In this quick walkthrough, we'll take a look at how to register and track datasets with Layer.

## Install Layer

Ensure that you have the latest version of [Layer](www.layer.ai) installed.

In [25]:
!pip install wget layer --upgrade -qqq

  Building wheel for wget (setup.py) ... [?25l[?25hdone


## Authenticate your Layer account 

Once Layer is installed, you need to log in to your Layer account. The created data will be stored under this account. Therefore, this step is a must.

In [None]:
import layer
layer.login()

## Create a project
The next step is to create a project. The dataset will be saved under this project.

Layer Projects are smart containers to organize your machine learning metadata such as models, datasets, metrics, reports etc. They associate some combination of datasets and models. Layer projects are basically the front page of your ML projects which includes all your ML metadata including ML models, datasets, metric, parameters and more.

In Layer, projects are created using the `layer.init` command while passing the name of the project.

In [3]:
layer.init("iris")

Your Layer project is here: https://app.layer.ai/layer/iris

⬆️Click this link to visit your Layer Project page.


## Create your dataset function
The first step is to define a dataset function that will load the data and do any pre-processing that you'd like.



In [26]:
import wget
wget.download("https://raw.githubusercontent.com/layerai/examples/main/tutorials/add-datasets-to-layer/iris.csv")
def save_iris():
  data_file = 'iris.csv'
  import pandas as pd
  df = pd.read_csv(data_file)
  classes = df['Species'].nunique()
  # Log data about your data
  print(f"Number of classes {classes}")
  return df

In [27]:
df = save_iris()

Number of classes 3


In [28]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Saving the data to Layer
We can interact with Layer using decorators. Layer has built-in decorators for different purposes. In this case, we are interested in the [@dataset](http://docs.app.layer.ai/docs/sdk-library/dataset-decorator) decorator is used to create new datasets. 

Let's demonstrate how to use the [@dataset](http://docs.app.layer.ai/docs/sdk-library/dataset-decorator) decorator by saving the Iris dataset.


If your dataset depends on a file like a CSV file, you can bundle it with your decorated function with [resources](https://docs.app.layer.ai/docs/sdk-library/resources-decorator) decorator. Layer automatically uploads your local file.  The decorator expects the path to the data file.


Let's also replace `print()` with `layer.log()` to enable experiment tracking.

In [36]:
import layer
from layer.decorators import dataset,pip_requirements
from layer.decorators import resources

data_file = 'iris.csv'
@resources(data_file)
@pip_requirements(packages=["matplotlib","seaborn"])
@dataset('iris_data')
def save_iris():
  import pandas as pd
  import matplotlib.pyplot as plt 
  import seaborn as sns 
  df = pd.read_csv(data_file)
  classes = df['Species'].nunique()
  # Log data about your data
  layer.log({"Number of classes": classes})
  # Log some data statistics
  plt.figure(figsize=(12,8))
  plt.title('Species Countplot')
  plt.xticks(rotation=90,fontsize=12)
  sns.countplot(x='Species',data=df) 
  layer.log({"Species Countplot":plt.gcf() })

  plt.figure(figsize=(12,8))
  plt.xticks(rotation=90,fontsize=12)
  sns.violinplot(x='Species',y='PetalWidthCm',data=df)
  layer.log({"Species violinplot":plt.gcf() })

  plt.figure(figsize=(12,8))
  plt.xticks(rotation=90,fontsize=12)
  sns.boxplot(x="Species", y="PetalLengthCm", data=df)
  layer.log({"Boxplot":plt.gcf() })

  plt.figure(figsize=(12,8))
  sns.scatterplot(x='SepalLengthCm',y='PetalLengthCm',hue='Species',data=df)
  layer.log({"Scatterplot":plt.gcf() })

  return df

When you execute this function, the data will be stored in Layer under the project you just intitialized. 

You can execute this function in two ways.

### Run the function localy

Running the function locally uses your local infrastructure. However, the resulting DataFrame will still be saved to Layer. Layer will also print a link that you can use to view the data immediately. 

In [None]:
save_iris()

⬆️ Click the above link to see the registered data in your Layer Project. 

### Run the function on Layer infrastructure 

You can also choose to execute the function on Layer's infrastructure. This is useful especially when dealing with large data that require a lot of computation power. 


Running functions on Layer infra is done by passing them to the `layer.run` command. The command expects a list of functions. 

In [39]:
# Execute the function on Layer infra
layer.run([save_iris])

Output()

Run(project_name='iris', files_hash='27535a7ca337daeddbd3cc246fd3a9ef217de2bc286db35b877d7bc33471976e', account=Account(id=UUID('add1b570-c8e7-4187-b747-1d01104893a9'), name='layer'))

⬆️ Click the above link to see the registered data in your Layer Project. You will see that Layer automatically registered and versioned your data.


![Data on Layer](https://files.slack.com/files-pri/T011VP38L1F-F03H34999NE/image.png?pub_secret=6fe1ca9154)

## How to load and use your data from Layer

Once you register your data to Layer, you can load the data with simple calling layer.get_dataset(DATASET_NAME)

In [40]:
df = layer.get_dataset("layer/iris/datasets/iris_data").to_pandas()
df.head()

Output()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Where to go from here?

Now that you have registered your first model to Layer, you can:

- Join our [Slack Community ](https://bit.ly/layercommunityslack) to connect with other Layer users
- Visit [Layer Examples Repo](https://github.com/layerai/examples) for more examples
- Browse [Community Projects](https://layer.ai/community) to see more use cases
- Check out [Layer Documentation](https://docs.layer.ai)
- [Contact us](https://layer.ai/contact-us?interest=notebook) for your questions