# Creating a FastEstimator dataset

## Creating simple datasets

The FastEstimator Dataset class took the inspiration from the PyTorch Dataset class which provides a clean and efficient interface. For a detailed tutorial on PyTorch implementation you can go to 
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

The two key functionalities that we need to provide for the Dataset class are the ability to get an individual data entry from the Dataset and the ability to get the length of the Dataset. This should be done as follows:
* len(dataset) should return the size of the dataset.
* dataset[i] should return the ith sample in the dataset.

## FastEstimator Dataset

In this section we will showcase how a Dataset can be created using FastEstimator. There are two ways you can create a Dataset in FastEstimator. One is using data from disk and another is using data from memory.

### 1. Creating Dataset from disk

To create a Dataset from disk, you can either use a Labeled directory structure or read a CSV file on the disk

#### 1.1 Creating Dataset from Labeled directory structure

To showcase this we will first have to create a dummy directory structure representing the two classes. Then we create a few files in each of the directories. Once that is done you can create a Dataset by passing the dummy directory to the LabeledDirDataset class constructor. The following code snippet shows how this can easily be done:

In [47]:
import os
import tempfile

import fastestimator as fe

with tempfile.TemporaryDirectory() as tmpdirname:
    a_tmpdirname = tempfile.TemporaryDirectory(dir=tmpdirname)
    b_tmpdirname = tempfile.TemporaryDirectory(dir=tmpdirname)
    
    a1 = open(os.path.join(a_tmpdirname.name, "a1.txt"), "x")
    a2 = open(os.path.join(a_tmpdirname.name, "a2.txt"), "x")
    
    b1 = open(os.path.join(b_tmpdirname.name, "b1.txt"), "x")
    b2 = open(os.path.join(b_tmpdirname.name, "b2.txt"), "x")
    
    dataset = fe.dataset.LabeledDirDataset(root_dir=tmpdirname)
    
    print (dataset[0])
    print (len(dataset))
    
    a_tmpdirname.cleanup()
    b_tmpdirname.cleanup()

{'x': '/tmp/tmpfhlul0vt/tmpcouwp4q9/b1.txt', 'y': 0}
4


#### 1.2 Creating Dataset from CSV

To showcase creating Dataset from CSV we now create a dummy CSV file representing information for the two classes. Once that is done you can create a Dataset by passing the CSV to the CSVDataset class constructor. The following code snippet shows how this can be done:

In [48]:
import os
import tempfile
import pandas as pd

import fastestimator as fe

with tempfile.TemporaryDirectory() as tmpdirname:
    data = {'x': ['a1.txt', 'a2.txt', 'b1.txt', 'b2.txt'], 'y': [0, 0, 1, 1]}
    df = pd.DataFrame(data=data)
    df.to_csv(os.path.join(tmpdirname, 'data.csv'), index=False)

    dataset = fe.dataset.CSVDataset(file_path=os.path.join(tmpdirname, 'data.csv'))

    print (dataset[0])
    print (len(dataset))

{'x': 'a1.txt', 'y': 0}
4


### 2. Creating Dataset from memory

To create a Dataset from memory, you use the NumpyDataset class passing it the data dictionary. The following code snippet shows how this can be done:

In [49]:
import numpy as np
import tensorflow as tf

import fastestimator as fe

(x_train, y_train), (x_eval, y_eval) = tf.keras.datasets.mnist.load_data()
train_data = fe.dataset.NumpyDataset({"x": x_train, "y": y_train})
eval_data = fe.dataset.NumpyDataset({"x": x_eval, "y": y_eval})

print (train_data[0]['y'])
print (len(train_data))

5
60000
