<a href="https://colab.research.google.com/github/gupta24789/hugging-face/blob/main/01_convert_data_to_hf_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objective


In this notebook we will cover below things
- How to convert any data to hugging face data
- How to load the data using hugging face methods


There are two kind of datasets
1. Datasets
2. IteratableDatasets : useful for big data, fetch the data in chunks



#### Mostly used arguments of load_dataset functions
```
path: str,
name: Union[str, NoneType] = None,
data_dir: path of data directory,
data_files: you can provide path of one or multiple file. If data_dir is None then provide the complete filepath otherwise filename will work. This will also accepts the dict object
split: represents the train, val test split
features: Feature class object which is having data type infomation,
streaming: convert dataset to iterable dataset
num_proc: number of processes for parallel processing
```

In [None]:
import pandas as pd
from datasets import Dataset, load_dataset
from datasets import ClassLabel, Features, Value, DatasetDict

## Convert pandas dataframe to hf dataset

In [None]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/val.csv")
train_df.shape, test_df.shape

((8004, 2), (2000, 2))

In [None]:
train_df.head(3)

Unnamed: 0,raw_tweet,label
0,Want to say a huge thanks to @WarriorAssaultS ...,1.0
1,@jaynehh_ you just need a job and get a letter...,1.0
2,"@knhillrocks HA yes, make it quick tho :D",1.0


In [None]:
test_df.head(3)

Unnamed: 0,raw_tweet,label
0,buuuuuuuut oh well :-),1
1,The four o'clock coffee habit. :-),1
2,@lewisssrg92 bet you do! Well I won't be getti...,1


In [None]:
## hf dataset
train_ds = Dataset.from_pandas(train_df, split='train')
test_ds = Dataset.from_pandas(test_df, split='test')

print("Train : ", train_ds)
print("Test : ", test_ds)
print("\n")
print("Train Features: ", train_ds.features)
print("Test Features: ", test_ds.features)

Train :  Dataset({
    features: ['raw_tweet', 'label'],
    num_rows: 8004
})
Test :  Dataset({
    features: ['raw_tweet', 'label'],
    num_rows: 2000
})


Train Features:  {'raw_tweet': Value(dtype='string', id=None), 'label': Value(dtype='float64', id=None)}
Test Features:  {'raw_tweet': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}


In [None]:
## As you can see above features(dtype) of train & test is not same
## mannually define the featues
features = Features({"raw_tweet": Value(dtype= "string"), "label": ClassLabel(num_classes=2, names=[0,1])})
train_ds = Dataset.from_pandas(train_df, split='train', features= features)
test_ds = Dataset.from_pandas(test_df, split='test', features= features)

print("Train : ", train_ds)
print("Test : ", test_ds)
print("\n")
print("Train Features: ", train_ds.features)
print("Test Features: ", test_ds.features)

Train :  Dataset({
    features: ['raw_tweet', 'label'],
    num_rows: 8004
})
Test :  Dataset({
    features: ['raw_tweet', 'label'],
    num_rows: 2000
})


Train Features:  {'raw_tweet': Value(dtype='string', id=None), 'label': ClassLabel(names=[0, 1], id=None)}
Test Features:  {'raw_tweet': Value(dtype='string', id=None), 'label': ClassLabel(names=[0, 1], id=None)}


In [None]:
## combine train and test
data = DatasetDict({
    "train":  train_ds,
    "test": test_ds
})

print(data)

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})


## Load csv data using hf dataset

In [None]:
## Pass multiple files in datafiles as part of same data
data = load_dataset("csv", data_files= ["data/train.csv","data/val.csv"])
data

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 10004
    })
})

In [None]:
## Load train and test data
data = load_dataset("csv", data_files = {"train" :  "data/train.csv", "test": "data/val.csv"})
data

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})

In [None]:
## files as dict of list
data = load_dataset("csv", data_files = {"train" :  ["data/train.csv"], "test": "data/val.csv"})
data

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})

In [None]:
## Another way to read the above data
data = load_dataset("csv", data_dir = "data", data_files = {"train" :  ["train.csv"], "test": "val.csv"}, delimiter=",", column_names = None)
data

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})

## Load text data using hf dataset

In [None]:
class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
data = load_dataset("text", data_dir = "data", data_files = {"train" :  "train.txt", "test": "val.txt"})
data

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 16000
    })
    test: Dataset({
        features: ['text'],
        num_rows: 2000
    })
})

In [None]:
data['train'][0]

{'text': 'i didnt feel humiliated;sadness'}

## Load json data using hf dataset

In [None]:
data = load_dataset(path = "json", data_dir= "data", data_files={"train": "train.json", "test":"test.json"})
data

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})

In [None]:
data['train'].features

{'raw_tweet': Value(dtype='string', id=None),
 'label': Value(dtype='float64', id=None)}

In [None]:
data['test'].features

{'raw_tweet': Value(dtype='string', id=None),
 'label': Value(dtype='float64', id=None)}

## Load pickle data using hf dataset

In [None]:
data = load_dataset(path = "pandas", data_dir= "data", data_files={"train": "train.pkl", "test":"test.pkl"})
data

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})

## from dict to hf dataset

In [None]:
my_dict = {'id': [0, 1, 2],
           'name': ['mary', 'bob', 'eve'],
           'age': [24, 53, 19]}

In [None]:
dataset = Dataset.from_dict(my_dict)
dataset

Dataset({
    features: ['id', 'name', 'age'],
    num_rows: 3
})

## Dataset vs IterableDataset


In [None]:
## Dataset
dataset = load_dataset("csv", data_files ={"train": "data/train.csv", "test":"data/val.csv"})
dataset

DatasetDict({
    train: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 8004
    })
    test: Dataset({
        features: ['raw_tweet', 'label'],
        num_rows: 2000
    })
})

In [None]:
dataset['train'][0]

{'raw_tweet': 'Want to say a huge thanks to @WarriorAssaultS @uktac @BolleSafety @Mechanix_Wear @Airtech_Studios @Hexmags #FF Thanks for the support :)',
 'label': 1.0}

In [None]:
## Iterable dataset
iter_dataset = load_dataset("csv", data_files ={"train": "data/train.csv", "test":"data/val.csv"}, streaming= True)
iter_dataset

IterableDatasetDict({
    train: IterableDataset({
        features: ['raw_tweet', 'label'],
        n_shards: 1
    })
    test: IterableDataset({
        features: ['raw_tweet', 'label'],
        n_shards: 1
    })
})

In [None]:
next(iter(iter_dataset['train']))

{'raw_tweet': 'Want to say a huge thanks to @WarriorAssaultS @uktac @BolleSafety @Mechanix_Wear @Airtech_Studios @Hexmags #FF Thanks for the support :)',
 'label': 1.0}

In [None]:
next(iter(iter_dataset['test']))

{'raw_tweet': 'buuuuuuuut oh well :-)', 'label': 1.0}

## Load data from hugging face hub

In [None]:
dataset = load_dataset('squad', split='train')
dataset

Downloading readme:   0%|          | 0.00/7.83k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

In [None]:
dataset[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}