# CH02b_Working_with_Datasets_Libary

## Installing and loading a dataset

In [None]:
!pip install datasets

In [None]:
# It loads a dataset from the HuggingFace Hub
from datasets import load_dataset

Datasets migth have several configurations. For instances, The GLUE dataset as an agregated benchmark has 10 subsets (as of writing this notebook) as: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX. 

To access each glue dataset, we pass two arguments where the first is **'glue'** and second is a **sub-part** of it to be chosen. Likewise, the wikipedia dataset have several configuration provided for several languages.

Lets load 'cola' subset of GLUE as follows:

In [None]:
cola = load_dataset("glue", "cola")
cola["train"][18:22]

While some dataset comes with DatasetDict object, some can be of type Dataset depending on splitting condition. The CoLA dataset come with DatasetDict where we have 3 splits: train,validation, and test. Train and validation datasets include the labels as well (1: Acceptable, 0: Unacceptable), but the label values of test split are -1, which means 'no-label'.   

In [None]:
cola

In [None]:
cola["train"][12]

In [None]:
cola["validation"][68]

In [None]:
cola["test"][20]

## Metadata of Datasets
* split
* description
* citation
* homepage
* license
* info


In [None]:
print(cola["train"].split)
print(cola["train"].description)
print(cola["train"].citation)
print(cola["train"].homepage)
print(cola["train"].license)

### Loading other datasets

In [None]:
sst2 = load_dataset("glue", "sst2")

In [None]:
mrpc = load_dataset("glue", "mrpc")

To check entire subsets, run the following piece of code


```
glue=['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
for g in glue:
 _=load_dataset('glue', g)
```




## Listing all datasets and metrics in the hub

In [None]:
from pprint import pprint
from datasets import list_datasets, list_metrics

all = list_datasets()
metrics = list_metrics()

print(f"{len(all)} datasets and {len(metrics)} metrics exists in the hub\n")
pprint(all[:20], compact=True)
pprint(metrics, compact=True)

## XTREME: Working with Cross-lingual dataset

MLQA is a subset of Xtreme benchmark, which is designed for assessing performances of cross-lingual question answering models. It includes about 5K extractive Question-Answer instances in SQuAD format in seven languages which are:
* (English, German, Arabic, Hindi, Vietnamese, Spanish and Simplified Chinese.) 

E.g. MLQA.en.de is English-German QA example dataset and can be loaded as follows:

In [None]:
en_de = load_dataset("xtreme", "MLQA.en.de")

Here is the dataset structure

In [None]:
en_de

### Viewing the dataset as a pandas data frame

In [None]:
# View dataset as a pandas data frame
import pandas as pd

pd.DataFrame(en_de["test"][0:4])

## Selecting, sorting, filtering

### Split
which split of the data to be loaded. If None by default, will return a `dict` with all splits (Train, Test, Validation or any other).  If split is specified, it will return a single Dataset rather than a Dictionary

In [None]:
cola = load_dataset("glue", "cola", split="train[:300]+validation[-30%:]")
# Which means the first 300 examples of train  plus the last 30% of validation.

#### Other Split Examples
The first 100 examples from train and validation

`split='train[:100]+validation[:100]'` 

50% of train and 30 % of validation

`split='train[:50%]+validation[:30%]'`


The first 20% of train and examples in the slice 30:50 from validation

`split='train[:20%]+validation[30:50]'`

### Sorting

In [None]:
cola.sort("label")["label"][:15]

In [None]:
cola.sort("label")["label"][-15:]

###  Indexing
You can also access several rows using slice notation or with a list of indices

In [None]:
cola[6, 19, 44]

In [None]:
cola[42:46]

### Shuffling 

In [None]:
cola.shuffle(seed=42)[:3]

## Caching and reusability
Using cache files allows us to load large datasets by means of memory mapping if datasets fit on the drive  to use a fast backend and do smart caching by saving and reusing the results of operations executed on the drive.

In [None]:
pprint(list(dir(cola)))

In [None]:
cola.cache_files

In [None]:
cola.info

## Dataset Filter and Map Function



### Filter function

In [None]:
# To get 3 sentences ,including the term "kick" with Filter
cola = load_dataset("glue", "cola", split="train[:100%]+validation[-30%:]")
pprint(cola.filter(lambda s: "kick" in s["sentence"])["sentence"][:3])

In [None]:
# To get 3 acceptable sentences
pprint(cola.filter(lambda s: s["label"] == 1)["sentence"][:3])

In [None]:
# To get 3 acceptable sentences - alternative version
cola.filter(lambda s: s["label"] == cola.features["label"].str2int("acceptable"))[
    "sentence"
][:3]

### Processing data with  map function
datasets.Dataset.map() function iterates over the dataset applying a processing function to each examples in a dataset and modifies the content of the samples.

In [None]:
# E.g. adding new features
cola_new = cola.map(lambda e: {"len": len(e["sentence"])})

In [None]:
cola_new

In [None]:
pprint(cola_new[0:3])

In [None]:
cola_cut = cola_new.map(lambda e: {"sentence": e["sentence"][:20]})

In [None]:
pprint(cola_cut[:3])

## Working with Local Files

In [None]:
import os
from google.colab import drive

drive.mount("/content/drive")

In [None]:
os.getcwd()

In [None]:
os.listdir("/content/drive/My Drive/akademi/Packt NLP with Transformers/CH02")

In [None]:
if os.getcwd() != "/content/drive/My Drive/akademi/Packt NLP with Transformers/CH02":
    os.chdir("drive/MyDrive/akademi/Packt NLP with Transformers/CH02")

In [None]:
os.getcwd()

In [None]:
os.listdir()

In [None]:
# To load a dataset from local files CSV, TXT, JSON, a generic loading scripts are provided

In [None]:
# under data folder there are the files[a.csv, b.csv, c.csv], some random part of SST-2 dataset
from datasets import load_dataset

data1 = load_dataset("csv", data_files="./data/a.csv", delimiter="\t")
data2 = load_dataset(
    "csv", data_files=["./data/a.csv", "./data/b.csv", "./data/c.csv"], delimiter="\t"
)
data3 = load_dataset(
    "csv",
    data_files={"train": ["./data/a.csv", "./data/b.csv"], "test": ["./data/c.csv"]},
    delimiter="\t",
)

In [None]:
import pandas as pd

pd.DataFrame(data1["train"][:3])

In [None]:
pd.DataFrame(data3["test"][:3])

In [None]:
# get the files in other format
# data_json = load_dataset('json', data_files='a.json')
# data_text = load_dataset('text', data_files='a.txt')

In [None]:
# you can also access several rows using slice notation or with a list of indices

In [None]:
# shuffling

In [None]:
data3_shuf = data3["train"].shuffle(seed=42)
data3_shuf["label"][:15]

## Preparing the data for model training
Let us take an example with a tokenizer. 
To do so, we need to install transformers library

In [None]:
!pip install transformers

In [None]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

If batched is True, it provides batch of examples to any function.
batch_size (default is 1000) is  number of instances per batch provided to a function. If not selected, the whole dataset is provided as a single batch to any given function.

In [None]:
encoded_data1 = data1.map(
    lambda e: tokenizer(e["sentence"]), batched=True, batch_size=1000
)

In [None]:
data1

In [None]:
encoded_data1

In [None]:
pprint(encoded_data1["train"][0])

In [None]:
encoded_data3 = data3.map(
    lambda e: tokenizer(e["sentence"], padding=True, truncation=True, max_length=12),
    batched=True,
    batch_size=1000,
)

In [None]:
data3

In [None]:
encoded_data3

In [None]:
pprint(encoded_data3["test"][12])