# Datasets

In this notebook, it will:

    I. present the basic usgae of datasets module
    II. use the module in the process
    III. Compare the torch and datasets ways
    IV. Modify the training process from Model

Datasets is a dataset manipulation module developed by hugging face. The public database of HF is https://huggingface.co/datasets.

## I. Usage

In [2]:
# import 
########

from datasets import *

### A. OnLine Loading

In [7]:
# Load
######

# To load the data, you have to pass the online location of the dataset which 
# is composed of the name of the repository and the dataset name. 
#
# There is no use to provide the url.
#
# By default, data is a DatasetDict.
#
# Depending ont he data structure to be downloaded, the content of this dict 
# will change.
#
# The following example has only on element in the dict since there is only one
# set is available.

# Otherwise, the dict may contain several elements such as train, valid, test. 


data =  load_dataset("davidberg/sentiment-reviews")
data

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 4084
    })
})

In [11]:
# split
#######

# if we are only interested in one part of the dataset (eg. train), one can use
# the 'split' argument.
# And the data becomes Dataset but not a dict anymore.

data =  load_dataset("davidberg/sentiment-reviews", split="train")
data

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 4084
})

In [24]:
# Slice
#######

# One can also slicing the dataset to be loaded
# the Slicing can be done by providing :
#  - the start and the end indices.
#  - the percentage of the data size

data =  load_dataset("davidberg/sentiment-reviews", split="train[:50]")
# data =  load_dataset("yezhengli9/opus_books_demo", split="train[10:100]")
# data =  load_dataset("yezhengli9/opus_books_demo", split="train[:10%]")
# data =  load_dataset("yezhengli9/opus_books_demo", split=["train[:10%]", "train[90%:]"]) # return a list
data

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 50
})

In [14]:
# subset
########

# Sometimes, there are several subsets in a dataset.
# We can download a subset by providing the name of this subset
# the following example has a subset named 'en-fr'

data_sub = load_dataset("yezhengli9/opus_books_demo", "en-fr")
data_sub

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 127085
    })
})

### B. Inspect dataset

The inspection is to see and check data.
The returned results are lists.
So they can't be used directly in subsequent operations such as dataloader.

In [15]:
# by Slicing
############

# if the data is a dict, by specifying the key (eg. train), we can get its content
# otherwise, the data can be accessible directly.

data_sub["train"][:3]["translation"]


[{'en': 'The Wanderer', 'fr': 'Le grand Meaulnes'},
 {'en': 'Alain-Fournier', 'fr': 'Alain-Fournier'},
 {'en': 'First Part', 'fr': 'PREMIÈRE PARTIE'}]

In [16]:
# features
##########

# we can get other attributes 

# all collumns
print("colums names: ", data.column_names)
print("features: ", data.features)

colums names:  ['Unnamed: 0', 'review', 'polarity', 'division']
features:  {'Unnamed: 0': Value(dtype='int64', id=None), 'review': Value(dtype='string', id=None), 'polarity': Value(dtype='float64', id=None), 'division': Value(dtype='string', id=None)}


### C. Splitting dataset

In [17]:
# spliting data into different groups (eg. train, valid)
# this returns a DatasetDict

data.train_test_split(test_size=0.2)

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 8
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 2
    })
})

### D. Filter & Manipulation

Those functions are to process the data for later usage, which is different from the inspection in the above section.
The results of those functions are still dataset.

In [19]:
# select
########

# select the data, this return a Dataset, which is different from previous section
# for inspection, which return a list or dict

data.select([0,1])

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 2
})

In [25]:
# filter
########

# For more complex filter, one can use filter method with a filter function composing of filter criteria
#this returns a Dataset
# The opration is not in place

data_neu = data.filter(lambda column : "neutral" in column["division"])
data_neu

Filter:   0%|          | 0/50 [00:00<?, ? examples/s]

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 10
})

In [29]:
# map
#####

# To apply processings to data, one can use map method.

def add_prefix(sample) :

    sample["review"] = "prefix: " + sample["review"]
    return sample

data_prefix = data.map(add_prefix)

data_prefix["review"][:5]

['prefix: able play youtube alexa',
 'prefix: able recognize indian accent really well drop function helpful call device talk person near device smart plug schedule work seamlessly con would sound kindloud but lack clarity mid frequency need tweeked optimum clarity rarely device doesnt respond call alexa',
 'prefix: absolute smart device amazon connect external sub woofer sound amaze recons voice even close room like almost collection songs english hindi must quite moneys worth',
 'prefix: absolutely amaze new member family control home voice connect home anywhere world',
 'prefix: absolutely amaze previously sceptical invest money but arrive worth ityou absolutely buy wont regret cheer']

### D. Combined with Tokenizer

In [30]:
# a more complex example using tokenizer

from transformers import AutoTokenizer

# load tokenizer

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")

In [31]:
# function to to the preprocessing
# https://huggingface.co/docs/transformers/tasks/translation

# Prefix the input with a prompt so T5 knows this is a translation task. 
# Some models capable of multiple NLP tasks require prompting for specific tasks.

source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

def preprocess(example) :

    inputs = prefix + example["translation"][source_lang]
    targets = example["translation"][target_lang]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

In [33]:
## Apply preprocess function to data

preprocessed_data = data_sub.map(preprocess)
preprocessed_data

Map:   0%|          | 0/127085 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 127085
    })
})

In [35]:
## use multple processes

# To accelerate the process, we can specify the number of process to be used.
# the more the process, the faster the process

preprocessed_data = data_sub.map(preprocess, num_proc=4)
preprocessed_data

Map (num_proc=4):   0%|          | 0/127085 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 127085
    })
})

In [37]:
## batched processing 

# if we use the function above for batched process, we will get an error of indice.
# The error is caused by the fact that the function above doesn't support batched data,
# eg: without batch, each element is data[i]["translation"]['en'], which returns the 
# ieme element in english.
# however, data[i:i+batch_size]["translation"]['en'] triggers error.

def preprocess_batched(examples):
    
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

preprocessed_data = data_sub.map(preprocess_batched, batched=True)
preprocessed_data

Map:   0%|          | 0/127085 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 127085
    })
})

In [39]:
## batch + proc

preprocessed_data = data_sub.map(preprocess_batched, batched=True, num_proc=4)
preprocessed_data

DatasetDict({
    train: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 127085
    })
})

In [44]:
## remove columns

# we can remove the information we don't want
# here we see that colums names:  ['id', 'translation'] are removed

preprocessed_data = data_sub.map(preprocess_batched, batched=True, num_proc=4, remove_columns=data_sub["train"].column_names)
preprocessed_data

Map (num_proc=4):   0%|          | 0/127085 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 127085
    })
})

### E. Save & Load

In [45]:
## save to disc

# after save, there are:
#   - a folder called "train" with:
#       * data-00000-of-00001.arrow
#       * dataset_info.json
#       * state.json
#   - a file called "dataset_dict.json"

preprocessed_data.save_to_disk("./tmp/processed_data")

Saving the dataset (0/1 shards):   0%|          | 0/127085 [00:00<?, ? examples/s]

In [46]:
## load from disc

preprocessed_data = load_from_disk("./tmp/processed_data")
preprocessed_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 127085
    })
})

### F. Local data

In [None]:
# other format
##############

# we can load one or several files
# If there are several files, put all file names into a list and pass the list to data_files arg.

data = load_dataset("csv", data_files="./mydata.csv", split="train")

In [None]:
# there is dedicated function to load csv file

data = Dataset.from_csv("./mydata.csv")

In [None]:
# load all files in a folder

data = load_dataset("csv", data_dir="./myfilespath", split="train")

# or
# data = load_dataset("csv", data_files=["./myfilespath/mydata1.csv", "./myfilespath/mydata2.csv"], split="train")

In [None]:
# load from panda data

# import pandas as pd
# data = pd.read_csv("./mydata.csv")

data = Dataset.from_pandas(data_pd)

In [None]:
# if the data is a list, we can't load them directly as dataset
# if so, we get an error: AttributeError: 'str' object has no attribute 'get'

data_list = ["sentence1", "sentence2"]

# we should decorate the list as

data_list = [{"text": line} for line in data_list]

print(data_list)

Dataset.from_list(data_list)

[{'text': 'sentence1'}, {'text': 'sentence2'}]


Dataset({
    features: ['text'],
    num_rows: 2
})

# Process Data For Training

Let's put is all together and construct some useful data structure for training.
There is not one way to prepare data for training. The choice depends on the data, its quality, format, etc.


## using the torch way

In [103]:
# step1. read data into pairs

from torch.utils.data import Dataset
from datasets import load_dataset

class TranslationDataset(Dataset) :

    def __init__(self) :

        super().__init__()
        self.data = load_dataset("opus_books", 'en-fr', split='train')
        
    def __len__(self) :
        return len(self.data)
    
    def __getitem__(self, index) :

        return self.data[index]["translation"]['en'], self.data[index]["translation"]['fr']
    
dataset = TranslationDataset()

In [104]:
# step2. split data

from torch.utils.data import random_split

train_set, valid_set = random_split(dataset, lengths=[0.8, 0.2])

In [105]:
# step3. preprocessing

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")

def preprocess(batch) :

    prefix = "translate English to French: "
    inputs = [prefix + example[0] for example in batch]
    targets = [example[1] for example in batch]
    model_inputs = tokenizer(inputs, text_target=targets, padding="max_length", max_length=500, truncation=True, return_tensors="pt")
    return model_inputs

In [106]:
# step4. construct dataloader

from torch.utils.data import DataLoader

train_loader = DataLoader(train_set, batch_size=16, shuffle=False, collate_fn=preprocess)
valid_loader = DataLoader(train_set, batch_size=1, shuffle=False, collate_fn=preprocess)

In [107]:
next(iter(train_loader))

{'input_ids': tensor([[13959,  1566,    12,  ...,     0,     0,     0],
        [13959,  1566,    12,  ...,     0,     0,     0],
        [13959,  1566,    12,  ...,     0,     0,     0],
        ...,
        [13959,  1566,    12,  ...,     0,     0,     0],
        [13959,  1566,    12,  ...,     0,     0,     0],
        [13959,  1566,    12,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[ 8786,     3,    85,  ...,     0,     0,     0],
        [  325,     3,    40,  ...,     0,     0,     0],
        [  312,  1072,     9,  ...,     0,     0,     0],
        ...,
        [ 1022, 11857,     9,  ...,     0,     0,     0],
        [  312, 14879,     6,  ...,     0,     0,     0],
        [ 1636,  3307,    71,  ...,     0,     0,     0]])}

## Hugging face way

In [None]:
# step1. load data

data = load_dataset("opus_books", 'en-fr', split='train')


In [119]:
# step2. split data

data_set =  data.train_test_split(test_size=0.2)

In [140]:
# step3. preprocesse data

# Here should use batched processing

def preprocess_batched(examples):
    
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

preprocessed_data = data_set.map(preprocess_batched, batched=True, remove_columns=data_set["train"].column_names)

In [109]:
# step4. define tokenizer

from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [113]:
# step5. define model

# The reason to provide model to collator is that this information can
# inform the collator to shift or not the decoder input by an extra 
# token.
# This is vitally important for translation tasks. But not necessary for
# other tasks such as classifications.

from transformers import T5ForConditionalGeneration, AutoConfig

model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small")

In [143]:
# step6. get collator

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [144]:
# step7. construct dataloader

from torch.utils.data import DataLoader

train_loader_col = DataLoader(preprocessed_data["train"], batch_size=10, shuffle=False, collate_fn=data_collator)
valid_loader_col = DataLoader(preprocessed_data["test"], batch_size=1, shuffle=False, collate_fn=data_collator)

In [145]:
# The data are not the same as the torch way since the split of the data is random.
next(iter(train_loader_col))

{'input_ids': tensor([[13959,  1566,    12,  2379,    10,   101,   646,    24,   294,    13,
             8,   684,  2111, 17310,   203,   977,    11,  1522,  1852,   470,
           281,   223,    12,    34,     5,     1,     0,     0],
        [13959,  1566,    12,  2379,    10,  1853, 10842, 10327,  3316,     1,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [13959,  1566,    12,  2379,    10,   901,     9,    77,    18,   371,
          1211,  8632,     1,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [13959,  1566,    12,  2379,    10,   101,   130,   840,    16,     8,
           740,    13,     8, 16808, 18867, 23081,    44,  2788,    15,    18,
           188,  5497,    88,    31,     7,  1121,     5,     1],
        [13959,  1566,    12,  2379,    10,   216,  4363,    44,    69,   234,
           

## IV. Update training

We take the Training section from 'hf_transformers_basics_model.ipynb" and replace the data module of torch by the datasets module of transformers explained above.

In [None]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [51]:
# 1) load dataset
#################

# replace the class by the load_dataset

from datasets import load_dataset

ckp_data = "davidberg/sentiment-reviews"

data = load_dataset(ckp_data, split="train")

data = data.filter(lambda column: "neutral" not in column["division"]) # filter out neutral
data

Filter:   0%|          | 0/4084 [00:00<?, ? examples/s]

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 3548
})

In [55]:
# 2) split data
###############

# replace the ramdom_split by the trans_test_split of datasets

split_data = data.train_test_split(test_size=0.1)
split_data

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 3193
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 355
    })
})

In [60]:
# 3) tokenizer
##############

# we define a function for tokenization and use map to obtain the tokens

from transformers import AutoTokenizer
import torch

label2id = {"negative":0, "positive": 1}
id2label = {0: "negative", 1: "positive"}

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

def process(batch):

    toks = tokenizer(batch["review"], max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    toks["labels"] = torch.tensor([label2id.get(item) for item in batch["division"]])

    return toks

tokenized_data = split_data.map(process, batched=True, remove_columns=data.column_names)
tokenized_data

Map:   0%|          | 0/3193 [00:00<?, ? examples/s]

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3193
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 355
    })
})

In [61]:
# 4) dataloader
###############

# replace the user-defined collate function by the transformers' data collator

from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

trainset, validset = tokenized_data["train"], tokenized_data["test"]

trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=DataCollatorWithPadding(tokenizer))
validloader = DataLoader(validset, batch_size=32, shuffle=False, collate_fn=DataCollatorWithPadding(tokenizer))

2024-06-19 17:19:16.835486: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-19 17:19:16.835539: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-19 17:19:16.837529: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-19 17:19:16.845762: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [62]:
# 5) load model
###############

# not changed

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

# sent to gpu

if torch.cuda.is_available():
    model = model.cuda()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [63]:
# 6) define optimizer
#####################

# not changed

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=2e-5)

In [64]:
# 7) evaluation
###############

# not changed

def eval():

    acc_count = 0

    model.eval()

    for batch in validloader:

        # if there is GPU, send the data to GPU
        if torch.cuda.is_available():
            batch = {k: v.to(model.device) for k, v in batch.items()}

        output = model(**batch)

        pred = torch.argmax(output.logits, dim=-1)

        # count correct labels
        acc_count += (pred.int() == batch["labels"].int()).sum()

    return acc_count / len(validset)

In [65]:
# 8) Train
##########

# not changed

def train(epoch=3, log_step=50):

    gStep = 0

    for e in range(epoch):

        model.train()

        for batch in trainloader:
            
            # if there is GPU, send the data to GPU
            if torch.cuda.is_available():
                batch = {k: v.to(model.device) for k, v in batch.items()}

            optimizer.zero_grad()

            output = model(**batch)

            output.loss.backward()

            optimizer.step()

            if gStep % log_step == 0:

                print(f"{e+1} / {epoch} - global step: {gStep}, loss: {output.loss.item()}")

            gStep += 1

        acc = eval()

        print(f"{e+1} / {epoch} - acc: {acc}")

In [66]:
# not changed

train()

1 / 3 - global step: 0, loss: 0.822261393070221
1 / 3 - global step: 50, loss: 0.26876893639564514
1 / 3 - acc: 0.9492957592010498
2 / 3 - global step: 100, loss: 0.12767672538757324
2 / 3 - global step: 150, loss: 0.04294269159436226
2 / 3 - acc: 0.9605633616447449
3 / 3 - global step: 200, loss: 0.08602031320333481
3 / 3 - global step: 250, loss: 0.014622676186263561
3 / 3 - acc: 0.9436619281768799
