<a href="https://colab.research.google.com/github/not-sid-29/transformers_huggingface/blob/main/Preprocessing_GLUE_MRPC%26SST_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing a Language Dataset:

In [1]:
## Setups and Imports

!pip install --q datasets
!pip install --q transformers
!pip install --q evaluate
!pip install --q accelerate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/547.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/547.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2

## Preprocessing the dataset:

### 1. Loading up the custom dataset:

Here, the dataset used is the `Microsoft Research Paraphrase Corpus(MRPC)` from the `GLUE Benchmark`

In [2]:
import torch
from datasets import load_dataset

dataset = load_dataset("glue", "mrpc")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

**Dataset information**:<br>
The dataset contains 3 splits *train-split*, *validation_split*, *test_split*<br>
Features: <br>
→ `sentence1`: The first sentence to check paraphrasing. <br>
→ `sentence2`: The second sentence that means the same as `sentence1` <br>
→ `label`: denoting whether the sentence is a paraphrase or not.<br>
> { 0 : 'not equivalent', <br>
>   1 : 'equivalent'} <br>

→ `idx`: indexes

In [4]:
train_data = dataset['train']

#displaying the first entry & features
print(train_data[0])
print("Features present in data: ")
print(train_data.features)

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}
Features present in data: 
{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None)}


### 2. Tokenizing the loaded dataset:

In [5]:
from transformers import AutoTokenizer

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Tokenizer the train_split:
tokenized_train = tokenizer(
    train_data["sentence1"],
    train_data["sentence2"],
    padding=True,
    truncation=True
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
#tokenized_train -> this contains a dictionary, if the dataset is bigger then it might cause the RAM to run out of memory

**To keep the dataset as `DatasetDict` object**

In [7]:
def tokenize_data(dataset):
  return tokenizer(
      dataset["sentence1"],
      dataset["sentence2"],
      padding=True,
      truncation=True
  )

In [8]:
tokenized_train_data = train_data.map(tokenize_data, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

In [9]:
tokenized_train_data

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

### 3. Preprocessing with Dynamic Padding:

In [10]:
from transformers import DataCollatorWithPadding


data_collator = DataCollatorWithPadding(tokenizer)

train_ds = {k:v for k, v in tokenized_train.items() if k not in ["idx", "sentence1", "sentence2"]}
batch_train = data_collator(train_ds)
{k:v.shape for k, v in batch_train.items()}

{'input_ids': torch.Size([3668, 103]),
 'token_type_ids': torch.Size([3668, 103]),
 'attention_mask': torch.Size([3668, 103])}

---

## II. Preprocessing GLUE_SST-2 Dataset:

In [15]:
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import load_dataset

def preprocess_sst2(model_id="bert-base-uncased"):
  data = load_dataset("glue", "sst2")

  #Print the Dataset information
  print(data)

  #Initialize the tokenizer model:
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  #Defining a helper to tokenize the data:
  def tokenize_function(data):
    return tokenizer(data["sentence"], truncation=True)

  sample_train = tokenize_function(data["train"])
  tknzd_train = data["train"].map(tokenize_function, batched=True)

  #Initialize data-collator with dynamic padding
  data_collator = DataCollatorWithPadding(tokenizer)

  tknzd_train_split = {k : v for k, v in sample_train.items() if k not in ["idx", "sentence"]}
  train_batch = data_collator(tknzd_train_split)

  return train_batch

In [16]:
train_batch = preprocess_sst2()

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

In [17]:
{k : v.shape for k, v in train_batch.items()}

{'input_ids': torch.Size([67349, 66]),
 'token_type_ids': torch.Size([67349, 66]),
 'attention_mask': torch.Size([67349, 66])}