Download the Train splits of datasets "aalksii/ml-arxiv-papers" and “Salesforce/wikitext” (the smaller 'wikitext-2-raw-v1' version)

● Utilise only the “abstract” column of the arxiv dataset.

● Perform appropriate steps to join both the datasets by concatenating them together

● Reference versions of packages for this exam:

	○ Transformer version: 4.44.2
	○ Datasets version: 3.1.0
	○ Tokenizers version: 0.19.1

In [5]:
!pip install "transformers==4.44.2" "datasets==3.1.0" "tokenizers==0.19.1"

Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting datasets==3.1.0
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting tokenizers==0.19.1
  Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hDownloading datasets-3.1.0-py3-none-any.whl (480 kB)
Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m31m16.3 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: tokenizers, transformers, datasets
  Attempting uninstall: tokenizers
    Found exist

In [1]:
import transformers
import datasets
import tokenizers

print('transfomers version: ', transformers.__version__)
print('datasets version: ', datasets.__version__)
print('tokenizers version: ', tokenizers.__version__)

transfomers version:  4.44.2
datasets version:  3.1.0
tokenizers version:  0.19.1


In [2]:
from datasets import load_dataset

arxiv = load_dataset('aalksii/ml-arxiv-papers', split='train')
arxiv

README.md:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

(…)-00000-of-00001-52427cf3bce60f12.parquet:   0%|          | 0.00/73.1M [00:00<?, ?B/s]

(…)-00000-of-00001-c5f66ae2f59807ae.parquet:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/105832 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11760 [00:00<?, ? examples/s]

Dataset({
    features: ['title', 'abstract'],
    num_rows: 105832
})

In [3]:
wikitext = load_dataset('Salesforce/wikitext', 'wikitext-2-raw-v1', split='train')
wikitext

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 36718
})

In [6]:
# Arxiv dataset
arxiv = arxiv.remove_columns('title')
arxiv = arxiv.rename_column('abstract', 'text')
print(arxiv)
print(wikitext)

Dataset({
    features: ['text'],
    num_rows: 105832
})
Dataset({
    features: ['text'],
    num_rows: 36718
})


In [8]:
from datasets import concatenate_datasets
ds = concatenate_datasets([arxiv, wikitext])
ds

Dataset({
    features: ['text'],
    num_rows: 142550
})

### Data prep done, now problems:

---

---

1. What is the size of combined dataset, in terms of “thousands of number of
rows”

In [10]:
len(ds) / 1000

142.55

In [11]:
ds.num_rows / 1000

142.55

2. What is the average length of sentences, assuming text in each row is split by a single space split(“ ”)

In [15]:
ds = ds.map(lambda x: {'n_tokens': len(x['text'].split(" "))}, num_proc=3)

Map (num_proc=3):   0%|          | 0/142550 [00:00<?, ? examples/s]

In [17]:
sum(ds['n_tokens']) / len(ds)

138.88237811294283

In [21]:
total_tokens = 0
for sample in ds:
    total_tokens += sample['n_tokens']
print(total_tokens / len(ds))

138.88237811294283


3. How many rows have more than or equal to 150 words but less than or equal to 400 words (answer in thousands of number of rows)?

In [22]:
filtered_data = ds.filter(lambda x: x['n_tokens'] >= 150).filter(lambda x: x['n_tokens'] <= 400)
filtered_data

Filter:   0%|          | 0/142550 [00:00<?, ? examples/s]

Filter:   0%|          | 0/70581 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'n_tokens'],
    num_rows: 70519
})

4. Now, utilise the BertNormalizer and BertPreTokenizer

Answer the following based on the above mentioned steps.

Which of the following statements is True?

- [ ] BertNormalizer will not change to lowercase by default
- [ ] BertNormalizer by default takes care of accented characters.
- [ ] BertPreTokenizer will split words only on white spaces while ignoring punctuations
- [ ] BertPreTokenizer will split words on white spaces as well as punctuation.

In [23]:
from tokenizers.normalizers import BertNormalizer
from tokenizers.pre_tokenizers import BertPreTokenizer

In [25]:
BertNormalizer?

[0;31mInit signature:[0m
[0mBertNormalizer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclean_text[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhandle_chinese_chars[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrip_accents[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlowercase[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
BertNormalizer

Takes care of normalizing raw text before giving it to a Bert model.
This includes cleaning the text, handling accents, chinese chars and lowercasing

Args:
    clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):
        Whether to clean the text, by removing any control characters
        and replacing all whitespaces by the classic one.

    handle_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
     

In [26]:
BertPreTokenizer?

[0;31mInit signature:[0m [0mBertPreTokenizer[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
BertPreTokenizer

This pre-tokenizer splits tokens on spaces, and also on punctuation.
Each occurence of a punctuation character will be treated separately.
[0;31mFile:[0m           ~/conda/envs/dlp/lib/python3.12/site-packages/tokenizers/pre_tokenizers/__init__.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     

In [27]:
# Third option false, fourth true

---

4. Consider the following text sentence to be encoded:

	○ From misfiring superstars Rohit and Kohli, to New Zealand’s spin attack making the most of the home pitches, here’s how India were handed a shock series loss.

● Suppose now you use the WordPiece model, along with its appropriate trainer.
● Train two Tokenizer models, of varying Vocabulary sizes: 5000 and 10000
● Answer the following based on the above mentioned Normalizer, Tokenize

Which of the following statements is True

- [ ] The Tokenizer model trained on 5k vocabulary size encodes the given text sentence in fewer tokens.
- [ ] The Tokenizer model trained on 10k vocabulary size encodes the given text sentence in fewer tokens.
- [ ] The Tokenizer model trained on 10k vocabulary size encodes the given text sentence in half the number of tokens
- [ ] Difference in the number of tokens in both the model’s encoding is less than 10.

In [28]:
from tokenizers.models import WordPiece

In [29]:
from tokenizers.trainers import WordPieceTrainer

In [32]:
from tokenizers import Tokenizer

In [33]:
model_5k = WordPiece()
model_10k = WordPiece()

tok_5k = Tokenizer(model_5k)
tok_10k = Tokenizer(model_10k)

tok_5k.normalizer = BertNormalizer()
tok_10k.normalizer = BertNormalizer()

tok_5k.pre_tokenizer = BertPreTokenizer()
tok_10k.pre_tokenizer = BertPreTokenizer()

In [35]:
trainer_5k = WordPieceTrainer(vocab_size=5000)
trainer_10k = WordPieceTrainer(vocab_size=10000)

In [38]:
# Are we supposed to train on filtered data or on the original concatenated dataset? Let's try both.

def iterator_train():
    for sample in ds:
        yield sample['text']

In [39]:
tok_5k.train_from_iterator(iterator_train(), trainer=trainer_5k, length=len(ds))






In [40]:
tok_10k.train_from_iterator(iterator_train(), trainer=trainer_10k, length=len(ds))






In [41]:
sample = "From misfiring superstars Rohit and Kohli, to New Zealand’s spin attack making the most of the home pitches, here’s how India were handed a shock series loss."
enc_5k = tok_5k.encode(sample)
enc_10k = tok_10k.encode(sample)

In [42]:
enc_5k

Encoding(num_tokens=53, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [43]:
enc_10k

Encoding(num_tokens=48, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

6. Select the true statements for WordPiece Tokenizer

- [ ] It is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.
- [ ] It is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.
- [ ] It will tokenize by looking for the most likely segmentation into tokens, according to the model.
- [ ] It will tokenize by looking for the longest subword starting from the beginning that is in the vocabulary, then repeat the process for the rest of the text.

From the [docs](https://huggingface.co/docs/transformers/main/en/tokenizer_summary#wordpiece): 

> WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules.
> In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

Thus, options 2 and 3 are correct.

---

7 . 


● Suppose now you retain only those rows in your dataset that consist of greater than or equal to 150 words but fewer than or equal to 400 words, called Filtered_data.

● Download the "bert-base-uncased" pretrained auto-tokenizer, and tokenize the Filtered_data, and split it into train and test, where test size is 1% of the dataset.

● Use appropriate DataCollator and DataLoader (from torch) over the train_split of the above tokenized data and to create data batches of size 100 .

● Suppose you use DistilBertConfig, DistilBertForMaskedLM for further tasks

How many batches of data exist in Dataloader?

In [45]:
from transformers import AutoTokenizer
bbu = AutoTokenizer.from_pretrained('bert-base-uncased')



In [80]:
def _tokenize(sample):
    enc = bbu(sample['text'], truncation=True, padding=True)
    return {'input_ids': enc['input_ids'], 'attention_mask': enc['attention_mask']}

tokenized_filtered = filtered_data.map(_tokenize, remove_columns=filtered_data.column_names, num_proc=3)

Map (num_proc=3):   0%|          | 0/70519 [00:00<?, ? examples/s]

In [81]:
splitted = tokenized_filtered.train_test_split(test_size=0.01, seed=123)
splitted

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 69813
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 706
    })
})

In [54]:
from transformers import DataCollatorForLanguageModeling
from torch.utils.data import DataLoader

In [82]:
collator = DataCollatorForLanguageModeling(bbu, mlm=True)
loader = DataLoader(dataset=splitted['train'], collate_fn=collator, batch_size=100)

In [86]:
n_batches = 0
for batch in loader:
    n_batches += 1
print(n_batches)

699


8. Which among the following statements is true?

- [ ] ReLU activation is used in DistilBert
- [x] The default configuration has max_position_embedding is 512
- [x] The default configuration has 6 transformer layers
- [ ] Dropout is not used in self-attention layers of DistilBert

In [87]:
from transformers import DistilBertConfig, DistilBertForMaskedLM

In [91]:
db_model = DistilBertForMaskedLM(DistilBertConfig())

In [92]:
# 9. What is the total number of parameters of the default configuration of
# DistilBert? (Write you answer in Millions)
n_params = 0
for p in db_model.parameters():
    n_params += p.numel()
print(n_params)

66985530


In [93]:
n_params / 1_000_000

66.98553

In [94]:
DistilBertConfig()

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.44.2",
  "vocab_size": 30522
}

---

10. Consider rest of the configurations to remain constant, and make three variations of DistilBertConfig as described below:

    a. Config1 : max_position_embedding = 256
    
    b. Config2 : max_position_embedding = 512
    
    c. Config3 : max_position_embedding = 1024


Let the total number of parameters in each configuration be equal to Param1, Param2 and Param3 respectively.

Which of the following statements is true?
- [ ] Param1 = Param2 = Param3
- [ ] Param1 > Param2 > Param3
- [x] Param1 < Param2 < Param3
- [ ] absolute(Param1 - Param2) > absolute (Param2 - Param3)
- [x] absolute(Param1 - Param2) < absolute (Param2 - Param3)

In [95]:
cfg1 = DistilBertConfig(max_position_embeddings=256)
cfg2 = DistilBertConfig(max_position_embeddings=512)
cfg3 = DistilBertConfig(max_position_embeddings=1024)

for i, config in enumerate([cfg1, cfg2, cfg3], start=1):
    _model = DistilBertForMaskedLM(config)
    n_params = 0
    for p in _model.parameters():
        n_params += p.numel()
    print(i, n_params)

1 66788922
2 66985530
3 67378746


In [96]:
param1 = 66788922
param2 = 66985530
param3 = 67378746

In [97]:
abs(param1 - param2) < abs(param2 - param3)

True

---

11. How many dimensions does the tensor output by the base Transformer model have, and what are they?

In [98]:
from transformers import AutoModel

In [100]:
DistilBertForMaskedLM?

[0;31mInit signature:[0m
[0mDistilBertForMaskedLM[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mconfig[0m[0;34m:[0m [0mtransformers[0m[0;34m.[0m[0mconfiguration_utils[0m[0;34m.[0m[0mPretrainedConfig[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
DistilBert Model with a `masked language modeling` head on top.

This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

Parameters:
    config ([`DistilBertConfig`]): Model configuration class with all the parameters of the model.
        Initializing with a confi