In [1]:
!pip install transformers[torch]

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\User\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
!pip install sentencepiece

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\User\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## 1. Understanding BERT and XLM-RoBERTa
**Objective**: Learn how transformer models work and their role in NLP tasks.

**Instructions**:

- Read through the descriptions of BERT and XLM-RoBERTa.
  - BERT (Bidirectional Encoder Representations from Transformers): Considered the "forefather" of modern Transformers. It was originally trained on English corpora using two primary objectives: Masked Language Modeling (MLM)—predicting missing words—and Next Sentence Prediction (NSP)—determining if one sentence logically follows another.

  - XLM-RoBERTa: An upgraded, multilingual version of RoBERTa. It was trained on a massive dataset (CommonCrawl) covering over 100 languages. Unlike BERT, it removes the Next Sentence Prediction task and focuses solely on more efficient dynamic masking during training.

- Understand how these models process text using tokenization.
  - BERT: WordPiece Algorithm
  BERT utilizes the WordPiece algorithm. When it encounters an unfamiliar word, it breaks it down into sub-tokens. To indicate that a sub-token is a continuation of a word rather than a new one, BERT adds a ## prefix.

  - XLM-RoBERTa: SentencePiece (BPE)
  XLM-RoBERTa uses SentencePiece based on Byte-Pair Encoding (BPE). It treats text as a raw stream of bytes, which eliminates the "unknown word" problem and makes it highly effective for languages without clear word boundaries (like Chinese). Instead of ##, it uses a special meta-symbol _ (a lower underscore) to mark the beginning of a new word.

- Learn about different pre-trained versions of these models and their characteristics.

In [3]:
from transformers import BertTokenizer, XLMRobertaTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

text = "Tokenization is amazing!"

# BERT tokenization
bert_tokens = bert_tokenizer.tokenize(text)
# XLM-R tokenization
xlmr_tokens = xlmr_tokenizer.tokenize(text)

print(f"BERT tokens: {bert_tokens}")
print(f"XLM-R tokens: {xlmr_tokens}")

BERT tokens: ['To', '##ken', '##ization', 'is', 'ama', '##zing', '!']
XLM-R tokens: ['▁To', 'ken', 'ization', '▁is', '▁amazing', '!']


## 2. Tokenizing Text
**Objective**: Understand how to tokenize text using pre-trained tokenizers.

**Instructions**:

- Use the BertTokenizer and XLMRobertaTokenizer to convert sentences into tokenized input.
  - Made it the preveous exercise

- Explore the different token types, such as input_ids, attention_mask, and labels.
  - input_ids: These are the numerical representations of tokens. Each word or sub-token is converted into a unique index from the model's predefined vocabulary. This sequence also includes special tokens, such as [CLS] (to mark the beginning of a sequence) and [SEP] (to separate sentences).

  - attention_mask: A binary mask consisting of 1s and 0s. It tells the model which tokens should be attended to (1) and which are "padding" (0). Padding is used to ensure all sequences in a batch have the same length.

  - labels: In training tasks, these represent the ground truth (correct answers). In the context of Masked Language Modeling (MLM), this is often a copy of the input_ids where specific tokens are replaced by a mask ID ([MASK]) for the model to predict.

  - token_type_ids (Specific to BERT): Also known as "segment IDs," these allow the model to distinguish between the first and second sentences in a pair (e.g., in Question Answering or Natural Language Inference tasks).
  
- Experiment with single-sentence and two-sentence tokenization.

In [4]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

sentence1 = "I love learning NLP."
sentence2 = "It is very fascinating."

# --- Single Sentence Tokenization - BERT---
print("--- Single Sentence (BERT) ---")
inputs_single_bert = bert_tokenizer.encode_plus(sentence1, padding=True, truncation=True, return_tensors="pt")

print(f"Input IDs: {inputs_single_bert['input_ids']}")
print(f"Tokens: {bert_tokenizer.convert_ids_to_tokens(inputs_single_bert['input_ids'][0])}")
print(f"Attention Mask: {inputs_single_bert['attention_mask']}\n")

# --- Single Sentence Tokenization - XLM-RoBERTa---
print("--- Single Sentence (XLM-RoBERTa) ---")
inputs_single_xlmr = xlmr_tokenizer.encode_plus(sentence1, padding=True, truncation=True, return_tensors="pt")

print(f"Input IDs: {inputs_single_xlmr['input_ids']}")
print(f"Tokens: {xlmr_tokenizer.convert_ids_to_tokens(inputs_single_xlmr['input_ids'][0])}")
print(f"Attention Mask: {inputs_single_xlmr['attention_mask']}\n")

# --- Two-Sentence Tokenization - BERT ---
print("--- Two-Sentence Pair (BERT) ---")
inputs_pair_bert = bert_tokenizer.encode_plus(sentence1, sentence2, padding=True, truncation=True, return_tensors="pt")

print(f"Input IDs (Pair): {inputs_pair_bert['input_ids']}")
print(f"Note: Look for the separator IDs in the sequence.")

# --- Decoding for clarity - BERT  ---
decoded_bert = bert_tokenizer.decode(inputs_pair_bert['input_ids'][0])
print(f"Decoded string: {decoded_bert}")


# --- Two-Sentence Tokenization - XLM-RoBERTa ---
print("--- Two-Sentence Pair (XLM-RoBERTa) ---")
inputs_pair_xlmr = xlmr_tokenizer.encode_plus(sentence1, sentence2, padding=True, truncation=True, return_tensors="pt")

print(f"Input IDs (Pair): {inputs_pair_xlmr['input_ids']}")
print(f"Note: Look for the separator IDs in the sequence.")

# --- Decoding for clarity - XLM-RoBERTa  ---
decoded_xlmr = xlmr_tokenizer.decode(inputs_pair_xlmr['input_ids'][0])
print(f"Decoded string: {decoded_xlmr}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

--- Single Sentence (BERT) ---
Input IDs: tensor([[  101,  1045,  2293,  4083, 17953,  2361,  1012,   102]])
Tokens: ['[CLS]', 'i', 'love', 'learning', 'nl', '##p', '.', '[SEP]']
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1]])

--- Single Sentence (XLM-RoBERTa) ---
Input IDs: tensor([[    0,    87,  5161, 52080,   541, 37352,     5,     2]])
Tokens: ['<s>', '▁I', '▁love', '▁learning', '▁N', 'LP', '.', '</s>']
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1]])

--- Two-Sentence Pair (BERT) ---
Input IDs (Pair): tensor([[  101,  1045,  2293,  4083, 17953,  2361,  1012,   102,  2009,  2003,
          2200, 17160,  1012,   102]])
Note: Look for the separator IDs in the sequence.
Decoded string: [CLS] i love learning nlp. [SEP] it is very fascinating. [SEP]
--- Two-Sentence Pair (XLM-RoBERTa) ---
Input IDs (Pair): tensor([[     0,     87,   5161,  52080,    541,  37352,      5,      2,      2,
           1650,     83,   4552, 102919,   1916,      5,      2]])
Note: Look for the separator

## 3. Preparing Input Data for the Model
**Objective**: Format input data correctly for transformer models.

**Instructions**:

- Ensure that input sentences are padded and possibly truncated to max_length.
- Understand and set special tokens such as s and /s.
- Learn about attention_mask and how it helps the model ignore padding tokens.
  - The attention_mask is a binary tensor that tells the model which tokens are meaningful and which are just "placeholders" (padding).

  - Why we need it: Transformer models process data in fixed-size blocks (batches). Since sentences have different lengths, we add padding tokens (like [PAD] or <pad>) to make them all equal in length.

  - How it works: * Value 1: Indicates a real token. The model performs calculations and pays "attention" to it.

  - Value 0: Indicates a padding token. The model ignores this position entirely during the self-attention stage.

  - Impact: Without the mask, the model would waste resources processing empty space and, more importantly, the "noise" from padding tokens would corrupt the mathematical representation of the actual sentence.

In [5]:
bert_tok = BertTokenizer.from_pretrained('bert-base-uncased')
xlmr_tok = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# two sentences with differrent length to see how pads work
s1 = "Hello world."
s2 = "This is a very long sentence that we might want to truncate if it exceeds our limit."

# the max_length
MAX_LEN = 10

#  BERT
print("--- BERT (with Padding & Truncation) ---")
bert_input = bert_tok.encode_plus(
    s1,
    add_special_tokens=True,
    max_length=MAX_LEN,
    padding='max_length',
    truncation=True,
    return_tensors="pt"
)

print(f"Input IDs: {bert_input['input_ids']}")
print(f"Tokens: {bert_tok.convert_ids_to_tokens(bert_input['input_ids'][0])}")
print(f"Attention Mask: {bert_input['attention_mask']}")
print(f"Special Tokens Map: {bert_tok.special_tokens_map}\n")
print(f"Vocab Size: {bert_tok.vocab_size}")

# XLM-RoBERTa
print(f"{'=' * 70}")
print("--- XLM-RoBERTa (with Padding & Truncation) ---")
xlmr_input = xlmr_tok.encode_plus(
    s1,
    add_special_tokens=True,
    max_length=MAX_LEN,
    padding='max_length',
    truncation=True,
    return_tensors="pt"
)

print(f"Input IDs: {xlmr_input['input_ids']}")
print(f"Tokens: {xlmr_tok.convert_ids_to_tokens(xlmr_input['input_ids'][0])}")
print(f"Attention Mask: {xlmr_input['attention_mask']}")
print(f"Special Tokens Map: {xlmr_tok.special_tokens_map}")
print(f"Vocab Size: {xlmr_tok.vocab_size}")

--- BERT (with Padding & Truncation) ---
Input IDs: tensor([[ 101, 7592, 2088, 1012,  102,    0,    0,    0,    0,    0]])
Tokens: ['[CLS]', 'hello', 'world', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Attention Mask: tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
Special Tokens Map: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

Vocab Size: 30522
--- XLM-RoBERTa (with Padding & Truncation) ---
Input IDs: tensor([[    0, 35378,  8999,     5,     2,     1,     1,     1,     1,     1]])
Tokens: ['<s>', '▁Hello', '▁world', '.', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
Attention Mask: tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
Special Tokens Map: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}
Vocab Size: 250002


## 4. Loading and Exploring the Dataset
**Objective**: Load the dataset and explore its structure.

**Instructions**:

- Load the training and testing data from CSV files.
- Display the first few rows to understand its structure.
- Identify the columns needed for training the model.

In [10]:
import pandas as pd

# Load the data from CSV files
train_df = pd.read_csv('train.csv')

# 1. Display the first few rows to understand its structure
print("--- Dataset Preview ---")
display(train_df.head())

# 2. Check the size of the dataset (Rows, Columns)
# This confirms the total number of training examples
print(f"Dataset Shape: {train_df.shape}")

# 3. Identify the columns needed for training
# For NLI, 'premise' and 'hypothesis' are the inputs, 'label' is the target
print(f"Columns in the dataset: {train_df.columns.tolist()}")

# Analyze languages (useful for multilingual models like XLM-R)
print(f"Number of unique languages: {train_df['language'].nunique()}")

--- Dataset Preview ---


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


Dataset Shape: (12120, 6)
Columns in the dataset: ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label']
Number of unique languages: 15


## 5. Creating Cross-Validation Folds
**Objective**: Implement k-fold cross-validation for training.

**Instructions**:

- Use StratifiedKFold to create 5 training-validation splits.
- Ensure that each fold maintains the same label distribution.
- Store the training and validation splits in separate lists.

In [12]:
from sklearn.model_selection import StratifiedKFold

# 1. Initialize StratifiedKFold with shuffle=True as instructed
# n_splits=5 creates the training-validation splits
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Separate lists to store the training and validation splits
train_indices_list = []
val_indices_list = []

# We use 'label' to ensure each fold maintains the same label distribution
X = train_df[['premise', 'hypothesis']].values
y = train_df['label'].values

# 2. Use kf.split() to create the 5 training-validation splits
for train_idx, val_idx in kf.split(X, y):
    # 3. Store the splits in separate lists
    train_indices_list.append(train_idx)
    val_indices_list.append(val_idx)

# Output results to verify
print(f"Number of folds: {len(train_indices_list)}")
print(f"First fold - Train size: {len(train_indices_list[0])}, Val size: {len(val_indices_list[0])}")

Number of folds: 5
First fold - Train size: 9696, Val size: 2424
