# Daily Challenge : Preprocess & fine-tune transformer-based models

## 1. Understanding BERT and XLM-RoBERTa

Objective: Learn how transformer models work and their role in NLP tasks.

Instructions:

- Read through the descriptions of BERT and XLM-RoBERTa.
- Understand how these models process text using tokenization.
- Learn about different pre-trained versions of these models and their characteristics.


In [1]:
from transformers import BertTokenizer, XLMRobertaTokenizer

BERT:

-Processes text by looking at both left and right context (bidirectional).

- Pre-trained using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

- Good for monolingual tasks (originally trained on English).

XLM-RoBERTa:

- Based on RoBERTa, but trained on 100+ languages.

- No NSP, dynamic masking, trained on CommonCrawl multilingual data.

- Great for multilingual tasks (e.g., translation, cross-lingual classification).

## 2. Tokenizing Text

Objective: Understand how to tokenize text using pre-trained tokenizers.

Instructions:

    Use the BertTokenizer and XLMRobertaTokenizer to convert sentences into tokenized input.
    Explore the different token types, such as input_ids, attention_mask, and labels.
    Experiment with single-sentence and two-sentence tokenization.

Functions to use:

    tokenizer.encode_plus()
    tokenizer.decode()


In [2]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
xlm_tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

text = "I love cheesecake."
bert_tokens = bert_tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    return_tensors="pt")

print("BERT Token IDs:", bert_tokens["input_ids"])
print("BERT Attention Mask:", bert_tokens["attention_mask"])

xlm_tokens = xlm_tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    return_tensors="pt"
)
print("XLM-RoBERTa Token IDs:", xlm_tokens["input_ids"])
print("XLM-RoBERTa Attention Mask:", xlm_tokens["attention_mask"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

BERT Token IDs: tensor([[  101,  1045,  2293,  8808, 17955,  1012,   102]])
BERT Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1]])
XLM-RoBERTa Token IDs: tensor([[     0,     87,   5161,  96967, 107111,      5,      2]])
XLM-RoBERTa Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1]])


In [3]:
decoded = bert_tokenizer.decode(bert_tokens['input_ids'][0])
print("BERT Decoded:", decoded)

decoded = xlm_tokenizer.decode(xlm_tokens['input_ids'][0])
print("XLMRoBERTa Decoded:", decoded)

BERT Decoded: [CLS] i love cheesecake. [SEP]
XLMRoBERTa Decoded: <s> I love cheesecake.</s>


In [4]:
bert_tokens = bert_tokenizer.encode_plus("I read a book.", "He watched three movies.", return_tensors="pt")

In [5]:
decoded = bert_tokenizer.decode(bert_tokens['input_ids'][0])
print("BERT Decoded:", decoded)

BERT Decoded: [CLS] i read a book. [SEP] he watched three movies. [SEP]


## 3. Preparing Input Data for the Model

Objective: Format input data correctly for transformer models.

Instructions:

  - Ensure that input sentences are padded or truncated to the desired length.
  - Understand and set special tokens such as `<s>` and `</s>`.
  - Learn about attention_mask and how it helps the model ignore padding tokens.

Functions to use:

    tokenizer.encode_plus()
    tokenizer.special_tokens_map
    tokenizer.vocab_size


In [6]:
tokens = bert_tokenizer.encode_plus(
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    max_length=16,
    truncation=True,
    padding='max_length',
    return_attention_mask=True,
    return_tensors="pt"
)

print("Special Tokens Map:", bert_tokenizer.special_tokens_map)
print("Vocab Size:", bert_tokenizer.vocab_size)

Special Tokens Map: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
Vocab Size: 30522


## 4. Loading and Exploring the Dataset

Objective: Load the dataset and explore its structure.

Instructions:

  - Load the training and testing data from CSV files.
  - Display the first few rows to understand its structure.
  - Identify the columns needed for training the model.

Functions to use:

    pd.read_csv()
    df.head()
    df.shape


In [8]:
import pandas as pd
df = pd.read_csv("sample_submission.csv")
df.head()

Unnamed: 0,id,prediction
0,c6d58c3f69,1
1,cefcc82292,1
2,e98005252c,1
3,58518c10ba,1
4,c32b0d16df,1


In [9]:
df.shape

(5195, 2)


## 5. Creating Cross-Validation Folds

Objective: Implement k-fold cross-validation for training.

Instructions:

  - Use StratifiedKFold to create 5 training-validation splits.
  - Ensure that each fold maintains the same label distribution.
  - Store the training and validation splits in separate lists.

Functions to use:

    from sklearn.model_selection import StratifiedKFold
    kf.split()
    shuffle()




In [10]:
from sklearn.model_selection import StratifiedKFold
from sklearn.utils import shuffle

In [11]:
df = shuffle(df, random_state=42)
kf = StratifiedKFold(n_splits=5)

In [14]:
folds = []
for train_index, val_index in kf.split(df['id'], df['prediction']):
    train_df = df.iloc[train_index]
    val_df = df.iloc[val_index]
    folds.append((train_df, val_df))