# **Natural Language Processing**

## **Project:** Sentiment Analysis Using roBERTa and distilBERT


1. The `Transformers` library, developed by [Hugging Face](https://huggingface.co/docs/transformers/index) , is an open-source library designed for working with state-of-the-art natural language processing (NLP) models. 
2. `TQDM` (short for "taqaddum" which means "progress") its arabic word. It's especially useful when you're running tasks that take a long time to complete, as it give us a visual indication of the progress.

In [None]:
!pip install transformers scikit-learn tqdm


# 1. Imported Libraries

- `os`: Interacts with the operating system, providing functions for file and directory operations.

- `re`: Performs regular expression operations for pattern matching or string manipulation.

- `spacy`: Its NLP library offering tools and pre-trained models for tasks like `tokenization` and `entity recognition`.

- `torch`: Open-source machine learning library, providing tensors.

- `numpy`:  It supports array and matrix operations.

- `pandas`: For data manipulation featuring data structures like data frames for structured data analysis.

- `matplotlib`: A data visualization library that offers functions for creating various types of plots and charts.

- `BeautifulSoup`: Facilitates web scraping by parsing HTML and XML documents, allowing extraction of relevant information.

- `tqdm`: Adds progress bars to loops, providing a visual indicator of iterative process progress.

- `transformers`: A library by Hugging Face for working with pre-trained transformer models in NLP, including tokenization and model loading.

- `get_linear_schedule_with_warmup`: A function for scheduling learning rates during training, available in the Transformers library.

- `AdamW`: Adam optimizer designed for training deep learning models, available in the Transformers library.

- `DataLoader`: A PyTorch utility for loading data in batches, enabling efficient processing of large datasets during training.

- `classification_report`, `confusion_matrix`, `accuracy_score`: Functions from scikit-learn for evaluating and analyzing classification model performance.


In [None]:
# Import necessary libraries
import os
import re
import spacy
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
from tqdm import tqdm
from transformers import RobertaTokenizer, RobertaForSequenceClassification, get_linear_schedule_with_warmup, AdamW, DistilBertTokenizer, DistilBertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


# 2. Create Directories 
We have separate folders for training and testing data, and within each of those have subfolders for positive and negative samples for extract chunks of data.

1. **Directory hierarchy** 
    - `partial_data/`: This is the main directory you created.
       * `train/`: This is the training data directory.
            * `pos/`: This is the directory for positive training samples.
            * `neg/`: This is the directory for negative training samples.
       * `test/`: This is the testing data directory.
            * `pos/`: This is the directory for positive testing samples.
            * `neg/`: This is the directory for negative testing samples.


In [None]:
!mkdir partial_data
!mkdir partial_data/train
!mkdir partial_data/train/pos
!mkdir partial_data/train/neg
!mkdir partial_data/test
!mkdir partial_data/test/pos
!mkdir partial_data/test/neg

> 

In [None]:
#!rm -r /kaggle/working/partial_data

# 3. Data Extracting

1. **Training Data:**
    - `2500 positive` reviews and `2500 negative` reviews are chosen for training.
        These reviews will teach the model to recognize positive and negative sentiments.
2. **Testing Data:**
    - `250 positive` and `250 negative` reviews are selected for evaluating the model's performance.

In [None]:
!cp -t /kaggle/working/partial_data/train/neg $(ls /kaggle/input/nlp-project/aclImdb/train/neg/* | head -n 2500)
!cp -t /kaggle/working/partial_data/train/pos $(ls /kaggle/input/nlp-project/aclImdb/train/pos/* | head -n 2500)


In [None]:
!cp -t /kaggle/working/partial_data/test/neg $(ls /kaggle/input/nlp-project/aclImdb/test/neg/* | head -n 250)
!cp -t /kaggle/working/partial_data/test/pos $(ls /kaggle/input/nlp-project/aclImdb/test/pos/* | head -n 250)


# 4. Count the text files

* we define a function `count_files(directory_path)` 

* First checking the path is `valid` or `not` if path is valid then continue further process and take `directory/path` as      argument.

* `os.lisdir(directory)` with this fucntion return the list of each filename and at the end  just return the length of list using `len(files)`.

In [None]:
# Function to count files in a directory
def count_files(directory_path):
    if os.path.exists(directory_path):
        # Use os.listdir() to get a list of files in the directory
        files = os.listdir(directory_path)
        # Use len() to get the number of files in the directory
        return len(files)
    else:
        return 0  # Directory not found

In [None]:
# Count files in each directory and print the results
print(f"Number of files in train_positive: {count_files('/kaggle/working/partial_data/train/pos')}")
print(f"Number of files in train_negative: {count_files('/kaggle/working/partial_data/train/neg')}")
print(f"Number of files in test_positive: {count_files('/kaggle/working/partial_data/test/pos')}")
print(f"Number of files in test_negative: {count_files('/kaggle/working/partial_data/test/neg')}")


# 5. Reading the reviews 
* We define a function `load_dataset(directory)` that take directory as argument. 

* Checking the each filename with `os.endswith(".txt")` if the filename end with `.txt` that exract it basically with this function filter out non-text files.

* `open(os.path.join(directory, filename),'r')` take two things as argument one is directory and filename that exist in given directory and open the current file in `read` mode and returnt the list of reviews every review append as a string in list.

In [None]:
def load_dataset(directory):
    data = []
    # Use os.listdir() to get a list of files in the directory
    for filename in os.listdir(directory):
        # filter out the .txt files
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                review = file.read()
                data.append(review)
    return data


In [None]:
train_pos = load_dataset('/kaggle/working/partial_data/train/pos')
train_neg = load_dataset('/kaggle/working/partial_data/train/neg')
test_pos = load_dataset('/kaggle/working/partial_data/test/pos')
test_neg = load_dataset('/kaggle/working/partial_data/test/neg')

# 6. Text preprocessing
* we define a function `preprocess_review(review)` that take review as a argument and which is expected review in string. 

* checking the string is empty or not if string is empty means in which no review its simple return `empty_string` .

* `BeautifulSoup(review, "html.parser")` BeautifulSoup is library that we used for remove the `HTML-tags` and special characters, punctuation from a review.

* Using regular expressions (regex) to perform a substitution operation on the review text. `re` is refer to regex expression and `.sub()` is subsitution method search a pattern in a string and replace with another string.

* `r'[^A-Za-z0-9]+` This is the regular expression pattern being used for the search operation. `^` it means 'not' or negate. `À-Za-z0-9`  its character set in whcih include uppercase `A-Z` and lowercase `a-z` and digits `0-9` . `+` means it no more accurence of the preceding pattern. 

* Replace with space 

In [None]:
def preprocess_review(review):
    # Check if the review is not empty or None
    if review is None or len(review) == 0:
        return ""
    
    # Remove HTML tags and formatting
    review = BeautifulSoup(review, "html.parser").get_text()

    # Replace special characters, punctuation, and symbols with spaces
    review = re.sub(r'[^A-Za-z0-9]+', ' ', review)

    # Convert to lowercase
    review = review.lower()

    return review.strip()  # Remove leading and trailing spaces


# 7. Calling the 'prepross_review' function.

list contain the reviews of each cateory `('train_neg', 'tarin_pos') , ('test_neg', 'test_pos')` iterate  on it and after prepross the review append in list of corresponding category. 

In [None]:
list_of_data = [train_pos,train_neg,test_pos,test_neg]

cleaning_train_pos = []
cleaning_train_neg = []
cleaning_test_pos = []
cleaning_test_neg = []

for iterate_data in list_of_data:
    for review in iterate_data:
        if list_of_data.index(iterate_data) == 0:
            cleaning_train_pos.append(preprocess_review(review))
        elif list_of_data.index(iterate_data) == 1:
            cleaning_train_neg.append(preprocess_review(review))
        elif list_of_data.index(iterate_data) == 2:
            cleaning_test_pos.append(preprocess_review(review))
        elif list_of_data.index(iterate_data) == 3:
            cleaning_test_neg.append(preprocess_review(review))

# 8. Lemmatization  

**Why have we preferred lemmatization and not stemming?** <br>
Because in **lemmatization** checking the token in vocabulary with the help of vocabulary lemmatization reduce words to their base form or root form of word. In which very low chance of loss the information. For example,**"running" becomes "run," and "better" becomes "good."** <br>
but In **stemming** also reduces words to their root form by removing prefixes or suffixes using some methods. stemming use should when less important of word meaning.In which high chances of loss the informarion. For example, **"running" might become "run," and "better" might become "better."**

* `spaCy` is library that designed for developrs and researcher work with Natural Language Processing. It provide various facilities like **tokenization , POS, lemmatization** etc.

* `spacy.load("en_core_web_sm")` it loads the pre-trained English model called "en_core_web_sm".

* `en_core_web_sm` is small size english model that trained on the web text.  It includes vocabulary, syntax, and NER.

* In which function first step is create tokkens of each review and checking the each tokken from english vocabulary. and change into base/root form of word. and return the list of limmatized_review list.


In [None]:
# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

def preprocess_and_lemmatize_review(review_text):
    # Process the review with spaCy
    doc = nlp(review_text)
    
    # Apply lemmatization and join the lemmatized words back into a string
    lemmatized_review = " ".join([token.lemma_ for token in doc])
    
    return lemmatized_review

# 9. Calling the preprocess_and_lemmatize_review() function 

list contain the reviews of each cateory `('cleaning_train_neg', 'cleaning_train_pos') , ('cleaning_test_neg', 'cleaning_test_pos')` iterate  on it and after apply lemmatization on the review append in list of corresponding category. 


In [None]:
cleaning_data_list = [cleaning_train_pos,cleaning_train_neg,cleaning_test_pos,cleaning_test_neg]

processed_train_pos = []
processed_train_neg = []
processed_test_pos = []
processed_test_neg = []

for iterate_clean_data in cleaning_data_list:
    for movie_review in iterate_clean_data:
        if cleaning_data_list.index(iterate_clean_data) == 0:
            # processed_train_pos
            processed_train_pos.append(preprocess_and_lemmatize_review(movie_review))
            
        elif cleaning_data_list.index(iterate_clean_data) == 1:
            #processed_train_neg
            processed_train_neg.append(preprocess_and_lemmatize_review(movie_review))
            
        elif cleaning_data_list.index(iterate_clean_data) == 2:
            #processed_test_pos
            processed_test_pos.append(preprocess_and_lemmatize_review(movie_review))
            
        elif cleaning_data_list.index(iterate_clean_data) == 3:
            #processed_test_neg
            processed_test_neg.append(preprocess_and_lemmatize_review(movie_review))

In [None]:
# Count files in each directory and print the results
print(f"Number of files in processed_train_pos: {len(processed_train_pos)}")
print(f"Number of files in processed_train_neg: {len(processed_train_neg)}")
print(f"Number of files in processed_test_pos: {len(processed_test_pos)}")
print(f"Number of files in processed_test_neg: {len(processed_test_neg)}")



# 10. Combine the reviews
Combining the positive `('pos')` and negative `('neg')` reviews of train and also of test data.

In [None]:
train_data = processed_train_pos + processed_train_neg
test_data = processed_test_pos + processed_test_neg

In [None]:
# checking the number of reviews in train and test data
print(f"Number of files in train_data :{len(train_data)}")
print(f"Number of files in test_data :{len(test_data)}")

# 11. Create labels
In which we create labels corresponding each positive, negative reviews length. and combining the reviews of training and testing data. 
* The positive review label is `1` .
* The negaive review label is `0` .

**Note:** length of training and testing data should be same with training and testing labels.

In [None]:
# Create labels (1 for positive, 0 for negative)
train_labels = [1] * len(processed_train_pos) + [0] * len(processed_train_neg)
test_labels = [1] * len(processed_test_pos) + [0] * len(processed_test_neg)


In [None]:
print(f"Number of files in train_labels :{len(train_labels)}")
print(f"Number of files in test_labels :{len(test_labels)}")

# 12. Overview on one review 
1. original review.
2. cleaning review.
3. lemmatization reviw.

**Original review**

In [None]:
review = test_pos[5]
print(review)

**Preprocess mean removing HTML-tags,special characters, punctuation**

In [None]:
processed_tes_pos = preprocess_review(review)
print(processed_tes_pos)

**Lemmatize review**

In [None]:
lemm = preprocess_and_lemmatize_review(processed_tes_pos)
print(lemm)

# 13. Why we have used roBERta and distilBERT models in training?

* We have tested many models like **BERT**, **mobileBERT**, **roBERTA**, **tinyBERT**,**distilBERT**, **mT5**, **XLNet**,     **BART**, **mBART**. <br>

* But due to some dependencies and some variations I had to use different models. In which **BERT**, **mobileBERT**,           **roBERTa**, **distilBERT** were run successfully. So that reason we have chose these models.<br>

* Some of the remaining models required more computational power because they were too large. Like **mBART** is a smaller     version of the **BART** and **BART** model also used but same memory-error and **mT5** is a smaller version of the **T5**   model. And some were not publicly available and some required private tokens.

# 14. Overview of RoBERTa Model 
* **RoBERTa** stand for `Robustly Optimized BERT Pre-training Approach` it is       variant of BERT model and which was developed by Facebook AI researcher and       Washington University.<br>

* It has almost **similar architecture** as compare to **BERT** but in order some improve   the results on BERT architecture.

#### **Modifications to BERT:**<br>
**->14.1. BERT uses two objectives `masked language modeling` and `next sentence prediction.`**

   * So in which remove NSP objective just use the MLM.
   
   - lets more explore **what is MLM(Masked language model).?**
   
    * MLM: Its train them to learn about contextual relationship between the word in a sentence.
      some words in a sentence randomly masked with special token '[mask]'. **For example** we have a sentence **"The quick       brown
      fox jumps over the lazy dog."** First of all convert sentenc into tokkens and then we randomly select words that cover       with masked. Let say words is 'dog' and 'fox' create masked of it. **["The", "quick", "brown", "[MASK]","jumps", "over", "the", "lazy", "[MASK]", "."]** And now model predict the word behind masked
      
   * By learning to predict the masked words the model becomes expert at understanding the contextual relationships between       the words. it also called `Dynamic masking strategy.` <br>
   
   
**->14.2. Training with bigger batch size and huge corpus**

   * Original BERT is trained on 1M steps with 246 batch size but in which trained model with 124 stepd of 2K sequences and      31K steps with 8K batch-size two benefits with bigger batch size improve the perplexity of model and large batch size        is also easier to paralize.
   
   * It trained on 160GB text corpus that is 10 bigger than BERT train-datasize in which include different type of data like `BOOKCORPUS`, `CC-Newsx`, `OPENWEBTEXT`, `STORIES`.

**->13.3. Training on large text sequence**

  * In the original BERT pretraining procedure  the model observes two concatenated document segments, which are either         sampled contiguously from the same document. but in which `roBERRTa` is trained on `full-sentence` without NSP(next         sentence prediction).
  
  * **FULL-SENTENCES:** Each input is packed with full sentences sampled contiguously from one or more documents, such that     the total length is at most `512 tokens.`<br>

**->14.4. Text Encoding using BPE (Byte-pair-encoding)**

  * we instead consider training BERT with a larger byte-level BPE vocabulary containing `50K subword` units, without any         additional preprocessing or tokenization of the input. but in original BERT model vocablary size is `30K`.
  
  
  * This adds approximately 15M and 20M additional parameters for BERT-BASE and BERT-LARGE, respectively.
  
**Remaining architecture is same as BERT-large architecture**

### Refrence:  [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)


In [None]:
#torch.cuda.empty_cache()

# 15. Model Implementaion - roBERTa

* `RobertaTokenizer` that load the pre-trained tokenizer that correspond to the `bert-base` variant. and this tokenizer is      responsible for converting text data into numerical tokens that processed the model.

* we give it list of text samples that `roBERTa tokenizer` tokenize it.

* `truncation=True` in which wi give a argument that is `max-length` basically maximum token length of text if the text is      above than maximum length `padding=true` add the padding in the text and pickup `512` tokkens if tokkens more than `512`    just truncate it.

* `train_encodings` will contain the token IDs and attention masks for the training data.

*  Dataset is created using `TensorDataset`.

*  `train_encodings['input_ids']` this contain the token IDs of the training data after tokenize.

* `train_encodings['attention_mask']` it contain the attention masks which indicate which tokens are actual words and which    are padding tokens.

* `torch.tensor(train_labels)` This includes the labels associated with the training data. basically `TensorDataset` combine 
   the `train_encodings['input_ids']` and `train_encodings['attention_mask']`. this wrap the each sample.

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
train_encodings = tokenizer(train_data, truncation=True, padding=True, max_length=512, return_tensors='pt')
test_encodings = tokenizer(test_data, truncation=True, padding=True, max_length=512, return_tensors='pt')

train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], torch.tensor(train_labels))
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], torch.tensor(test_labels))


## 16. Create the Dataloader - roBERTa
- In which we set the `batch-size` to `16`. means that during the training we feed to model for processing 16 at a time.
  * we set different batch size if we exceeding to 16 accuring the memory error your memory is not free etc. so that we set     16 samples at a time according to our computational resources.
  
* In first we pass wrap-up data that above created mean it`train_encodings['input_ids']` and          `train_encodings['attention_mask']` and its label.

* Second we give batch-size that we define is `16`.

* `shuffl=true` mean that shuffling the data at the each epochs. with this we feed whole data to the model and understand      the data contextually.

In [None]:
# Set up the DataLoader
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

## 17. Model Loading - roBERTa

* load the `RobertaForSequenceClassification` pre-trained `reberta-case` model with 2 labels and tarined on own corpus.

- We use `Adam` optimizer with learning rate [0.01,0.25] both lr use but not any effect on results.
   * Adam adapts the learning rates for each parameter based on their previous gradients. It maintains a per-parameter            learning rate that is adjusted during training.
   * It dealing with sparse or noisy gradients. so that why we use it and **original paper of roBERTa model** also use adam          optimizer.
   
- `torch.device` used for check the if the GPU is available and model set on device `GPU` otherwise use CPU.

In [None]:
# Initialize and configure the FlauBERT model for sequence classification
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)
# 1e-6
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=len(train_loader) * 1, num_training_steps=len(train_loader) * 10)

# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


## 18. Model training - roBERTa

- Using loop It trains the model for the specified number of epochs.
 * intially get better result becuase we chose `10 epoch` first. we change number of 5,10,15,20 even run the `10 epochs` but not        effect on results. so that also these reason `we chose 10 epochs.`
 
- Using the `tqdm` library visulize the model running time and running progress. 

- Computes training `losses` and `accuracies`.

- Saves the model after each epoch.

In [None]:
epochs = 10
training_accuracies = []
training_losses = []

# Create a directory to save the models
output_dir = "reBERT_5e"
os.makedirs(output_dir, exist_ok=True)

for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    correct = 0
    total = 0
    
    for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1}"):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        
        # Use the loss from the model
        loss = outputs.loss
        train_loss += loss.item()
        
        # Compute accuracy
        logits = outputs.logits
        _, predicted = torch.max(logits, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        
        loss.backward()
        optimizer.step()
        scheduler.step()

    avg_train_loss = train_loss / len(train_loader)
    training_losses.append(avg_train_loss)
    
    # Calculate training accuracy
    train_accuracy = correct / total
    training_accuracies.append(train_accuracy)
    
    print(f"Epoch {epoch + 1} - Average training loss: {avg_train_loss:.4f}, Training accuracy: {train_accuracy:.4f}")
    
    # Save the model after every epoch in the specified directory
    model.save_pretrained(os.path.join(output_dir, f"robert_epoch_{epoch + 1}_32_adam"))


In [None]:

# Data
epochs = range(1, len(training_accuracies) + 1)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Training Accuracy plot
ax1.plot(epochs, training_accuracies, marker='o', color='#ff7f0e', label='Training Accuracy')
ax1.set_title('roBERTa-Training Accuracy Over Epochs')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('roBERTa Accuracy')
ax1.set_xticks(range(1, len(training_accuracies) + 1))
ax1.legend()

# Training Loss plot
ax2.plot(epochs, training_losses, marker='o', color='#ff7f0e', label='Training Loss')
ax2.set_title('roBERTa-Training Loss Over Epochs')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('roBERTA Loss')
ax2.set_xticks(range(1, len(training_accuracies) + 1))
ax2.legend()

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()


## 19. Model Evaluation - roBERTa

- After model is trained and **evaluate** the model that how many trained well our model.

- At the end print the **Confusion Matrics**.

In [None]:
model.eval()
y_true = []
y_pred = []

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating"):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_labels = np.argmax(logits.cpu().numpy(), axis=1)

        y_true.extend(labels.cpu().numpy())
        y_pred.extend(predicted_labels)

# Print classification metrics
print(classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)


In [None]:
#torch.cuda.empty_cache()


# 20. Overview of DistilBERT model

DistilBERT is a `lightweight` and more `efficient` compare than the original BERT  model. It was introduced by `Hugging-Face` in their paper titled "DistilBERT, a distilled version of BERT" in `2019`.
* DistilBERT has the `same` general architecture as BERT
* `DistilBERT` is `40% smaller` and `60% faster` compare than original BERT model. so that reason distilBERT known as student and BERT as teacher.
* DistilBERT can understand the context of a word by considering both the `left` and `right` surrounding words in a sentence. 
*  It is first pre-trained on a `large corpus of text data` and then `fine-tuned` on specific tasks such as `classification` or `named entity recognition(NER)`.
* it also use **Masked Language Model** `(MLM)` masking technique during pre-training.
* It achieves this by `reducing` the number of layers and the `hidden size` of the model.


* The BERT-base model has 12 layers, while DistilBERT has 6 layers.
* **Token type embeddings are removed:** Token type embeddings are used to distinguish between different tokens in a sentence, such as the difference between a `noun` and a `verb`. DistilBERT does not use token type embeddings, which reduces the number of parameters.
* **The pooler is removed.** The pooler is a layer that is used to extract a representation of the entire sentence. DistilBERT does not use the pooler, which further reduces the number of parameters.

### Refrence:  [DistilBERT, a distilled version of BERT](https://arxiv.org/pdf/1910.01108v4.pdf)


# 21. Model Implementation - DistilBERT

- `DistilBertTokenizer` loads the pre-trained tokenizer corresponding to the `distilbert-base-uncased` variant. This tokenizer is responsible for converting text data into numerical tokens processed by the model.

- We provide it with a list of text samples that the `DistilBERT tokenizer` tokenizes.

- `truncation=True`: This argument specifies that if a text is longer than the maximum token length (512 tokens in this case), it will be truncated. 

- `padding=True`: This adds padding to the text, ensuring it has a uniform length. DistilBERT picks up to `512` tokens and pads if there are fewer.

- `train_encodings` will contain the token IDs and attention masks for the training data.

- Dataset is created using `TensorDataset`.

- `train_encodings['input_ids']` contains the token IDs of the training data after tokenization.

- `train_encodings['attention_mask']` contains the attention masks which indicate which tokens are actual words and which are padding tokens.

- `torch.tensor(train_labels)` includes the labels associated with the training data. `TensorDataset` combines `train_encodings['input_ids']` and `train_encodings['attention_mask']`, wrapping each sample.

This setup allows us to efficiently use the DistilBERT model for tasks such as classification, sentiment analysis, and more, by leveraging the token IDs and attention masks produced by the tokenizer.


In [None]:
# Change tokenizer
tokenizer_distil = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_encodings_distil = tokenizer_distil(train_data, truncation=True, padding=True, max_length=512, return_tensors='pt')
test_encodings_distil = tokenizer_distil(test_data, truncation=True, padding=True, max_length=512, return_tensors='pt')

train_dataset_distil = TensorDataset(train_encodings_distil['input_ids'], train_encodings_distil['attention_mask'], torch.tensor(train_labels))
test_dataset_distil = TensorDataset(test_encodings_distil['input_ids'], test_encodings_distil['attention_mask'], torch.tensor(test_labels))


## 22. Create the Dataloader - DistilBERT
- In which we set the `batch-size` to `16`. means that during the training we feed to model for processing 16 at a time.
  * we set different batch size if we exceeding to 16 accuring the memory error your memory is not free etc. so that we set     16 samples at a time according to our computational resources.
  
* In first we pass wrap-up data that above created mean it`train_encodings['input_ids']` and          `train_encodings['attention_mask']` and its label.

* Second we give batch-size that we define is `16`.

* `shuffl=true` mean that shuffling the data at the each epochs. with this we feed whole data to the model and understand      the data contextually.





In [None]:
# Set up DataLoader for DistilBERT
batch_size_distil = 16
train_loader_distil = DataLoader(train_dataset_distil, batch_size=batch_size_distil, shuffle=True)
test_loader_distil = DataLoader(test_dataset_distil, batch_size=batch_size_distil)


## 23. Model Loading - DistilBERT

* load the `DistilBertForSequenceClassification` pre-trained `distilbert-base-uncased` model with 2 labels and tarined on own corpus.

- We use `Adam` optimizer with learning rate [0.01,0.25] both lr use but not any effect on results.
   * Adam adapts the learning rates for each parameter based on their previous gradients. It maintains a per-parameter            learning rate that is adjusted during training.
   
- `torch.device` used for check the if the GPU is available and model set on device `GPU` otherwise use CPU.

In [None]:
# Initialize and configure the DistilBERT model for sequence classification
model_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
optimizer_distil = torch.optim.AdamW(model_distil.parameters(), lr=2e-5)
scheduler_distil = get_linear_schedule_with_warmup(optimizer_distil, num_warmup_steps=len(train_loader_distil) * 1, num_training_steps=len(train_loader_distil) * 10)

# Training loop for DistilBERT
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_distil.to(device)


## 24. Model training - DistilBERT

- Using loop It trains the model for the specified number of epochs.
 * intially get better result becuase we chose `10 epoch` first. we change number of 5,10,15,20 even run the `10 epochs` but not        effect on results. so that also these reason `we chose 10 epochs.`
 
- Using the `tqdm` library visulize the model running time and running progress. 

- Computes training `losses` and `accuracies`.

- Saves the model after each epoch.

In [None]:
epochs_distil = 10
training_accuracies_distil = []
training_losses_distil = []

output_dir_distil = "distilBERT_5e"
os.makedirs(output_dir_distil, exist_ok=True)

for epoch in range(epochs_distil):
    model_distil.train()
    train_loss_distil = 0.0
    correct_distil = 0
    total_distil = 0
    
    for batch in tqdm(train_loader_distil, desc=f"Epoch {epoch + 1}"):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        
        optimizer_distil.zero_grad()
        outputs = model_distil(input_ids, attention_mask=attention_mask, labels=labels)
        
        # Use the loss from the model
        loss = outputs.loss
        train_loss_distil += loss.item()
        
        # Compute accuracy
        logits = outputs.logits
        _, predicted = torch.max(logits, 1)
        total_distil += labels.size(0)
        correct_distil += (predicted == labels).sum().item()
        
        loss.backward()
        optimizer_distil.step()
        scheduler_distil.step()

    avg_train_loss_distil = train_loss_distil / len(train_loader_distil)
    training_losses_distil.append(avg_train_loss_distil)
    
    # Calculate training accuracy
    train_accuracy_distil = correct_distil / total_distil
    training_accuracies_distil.append(train_accuracy_distil)
    
    print(f"Epoch {epoch + 1} - Average training loss: {avg_train_loss_distil:.4f}, Training accuracy: {train_accuracy_distil:.4f}")
    
    # Save the model after every epoch in the specified directory
    model_distil.save_pretrained(os.path.join(output_dir_distil, f"distilbert_epoch_{epoch + 1}_16_adam"))


### Visulize the accuracy of each epoch

In [None]:

# Data
epochs = range(1, len(training_accuracies_distil) + 1)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Training Accuracy plot
ax1.plot(epochs, training_accuracies_distil, marker='o', color='#1f77b4', label='Training Accuracy')
ax1.set_title('Training Accuracy Over Epochs')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('distilBERT Accuracy')
ax1.set_xticks(range(1, len(training_accuracies) + 1))
ax1.legend()

# Training Loss plot
ax2.plot(epochs, training_losses_distil, marker='o', color='#1f77b4', label='Training Loss')
ax2.set_title('Training Loss Over Epochs')
ax2.set_xlabel('Epochs')
ax2.set_ylabel('disrilBERT Loss')
ax2.set_xticks(range(1, len(training_accuracies) + 1))
ax2.legend()

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()


## 25. Model Evaluation - DistilBERT

- After model is trained and **evaluate** the model that how many trained well our model.

- At the end print the **Confusion Matrics**.

In [None]:
# Evaluation loop for DistilBERT
model_distil.eval()
y_true_distil = []
y_pred_distil = []

with torch.no_grad():
    for batch in tqdm(test_loader_distil, desc="Evaluating"):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        outputs = model_distil(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_labels = np.argmax(logits.cpu().numpy(), axis=1)

        y_true_distil.extend(labels.cpu().numpy())
        y_pred_distil.extend(predicted_labels)

# Print classification metrics
print(classification_report(y_true_distil, y_pred_distil, target_names=["Negative", "Positive"]))
conf_matrix_distil = confusion_matrix(y_true_distil, y_pred_distil)
print("Confusion Matrix:")
print(conf_matrix_distil)

# 26. Visualize the Resuts
* Plot the accuracy of both model of each epoch.

In [None]:
epochs = range(1, 11)

# Create subplots for accuracy and loss
plt.figure(figsize=(14, 5))

# Accuracy subplot
plt.subplot(1, 2, 1)
plt.plot(epochs, training_accuracies, marker='o', color='#1f77b4', label='DistilBERT')
plt.plot(epochs, training_accuracies_distil, marker='o', color='#ff7f0e', label='RoBERTa')
plt.xticks(range(1, len(training_accuracies) + 1))
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.legend()

# Loss subplot
plt.subplot(1, 2, 2)
plt.plot(epochs, training_losses, marker='o', color='#1f77b4', label='DistilBERT')
plt.plot(epochs, training_losses_distil, marker='o', color='#ff7f0e', label='RoBERTa')
plt.xticks(range(1, len(training_accuracies) + 1))
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Model Loss')
plt.legend()

plt.tight_layout()

plt.show()


* Just show accuracy of both model using Bar-plot

In [None]:
# Data
models = ['RoBERTa', 'DistilBERT']
accuracies = [93, 91]

# Create the bar chart
plt.figure(figsize=(8, 6))
plt.bar(models, accuracies, color=['#1f77b4', '#ff7f0e'])

plt.xlabel('Models')
plt.ylabel('Accuracy (%)')
plt.title('Validation Accuracy Comparison between RoBERTa and DistilBERT')

# Set y-axis ticks with a difference of 10
plt.yticks(range(0, 101, 10))

plt.show()


# 27. RoBERTa vs. BERT

* **Accuracy Improvement:** `roBERTa` and `distilBERT` models show very slightly improvements in
accuracy as training progresses over the epochs.

* **Training aspect:** `roBERTa` require much long time for training compare than `distilBERT` training time very faster. Both models achieve in initially epoch accuracy is not good but over the epochs improve itself slightly.

* **Model Preference:** `reBERTa` give 98% accuracy on training and 93%non validation. In `distilBERT` give 99% on training and 90% accuracy on validation.

* **Overfitting:** both models is going to overfitting.

# 28. Conclusion

In conclusion, `RoBERTa` appears to be the stronger performer in this `sentiment analysis` task,
showing higher **precision, recall, F1-scores**, and **accuracy** compared to distilBERT on the test dataset.
It is recommended for use in this specific NLP application.

                                                
#                                                   **THE END!**