# **Natural Language Processing with Pytorch**

This notebook will guide you through key **Natural Language Processing (NLP) concepts** using **PyTorch**, from **tokenization** to **deep learning models** like **LSTMs and Transformers**.

**Key Topics Covered:**
- **Tokenization** (using Hugging Face tokenizers)
- **Word Embeddings** (Custom + Pretrained BERT)
- **Building an LSTM for NLP**
- **Using Transformers for NLP**



## **Part 1: Installs, Imports, Seed, and GPU Utilization**

In [None]:
!pip install transformers datasets

# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.utils.data import DataLoader
from datasets import load_dataset
import numpy as np
import random
import matplotlib.pyplot as plt
from collections import Counter
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, get_scheduler
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Set a random seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


## **Part 2: Tokenization & Vocabulary Building**

Before we can process text using deep learning, we need to **convert text into numerical format**. Tokenization is the process of **splitting text into smaller units** (words, subwords, or characters). We'll use **Hugging Face's AutoTokenizer** for modern subword tokenization instead of traditional methods like NLTK.

In [None]:
# Load a pretrained tokenizer (BERT-based)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Sample dataset: A few example sentences
corpus = [
    "Deep learning is amazing",
    "Natural language processing is a branch of AI",
    "PyTorch makes NLP easier",
    "Transformers are powerful models"
]

# Tokenize the sentences using the BERT tokenizer
tokenized_corpus = [tokenizer.tokenize(sentence.lower()) for sentence in corpus]

# Let's check how the tokenizer splits words
print("\nTokenized Corpus:")
print(tokenized_corpus)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Tokenized Corpus:
[['deep', 'learning', 'is', 'amazing'], ['natural', 'language', 'processing', 'is', 'a', 'branch', 'of', 'ai'], ['p', '##yt', '##or', '##ch', 'makes', 'nl', '##p', 'easier'], ['transformers', 'are', 'powerful', 'models']]


### Building a Vocabulary
Once we have tokenized text, we need to assign **unique numerical indices** to each token to create a vocabulary.
We'll create a **word2idx mapping** where each unique token gets a unique number.

In [None]:
def build_vocab(corpus):
    """
    Builds a vocabulary dictionary from tokenized text.
    Assigns a unique index to each token.
    """
    counter = Counter()
    for tokens in corpus:
        counter.update(tokens)

    # Create a mapping of words to indices, starting from 2
    word2idx = {word: idx+2 for idx, word in enumerate(counter.keys())}

    # Add special tokens for unknown words and padding
    word2idx["<unk>"] = 0
    word2idx["<pad>"] = 1

    return word2idx

# Build vocabulary from tokenized corpus
vocab = build_vocab(tokenized_corpus)
print("Vocabulary size:", len(vocab))

Vocabulary size: 25


## **Part 3: Word Embeddings (Custom & Pretrained BERT)**

Neural networks **do not understand text** directly, so we need to **convert words into dense numerical vectors**.

We'll explore two approaches:
- **Custom Embeddings**: Learned during training (random initialization)
- **Pretrained Embeddings**: Extracted from BERT

In [None]:
# Custom Word Embeddings (random initialization)
VOCAB_SIZE = len(vocab)
EMBED_DIM = 50  # number of features per word

embedding_layer = nn.Embedding(VOCAB_SIZE, EMBED_DIM).to(device)

# Sample sentence to tensor
sample_sentence = "Deep learning is powerful"
sample_tokens = tokenizer.tokenize(sample_sentence.lower())
sample_indices = [vocab.get(word, vocab["<unk>"]) for word in sample_tokens]
sample_tensor = torch.tensor(sample_indices).unsqueeze(0).to(device)

# Get the embedding representation
embedded_sentence = embedding_layer(sample_tensor)
print("\nCustom Embedding Shape:", embedded_sentence.shape)


Custom Embedding Shape: torch.Size([1, 4, 50])


## **Part 4: Using Pretrained Embeddings (BERT)**

Instead of learning word embeddings from scratch, we can use **pretrained embeddings** from models like BERT. These embeddings capture **semantic meaning** and are **contextualized**, meaning they depend on the sentence.

In [None]:
# Load a pretrained BERT model
bert_model = AutoModel.from_pretrained("bert-base-uncased").to(device)

# Sample sentence
sentence = "Deep learning is transforming AI"
inputs = tokenizer(sentence, return_tensors="pt")
inputs = {key: value.to(device) for key, value in inputs.items()}  # Move to GPU if available

# Get embeddings from BERT
with torch.no_grad():
    outputs = bert_model(**inputs)
    last_hidden_states = outputs.last_hidden_state

print("BERT Embedding Shape:", last_hidden_states.shape)  # Shape: (1, seq_length, 768)

BERT Embedding Shape: torch.Size([1, 7, 768])


### 🔹 Custom vs Pretrained Embeddings: Key Differences
| Feature            | Custom Embeddings | Pretrained (BERT) |
|--------------------|------------------|------------------|
| Trained on dataset | Yes              | No (already trained) |
| Contextualized?    | No               | Yes |
| Captures semantics | Limited          | Strong |
| Computational cost | Low              | High |

## **Part 5: Applying Transformers for NLP**

### **Understanding the IMDB Dataset and the Task**

The IMDB dataset is a widely used benchmark dataset for sentiment analysis. It consists of 50,000 movie reviews from IMDB, labeled as either positive (1) or negative (0). The dataset is divided into:
* 25,000 training reviews
* 25,000 test reviews

Each set contains an equal number of positive and negative reviews. Here are two examples:
* ✅ Positive Review: "This movie was absolutely fantastic! The storyline was engaging, and the performances were top-notch."
* ❌ Negative Review: "Terrible film. The acting was bad, the plot was predictable, and I regret watching it."

The goal is to build a binary classification model that takes a movie review as input and predicts whether it is positive (1) or negative (0).

**Why is this useful?**

* Companies use sentiment analysis to analyze customer feedback.
* It helps in opinion mining from social media, news, and reviews.
* Used in automated content moderation (e.g., filtering harmful content).

**Challenges in Text-Based Sentiment Analysis**

Unlike structured numerical data, text data presents unique challenges:
* Different sentence structures:
  * "I didn't like this movie." (Negative)
  * "I didn't expect to like this movie, but I did." (Positive)
* Sarcasm & Negation Handling:
  * "Oh great, another terrible remake." (Negative)
* Word Importance:
  * Some words (e.g., "not") can flip the sentiment of a sentence.
* Large Vocabulary:
  * Many unique words → Requires efficient embedding techniques.

**We Will Solve This Using BERT**

BERT (Bidirectional Encoder Representations from Transformers) is a pretrained deep learning model that:
* Understands context in text (unlike older methods like TF-IDF)
* Is pretrained on vast amounts of text data
* Handles long-range dependencies in text

We will fine-tune BERT to classify IMDB reviews as positive or negative using PyTorch and Hugging Face’s transformers library.

### STEP 1: Load the Data

In [None]:
# Load the IMDB dataset (already pre-split into train and test sets)
dataset = load_dataset("imdb")

# Reduce dataset size to save memory (use only 500 training samples)
train_texts = dataset['train']['text'][:500]
train_labels = dataset['train']['label'][:500]

# Use full test set for evaluation
test_texts = dataset['test']['text'][:100]
test_labels = dataset['test']['label'][:100]

# Print dataset size
print(f"Training samples: {len(train_texts)}")
print(f"Testing samples: {len(test_texts)}")

Training samples: 500
Testing samples: 100


### STEP 2: Tokenize the Data

In [None]:
# Load the tokenizer for DistilBERT (a smaller version of BERT)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Function to tokenize text
def tokenize_function(texts):
    return tokenizer(texts, padding="max_length", truncation=True, max_length=256, return_tensors="pt")

# Tokenize training and testing texts
train_encodings = tokenize_function(train_texts)
test_encodings = tokenize_function(test_texts)

# Convert labels into PyTorch tensors
train_labels_tensor = torch.tensor(train_labels)
test_labels_tensor = torch.tensor(test_labels)

### STEP 3: Create DataSet Class

In [None]:
# Custom dataset class to handle tokenized data
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings  # Tokenized text
        self.labels = labels  # Corresponding labels (0 = negative, 1 = positive)

    def __len__(self):
        return len(self.labels)  # Number of samples in dataset

    def __getitem__(self, idx):
        # Retrieve tokenized text and label for a given index
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item

# Create dataset objects for training and testing
train_dataset = IMDBDataset(train_encodings, train_labels_tensor)
test_dataset = IMDBDataset(test_encodings, test_labels_tensor)

## STEP 4: Create DataLoaders

In [None]:
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

### STEP 5: Load Pretrained DistilBERT Model

In [None]:
# Load the pre-trained DistilBERT model for binary classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Move model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


### STEP 7: Define Optimizer & Loss Function

In [None]:
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

# Calculate the number of training steps (total batches in 2 epochs)
num_training_steps = len(train_loader) * 2

# Create a learning rate scheduler (linearly decreases learning rate)
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

### STEP 8: Training Loop

In [None]:
import time  # Add timing to monitor training speed

# Enable mixed precision training to reduce memory usage
scaler = torch.cuda.amp.GradScaler()
num_epochs = 1  # Reduce to 1 epoch for faster training
start_time = time.time()  # Start timer

model.train()  # Set model to training mode

for epoch in range(num_epochs):
    print(f"Starting epoch {epoch+1}...")

    for batch_idx, batch in enumerate(train_loader):
        optimizer.zero_grad()  # Reset gradients

        batch = {key: val.to(device) for key, val in batch.items()}  # Move to device

        with torch.cuda.amp.autocast():  # Mixed precision
            outputs = model(**batch)  # Forward pass
            loss = outputs.loss  # Compute loss

        # Backpropagation with mixed precision scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        lr_scheduler.step()

        # Print loss every 10 batches
        if batch_idx % 10 == 0:
          print(f"Batch {batch_idx}/{len(train_loader)} - Loss: {loss.item():.4f}")

        # If training takes too long (> 10 minutes), break
        if time.time() - start_time > 600:
            print("Training took too long, stopping early.")
            break

    print(f"Epoch {epoch+1} completed. Loss: {loss.item():.4f}")

print("Training finished.")

  scaler = torch.cuda.amp.GradScaler()
  with torch.cuda.amp.autocast():  # Mixed precision


Starting epoch 1...
Batch 0/63 - Loss: 0.7323
Batch 10/63 - Loss: 0.0293
Batch 20/63 - Loss: 0.0067
Batch 30/63 - Loss: 0.0030
Batch 40/63 - Loss: 0.0021
Batch 50/63 - Loss: 0.0013
Training took too long, stopping early.
Epoch 1 completed. Loss: 0.0013
Training finished.


### STEP 9: Evaluation

In [None]:
# Set model to evaluation mode (disables dropout and gradients)
model.eval()

predictions = []
true_labels = []

# Disable gradient computation to save memory during testing
with torch.no_grad():
    for batch in test_loader:
        batch = {key: val.to(device) for key, val in batch.items()}  # Move batch to device
        outputs = model(**batch)  # Forward pass
        logits = outputs.logits  # Get model outputs

        # Convert logits to class predictions (0 or 1)
        preds = torch.argmax(logits, dim=-1).cpu().numpy()
        labels = batch["labels"].cpu().numpy()

        # Store predictions and actual labels for accuracy calculation
        predictions.extend(preds)
        true_labels.extend(labels)

# Compute accuracy score
accuracy = accuracy_score(true_labels, predictions)
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 1.0000


## **Part 7: Challenge Time!**

**Try modifying the training setup!**
- Change the learning rate and optimizer
- Use a different model (e.g., 'distilbert-base-uncased')
- Train for more epochs
- Add regularization like dropout
- Experiment with different batch sizes
- Visualize loss over training steps