# Module 5: Deep learning for natural language processing

## Overview
At its core, Natural Language Processing is about enabling computers to understand, interpret, and produce human language in a way that is both meaningful and valuable. This interdisciplinary domain sits at the intersection of computer science, artificial intelligence, and linguistics. The applications of NLP are vast and varied, spanning from simple tasks like spell-checkers and search engines to more complex ones like chatbots, sentiment analysis, and machine translation.

## Deep learning in NLP
The last decade has witnessed a transformative shift in the way we model language, thanks to the advent of deep learning. Traditional NLP techniques often relied on handcrafted features and rule-based approaches. However, deep learning, with its ability to automatically learn representations from data, has provided a more powerful and scalable way to tackle intricate NLP challenges.

## Goals of this notebook
- gain familiarity working with text in Python and representing it in ways that work for neural networks
- initialize and train a neural network to create word embeddings using PyTorch
- train a neural network for text classification using a transformer model and PyTorch 

## Non-goals
- understand _all_ deep learning and text processing code in this notebook at a deeper level
- understand the underlying architecture of the pre-trained transformer model

## Text Preprocessing for Neural Networks
Before feeding text data into a neural network, it is crucial to preprocess and convert it into a format that the model can understand. The aim of text preprocessing is to clean and standardize the text data, making it free of any inconsistencies and irrelevant elements. Let's explore the fundamental steps involved in this process.

Tokenization: This is the process of breaking down text into smaller units, typically words or tokens. For example, the sentence "I love Anaconda!" can be tokenized into the words ["I", "love", "Anaconda", "!"].

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize

text = "I love Anaconda!"
tokens = word_tokenize(text)
print(tokens)

Lowercasing: Often, text data comes in a mix of upper and lower case. Since "Anaconda" and "anaconda" are treated as different tokens, converting all tokens to lowercase helps in reducing the size of our vocabulary and ensuring consistency.

In [None]:
text = text.lower()
print(text)

Stopwords: These are commonly used words in a language (like "and", "is", "in") which might not contribute significantly to the meaning of a sentence in certain NLP tasks. Removing them can often help reduce the dimensionality of the data.

Punctuation: Punctuation marks can often be extraneous in NLP tasks. Removing them further cleans the data.

In [None]:
import string

text = text.translate(str.maketrans('', '', string.punctuation))
print(text)

Stemming: This method reduces words to their root form, even if the root is not a valid word. For instance, "running" would be stemmed to "runn".

Lemmatization: It reduces words to their base or dictionary form. For instance, "running" would be lemmatized to "run".

In [None]:
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

tokens = ["I", "love", "anacondas"]

lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
print(tokens)

One-hot encoding: Each word in the vocabulary is represented by a vector where one element is "hot" or 1, indicating the presence of the word, while all other elements are zero.

Word embeddings: A dense representation of words in a continuous vector space where semantically similar words are positioned closer together. While there are pre-trained embeddings like Word2Vec and GloVe, we'll be training our embeddings in the next section.

In [None]:
# Assuming a predefined vocabulary from our dataset.
vocab = {"I": 0, "love": 1, "Anaconda": 2}  # This is just a small example; in practice, it'll be larger.


tokens = ["I", "love", "Anaconda", "!"]

token_ids = [vocab[token] for token in tokens if token in vocab]
print(token_ids)

## Word Embeddings using PyTorch
Word embeddings transform words into dense vectors in a continuous space, capturing semantic relationships between words based on their co-occurrence patterns. For instance, in the embedding space, the vectors for words like "king" and "queen" might be closer together due to their similar contexts and meanings.

Word embeddings represent a major leap from traditional sparse representations like one-hot encoding. While a one-hot encoded vector has dimensions equal to the size of the vocabulary and is mostly filled with zeros, a word embedding is a dense vector with a much smaller dimension, e.g., 100 or 300, and captures semantic meaning.

Some popular pre-trained word embeddings are Word2Vec, GloVe, and FastText. They are trained on massive corpora and can be readily used. However, for domain-specific tasks or languages with less resources, it's often beneficial to train your own embeddings.

## Training a Simple Word Embedding Model with PyTorch
In PyTorch, the Embedding layer is designed to handle word embeddings. It initializes a matrix of `[vocab_size x embedding_dim]` where each row corresponds to a vector representation of a word.

In [None]:
import torch
import torch.nn as nn

vocab_size = len(vocab)  # Assuming you've defined a vocab from your dataset previously.
embedding_dim = 100  # Typical dimensions are 50, 100, 200, 300, etc.

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

To retrieve the embeddings for specific words:

In [None]:
# Using our previous vocab example
word_ids = torch.tensor([vocab["I"], vocab["Anaconda"]], dtype=torch.long)
word_embeddings = embedding_layer(word_ids)
print(word_embeddings)

The above is a very simple example of creating an embedding layer and accessing it. But what about training embeddings ourselves? Let's use the sample `AG_NEWS` dataset in PyTorch to do so.

In [None]:
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab

# Tokenize the AG_NEWS dataset
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter)

# Convert token to tensor
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

# Example
text_pipeline('I love Anaconda')

Let's view a few samples from the datsaset to see what we'll be working with.

In [None]:
train_iter = AG_NEWS(split='train')

i = 0
for (label, line) in train_iter:
    if i < 5:
        print(f"label: {label}, line: {line}")
        i += 1
    else:
        break

Now, let's create a lightweight class to manage our data prior to passing in.

In [None]:
from torch.utils.data import Dataset, DataLoader

class SkipGramDataset(Dataset):
    def __init__(self, text, window_size=4):
        self.window_size = window_size
        self.data = []
        for i, center_word in enumerate(text):
            for j in range(i - window_size, i + window_size + 1):
                if j != i and j >= 0 and j < len(text):
                    self.data.append((center_word, text[j]))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Example dataset usage
toy_text = text_pipeline('I love Anaconda')
skipgram_dataset = SkipGramDataset(toy_text)

And one more class that will represent our skip-gram PyTorch model that let's us initialize and train word embeddings!

In [None]:
import torch.optim as optim

class SkipGram(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        self.log_softmax = nn.LogSoftmax(dim=1)

    def forward(self, center_words):
        embeds = self.embeddings(center_words)
        out = self.linear(embeds)
        log_probabilities = self.log_softmax(out)
        return log_probabilities

Finally - we can train the model.

In [None]:
# Training
embedding_dim = 100
learning_rate = 0.01
epochs = 5

model = SkipGram(len(vocab), embedding_dim)
criterion = nn.NLLLoss() # Negative Log Likelihood Loss
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(epochs):
    total_loss = 0
    dataloader = DataLoader(skipgram_dataset, batch_size=32, shuffle=True)
    for center_word, target_word in dataloader:
        optimizer.zero_grad()
        log_probs = model(center_word)
        loss = criterion(log_probs, target_word)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

At the end of this training process, model.embeddings contains the "learned" word embeddings. The small amount of trianing epochs we used here might not yield very useful word embeddings, however.

This approach uses a simplified skip-gram model for clarity and brevity. In practice, models like Word2Vec employ optimizations such as negative sampling to make training more efficient.

## Text Classification using Transformer Layer in PyTorch
Transformers, introduced in the paper "Attention is All You Need" by Vaswani et al., revolutionized the NLP landscape by presenting a new way to handle sequential data without relying on recurrence. At the heart of the transformer architecture is the attention mechanism, which allows the model to focus on different parts of the input text.

In this section, we'll employ a simple transformer layer to classify text.

## Setting up the dataset

AG News is a dataset for news classification. The goal is to classify a news article into one of 4 classes: World, Sports, Business, Science/Technology.

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from datasets import load_dataset

dataset = load_dataset("ag_news")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
batch_size = 8

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=256)
 
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_data = tokenized_datasets["train"].select(range(1000))
test_data = tokenized_datasets["test"].select(range(10))

train_dataloader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

## Initialize the model and optimizer
We'll use a pretrained BERT model for classification as our model. Since AG News has 4 categories, we'll be setting the num_labels parameter to 4.

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)

And define the optimizer and scheduler we'll use to train this model.

In [None]:
from transformers import get_scheduler

num_epochs = 3

optimizer = AdamW(model.parameters(), lr=1e-5)

num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

## Fine-tuning the model
We'll train our model for a few epochs. For the sake of this demonstration, we'll set it to 3.

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Remember, working with transformers can be resource-intensive, especially with longer sequences.

## Evaluating the model
Last step is evaluating the performance of the model.

In [None]:
import evaluate

metric = evaluate.load("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()