# Natural Language Processing Concepts and Demonstrations

This Google Colab notebook provides explanations and runnable code snippets to demonstrate various Natural Language Processing (NLP) concepts discussed in the lecture.

## 1. Introduction

This notebook will cover the following key NLP concepts:

* Text Preprocessing
* Feature Generation (Bag of Words and TF-IDF)
* Topic Modeling (Latent Dirichlet Allocation)
* Sentiment Analysis
* Recurrent Neural Networks (RNNs, LSTMs, GRUs)

## 2. Text Preprocessing

Text preprocessing is a crucial step in NLP to transform raw text into an analyzable format. Common preprocessing techniques include:

* **Lowercasing:** Converting all text to lowercase to ensure consistency.
* **Cleaning and Normalization:** Removing HTML tags, accented characters, expanding contractions, standardizing spellings, removing extra whitespaces, punctuation, special characters, and converting number words to numeric form.
* **Tokenization:** Splitting text into smaller units called tokens (e.g., words).
* **Stemming:** Reducing inflected words to their root form (stem).
* **Lemmatization:** Converting words to their base dictionary form (lemma).
* **Stop Word Removal:** Removing common words that often have little semantic value (e.g., "the", "a", "is").

In [2]:
!pip uninstall -y nltk
!pip install nltk

In [3]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

text = "This is an example sentence with some common words like is and some inflected words like running and runs."

# Lowercasing
text_lower = text.lower()
print(f"Lowercased: {text_lower}")

# Cleaning (removing punctuation and extra spaces)
text_cleaned = re.sub(r'[^\w\s]', '', text_lower)
text_cleaned = " ".join(text_cleaned.split())
print(f"Cleaned: {text_cleaned}")

# Tokenization
tokens = nltk.word_tokenize(text_cleaned)
print(f"Tokens: {tokens}")

# Stop word removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if not w in stop_words]
print(f"Filtered tokens (stop words removed): {filtered_tokens}")

# Stemming
porter = PorterStemmer()
stemmed_tokens = [porter.stem(w) for w in filtered_tokens]
print(f"Stemmed tokens: {stemmed_tokens}")

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in filtered_tokens]
print(f"Lemmatized tokens: {lemmatized_tokens}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Lowercased: this is an example sentence with some common words like is and some inflected words like running and runs.
Cleaned: this is an example sentence with some common words like is and some inflected words like running and runs
Tokens: ['this', 'is', 'an', 'example', 'sentence', 'with', 'some', 'common', 'words', 'like', 'is', 'and', 'some', 'inflected', 'words', 'like', 'running', 'and', 'runs']
Filtered tokens (stop words removed): ['example', 'sentence', 'common', 'words', 'like', 'inflected', 'words', 'like', 'running', 'runs']
Stemmed tokens: ['exampl', 'sentenc', 'common', 'word', 'like', 'inflect', 'word', 'like', 'run', 'run']
Lemmatized tokens: ['example', 'sentence', 'common', 'word', 'like', 'inflected', 'word', 'like', 'running', 'run']


## 3. Feature Generation
Feature generation involves transforming text data into numerical representations that machine learning models can understand. Two common techniques are:

- Bag of Words (BoW): Represents each document as a collection of words, ignoring grammar and word order, and encodes documents as vectors based on the frequency of each word in a vocabulary.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weights words based on their frequency in a document and their inverse frequency across the entire corpus, giving higher importance to words that are frequent in a specific document but rare in general1.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Bag of Words
vectorizer_bow = CountVectorizer()
bow_matrix = vectorizer_bow.fit_transform(documents)
print("Bag of Words Matrix:")
print(bow_matrix.toarray())
print("Vocabulary:", vectorizer_bow.vocabulary_)

# TF-IDF
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(documents)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
print("Vocabulary:", vectorizer_tfidf.vocabulary_)

Bag of Words Matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

TF-IDF Matrix:
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


## 4. Topic Modeling (Latent Dirichlet Allocation - LDA)
Topic modeling is a technique to automatically discover the underlying topics in a collection of documents. Latent Dirichlet Allocation (LDA) is a popular probabilistic model that assumes each document is a mixture of topics and each topic is a mixture of words.

In [5]:
import gensim
from gensim import corpora

documents = [
    "health medicine exercise cardio vascular cancer",
    "sports news team baseball injury recovery",
    "religion pray hope faith god",
    "health injury medicine recovery",
    "sports team baseball game"
]

# Tokenize the documents
tokenized_documents = [doc.split() for doc in documents]

# Create a dictionary (mapping from word to integer id)
dictionary = corpora.Dictionary(tokenized_documents)

# Convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in tokenized_documents]

# Train the LDA model
lda_model = gensim.models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

# Print the topics
print("Topics:")
for topic in lda_model.print_topics():
    print(topic)

Topics:
(0, '0.106*"god" + 0.106*"faith" + 0.106*"pray" + 0.106*"religion" + 0.106*"hope" + 0.037*"game" + 0.036*"sports" + 0.036*"team" + 0.036*"baseball" + 0.036*"recovery"')
(1, '0.086*"injury" + 0.086*"medicine" + 0.086*"health" + 0.086*"recovery" + 0.086*"baseball" + 0.086*"team" + 0.086*"sports" + 0.052*"exercise" + 0.052*"news" + 0.052*"cardio"')


## 5. Sentiment Analysis
Sentiment analysis aims to identify and extract subjective information (opinions, emotions, attitudes) from text. A simple rule-based approach involves using sentiment lexicons, which are lists of words associated with positive or negative sentiment

In [6]:
positive_words = ["good", "best", "beautiful", "amazing", "fantastic"]
negative_words = ["bad", "worst", "ugly", "awful", "poor"]

def analyze_sentiment(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    positive_count = 0
    negative_count = 0
    for token in tokens:
        if token in positive_words:
            positive_count += 1
        elif token in negative_words:
            negative_count += 1

    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

review1 = "This is a fantastic and amazing movie!"
review2 = "The food was bad and the service was awful."
review3 = "The weather is cloudy today."

print(f"Review 1 Sentiment: {analyze_sentiment(review1)}")
print(f"Review 2 Sentiment: {analyze_sentiment(review2)}")
print(f"Review 3 Sentiment: {analyze_sentiment(review3)}")

Review 1 Sentiment: Positive
Review 2 Sentiment: Negative
Review 3 Sentiment: Neutral


## 6. Recurrent Neural Networks (RNNs, LSTMs, GRUs)
Recurrent Neural Networks (RNNs) are designed to process sequential data by maintaining an internal state (memory). However, simple RNNs suffer from the vanishing gradient problem, limiting their ability to learn long-range dependencies. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) are extensions of RNNs that address this issue by introducing gating mechanisms

In [7]:
import torch
import torch.nn as nn
from torch.nn import Embedding, RNN, LSTM, GRU, Linear
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from collections import Counter
import numpy as np

# Sample IMDB-like dataset (replace with actual IMDB loading if needed)
reviews = [
    "This movie is fantastic and I loved it.",
    "The acting was terrible and the plot was boring.",
    "I really enjoyed this film, it was great.",
    "Not a good movie, I would not recommend it.",
    "Excellent performance by the cast.",
    "Waste of time and money.",
    "A must-see movie, highly recommended.",
    "The story was confusing and the ending was bad."
]
labels = [1, 0, 1, 0, 1, 0, 1, 0] # 1 for positive, 0 for negative

# 1. Data Preprocessing
def preprocess_data(reviews, labels):
    # Tokenization
    tokens = [review.lower().split() for review in reviews]
    # Vocabulary creation
    all_tokens = [token for sublist in tokens for token in sublist]
    vocabulary = Counter(all_tokens)
    sorted_vocab = sorted(vocabulary, key=vocabulary.get, reverse=True)
    word_to_index = {word: index + 2 for index, word in enumerate(sorted_vocab)}
    word_to_index['<pad>'] = 0
    word_to_index['<unk>'] = 1
    index_to_word = {index: word for word, index in word_to_index.items()}
    # Convert tokens to indices and pad sequences
    indexed_tokens = [[word_to_index.get(token, word_to_index['<unk>']) for token in review] for review in tokens]
    max_length = max(len(seq) for seq in indexed_tokens)
    padded_tokens = [seq + [word_to_index['<pad>']] * (max_length - len(seq)) for seq in indexed_tokens]
    return np.array(padded_tokens), np.array(labels), word_to_index, index_to_word, max_length

padded_tokens, labels, word_to_index, index_to_word, max_length = preprocess_data(reviews, labels)

# Split data into training and testing sets
train_tokens, test_tokens, train_labels, test_labels = train_test_split(padded_tokens, labels, test_size=0.2, random_state=42)

# Create PyTorch Datasets
class SentimentDataset(Dataset):
    def __init__(self, tokens, labels):
        self.tokens = torch.tensor(tokens, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.tokens[idx], self.labels[idx]

train_dataset = SentimentDataset(train_tokens, train_labels)
test_dataset = SentimentDataset(test_tokens, test_labels)

# Create DataLoaders
batch_size = 2
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)



### Data Preprocessing:

A sample IMDB-like dataset of reviews and their corresponding sentiment labels (positive or negative) is created. In a real-world scenario, you would load the actual IMDB dataset using libraries like torchtext.
The preprocess_data function performs the following steps:

- Tokenization: Converts each review into a list of lowercase words.
- Vocabulary Creation: Creates a vocabulary (mapping of unique words to indices) from all the tokens. <pad> is used for padding sequences to the same length, and <unk> is used for words not in the vocabulary.
Token to Index Conversion: Converts each word in the reviews to its corresponding index in the vocabulary.
- Padding: Pads shorter sequences with the <pad> index to make all sequences the same length, which is required for batch processing in RNNs.


The preprocessed data is split into training and testing sets.
A PyTorch Dataset class (SentimentDataset) is created to load the tokenized reviews and their labels.
DataLoaders are created to handle batching and shuffling of the data during training.

### Model Architectures:

- SimpleRNN:
-- Embedding layer: Converts word indices into dense vector representations. vocab_size is the number of unique words in the vocabulary, and embedding_dim is the size of the embedding vectors.
-- RNN layer: The core recurrent layer. embedding_dim is the input size, hidden_dim is the size of the hidden state vectors. nonlinearity='relu' specifies the activation function.
-- Linear layer: A fully connected layer that maps the final hidden state to the output dimension (1 for binary sentiment classification).
-- forward method: Defines the forward pass of the network. It passes the input through the embedding layer, then the RNN layer. The hidden state of the last time step is used as the representation of the entire sequence, which is then passed through the fully connected layer and a sigmoid activation function to get the probability of the sentiment being positive.

- LSTM Model:
Similar structure to SimpleRNN, but uses an LSTM layer instead of RNN. The LSTM layer internally manages cell state and hidden state to handle long-range dependencies.
The forward method returns both the output and the hidden and cell states. We use the hidden state of the last time step.

- GRUModel:
Similar structure to SimpleRNN, but uses a GRU layer. GRU is a simplified version of LSTM with fewer parameters.
The forward method returns both the output and the hidden state. We use the hidden state of the last time step.

In [8]:
# 2. Define Model Architectures
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SimpleRNN, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.rnn = RNN(embedding_dim, hidden_dim, batch_first=True) # batch_first=True for correct shape
        self.fc = Linear(hidden_dim, output_dim)


    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)
        # hidden shape: (n_layers * n_directions, batch_size, hidden_dim)
        # output shape: (seq_len, batch_size, hidden_dim)
        # Take the output corresponding to the last non-padded token
        # This assumes padding token is 0
        #last_non_padded_idx = (text != 0).sum(dim=1) - 1
        #last_hidden_states = hidden[:, range(text.shape[0]), last_non_padded_idx]
        # Reshape the hidden state to match the expected input shape of the fully connected layer
        last_hidden_states = hidden[-1]  # Get the last hidden state
        #print(last_hidden_states.shape)
        #return torch.sigmoid(self.fc(last_hidden_states)) # Use the hidden state of the last time step
        return torch.sigmoid(self.fc(last_hidden_states)) # Use the hidden state of the last time step

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.lstm = LSTM(embedding_dim, hidden_dim, batch_first=True) # Added batch_first=True
        self.fc = Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        # Pass the embedded input to the LSTM
        output, (hidden, cell) = self.lstm(embedded)
        # Take the hidden state of the final timestep,
        # and pass it through the fully connected layer
        # output.shape = (batch_size, seq_len, hidden_size)
        # hidden.shape = (num_layers * num_directions, batch_size, hidden_size)
        #print(output.shape)
        output = output[:, -1, :] # Get the output of the last time step
        #print(output.shape)
        return torch.sigmoid(self.fc(output)) # Return the prediction

class GRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(GRUModel, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.gru = GRU(embedding_dim, hidden_dim, batch_first=True) # Added batch_first=True
        self.fc = Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        # Pass the embedded input to the GRU
        output, hidden = self.gru(embedded)
        # hidden.shape = (num_layers * num_directions, batch_size, hidden_size)
        # Take the hidden state of the final timestep,
        # and pass it through the fully connected layer
        # output.shape = (batch_size, seq_len, hidden_size)
        output = hidden[-1] # Get the output of the last time step
        #print(output.shape)
        return torch.sigmoid(self.fc(output)) # Return the prediction

# 3. Instantiate and Train Models
embedding_dim = 100
hidden_dim = 128
output_dim = 1
vocab_size = len(word_to_index)

rnn_model = SimpleRNN(vocab_size, embedding_dim, hidden_dim, output_dim)
lstm_model = LSTMModel(vocab_size, embedding_dim, hidden_dim, output_dim)
gru_model = GRUModel(vocab_size, embedding_dim, hidden_dim, output_dim)

def train_model(model, data_loader, optimizer, criterion, epochs=100):
    model.train()
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(data_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            if (batch_idx + 1) % 10 == 0:
                print(f'Epoch [{epoch+1}/{epochs}], Step [{batch_idx+1}/{len(data_loader)}], Loss: {loss.item():.4f}')

def evaluate_model(model, data_loader, criterion):
    model.eval()
    total_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in data_loader:
            output = model(data)
            loss = criterion(output, target)
            total_loss += loss.item()
            predictions = torch.round(output)
            correct += (predictions == target).sum().item()
    avg_loss = total_loss / len(data_loader)
    accuracy = correct / len(data_loader.dataset)
    print(f'Test Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}')

# Training Simple RNN
optimizer_rnn = torch.optim.Adam(rnn_model.parameters(), lr=0.001)
criterion = nn.BCELoss()
print("Training Simple RNN...")
train_model(rnn_model, train_loader, optimizer_rnn, criterion)
print("Evaluating Simple RNN...")
evaluate_model(rnn_model, test_loader, criterion)

# Training LSTM
optimizer_lstm = torch.optim.Adam(lstm_model.parameters(), lr=0.001)
print("\nTraining LSTM...")
train_model(lstm_model, train_loader, optimizer_lstm, criterion)
print("Evaluating LSTM...")
evaluate_model(lstm_model, test_loader, criterion)

# Training GRU
optimizer_gru = torch.optim.Adam(gru_model.parameters(), lr=0.001)
print("\nTraining GRU...")
train_model(gru_model, train_loader, optimizer_gru, criterion)
print("Evaluating GRU...")
evaluate_model(gru_model, test_loader, criterion)

Training Simple RNN...
Evaluating Simple RNN...
Test Loss: 3.6792, Accuracy: 0.5000

Training LSTM...
Evaluating LSTM...
Test Loss: 4.0651, Accuracy: 0.5000

Training GRU...
Evaluating GRU...
Test Loss: 4.0790, Accuracy: 0.5000


In [1]:
print("66172 is done!")

66172 is done!
