## Transformer for Sequence Classification

In [1]:
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv('award_data.csv')
data.iloc[0]['Abstract'] 

: 

In [None]:
# Replace the dataset with the actual DataFrame you have
# Standardize the State column to uppercase and strip extra spaces
data['State'] = data['State'].str.strip().str.upper()

# Check unique states after cleaning
unique_states = data['State'].unique()
print("Unique States After Cleaning:", unique_states)

# If you need to handle specific cases (e.g., replace 'NAN' with 'Unknown' or drop them):
data['State'].replace('NAN', pd.NA, inplace=True)  # Convert 'nan' to proper missing values
data.dropna(subset=['State'], inplace=True)  # Drop rows with missing State if necessary

# Re-check unique states to verify cleaning
unique_states_cleaned = data['State'].unique()
print("Final Unique States:", unique_states_cleaned)

: 

In [None]:
import pandas as pd

# Assuming the data has been loaded into a DataFrame called 'data'
# Replace 'State Column Name' and 'Proposal Title Column Name' with actual column names in your dataset
state_column = 'State'  # Replace with the actual column name for state
proposal_title_column = 'Abstract'  # Replace with the actual column name for proposal titles

# Normalizing the state names to uppercase to handle duplicates due to case sensitivity
data[state_column] = data[state_column].str.upper()

# Creating a dictionary with states as keys and lists of proposal titles as values
state_proposals = data.groupby(state_column)[proposal_title_column].apply(list).to_dict()

# Display the resulting dictionary
print(state_proposals)

: 

In [None]:
len(state_proposals)

: 

# Dataset and Preprocessing

**Data Preprocessing**

Raw text needs to be converted into numeric tensors before feeding to a neural network. We will perform the following preprocessing steps:
- Loading and Tokenization: Read the review text files and split the text into tokens (words). Clean the text by lowercasing and removing punctuation.
- Vocabulary Creation: Build a vocabulary of the most frequent words, mapping each word to an integer index.
- Encoding: Convert each review into a sequence of word indices using the vocabulary (unknown words get a special index).
- Padding/Truncation: Pad or truncate each sequence to a fixed length so they can be batched into tensors.


## Loading and Tokenizing Reviews
We'll gather all training reviews and their labels first. The training data will be used to build the vocabulary. The IMDb dataset comes with train/test split, so we'll keep those separate. To load the reviews:

In [None]:
len(state_proposals['TN'])

: 

In [None]:
import re

OR_Proposal = state_proposals['OR']
TN_Proposal = state_proposals['TN']

Train_OR_Proposal = OR_Proposal[:len(OR_Proposal)//4*3]
Test_OR_Proposal = OR_Proposal[len(OR_Proposal)//4*3:]

Train_TN_Proposal = TN_Proposal[:len(TN_Proposal)//4*3]
Test_TN_Proposal = TN_Proposal[len(TN_Proposal)//4*3:]

train_texts = []
train_labels = []
test_texts = []
test_labels = []

# Load and label positive training reviews
for text in Train_OR_Proposal:
    train_texts.append(str(text))
    train_labels.append(1)  # positive label 1

# Load and label negative training reviews
for text in Train_TN_Proposal:
    train_texts.append(str(text))
    train_labels.append(0)  # negative label 0

# Load and label positive test reviews
for text in Test_OR_Proposal:
    test_texts.append(str(text))
    test_labels.append(1)  # positive label 1

# Load and label negative test reviews
for text in Test_TN_Proposal:
    test_texts.append(str(text))
    test_labels.append(0)  # negative label 0

print(f"Loaded {len(train_texts)} training reviews and {len(test_texts)} test reviews.")

: 

Next, we define a function to tokenize and clean a review string. We'll remove HTML tags (if any), punctuation, and make everything lowercase. We'll use a simple regex to keep only letters and numbers (you could also use more advanced tokenizers).

In [None]:
def tokenize(text):
    # Remove HTML tags
    text = re.sub(r"<.*?>", " ", text)
    # Keep only letters and standard punctuation (replace others with space)
    text = re.sub(r"[^a-zA-Z0-9\s]", ' ', text)
    # Lowercase the text
    text = text.lower()
    # Split into tokens by whitespace
    tokens = text.split()
    return tokens

# Tokenize all reviews
train_tokens = [tokenize(review) for review in train_texts]
test_tokens  = [tokenize(review) for review in test_texts]

# Peek at one tokenized example
print(train_texts[0][:100], "->", train_tokens[0][:20])


: 

## Building the Vocabulary

Now we build a vocabulary (word index mapping) based on the training tokens. It's common to limit the vocabulary size to the top N most frequent words to avoid very rare words. In this dataset, there are around 130k unique tokens if we include everything. Using all unique words can hurt performance (very sparse, many rare words), so we will limit to the most frequent, for example, 10,000 words. (This is a common choice in literature, often 10k or 20k most common words.) Words outside this top list will be treated as "unknown."

In [None]:
from collections import Counter

# Count frequency of each token in the training set
word_counts = Counter(token for review in train_tokens for token in review)
print(f"Total unique tokens in training data: {len(word_counts)}")

# Select most common words for the vocabulary
vocab_size = 10000  # limit vocab to top 10k
most_common = word_counts.most_common(vocab_size - 2)  # -2 to account for special tokens
# We will reserve indices 0 and 1 for special tokens (<PAD> and <UNK>)
word_to_idx = {"<PAD>": 0, "<UNK>": 1}
for i, (word, freq) in enumerate(most_common, start=2):
    word_to_idx[word] = i

print(f"Vocabulary size (including PAD/UNK): {len(word_to_idx)}")

: 

We added two special tokens: <PAD> for padding and <UNK> for any unknown/out-of-vocabulary word. We assigned <PAD> index 0 (we will use 0 for padding in sequences) and <UNK> index 1. Common words like "the", "and", etc., will get low indices since they appear frequent. For example, "the" might be index 2, "and" 3, etc., depending on frequencies. This frequency-based indexing makes it easy to filter out rare words.

## Encoding and Padding Sequences

With the vocabulary in hand, we encode each review’s token list into a sequence of integers. Words not in our word_to_idx (e.g., a rare word not in top 10k) will be mapped to <UNK> (index 1).We also need to pad or truncate each sequence to a fixed length. Neural networks typically process batches of equal-length sequences. We'll pick a maximum sequence length (for example, 250 words). The average IMDb review is around 230 words, and most reviews are under 500 words, so 250 is a reasonable cutoff for this tutorial. Reviews longer than 250 words will be truncated, and shorter reviews will be padded with <PAD> (zeros) at the end. (Padding at the end is common for LSTMs/Transformers; padding at the beginning is another option but either works if handled consistently.)Let's encode and pad.


In [None]:
max_length = 250

def encode_and_pad(tokens):
    # Encode tokens to indices
    indices = [word_to_idx.get(token, 1) for token in tokens]  # 1 is <UNK> for unknown
    # Truncate if longer than max_length
    if len(indices) > max_length:
        indices = indices[:max_length]
    # Pad with 0s if shorter than max_length
    if len(indices) < max_length:
        indices += [0] * (max_length - len(indices))
    return indices

# Encode all training and test sequences
train_sequences = [encode_and_pad(tok_list) for tok_list in train_tokens]
test_sequences  = [encode_and_pad(tok_list) for tok_list in test_tokens]

print("Example encoded review (first 20 indices):", train_sequences[0][:20])
print("Length of encoded review:", len(train_sequences[0]))


: 

Now we have train_sequences and test_sequences which are lists of equal-length lists of integers (each length = 250). Each integer is an index in our vocabulary. For example, if "the" is index 2, you'll see "2" wherever "the" appeared in the text. Unknown words will appear as "1". And padded positions are "0".At this point, our data is ready to be fed into PyTorch models. We just need to wrap it in a PyTorch Dataset and DataLoader for convenient batching.

## Creating PyTorch Dataset and DataLoader

We'll create a custom Dataset to hold our encoded sequences and labels, and then use DataLoader to handle batching and shuffling.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class ProposalDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = [torch.tensor(seq, dtype=torch.long) for seq in sequences]
        self.labels    = torch.tensor(labels, dtype=torch.long)
    def __len__(self):
        return len(self.sequences)
    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

# Create dataset instances
train_dataset = ProposalDataset(train_sequences, train_labels)
test_dataset  = ProposalDataset(test_sequences, test_labels)

# Create DataLoaders for batching
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

: 

We use shuffle=True for training data so that each epoch the model sees batches in a new order (which helps training). We keep shuffle=False for the test loader to evaluate in a fixed order (shuffling isn’t necessary for evaluation).Now we're ready to define our models.

## Defining the Models
We'll implement two models using PyTorch's nn.Module:

1. LSTM-based RNN: An Embedding layer followed by an LSTM (recurrent layer) and a linear layer to output the sentiment class.
2. Transformer Encoder: An Embedding layer (with added positional encoding) followed by Transformer encoder layers and a final linear layer for classification.

Both models will output a prediction for each input review (binary classification: positive or negative). We will use a size-2 output (for classes 0 and 1) and later apply a softmax or use CrossEntropyLoss which expects raw logits of size 2.

## Transformer Encoder Model


Transformers use self-attention to process all words in a sequence in parallel (not strictly left-to-right like an RNN). We will implement a Transformer encoder that produces contextualized embeddings for each position, then aggregate those into a single vector for classification. One common approach is to prepend a special [CLS] token and use its output embedding for classification (like BERT does), but for simplicity, we'll instead average the transformer outputs from all positions.One challenge is that Transformers are position-agnostic, so we need to provide positional information. We'll add a positional embedding to the token embeddings.Let's define a TransformerClassifier module:

In [None]:
from torch import nn
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=100, num_heads=4, num_layers=2, ff_dim=256, max_len=250):
        super(TransformerClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.pos_embedding = nn.Embedding(max_len, embed_dim)  # learnable positional embeddings
        # Define a Transformer Encoder layer
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, 
                                                  dim_feedforward=ff_dim, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc = nn.Linear(embed_dim, 2)
    def forward(self, x):
        # x shape: (batch, seq_length)
        batch_size, seq_len = x.size()
        # Create position indices for each position in the sequence
        pos_indices = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, seq_len)
        # Get token embeddings and positional embeddings
        token_embeds = self.embedding(x)            # (batch, seq_len, embed_dim)
        pos_embeds = self.pos_embedding(pos_indices)  # (batch, seq_len, embed_dim)
        # Add token and position embeddings
        x_embedded = token_embeds + pos_embeds       # (batch, seq_len, embed_dim)
        # Create padding mask (True where padding token is present)
        pad_mask = (x == 0)  # shape: (batch, seq_len), True at padded indices
        # Pass through Transformer encoder
        enc_out = self.transformer(x_embedded, src_key_padding_mask=pad_mask)  # (batch, seq_len, embed_dim)
        # Aggregate the encoder outputs; we use mean pooling
        seq_avg = enc_out.mean(dim=1)               # (batch, embed_dim)
        logits = self.fc(seq_avg)                   # (batch, 2)
        return logits


: 

Key points for the Transformer model:

- We use nn.TransformerEncoderLayer and stack two layers (you can adjust num_layers). We set batch_first=True so it expects input shape (batch, seq, embed).
- We use a small number of heads (4) and feed-forward dimension (256) for demonstration. In practice, you might use more heads or larger dimensions.
- We add positional embeddings to the word embeddings to give the model a sense of word order. These are learned embeddings for positions 0 to max_len-1.
- We create a pad_mask where positions with token index 0 (PAD) are marked True, so the Transformer will ignore those positions in its self-attention calculations. This ensures padded tokens don't affect the attention or outputs.
- After the Transformer encoder, we average (mean) the output vectors across the sequence length to get a single vector per sequence. (Alternatively, one could use the output at the first position, use max-pooling, or a designated [CLS] token output. Averaging is a simple way to get a global representation.)
- Then we apply a linear layer to get class logits.

Now we have both models defined.

## Training the Models


We will train and evaluate the LSTM model first, then the Transformer model. Both will be trained from scratch on our dataset. We will use a simple training loop without any high-level libraries.First, set up the training essentials: loss function, optimizer, and device (CPU/GPU). We'll use CrossEntropyLoss since it's a classification task with logits, and Adam optimizer for both models. If a GPU is available, we'll use it for faster training.

In [None]:
import torch.optim as optim

# Select device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", device)

# Instantiate models
vocab_size = len(word_to_idx)
transformer_model = TransformerClassifier(vocab_size).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
trans_optimizer = optim.Adam(transformer_model.parameters(), lr=0.001)


: 

## Training the Transformer Model


In [None]:
num_epochs = 4
for epoch in range(num_epochs):
    transformer_model.train()
    total_loss = 0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        trans_optimizer.zero_grad()
        outputs = transformer_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        trans_optimizer.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Transformer Training loss: {avg_loss:.4f}")


: 

In [None]:
transformer_model.eval()
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = transformer_model(inputs)
        preds = torch.argmax(outputs, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
accuracy = correct / total
print(f"Transformer Model Test Accuracy: {accuracy:.4f}")


: 