<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/text_classification_advanced_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Text processing with linear and non-linear models



In [None]:
__author__ = "Pranava Madhyastha"
__version__ = "INM434/IN3045 City, University of London, Spring 2025"

# Importance of features

We will begin with experimets on features. But before we begin, let us download datasets library from huggingface which gives us access to a large number of datasets.

We will locally install the library (this is temporary and will disappear as soon the session is restarted).

In [None]:
# preparation
!pip install datasets # we are installing huggingface datasets

# Loading the dataset and experimenting with features

For the following set of worked out examples we will work with sentiment portion of the tweet_eval dataset. Read the documentation for the dataset here: https://huggingface.co/datasets/tweet_eval.

The function load_dataset automatically creates the pre-designed split of train and test. This helps in proper comparison of the models.

We will first begin with linear model (logistic regression) and play with different types of features on the same dataset.

Please attempt the TODOs mentioned in the code below.

In [None]:
import numpy as np
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Custom feature extraction function

# TODO - Experiments with features:
#   1. Experiment with different ngrams
#   2. Read the documentation for CountVectorizer - see what happens with the analyzer argument?
#   3. Try changing max_df and min_df
#   4. Try changing max_features
#   5. What happens when you binarise the features?
#   6. Experiment stopwords. See the impact of stop words.
#   7. Experiment with lower casing tokens
#   8. Advanced - can you write a new preprocessor function - this goes inside CountVectorizer.

def custom_feature_extraction(text, ngrams=3, compute=0, vectorizer=None):
    if compute:
    # Extract unigrams, bigrams, and trigrams
        vectorizer = CountVectorizer(ngram_range=(1, ngrams))
        features = vectorizer.fit_transform(text)
        return features, vectorizer
    else:
        features = vectorizer.transform(text)
        return features


# Load sentiment analysis dataset from Hugging Face
dataset = load_dataset('tweet_eval', 'sentiment')

# Extract input and output data
X = dataset['train']['text']
y = dataset['train']['label']

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


# Further split training set into smaller training set and validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=123)


# Extract features using custom feature extraction function
X_train_feats, feat_extractor = custom_feature_extraction(X_train, compute=1)
X_val = custom_feature_extraction(X_val, compute=0, vectorizer=feat_extractor)
X_test = custom_feature_extraction(X_test, compute=0, vectorizer=feat_extractor)

# Train linear model with logistic regression
clf = LogisticRegression(verbose=True)
clf.fit(X_train_feats, y_train)

# Evaluate model on validation set
val_score = clf.score(X_val, y_val)
print('Validation score:', val_score)

# Evaluate model on test set
test_score = clf.score(X_test, y_test)
print('Test score:', test_score)

#Solutions:

1. Experiment with different n-grams: modify `ngrams` object with 1, 2, 4 and observe the changes.
2. Experiment with different `analyzers`: modify `custom_feature_extraction` to allow for custom analyzers, you will see a description of the types of analyzers. In the context of scikit-learn's CountVectorizer, an "analyzer" is a parameter that determines how the text is broken down into features. It controls the type of features that are extracted from the text data. CountVectorizer offers three main analyzer options: `word`,`char`,`char_wb` -- Please refer to this, experiment and see how things change.
3. again from documentation, in scikit-learn's CountVectorizer, `max_df` is a parameter that helps filter out terms that appear too frequently in the document collection. It stands for "maximum document frequency."
4. `max_features` requires similar modifications as for the above two. Again - read the documentation - see that it limits vocabulary size to most frequent terms.
5. setting `binary=True` requires similar modifications -- it converts counts to binary presence/absence features.
6. Stopword filtering removes stop words -- stop words recurr everywhere!
7. This is about case sensitivity - this will require you to modify the main function (and hence the sub functions) to allow for setting lower and uppper case parametrisations.
8. Here is one that uses regular expressions from Lab 1:

```
def custom_preprocessor(text):
    """
    Custom preprocessor to clean tweets: removes URLs, removes user mentions (@username). removes special characters but keeps emoticons, normalizes whitespace --- why am I doing all this -- because I saw a few samples.
    """
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    
    # Keep emoticons but remove other special characters
    # The negative lookahead (?![\w\s]) ensures we don't remove emoticons
    text = re.sub(r'[^\w\s](?![\w\s])', ' ', text)
    
    # Normalize whitespace (replace multiple with single space)
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

  ```




# Plot the learning curves

Let us now see how the learning curves vary with different design choices of the features.


In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve


X_train_ngrams_1, feat_extractor = custom_feature_extraction(X_train, ngrams=1, compute=1)
X_train_ngrams_2, feat_extractor = custom_feature_extraction(X_train, ngrams=2, compute=1)
X_train_ngrams_3, feat_extractor = custom_feature_extraction(X_train, ngrams=3, compute=1)
X_train_ngrams_4, feat_extractor = custom_feature_extraction(X_train, ngrams=4, compute=1)



# Calculate learning curve
train_sizes_1, train_scores_1, cv_scores_1 = learning_curve(clf, X_train_ngrams_1, y_train, cv=5, train_sizes=[0.1, 0.3, 0.5, 0.7, 0.9])
#train_sizes_2, train_scores_2, cv_scores_2 = learning_curve(clf, X_train_ngrams_2, y_train, cv=5, train_sizes=[0.1, 0.3, 0.5, 0.7, 0.9])
#train_sizes_3, train_scores_3, cv_scores_3 = learning_curve(clf, X_train_ngrams_3, y_train_sample, cv=5, train_sizes=[0.1, 0.3, 0.5, 0.7, 0.9])


plt.figure()

plt.plot(train_sizes_1, cv_scores_1.mean(axis=1), label='Cross-validation score for clf-1')
#plt.plot(train_sizes_2, cv_scores_2.mean(axis=1), label='Cross-validation score for clf-2')


plt.title('Learning curve')
plt.xlabel('Training set size')
plt.ylabel('Accuracy')
plt.legend()
plt.show()


# TODO
1. Plot all the variants of different feature designs.
2. Experiment with SVM inplace of Logistic Regression.
3. What is the loss for SVM?
4. Test your classifiers on a different sentiment dataset from hugginface.




1. I will let this as an exercise.
2. for this, use the relevant libraries: `from sklearn.svm import SVC, LinearSVC` and then we will try with a simple linear SVM, only change the classifier variable from above -- `clf = LinearSVC(random_state=123, max_iter=2000, dual="auto")`
3. We saw this in the class briefly -- it is max margin loss.
4. I leave this as an assignment -- the `datasets` library should allow you to easily experiment with.


# Non-linear models

We will only cover multi-layer perceptron (a feed-forward model) in this lab.

We will first make use of scikit-learn's MLP classifier.

Let us build a simple 2 hidden layer feedforward model with ReLU non-linearity.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from datasets import load_dataset



# Split the dataset into training and testing sets (validation splits)
# It is always a good idea to have validation set - this is where we will tune the hyperparameters.
train_texts, val_texts, train_labels, val_labels = train_test_split(dataset['train']['text'], dataset['train']['label'], test_size=0.2, random_state=42)

# One-hot encode the text using CountVectorizer and calculate the mean pooling
vectorizer = CountVectorizer(binary=True)
train_data = vectorizer.fit_transform(train_texts)
train_data = train_data.multiply(1 / train_data.sum(axis=1)) # what is happening here?

val_data = vectorizer.transform(val_texts)
val_data = val_data.multiply(1 / val_data.sum(axis=1))

# Build the neural network model with two hidden layers
model = MLPClassifier(hidden_layer_sizes=(100, 50), activation='relu', max_iter=5, verbose=True)

# Train the model

model.fit(train_data, train_labels)
train_acc = accuracy_score(train_labels, model.predict(train_data))
val_acc = accuracy_score(val_labels, model.predict(val_data))

print(train_acc, val_acc)

# TODO

1. Evaluate the new model on the original test set.
2. Is it taking a long time to train? Why?
3. Can you experiment with different features?
4. Is this model better?
5. What would you change? (the hyperparameters) - how are these different from "feature combination" based design choices?
6. Can you perform model comparison with hypothesis testing?

1. Here is some sample code:  
```
test_data = vectorizer.transform(dataset['test']['text'])
test_data = test_data.multiply(1 / test_data.sum(axis=1))
test_acc = accuracy_score(dataset['test']['label'], model.predict(test_data))
print(f"Test accuracy: {test_acc:.4f}")
```
2. There are many reasons, but the dimensionality is likely the biggest factor - neural networks can be slow with high-dimensional sparse text features.

3. try a few different ways from the feature engineering that you did in exercise 1 and 2.

4. For that you would have to compare the accuracy metrics on the same test splits across models. Consider training time and inference speed trade-offs?
Examine the error patterns.

5. You could do several things here:

```
hidden_layer_sizes - try different architectures (more/fewer layers, different neuron counts)
activation - try 'tanh' or 'logistic' instead of 'relu'
alpha - L2 regularization strength
learning_rate and learning_rate_init - control how weights are updated
batch_size - could impact convergence speed
max_iter - definitely increase from just 5
solver - try 'adam' or 'sgd' instead of default 'adam'
```

6. Consider checking for hypothesis testing - this will help compare the outputs of the models. If the scores are similar, test like McNemar's test, Wilcoxon signed-rank test can help. They will help establish if the results are statistically significant.




# Using pytorch + GPU acceleration

THe code below reimplements the non-linear model using pytorch. We will be using GPU, to set the environment to GPU please change the runtime type and select the runtime with standard GPU.

Try going back to CPU. Increase the batch size.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset



# This line of code assigns the device that will be used for running of the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a custom dataset class
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, vocab):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        # Convert the text to a one-hot encoding
        text = self.vocab.transform([self.texts[idx]])
        text = text.multiply(1 / text.sum(axis=1))

        # Calculate the mean pooling of the one-hot encoding
        text = torch.tensor(text.todense()[0])
        text = torch.mean(text, dim=0)

        # Convert the label to a tensor
        label = torch.tensor(self.labels[idx])

        return text, label

# Define a custom neural network model
class SentimentModel(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(SentimentModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.fc3 = nn.Linear(hidden_size2, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# One-hot encode the texts using CountVectorizer
vectorizer = CountVectorizer(binary=True)
vectorizer.fit(train_texts)

# Define the dataset and data loader for training
train_dataset = SentimentDataset(train_texts, train_labels, vectorizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Define the dataset and data loader for testing
val_dataset = SentimentDataset(val_texts, val_labels, vectorizer)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Define the model, loss function, and optimizer
model = SentimentModel(input_size=len(vectorizer.vocabulary_), hidden_size1=100, hidden_size2=50, output_size=3).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(5):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs.float().to(device))
        loss = criterion(outputs, labels.to(device))
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 100 == 99:
            print(f"[Epoch {epoch+1}, Batch {i+1}] Loss: {running_loss/100:.3f}")
            running_loss = 0.0

# Evaluate the model on the test set
model.eval()
total_correct = 0
total_samples = 0
with torch.no_grad():
    for data in val_loader:
        inputs, labels = data
        outputs = model(inputs.float().to(device))
        _, predicted = torch.max(outputs.data, 1)
        total_samples += labels.size(0)
        total_correct += (predicted == labels.to(device)).sum().item()

print(f"Accuracy on test set: {(total_correct/total_samples)*100:.2f}%")




# TODO

1. Experiment with different activation functions.
2. Apply the classifier to different sentiment datasets from hugginface. What did you notice?
3. Make the classifiers comparable (similar scores) by tuning hyperparameters.
4. Are the features still sub-optimal?

1. Please try running the code with the \ of these functions: `['relu', 'leaky_relu', 'tanh', 'sigmoid', 'elu']`, remember we already have `relu` in our code (Hint: `F.relu`)
- ReLU (Rectified Linear Unit): The default choice in many neural networks that sets negative values to zero
- Leaky ReLU: A variant of ReLU that allows a small gradient for negative values
- Tanh: Hyperbolic tangent that outputs values between -1 and 1
- Sigmoid: Outputs values between 0 and 1, historically used in neural networks
- ELU (Exponential Linear Unit): Similar to ReLU but with smooth negative values (similar activation functions to ELU are used in large language models)

You must see something similar to the following behaviours:
- ReLU and Leaky ReLU typically converge faster (this may not happen in all cases).
- Tanh often provides better accuracy for sentiment analysis tasks but may take a while to converge.
- Sigmoid tends to perform worse, especially as you add layers -- try this with your other hyperparameters.
- ELU sometimes provides a good balance between fast convergence and good performance

3. This depends on your hyperparameter combinations

4. We are still using features from our exercise 1 and 2. They are probably not the best. -- we will learn later how we can obtain faster and simpler ways of training deep neural models.
