# Practical 3: Text Classification

In the previous practicals we created some IMDB movie review data for sentiment analysis and explored several text pre-processing and representation methods. By now you should have pre-processed the reviews you scraped from IMDB and also the full 50,000 review dataset. So, now we are ready to train a model to classify the sentiment of our movie reviews! We will explore several unsupervised and supervised approaches, using an existing movie review dataset for training and keep ours as an additional test set.

In the first part of this practical we will explore two supervised classification algorithms, Naive Bayes and an Artificial Neural Network (ANN).

In the second part of this practical we will look at several unsupervised algorithms K-means clustering and Semantic Analysis using word embeddings.

The objectives of this practical are:

1. Apply a complete NLP workflow for text classification

2. Understand the probabilistic Naive Bayes classifier and consider different aspects of applying an ANN to text data

3. Consider appropriate representations for unsupervised text classification, including clustering and semantic analysis with word embeddings

# 1 Supervised Text Classification

## 1.0 Import libraries

1. [Tensorflow](https://www.tensorflow.org/) - is a powerful Python library for machine learning.

2. [Keras](https://keras.io/) - is a simple API for building machine learning models and is built into Tensorflow 2.

In [None]:
import os
import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras import models, layers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
%matplotlib inline

# Set the directory to the data folder
data_dir = os.path.join('..', 'data', 'imdb')

# Spacy needs to install the language model also
# If you recieve an error, uncomment the following line and re-run the cell
# !python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

## 1.1 Load and pre-process data

1. First load the full IMDB dataset and our smaller reviews set.

2. Then we need to convert the 'positive' and 'negative' class labels to numerical values, 1 for positive and 0 for negative. Using the pandas `get_dummies` function creates two binary valued columns and then the `drop_first` parameter collapses these into a single column.

3. The next cell plots the distribution of reveiew sentiment for the dataset and our reviews. As you can, see the classes are prefectly balanced within the dataset, but are they in your data?

In [None]:
# Load the imdb dataset
imdb_data = pd.read_csv(os.path.join(data_dir, 'imdb_dataset.csv'))

# Load your imdb reviews
imdb_reviews = pd.read_csv(os.path.join(data_dir, 'imdb_reviews.csv'), index_col=0)
# Just keep the review text and the sentiment columns
imdb_reviews = imdb_reviews[['review', 'sentiment']]

# Convert the sentiment to a binary value
imdb_data['sentiment'] = pd.get_dummies(imdb_data['sentiment'], drop_first=True)
imdb_reviews['sentiment'] = pd.get_dummies(imdb_reviews['sentiment'], drop_first=True)

imdb_data.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
ax1 = sns.countplot(imdb_data, x='sentiment', ax=ax[0])
ax1 = sns.countplot(imdb_reviews, x='sentiment', ax=ax[1])

### Load the vocabulary

From last week you should have created a vocabulary from the larger IMDB dataset, so lets load that and set the vocabulary size accordingly. We will also add some special padding and unknown tokens for use later.

In [None]:
# Load the vocabulary file and store each word in a list
with open(os.path.join(data_dir, 'imdb_vocab.txt'), 'r') as file:
    imdb_vocab = file.read().splitlines() 
    
# Set the vocab size
vocab_size = len(imdb_vocab)

# Print the vocabulary
print("Vocabulary size: " + str(vocab_size))
for i, word in enumerate(imdb_vocab[:50]):
    print(f'({str(i)}, {word})', end=' ')

### Process and vectorise the text

1. We will use sklearns `CountVectorizer()` to tokenise the text and vectorise each review into a BOW.

2. Once the text is vectorised, split into training and validation sets.

In [None]:
# Create a CountVectorizer
bow_vectoriser = CountVectorizer(max_features=vocab_size)

# Vectorise the text
X = bow_vectoriser.fit_transform(imdb_data['review']).toarray()
print('Shape of X:', X.shape)
print(X[:5, :])

# Get the class labels
y = imdb_data['sentiment'].values
print('Shape of y:', y.shape)
print(y[:5])

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## 1.2 Naive Bayes

Naïve Bayes is a generative classification algorithm which finds the probability of an event based on prior knowledge/examples of similar events. It is naïve, because it assumes that each feature is independent (do not effect each other) so we can calculate probabilities independently.

The class defined below implements the following formulation of the algorithm:

$ \hat{y} = argmax(log(P(y) + \sum_{i=1}^{n} P(x_i|y))) $

Where:

$ \hat{y} = $ the predicted label

$ P(y) = $ the probability of class y

$ P(x_i|y) = $ the product of the probability that feature i in x occurs, given y

By default the algorithm uses:
- a laplace smoothing parameter `alpha=1.0`, which prevents division by zero when calculating likelihoods for words that do not appear in the training data for a given class.
- and `use_log=True`, to calculates probabilities in log space which numerically more stable.

<div class = "alert alert-block alert-info"><b>Note:</b> For a thorough discussion of Naïve Bayes, including smoothing, logs and several Python implementations see <a href='https://sidsite.com/posts/implementing-naive-bayes-in-python/'> this </a>.
</div>

In [None]:
class NaiveBayes():
    """Naive Bayes classifier for categorical data."""

    def __init__(self, alpha=1.0, use_log=True):
        """ Arguments:
                alpha: Laplace smoothing parameter.
                use_log: Use log probabilities to avoid underflow.
        """
        self.alpha = alpha # Smoothing parameter. Prevents division by zero when calculating likelihoods.
        self.use_log = use_log # Use log probabilities to avoid underflow.
        self.prior = None # The prior (mu) distribution of class labels. The probability of each class, P(class) within the training data.
        self.multinomial = None # The multinomial distribution (phi) is the probability/likelihood of each feature conditioned on the class, P(feature | class).


    def fit(self, X, y):
        """Fit training data for Naive Bayes classifier."""

        # N is the number of examples
        N = X.shape[0]

        # Calculate prior
        # Split the input array into sub-arrays depending on class label
        X_by_class = np.array([X[y == class_lbl] for class_lbl in np.unique(y)], dtype=object)
        
        # Count the number of examples in each class and divide by total number of examples
        self.prior = np.array([X_class.shape[0] / N for X_class in X_by_class])
        assert len(self.prior) == len(np.unique(y)), 'Number of priors should equal number of classes'

        # Calculate multinomial coefficients
        # Create array of shape (num_classes, num_features) to hold multinomial coefficients
        self.multinomial = np.zeros((len(np.unique(y)), X.shape[1]))

        for class_lbl in range(len(self.prior)):

            # Count the number of times each feature appears in all examples of a particular class + alpha
            class_feature_counts = X_by_class[class_lbl].sum(axis=0) + self.alpha

            # Probability of each feature given the class
            # Individual feature counts divided by the total number of times all features appear in the class
            self.multinomial[class_lbl] = class_feature_counts / class_feature_counts.sum()
        
        # Convert to log probabilities
        if self.use_log:
            self.prior = np.log(self.prior)
            self.multinomial = np.log(self.multinomial)
        return self

    def predict_proba(self, X):
        """Predict probability of class for each input example."""

        # Create array of shape (num_examples, num_classes) to store class probabilities (posterior) for each example
        class_probabilities = np.zeros(shape=(X.shape[0], self.prior.shape[0]))

        # Loop over each example and calculate individual conditional likelihoods for each class,
        # then multiply them all together (the product), and multiply by the class priors,
        # or in log space, add them all together (the sum), and add the class priors.
        for i, example in enumerate(X):
            example_likelihood = []

            # Loop over each class
            for class_lbl in range(len(self.prior)):
                feature_likelihood = []

                # Loop over each feature
                for feature in range(example.shape[0]):
                    # If the feature is present in the example
                    if example[feature] > 0:
                        # Calculate the probability of the feature given the class (multinomial coefficient */+ feature count)
                        mn_coefficient = self.multinomial[class_lbl][feature]
                        # If using log space the convert the example feature count to log space
                        if self.use_log:
                            feature_likelihood.append(mn_coefficient + np.log(example[feature]))
                        else:
                            feature_likelihood.append(mn_coefficient ** example[feature])

                # Append the probabilties of this class for this example
                example_likelihood.append(feature_likelihood)

            # Calculate joint probabilities
            # Multiply (or sum) all the individual feature probabilities together and multiply by (or add) class priors
            if self.use_log:
                class_probabilities[i] = np.asarray(example_likelihood).sum(axis=1) + self.prior
            else:
                class_probabilities[i] = np.asarray(example_likelihood).prod(axis=1) * self.prior
        
        # Normalise so probabilities sum to 1
        class_probabilities = class_probabilities / np.linalg.norm(class_probabilities, ord=1, axis=1, keepdims=True)
        assert (class_probabilities.sum(axis=1) - 1 < 0.001).all(), 'Rows should sum to 1'

        return class_probabilities

    def predict(self, X):
        """Predict class with highest probability."""
        return self.predict_proba(X).argmax(axis=1)

### Train and evaluate the model

In [None]:
# Create and train a Naive Bayes classifier
nb = NaiveBayes()
nb.fit(X_train, y_train)

# Predict class labels for validation set
predictions = nb.predict(X_val)
print('Accuracy:', accuracy_score(y_val, predictions))

# Print confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_val, predictions, display_labels=['negative', 'positive',], colorbar=False)
plt.show()

### Evaluate the model on your IMDB reviews

In [None]:
# Vectorise the text
X_test = bow_vectoriser.transform(imdb_reviews['review']).toarray()

# Get the class labels
y_test = imdb_reviews['sentiment'].values

# Predict class labels for test set
predictions = nb.predict(X_test)
print('Test Accuracy:', accuracy_score(y_test, predictions))

# Print confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_test, predictions, display_labels=['negative', 'positive',], colorbar=False)
plt.show()


## 1.3 Exercise: Understanding Naive Bayes and evaluating pre-processing

1. Take some time to examin the code in the `NaiveBayes()` class defined above. You should ensure you understand what is happening in the `fit()` and `predict_proba()` methods and how that relates to the equation for the Naive Bayes algorithm.

2. We also used the datasets that were already pre-processed. However, the `CountVectorizer()` has some built in pre-processing methods to strip accents, lowercase and remove stop words. Try loading the unprocessed datasets and experimenting with the build in options. Does your pre-processed data result in an improvement?

3. Sklearn also has several implementations of Naive Bayes. Try comparing the implementation above to the `MultinomialNB()`.


## 1.4 Artificial Neural Network

ANN are a discriminative classification algorithm which learn a decision boundary to separate the classes and select appropriate labels based on the input features.

The cell below creates a Keras sequential model:
- The `input_shape` of the input layer is the size of the vocabulary, because each word is a feature.

- It uses two hidden layers with relu activation and intermediate Dropout layers to prevent overfitting.

- The output layer only needs one node, because this is a binary classification problem.

- Finally, we compile using the adam optimiser and binary cross-entropy loss.

In [None]:
# Create a sequential model
model = models.Sequential()

# Input layer
model.add(layers.Dense(40, activation="relu", input_shape=(vocab_size, )))
# Hidden layers
model.add(layers.Dropout(0.3))
model.add(layers.Dense(20, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(10, activation="relu"))
# Output layer
model.add(layers.Dense(1, activation="sigmoid"))

# Compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

### Train and evaluate the model

Now train the model for a few epochs and evaluate on the test set. You should see an improvement over the Naive Bayes model. You can compare your models performance to the 'state-of-the-art' listed [here](https://paperswithcode.com/sota/sentiment-analysis-on-imdb).

In [None]:
# Fit the model
results = model.fit(X_train, y_train, epochs=2, batch_size=500, validation_data=(X_val, y_val))
print("Validation Accuracy:", round(results.history["val_accuracy"][-1], 3))

# Predict class labels for test set
predictions = model.predict(X_test)
predictions = [0 if x < 0.5 else 1 for x in predictions] # Convert probabilities to binary
print('Test Accuracy:', accuracy_score(y_test, predictions))

# Print confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_test, predictions, display_labels=['negative', 'positive',], colorbar=False)
plt.show()

## 1.5 Exercise: Different model parameters and input representations

1. Experiment with different numbers of nodes within the ANN layers, a numbers of layers, dropout probabilities and optimisation algorithms to see what impact they have on model performance.

2. You may have also noticed that currently the model is only trained for two epochs. Try a different number and observe the effect.

3. This model works well with BOW inputs. Try TF-IDF to see if it improves performance.

# 2 Unsupervised Text Classification

## 2.0 Import libraries

In [None]:
import os
import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
from sklearn.metrics.pairwise import cosine_similarity
%matplotlib inline

# Set the directory to the data folder
data_dir = os.path.join('..', 'data', 'imdb')

## 2.1 Load and pre-process data

In [None]:
# Load the imdb dataset
imdb_data = pd.read_csv(os.path.join(data_dir, 'imdb_dataset.csv'))

# Load your imdb reviews
imdb_reviews = pd.read_csv(os.path.join(data_dir, 'imdb_reviews.csv'), index_col=0)
# Just keep the review text and the sentiment columns
imdb_reviews = imdb_reviews[['review', 'sentiment']]

# Convert the sentiment to a binary value
imdb_data['sentiment'] = pd.get_dummies(imdb_data['sentiment'], drop_first=True)
imdb_reviews['sentiment'] = pd.get_dummies(imdb_reviews['sentiment'], drop_first=True)

imdb_data.head()

### Process and vectorise the text

Unlike our application of Naive Bayes and ANN, for clustering we need to change our input representations need to consider two things:
- **Zero values** - in Naive Bayes these simply result in low probability and in an ANN inputs of 0 will not impact weight updates.
- **Common words** - these have less impact on the probabilistic and ANN approach, but for clustering, since we are calculating the distance/similarity between examples, lots of common but uninformative words will tend to make examples appear closer/more similar.

1. We will use sklearns `TfidfVectorizer()` to tokenise the text and vectorise each review into a TF-IDF values. This should reduce the magnitude of very common words, which are unlikely to provide sentiment information. We will also use 1-grams *and* bi-grams to better capture the relationships between word pairs.

2. To avoid zeros within the input, and enhance important words, we will also scale the TF-IDF values.

3. Next use PCA to select only the most important features. Here we use 2 so that the clusters can be plotted in 2D.

2. Finally, split into training and validation sets.

In [None]:
# Create a TfidfVectorizer, StandardScaler and PCA
tfidf_vectoriser = TfidfVectorizer(max_features=vocab_size, ngram_range=(1, 2))
scaler = StandardScaler()
pca = PCA(n_components=2)

# Vectorise the text, scale it and apply PCA
X = tfidf_vectoriser.fit_transform(imdb_data['review']).toarray()
X = scaler.fit_transform(X)
X = pca.fit_transform(X)
print('Shape of X:', X.shape)
print(X[:5, :])

# Get the class labels
y = imdb_data['sentiment'].values

print('Shape of y:', y.shape)
print(y[:5])

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## 2.2 K-means

K-means is a clustering algorithm which defines a set of centroids (mid-points) for each cluster.

Examples are assigned to the closest cluster/centroid. Centroids are repeatedly moved to a new mid-point of all examples in the cluster, until no further changes are made.

Pseudocode:
```
Set K number of centroids and randomly assign to examples
Set converged = False

WHILE not converged:
	For each example i:
		For each cluster k:
			Calculate distance(i, k)
		Assign i to cluster with smallest distance
	
	IF no examples moved cluster:
		converged = True
	ELSE:
		For each cluster k:
			SET new cluster centroid = average position of examples in cluster
```

### Train and evaluate the model

In [None]:
# Create and train a Kmeans
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_train)

# Predict class labels for validation set
predictions = kmeans.predict(X_val)
print('Accuracy:', accuracy_score(y_val, predictions))

# Print confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_val, predictions, display_labels=['negative', 'positive',], colorbar=False)
plt.show()

### Evaluate the model on your IMDB reviews

In [None]:

# Vectorise the text, scale it and apply PCA
X_test = bow_vectoriser.transform(imdb_reviews['review']).toarray()
X_test = scaler.transform(X_test)
X_test = pca.transform(X_test)

# Get the class labels
y_test = imdb_reviews['sentiment'].values

# Predict class labels for test set
predictions = kmeans.predict(X_test)
print('Test Accuracy:', accuracy_score(y_test, predictions))

# Print confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_test, predictions, display_labels=['negative', 'positive',], colorbar=False)
plt.show()

### Plot the decision boundary, cluster centroids and validation and test datapoints

Code for plotting is adapted from [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py).

In [None]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.01  # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh using trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    # cmap=plt.cm.Paired,
    aspect="auto",
    origin="lower",
)

# Plot the validation and test data only
points = np.concatenate((X_val, X_test))
labels =  np.concatenate((y_val, y_test))

pos_points = np.array([list(points[i]) for i in range(len(points)) if labels[i] == 1])
neg_points = np.array([list(points[i]) for i in range(len(points)) if labels[i] == 0])

plt.scatter(pos_points[:, 0], pos_points[:, 1], color="green", s=2)
plt.scatter(neg_points[:, 0], neg_points[:, 1], color="red", s=2)

# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    s=169,
    linewidths=3,
    color="w",
    zorder=10,
)

plt.title("K-means clustering on the IMDB sentiment data")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

## 2.3 Semantic Similarity

We can use semantic representation of word embeddings to build a simple classifier using a heuristic approach.

1. Start with a list(s) of words that are representative of the categories e.g. ‘good’ and ‘bad’ for sentiment.

    - Optionally, find other similar words within the corpus using cosine similarity.

2. Score each word in an input example according to its similarity to each word in the category lists.

3. Average the word scores to produce semantic scores for each category.


We will use gensim to load Word2Vec embeddings and fine tune to our corpus of IMDB reviews.


In [None]:
# Tokenise the reviews
imdb_corpus = imdb_data['review'].apply(lambda x: [token.text for token in nlp.tokenizer(x)])

# Create a word2vec model with gensim
embedding_dim = 300
w2v_model = Word2Vec(sentences=imdb_corpus, size=embedding_dim, window=5, min_count=5, sg=1, seed=1, workers=4)

print("W2v model vocabulary: " + str(list(w2v_model.wv.vocab)[:50]))


### Create a list of positive and negative sentiment words

In [None]:
# Start with a selection of positive and negative words
pos_words = ['excellent', 'awesome', 'cool', 'decent', 'amazing', 'strong', 'good', 'great', 'funny', 'entertaining']
neg_words = ['terrible', 'awful', 'horrible', 'boring', 'bad', 'disappointing', 'weak', 'poor', 'senseless', 'confusing']

# Get the most similar words for each word in the positive and negative lists
pos_sims = w2v_model.wv.most_similar(pos_words, topn=10)
print('Positive similar words:')
print(pos_sims)
neg_sims = w2v_model.wv.most_similar(neg_words, topn=10)
print('Negative similar words:')
print(neg_sims)

# Get the vectors for each word in the positive and negative lists
pos_words.extend([word for word, score in pos_sims])
pos_vectors = [w2v_model.wv.get_vector(word) for word in pos_words]

neg_words.extend([word for word, score in neg_sims])
neg_vectors = [w2v_model.wv.get_vector(word) for word in neg_words]


### Calculate the semantic score of each review

In [None]:
# Function maps a word to its vector representation
def tokens_to_vecs(tokens, model):
    vecs = []
    for word in tokens:
        if word in model.wv.vocab:
            vecs.append(w2v_model.wv.get_vector(word))
    return vecs

# Score each review based on the similarity between the positive and negative word vectors
semantic_scores = np.zeros((len(imdb_corpus), 2 ))
for i, review in enumerate(imdb_corpus):
    # Get the vectors for each word in the review
    review_tokens = tokens_to_vecs(review, w2v_model)
    # Calculate the similarity betwen review vectors and positive/negative word vectors
    pos_score = cosine_similarity(review_tokens, pos_vectors)
    neg_score = cosine_similarity(review_tokens, neg_vectors)
    # Take the average of the similarity scores
    semantic_scores[i][0] = np.mean(pos_score)
    semantic_scores[i][1] =  np.mean(neg_score)

print('Semantic scores:', semantic_scores[:10])

### Assign sentiment labels according to category with the highest score

In [None]:
# Get the scores and the class labels
X = semantic_scores
y = imdb_data['sentiment'].values

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Predict class labels for training set
predictions = [1 if X_train[i][0] >  X_train[i][1] else 0 for i in range(len(X_train))]
print('Train Accuracy:', accuracy_score(y_train, predictions))

# Predict class labels for validation set
predictions = [1 if X_val[i][0] >  X_val[i][1] else 0 for i in range(len(X_val))]
print('Validation Accuracy:', accuracy_score(y_val, predictions))

# Print confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_val, predictions, display_labels=['negative', 'positive',], colorbar=False)
plt.show()


## 2.4 Exercise: Different embedding sizes and category words

1. Try repeating the semantic similarity classifier with different embedding dimensions and examine the impact on classification accuracy.

2. Similarly try reducing the list of positive and negative words.

3. You could also consider ignoring words with low positive/negative scores, because these tend to be less informative.