# Model Performance and Comparison

## Sklearn and Natural Language Processing
In this part, we will apply sklearn and related NLP libraries to perform data analysis on the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/).

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.decomposition import LatentDirichletAllocation

from gensim.models import Word2Vec

import pandas as pd
import numpy as np
import scipy.sparse as sp

We begin by loading a subset of the dataset, which contains 5000 movie reviews and their associated sentiment labels (i.e., whether a review is considered positive or negative).

In [6]:
df_reviews = pd.read_csv("imdb_reviews.csv")

In [7]:
# this cell has been tagged with excluded_from_script
# it will be ignored by the autograder
df_reviews.head()

Unnamed: 0,review,processed_review,sentiment
0,Taran Adarsh a reputed critic praised such a d...,taran adarsh repute critic praise dubba movie ...,negative
1,"Worth the entertainment value of a rental, esp...",worth entertainment value rental especially li...,negative
2,"I liked Antz, but loved ""A Bug's Life"". The an...",like antz love bug life animation put paid def...,positive
3,This reboot is like a processed McDonald's mea...,reboot like process mcdonald meal compare ang ...,negative
4,"The working title was: ""Don't Spank Baby"". <br...",work title spank baby wayne crawford go become...,positive


The `review` column contains raw review texts from the original dataset. However, it's always a good idea to process and clean text data before performing analysis. The column `processed_review` was constructed by processing and tokenizing the raw reviews, and then joining the review tokens by a single space. From this point, we only need to focus on the `processed_review` and `sentiment` columns.

Next, let's look at the distribution of class labels:

In [8]:
# this cell has been tagged with excluded_from_script
# it will be ignored by the autograder
display(df_reviews['sentiment'].value_counts())

sentiment
negative    2500
positive    2500
Name: count, dtype: int64

We see that there are 2500 positive reviews and 2500 negative reviews. In other words, our dataset is [perfectly balanced](https://i.imgflip.com/303krn.jpg).

###  Count Vectorizer

The first feature construction task we will perform is building a term-frequency matrix. The function `count_vectorizer` uses sklearn's `CountVectorizer` API to construct the term-frequency training matrix and testing matrix, along with the feature names (i.e., the list of words corresponding to the columns in the matrices).


In [13]:
def dummy_fun(x):
    return x

def count_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the term-frequency matrices for train_data and test_data using CountVectorizer.
    
    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training
        
    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing
    
    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : TF matrix for training
            tf_test (scipy.sparse.csr_matrix) : TF matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the TF matrices
    """
    vectorizer = CountVectorizer(analyzer=str.split, tokenizer=str.split, preprocessor=dummy_fun)
    tf_train = vectorizer.fit_transform(reviews_train)
    if reviews_test is not None:
        tf_test = vectorizer.transform(reviews_test)
    else:
        tf_test = None
    features = vectorizer.get_feature_names_out().tolist()
    
    return tf_train, tf_test, features

In [14]:
def test_count_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    count_vec_train, count_vec_test, features = count_vectorizer(reviews_train, reviews_test)
    assert count_vec_train.shape == (3750, 27242)
    assert count_vec_test.shape == (1250, 27242)
    assert np.allclose(
        count_vec_train.sum(axis = 1)[:10].ravel().tolist()[0],
        [70, 65, 168, 77, 139, 132, 28, 139, 453, 89]
    )
    assert np.allclose(
        count_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [168, 60, 59, 144, 494, 135, 69, 119, 76, 68]
    )
    assert features[:10] == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert features[-10:] == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")
    
test_count_vectorizer()

All tests passed!


###  TF-IDF Vectorizer

Now let's use the TF-IDF feature construction method. The function `tfidf_vectorizer`  uses sklearn's `TfidfVectorizer` API to construct the TF-IDF training matrix and testing matrices, along with the feature names (i.e., the list of words corresponding to the columns in the matrices).

In [15]:
def tfidf_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the TF-IDF matrices for train_data and test_data using TfidfVectorizer.
    
    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training
    
    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing
    
    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : TF-IDF matrix for training
            tf_test (scipy.sparse.csr_matrix) : TF-IDF matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the TF-IDF matrices
    """
    vectorizer = TfidfVectorizer()
    
    tf_train = vectorizer.fit_transform(reviews_train)
    
    if reviews_test is not None:
        tf_test = vectorizer.transform(reviews_test)
    else:
        tf_test = None
    
    features = vectorizer.get_feature_names_out().tolist()
    
    return tf_train, tf_test, features

In [16]:
def test_tfidf_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    tfidf_vec_trains, tfidf_vec_test, features = tfidf_vectorizer(reviews_train, reviews_test)
    assert tfidf_vec_trains.shape == (3750, 27242)
    assert tfidf_vec_test.shape == (1250, 27242)
    assert np.allclose(
        tfidf_vec_trains.sum(axis = 1)[:10].ravel().tolist()[0],
        [7.03658925089979, 7.417196035144321, 11.492434722367015, 6.965673648338525, 9.428219597939362, 9.425632229448961, 3.9722806270035345, 9.635230284023372, 11.779155501275017, 7.44670396016231]
    )
    assert np.allclose(
        tfidf_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [7.2233277330801196, 4.869804242110142, 6.249091468966529, 9.689812079503804, 11.89432945296538, 9.115185225757216, 6.798492438570971, 8.57464867777901, 7.954528809138947, 6.81383392701789]
    )
    assert features[:10] == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert features[-10:] == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")
    
test_tfidf_vectorizer()

All tests passed!


###  Predicting review sentiment
Let's now see which feature construction method -- TF or TF-IDF -- is better for predicting review sentiments in our dataset. Our choice of learning algorithm here will be a support vector machine with Gaussian kernel (this means that it uses a different hypothesis function that can also account for non-linearly separable data).

The function `predict_sentiment` takes as input the `reviews` and `sentiment` columns of our IMDB dataset and performs the following tasks:
1. Convert the `sentiment` column to a vector `y` of 1s and -1s: `positive` corresponds to 1 and `negative` to -1.
1. Perform a [stratified k-fold split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) of the review and sentiment vectors, based on the provided `k`. Also set `shuffle` to `True` and `random_state` to the provided `seed`.
1. For $f$ from $1 \to k$:
     * Let fold $f$ be the test set, and the remaining $k-1$ folds be the training set.
     * Convert the training and testing reviews to feature matrices `X_train` and `X_test`, using either TF or TF-IDF. Which method to use is based on the function parameter `method`.
     * Train the SVM model on `X_train, y_train` and evaluate its accuracy $a_f$ on `X_test, y_test`.
1. Return $a_1, a_2, \ldots, a_k$.

In [17]:
def predict_review_sentiment(reviews, sentiments, method, k, seed = 0):
    """
    Compute the cross-validated accuracy of SVM with either TF or TF-IDF features
    in predicting review sentiment.
    
    args:
        reviews (pd.Series[str]) : a Series of all processed movie reviews
        sentiments (pd.Series[str]) : a Series of movie review sentiments,
            containing either "positive" or "negative"
        method (str) : a string which is either "TF" or "TF-IDF",
            specifying which feature construction method to use
        k (int) : the number of folds in stratified k-fold split
    
    kwargs:
        seed (int) : the random generator seed for kfold split
        
    return:
        List[float] : a list of k accuracy values from evaluating a trained SVM model
            on each of the k folds, using the remaining folds as training data
    """
    y = sentiments.map({'positive': 1, 'negative': -1}).values
    
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
    
    accuracies = []
    
    for train_index, test_index in skf.split(reviews, sentiments):
        X_train, X_test = reviews.iloc[train_index], reviews.iloc[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        if method == 'TF':
            X_train_transformed, X_test_transformed, _ = count_vectorizer(X_train, X_test)
        elif method == 'TF-IDF':
            X_train_transformed, X_test_transformed, _ = tfidf_vectorizer(X_train, X_test)
        else:
            raise ValueError("Invalid method. Choose either 'TF' or 'TF-IDF'.")
        
        svm = SVC(kernel="rbf", C=10)
        svm.fit(X_train_transformed, y_train)
        
        accuracy = svm.score(X_test_transformed, y_test)
        accuracies.append(accuracy)
    
    return accuracies

In [18]:
def test_predict_review_sentiment():
    # prediction based on TF
    count_vec_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "TF", 10)
    assert count_vec_accs == [0.878, 0.836, 0.854, 0.824, 0.826, 0.824, 0.824, 0.85, 0.844, 0.83]
    
    # prediction based on TF-IDF
    tf_idf_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "TF-IDF", 10)
    assert tf_idf_accs == [0.88, 0.862, 0.85, 0.868, 0.854, 0.846, 0.864, 0.874, 0.874, 0.846]
    print("All tests passed!")
    print("Cross-validated accuracy of SVM with TF matrices", np.mean(count_vec_accs))
    print("Cross-validated accuracy of SVM with TF-IDF matrices", np.mean(tf_idf_accs))
    
test_predict_review_sentiment()



All tests passed!
Cross-validated accuracy of SVM with TF matrices 0.8389999999999999
Cross-validated accuracy of SVM with TF-IDF matrices 0.8617999999999999


We see that using TF-IDF features yields better cross-validated accuracy than using TF features (when the learning algorithm is SVM with RBF kernel and $C = 10$), although the difference in this case is not large.

### Topic modeling and word distribution
Let's now try to understand the review texts a bit more. We can treat all the reviews as a corpus and perform Latent Dirichlet Allocation to extract the corpus topics. We can also see which words are most frequent in a given topic. The function `top_words_by_topic` takes as input the `processed_reviews` column in our IMDB dataset and performs the following tasks:

1. Build a term-frequency matrix out of this column.
1. Input this matrix to sklearn's `LatentDirichletAllocation`.
1. In the resulting word-topic matrix, identify the most frequent `n_top_words` in each topic. Sort the most frequent words from lower to higher frequency.

In [27]:
def top_words_by_topic(reviews, n_topics = 10, n_top_words = 20, seed = 0):
    """
    Perform topic modeling on the movie review corpus and return the most frequent words in each topic.
    
    args:
        reviews (pd.Series[str]) : a Series of all processed movie reviews
    
    kwargs:
        n_topics (int) : the number of topics to model by LDA
        n_top_words (int) : the number of most frequent words to identify in each topic
        seed (int) : the random generator seed for LDA
    
    return:
        List[List[str]] : a nested list of words, where each of the n_topics inner lists
            contains the n_top_words most frequent words in a given topic
    """
    vectorizer = CountVectorizer(analyzer=str.split, tokenizer=str.split, preprocessor=dummy_fun)
    tf_matrix = vectorizer.fit_transform(reviews)
    lda = LatentDirichletAllocation(n_components=n_topics, learning_method="online", random_state=seed)
    lda.fit(tf_matrix)
    feature_names = vectorizer.get_feature_names_out()
    top_words = []
    for topic_idx, topic in enumerate(lda.components_):
        sorted_indices = topic.argsort()
        top_indices = sorted_indices[:-n_top_words-1:-1]
        topic_top_words = [feature_names[i] for i in top_indices]
        top_words.append(topic_top_words[::-1])

    return top_words

In [26]:
def test_top_words_by_topic():
    corpus = pd.Series([
        "I like to eat broccoli and bananas",
        "I ate a banana and spinach smoothie for breakfast",
        "Chinchillas and kittens are cute",
        "My sister adopted a kitten yesterday",
        "Look at this cute hamster munching on a piece of broccoli"
    ])
    top_words = top_words_by_topic(corpus, n_topics = 2, n_top_words = 5)
    assert top_words == [['Look', 'broccoli', 'and', 'cute', 'a'], ['I', 'eat', 'like', 'to', 'and']]
    
    top_words = top_words_by_topic(df_reviews["processed_review"], n_topics = 5, n_top_words = 5)
    assert top_words == [
        ['performance', 'play', 'version', 'jack', 'role'],
        ['dancer', 'paris', 'dance', 'cartoon', 'hitchcock'],
        ['make', 'like', 'one', 'film', 'movie'],
        ['film', 'father', 'world', 'american', 'war'],
        ['mad', 'sheriff', 'match', 'carmen', 'arthur']
    ]
    print("All tests passed!")
    
test_top_words_by_topic()



All tests passed!


###  Word embedding and word similarity
Finally, let's look at how we can train a word embedding model from our movie review corpus.

In [21]:
def find_most_similar_words(reviews, input_words, n_similar_words):
    corpus = [review.split() for review in reviews]
    model = Word2Vec(sentences = corpus, vector_size = 100, window = 5, workers = 4, min_count = 1)
    similar_words = []
    for inp_word in input_words:
        similar_words.append([w for w in sorted(model.wv.most_similar(inp_word, topn = n_similar_words), key = lambda x: x[1])])
    return similar_words

In [22]:
def test_find_most_similar_words():
    input_words = ["see", "good", "bad", "watch", "check"]
    most_similar_words = find_most_similar_words(df_reviews["processed_review"], input_words, 7)
    for i in range(len(input_words)):
        print(f"Words most similar to '{input_words[i]}':")
        print(most_similar_words[i])
        print()
    
test_find_most_similar_words()

Words most similar to 'see':
[('tonight', 0.9114551544189453), ('believe', 0.912100076675415), ('watch', 0.918253481388092), ('remember', 0.9189435839653015), ('dumbest', 0.9263791441917419), ('rent', 0.9340086579322815), ('heard', 0.9407606720924377)]

Words most similar to 'good':
[('funny', 0.9429802298545837), ('fetch', 0.9457199573516846), ('nice', 0.9502775073051453), ('decent', 0.9511047005653381), ('awful', 0.952179491519928), ('terrible', 0.959258496761322), ('horrible', 0.9593119621276855)]

Words most similar to 'bad':
[('fetch', 0.8861246705055237), ('horrible', 0.8951228857040405), ('strangest', 0.8966148495674133), ('terrible', 0.8991929292678833), ('good', 0.9007490873336792), ('awful', 0.9033360481262207), ('stupidest', 0.9163578748703003)]

Words most similar to 'watch':
[('wait', 0.9476494193077087), ('remember', 0.9490033984184265), ('recommend', 0.9496187567710876), ('subspecies', 0.950914740562439), ('enjoy', 0.9528793692588806), ('tonight', 0.9564355611801147), ('