# NLP and the Web: Home Exercise 4 

Text classification is the task of categorizing text data into a set of predefined labels. The most important part of text classification is feature engineering: the process of extracting features from raw text data for machine learning models. We discussed previously what is a bag of words (bow) and why is it important to use it. In today’s class, we have seen how to transform a raw text into a set of features that can be represented as a matrix or a vector. In this exercise, we will practice feature engineering for text classification with <a href="https://scikit-learn.org/stable/">scikit-learn</a>. The goal of this exercise is to explore different features and their representations, train and evaluate different classifiers to automatically identify the sentiment polarity of the movie reviews.<br><br>
<b>Data:</b> The dataset provided for this exercise contains 5k movie reviews with positive and negative labels, which were taken from IMDB Dataset. It's saved in the file *IMDB_reviews.csv*, which has two columns separated by ','. The first column contains reviews and the second column contains their sentiment labels (0=negative, 1=positive). <br><br>
<b>Note:</b> For this exercise, you may only use spaCy, scikit-learn, NumPy, Pandas and internal packages from Python. Please follow the instructions as given below and in case of questions use our Discussion forum in Moodle, we don’t answer questions via email.<br><br>
Please use comments where appropriate to help tutors understand your code. 

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## Task 1 - 5 Points

**a)** Read the data from *IMDB_reviews.csv*. Shuffle and split it into training (60%), development (20%) and test(20%) sets. 

In [49]:
reviews_df = pd.read_csv("IMDB_reviews.csv",sep=",",header=0)
X_train_dev, X_test, y_train_dev, y_test = train_test_split(reviews_df.iloc[:,0], reviews_df.iloc[:,1], test_size=0.2, random_state=5,shuffle=True)
X_train, X_dev, y_train, y_dev = train_test_split(X_train_dev, y_train_dev, test_size=0.25, random_state=5)
print("Number of training/dev/test samples {0} - {1} - {2}: ".format(X_train.shape[0],X_dev.shape[0],X_test.shape[0])) 

Number of training/dev/test samples 3000 - 1000 - 1000: 


**b)** Use the CountVectorizer() and the MultinomialNB() provided by scikit-learn, train two multinomial Naive Bayes classifiers with the training set. One classifier uses the count matrix (absolute occurrence of each word) and another one uses the binary count matrix (binary, whether a word occurs in a text) as features. Evaluate them on the development set, print the accuracy.  
**Requirement:** For the reusability of subsequent tasks, you should first implement the function `train_valid_cls`, and then call the function to train and evaluate the classifiers. The required input parameters are described as follows. 

In [50]:
def train_valid_cls(train_texts, train_labels, dev_texts, dev_labels, vectorizer, classifier):
    """
    Train and validate the classifier with the given data and vectorizer, print the accuracy of 
    the trained classifier on the development set.
    
    @param train_texts: array-like object containing review texts from the training set
    @param train_labels: array-like object with corresponding sentiment labels for train_texts
    @param dev_texts: array-like object containing review texts from the development set
    @param dev_labels: array-like object with corresponding sentiment labels for dev_texts
    @param vectorizer: a customized scikit-learn Vectorizer 
    @param classifier: a scikit-learn Classifier
    """

    # tokenize and build vocab
    vectorizer.fit(train_texts)

    # encode document
    occ_train_vector = vectorizer.transform(train_texts)   
    occ_dev_vector = vectorizer.transform(dev_texts)
    
    #bin_train_vector = occ_train_vector.toarray()
    
    # train
    count_acc = classifier.fit(occ_train_vector, train_labels)
    
    # Predict new observation's class
    y_hat = classifier.predict(occ_dev_vector)
    count_acc = accuracy_score(dev_labels,y_hat)
    
    return count_acc

In [51]:
vectorizer = CountVectorizer()
clf = MultinomialNB()
acc = train_valid_cls(X_train, y_train, X_dev, y_dev, vectorizer, clf)
print("Accuracy for MultinomialNB with occurence as feature: ", acc)

Accuracy for MultinomialNB with occurence as feature:  0.829


In [52]:
def train_valid_cls_bin(train_texts, train_labels, dev_texts, dev_labels, vectorizer, classifier):
    """
    Train and validate the classifier with the given data and vectorizer, print the accuracy of 
    the trained classifier on the development set.
    
    @param train_texts: array-like object containing review texts from the training set
    @param train_labels: array-like object with corresponding sentiment labels for train_texts
    @param dev_texts: array-like object containing review texts from the development set
    @param dev_labels: array-like object with corresponding sentiment labels for dev_texts
    @param vectorizer: a customized scikit-learn Vectorizer 
    @param classifier: a scikit-learn Classifier
    """

    # tokenize and build vocab
    vectorizer.fit(train_texts)

    # encode document
    occ_train_vector = vectorizer.transform(train_texts)   
    occ_dev_vector = vectorizer.transform(dev_texts)
    
    bin_train_vector = occ_train_vector.toarray()
    bin_dev_vector = occ_dev_vector.toarray()
    
    # train and test on dev set 
    y_hat = classifier.fit(bin_train_vector, train_labels).predict(bin_dev_vector)
    
    # Calculate the accuracy on dev set 
    count_acc = accuracy_score(dev_labels,y_hat)
    
    return count_acc

In [53]:
vectorizer = CountVectorizer()
clf = MultinomialNB()
acc_bin = train_valid_cls_bin(X_train, y_train, X_dev, y_dev, vectorizer, clf)
print("Accuracy for MultinomialNB with one hot encoding as features: ", acc_bin)

Accuracy for MultinomialNB with one hot encoding as features:  0.829


**c)** Create a list of at least 20 (stop) words that you think are useless for the training (and removing them could help improve the accuracy of the classifier). Call the function `train_valid_cls` from b), train and validate two new multinomial Naive Bayes classifiers using the count matrix and binary count matrix **without these (stop) words**. Compare the results with b), and briefly explain why you think this can improve the accuracy (and analyse the possible reason if it doesn't work as you expect) in 2-4 sentences.  
**Hint:** Check <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">here</a> may help you find an easy way to remove the stop words. You are allowed to take a look at existing stop word lists, but please choose the stop words with your own consideration.

In [55]:
vectorizer_without_stopwords = CountVectorizer(stop_words="english")
clf = MultinomialNB()
acc_without_stop = train_valid_cls(X_train,y_train,X_dev,y_dev,vectorizer_without_stopwords, classifier)
print("Accuracy for MultinomialNB with occurence as feature without stopwords: ", acc_without_stop)

Accuracy for MultinomialNB with occurence as feature without stopwords:  0.835


In [54]:
vectorizer = CountVectorizer(stop_words="english")
clf = MultinomialNB()
acc_bin = train_valid_cls_bin(X_train, y_train, X_dev, y_dev, vectorizer_without_stopwords, clf)
print("Accuracy for MultinomialNB with one hot encoding as features without stopwords: ", acc_bin)

Accuracy for MultinomialNB with one hot encoding as features without stopwords:  0.835


**d)** Explore at least 3 different ranges of n-grams (introduce bigram, trigram ... features) and try to find the best one for training the multinomial Naive Bayes classifier with count matirx or binary count matrix. Report the accuracy on the development set for every range you tried. 

In [76]:
# 2-gram
bi_gram_vec = CountVectorizer(stop_words="english", analyzer='word', ngram_range=(1, 2))
classifier = MultinomialNB()
bi_gram_acc = train_valid_cls(X_train,y_train,X_dev,y_dev,bi_gram_vec, classifier)
print("Accuracy for MultinomialNB with bigram as features without stopwords: ", bi_gram_acc)

# 3-gram
tri_gram_vec = CountVectorizer(stop_words="english", analyzer='word', ngram_range=(1, 3))
classifier = MultinomialNB()
tri_gram_acc = train_valid_cls(X_train,y_train,X_dev,y_dev,tri_gram_vec, classifier)
print("Accuracy for MultinomialNB with trigram as features without stopwords: ", tri_gram_acc)

# ignore terms that appear in less than 1 document
min_df_vec = CountVectorizer(stop_words="english", analyzer='word', ngram_range=(1, 2),min_df = 1)
classifier = MultinomialNB()
min_df_acc = train_valid_cls(X_train,y_train,X_dev,y_dev,min_df_vec, classifier)
print(min_df_acc)

Accuracy for MultinomialNB with bigram as features without stopwords:  0.842
Accuracy for MultinomialNB with trigram as features without stopwords:  0.846
0.842


## Task 2 - 5 Points

**a)** Tokenize every review text in the training and development sets using spaCy (It may take a few minutes). Filter out tokens except verbs, adjectives and adverbs. You should store the remaining tokens as spaCy token objects (instead of strings) for the subsequent tasks b) and c).

In [59]:
import spacy

filtered_reviews_train = []
filtered_reviews_dev = []

nlp = spacy.load("en_core_web_sm")
whitelist = ["VERB","ADJ","ADV"]

for doc in X_train:
    tokens = nlp(doc)
    temp = []
    for token in tokens:        
        if token.pos_ in whitelist:
            temp.append(token)
    filtered_reviews_train.append(temp)
    
for doc in X_dev:
    tokens = nlp(doc)
    temp = []
    for token in tokens:        
        if token.pos_ in whitelist:
            temp.append(token)
    filtered_reviews_dev.append(temp)

**b)** Lemmatize the remaining tokens from a). Train and validate a new multinomial Naive Bayes classifier with the count matrix of the lemmatized tokens, print the accuracy on the development set.

In [60]:
lemmatized_reviews_train = []
lemmatized_reviews_dev = []

for reviews in filtered_reviews_train:
    temp=[]
    for token in reviews: 
        temp.append(token.lemma_)
    lemmatized_reviews_train.append(temp)
    
for i, v in enumerate(lemmatized_reviews_train):
    lemmatized_reviews_train[i]=" ".join(v)
    
for reviews in filtered_reviews_dev:
    temp=[]
    for token in reviews: 
        temp.append(token.lemma_)
    lemmatized_reviews_dev.append(temp)
    
for i, v in enumerate(lemmatized_reviews_dev):
    lemmatized_reviews_dev[i]=" ".join(v)

In [61]:
vectorizer = CountVectorizer()
clf = MultinomialNB()
lemmatized_acc = train_valid_cls(lemmatized_reviews_train, y_train, lemmatized_reviews_dev, y_dev, vectorizer, clf)
print("Accuracy for MultinomialNB with occurance as features and lemmatized words: ",lemmatized_acc)

Accuracy for MultinomialNB with occurance as features and lemmatized words:  0.857


**c)** Extract the spaCy word vectors of the remaining tokens from a). For each review text, calculate the average of all word vectors as its vector representation. Then train a gaussian Naive Bayes classifier (sklearn.naive_bayes.GaussianNB) and a linear SVM classifier (sklearn.svm.LinearSVC) with the obtained vector representations, evaluate and print their accuracy on the development set.

In [62]:
for i, v in enumerate(filtered_reviews_train):
    filtered_reviews_train[i] = " ".join( [t.text for t in filtered_reviews_train[i]] )

for i, v in enumerate(filtered_reviews_dev):
    filtered_reviews_dev[i] = " ".join( [t.text for t in filtered_reviews_dev[i]] )

In [63]:
from sklearn.svm import LinearSVC
import numpy as np 

clf_svc = LinearSVC()

# tokenize and build vocab
vectorizer.fit(filtered_reviews_train)

# encode document
filtered_train_vector = vectorizer.transform(filtered_reviews_train)   
filtered_dev_vector = vectorizer.transform(filtered_reviews_dev)

# create word vector 
average_train = np.sum(filtered_train_vector, axis=1)
average_dev = np.sum(filtered_dev_vector, axis=1)

In [64]:
## train
clf_svc.fit(average_train, y_train)

## Predict new observation's class
y_hat = clf_svc.predict(average_dev)
svc_average_acc = accuracy_score(y_dev,y_hat)
print(svc_average_acc)



0.522


In [65]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()

## train and predict
y_hat = clf_svc.fit(average_train, y_train).predict(average_dev)

## Calculate the result
gnb_average_acc = accuracy_score(y_dev,y_hat)
print(gnb_average_acc)



0.48


**d)** Explain the general differences between discriminative and generative models in 2-3 sentences, and identify the corresponding model category for the two classifiers from c).

A discriminative model e.g. support vector machine models the decision boundary between the classes. A generative model e.g. naives bayes classifier explicitly models the actual distribution of each class

**e)** Choose your best model from the whole exercise and test it on the test set, print the accuracy. Why is it important to evaluate your final model on a previously unused test set? Explain it in up to 2 sentences.

Testing on a separated test set gives an accurate evaluation of the performance of the model, since the model should be tested on the data that it has never seen before in the training. 

In [72]:
filtered_reviews_test = []

for doc in X_test:
    tokens = nlp(doc)
    temp = []
    for token in tokens:        
        if token.pos_ in whitelist:
            temp.append(token)
    filtered_reviews_test.append(temp)

lemmatized_reviews_test = []
for reviews in filtered_reviews_test:
    temp=[]
    for token in reviews: 
        temp.append(token.lemma_)
    lemmatized_reviews_test.append(temp)
    
for i, v in enumerate(lemmatized_reviews_test):
    lemmatized_reviews_test[i]=" ".join(v)

<class 'list'>


In [75]:
vectorizer = CountVectorizer(stop_words="english")
clf_final = MultinomialNB()
lemmatized_acc = train_valid_cls(lemmatized_reviews_train, y_train, lemmatized_reviews_dev, y_dev, vectorizer, clf_final)
occ_test_vector = vectorizer.transform(lemmatized_reviews_test)
y_hat = clf_final.predict(occ_test_vector)
final_acc = accuracy_score(y_test,y_hat)

print("Accuracy for MultinomialNB with occurance as features and lemmatized words without stopwords: ",final_acc)

Accuracy for MultinomialNB with occurance as features and lemmatized words without stopwords:  0.827


Please upload in Moodle your working Jupyter-Notebook <b>before next lab session</b> <span style="color:red">(Dec 3rd, 4:14pm)</span>. Submission format: ExerciseX_YourName.zip<br>
Submission should contain your filled out Jupyter notebook template (naming schema: ExerciseX_YourName.ipynb) and any auxiliar files that are necessary to run your code (e.g. datasets provided by us)