# Natural Language Processing - Assignment 2
# Sentiment analysis for movie reviews

This notebook was created for you to answer question 2, 3 and 4 from assignment 2. Please read the steps and the provided code carefully and make sure you understand them.

The (red) comments at the beginning of each function explain what they should do, which parameters you should give as input and which variables should be returned by the function. After the (green) comments "### student code here###' you should write your own code.

**Please modify the next cell specifying your group number**

 *This is the Notebook of* ***Group 0***




### Prerequisite - Libraries
Make sure you have the needed libraries installed on your computer: scikit-learn, Pandas, NLTK...

### Prerequisite - Load Data

In the first step, we are going to load the data in a Pandas DataFrame. Pandas DataFrames are a useful way of storing data. DataFrames are tables in which data can be accessed as columns, as rows or as individual cells. You can find more info on DataFrames here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Read the code below and make sure you understand what is happening. Run the code to load your data.

In [None]:
import os
import re
import pandas as pd
import numpy as np
import glob
### student code here: import the needed modules from sci-kit learn ###
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


In [3]:
def get_path(filename):
    """
    Makes a list of all the paths that fit the search requirement

    :param filename: A regular expression that defines the search requirement for the filenames
    :return  Returns a list of all the pathnames
    """
    # place the movies folder in the same directory as this notebook
    current_directory = os.getcwd()
    # if you are using Google Colab, you will have to change the above line
    # to load the dataset from your Google Drive

    # glob.glob() is a pattern-matching path finder, it searches for the reviews in the movies folder based on a Regular Expression
    paths = glob.glob(current_directory + '/movies/' + filename)

    if len(paths) == 0:
        print('Your file list is empty. The code looks for the folder '+current_directory+'/movies, but could not find it.')
    else:
        print("Found ", len(paths), "files")
    return paths

In [4]:
def load_data(pathset):
    """
    Loads the data into a dataframe

    :param pathset:  A list of paths
    :return  A dataframe with three columns: Path, Review (Text) and Label
    """
    # Files are named by sentiment (P for positive, N for negative)
    pattern = re.compile('P-(train|test)[0-9]*.txt')
    reviews = []
    labels = []
    df = pd.DataFrame(columns = ['Path', 'Review', 'Label'])
    for path in pathset:
        if re.search(pattern, path):
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Pos')
        else:
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Neg')
    df['Path'] = pathset
    df['Review'] = reviews
    df['Label'] = labels
    return df

In [6]:
#Load the files in the Dataframe. This will take a while...
paths = get_path('train/[NP]-train[0-9]*.txt')
data = load_data(paths)
data.head()

Found  1200 files


Unnamed: 0,Path,Review,Label
0,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,"Let's see, cardboard characters like Muslim te...",Neg
1,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,"""May contain spoilers"" Sadly Lou Costellos' la...",Neg
2,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,"I can't emphasize it enough, do *NOT* get this...",Neg
3,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,I am truly sad that this is the first bad revi...,Neg
4,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,I'm a Petty Officer 1st Class (E-6) and have b...,Pos


### Part 2 - Tokenization

In this step, you should write a tokenizer and compare it with an off-the-shelf one.

#### Question 2.1 Making your own tokenizer

In [7]:
def my_tokenizer(text):
    """
    The implementation of your own tokenizer

    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """
    text = text.lower()
    # Remove punctuation except apostrophes inside words
    text = re.sub(r"[^\w\s']", " ", text)

    # Collapse multiple spaces into one
    text = re.sub(r"\s+", " ", text).strip()

    # Split on whitespace
    tokenized_text = text.split(" ")

    return tokenized_text




sample_string0 = "If you have the chance, watch it. Although, a warning, you'll cry your eyes out."
sample_string1 = "kaas is lekker"
sample_string2 = "Me and My bEstfriend are going to the city!"
sample_string3 = "me and my number 3 bEStfrienD like movies"
print(my_tokenizer(sample_string0))
print(my_tokenizer(sample_string1))
print(my_tokenizer(sample_string2))
print(my_tokenizer(sample_string3))

['kaas', 'is', 'lekker']
['me', 'and', 'my', 'bestfriend', 'are', 'going', 'to', 'the', 'city']
['me', 'and', 'my', 'number', '3', 'bestfriend', 'like', 'movies']


#### Question 2.2 Using an off-the-shelf tokenizer

In [8]:
#Now we are gonna compare the tokenizer you just wrote with the one from NLTK
#if you installed NLTK but never downloaded the 'punkt' tokenizer, uncomment the following lines:
#import nltk
#nltk.download('punkt')
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

def nltk_tokenizer(text):
    """
    This function should apply the word_tokenize (punkt) tokenizer of nltk to the input text

    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """
    tokenized_text = word_tokenize(text)




    return tokenized_text

test_sentences = ["I like this assignment because:\n-\tit is fun;\n-\tit helps me practice my Python skills.",
        "I won a prize, but I won't be able to attend the ceremony.",
        "“The strange case of Dr. Jekyll and Mr. Hyde” is a famous book... but I haven't read it.",
        "I work for the C.I.A.. And you?",
        "OMG #Twitter is sooooo coooool <3 :-) <-- lol...why do i write like this idk right? :) 🤷😂 🤖"]

for test_string in test_sentences:
    print(my_tokenizer(test_string))
    print(nltk_tokenizer(test_string))
    print("\n")


['i', 'like', 'this', 'assignment', 'because', 'it', 'is', 'fun', 'it', 'helps', 'me', 'practice', 'my', 'python', 'skills']
['I', 'like', 'this', 'assignment', 'because', ':', '-', 'it', 'is', 'fun', ';', '-', 'it', 'helps', 'me', 'practice', 'my', 'Python', 'skills', '.']


['i', 'won', 'a', 'prize', 'but', 'i', "won't", 'be', 'able', 'to', 'attend', 'the', 'ceremony']
['I', 'won', 'a', 'prize', ',', 'but', 'I', 'wo', "n't", 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']


['the', 'strange', 'case', 'of', 'dr', 'jekyll', 'and', 'mr', 'hyde', 'is', 'a', 'famous', 'book', 'but', 'i', "haven't", 'read', 'it']
['“', 'The', 'strange', 'case', 'of', 'Dr.', 'Jekyll', 'and', 'Mr.', 'Hyde', '”', 'is', 'a', 'famous', 'book', '...', 'but', 'I', 'have', "n't", 'read', 'it', '.']


['i', 'work', 'for', 'the', 'c', 'i', 'a', 'and', 'you']
['I', 'work', 'for', 'the', 'C.I.A', '..', 'And', 'you', '?']


['omg', 'twitter', 'is', 'sooooo', 'coooool', '3', 'lol', 'why', 'do', 'i', 'write', 'like

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kornelovics/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/kornelovics/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Part 3 - Text classification with a unigram language model

#### Training phase
You now need to create the model and train it on the documents in the dataframe. Look at the scikit learn documentation to learn how to use the CountVectorizer and MultimodalNaiveBayes modules.

In [9]:
#Load the files in the Dataframe. This will take a while...
train_paths = get_path('train/[NP]-train[0-9]*.txt')
train_data = load_data(train_paths)
train_data.head()

#Load the files in the Dataframe. This will take a while...
test_paths = get_path('test/[NP]-test[0-9]*.txt')
test_data = load_data(test_paths)
test_data.head()

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_data['Review'])
X_test = vectorizer.transform(test_data['Review'])


Found  1200 files
Found  100 files


#### Testing phase
Now that you have a trained model, you need to test its performance.

1. Load your test data.
2. Classify your test data using the classifier you trained before.
3. Compute the accuracy of your classifier on the test data

In [10]:
# First, read all the test data from the files.
# Then classify it using the classifier you trained before
# Finally, calculate the performance
classifier = MultinomialNB(alpha=1)  # Laplace smoothing
classifier.fit(X_train, train_data['Label'])
prediction = classifier.predict(X_test)
print("Accuracy with Laplace smoothing, alpha = 1):", accuracy_score(test_data['Label'], prediction))

Accuracy with Laplace smoothing, alpha = 1): 0.84


Now train two more models: one without Laplace smoothing, and one where stopwords are removed. Then test them on the same test data, and compare the performance with the results you previously obtained.

In [12]:
#Model without smoothing:
classifier_no_smooth = MultinomialNB(alpha=0)  # no smoothing
classifier_no_smooth.fit(X_train, train_data['Label'])
prediction_no_smooth = classifier_no_smooth.predict(X_test)
print("Accuracy without smoothing, alpha = 0:", accuracy_score(test_data['Label'], prediction_no_smooth))

#Model with stop words removed:
vectorizer_sw = CountVectorizer(stop_words='english')
X_train_sw = vectorizer_sw.fit_transform(train_data['Review'])
X_test_sw = vectorizer_sw.transform(test_data['Review'])

classifier_sw = MultinomialNB(alpha=1)
classifier_sw.fit(X_train_sw, train_data['Label'])
prediction_sw = classifier_sw.predict(X_test_sw)
print("Accuracy with stop word removal and alpha = 1:", accuracy_score(test_data['Label'], prediction_sw))

#Model without lowercasing the words
vectorizer_nolc = CountVectorizer(lowercase= False)
X_train_nolc = vectorizer_nolc.fit_transform(train_data['Review'])
X_test_nolc = vectorizer_nolc.transform(test_data['Review'])

classifier_nolc = MultinomialNB(alpha=1)
classifier_nolc.fit(X_train_nolc, train_data['Label'])
prediction_nolc = classifier_nolc.predict(X_test_nolc)
print("Accuracy with orginial casing and alpha = 1:", accuracy_score(test_data['Label'], prediction_nolc))



  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


Accuracy without smoothing, alpha = 0: 0.5
Accuracy with stop word removal and alpha = 1: 0.85
Accuracy with orginial casing and alpha = 1: 0.81


### Part 4 - Text classification with a bigram language model

Now we will classify the same dataset again, but this time with a bigram language model.

#### Training phase
Build a Naïve Bayes classifier that uses bigrams instead of single words.


In [13]:
### Student code here ###

print("Loading training data...")
train_paths = get_path('train/[NP]-train[0-9]*.txt')
train_data = load_data(train_paths)
print(f"Loaded {len(train_data)} training documents")

# Load test data
print("Loading test data...")
test_paths = get_path('test/[NP]-test[0-9]*.txt')
test_data = load_data(test_paths)
print(f"Loaded {len(test_data)} test documents")

vectorizer_bigram = CountVectorizer(ngram_range=(2,2), lowercase=True)
X_train_bigram = vectorizer_bigram.fit_transform(train_data['Review'])
X_test_bigram = vectorizer_bigram.transform(test_data['Review'])

# Train bigram model
classifier_bigram = MultinomialNB(alpha=1.0)
classifier_bigram.fit(X_train_bigram, train_data['Label'])

Loading training data...
Found  1200 files
Loaded 1200 training documents
Loading test data...
Found  100 files
Loaded 100 test documents


0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


#### Testing phase
As before, calculate the performance on your test data, and notice the difference with the previous

In [14]:
### Student code here ###
y_pred_bigram = classifier_bigram.predict(X_test_bigram)
accuracy_bigram = accuracy_score(test_data['Label'], y_pred_bigram)

print(f"Bigram model accuracy: {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")

Bigram model accuracy: 0.8900 (89.00%)


### Trigrams
When I asked students how to improve the classification performance on this dataset, the first question was always "use trigrams" (or even higher-order n-grams). Let's try how much of an improvement that would be, by training a trigram model and testing it.

In [15]:
### Student code here ###

### Trigram model

vectorizer_trigram = CountVectorizer(ngram_range=(3,3), lowercase=True)
X_train_trigram = vectorizer_trigram.fit_transform(train_data['Review'])
X_test_trigram = vectorizer_trigram.transform(test_data['Review'])

# Train trigram model
classifier_trigram = MultinomialNB(alpha=1.0)
classifier_trigram.fit(X_train_trigram, train_data['Label'])

# Test trigram model
y_pred_trigram = classifier_trigram.predict(X_test_trigram)
accuracy_trigram = accuracy_score(test_data['Label'], y_pred_trigram)

print(f"Trigram model accuracy: {accuracy_trigram:.4f} ({accuracy_trigram*100:.2f}%)")

### 4-gram model

vectorizer_4gram = CountVectorizer(ngram_range=(4,4), lowercase=True)
X_train_4gram = vectorizer_4gram.fit_transform(train_data['Review'])
X_test_4gram = vectorizer_4gram.transform(test_data['Review'])

# Train 4-gram model
classifier_4gram = MultinomialNB(alpha=1.0)
classifier_4gram.fit(X_train_4gram, train_data['Label'])

# Test 4-gram model
y_pred_4gram = classifier_4gram.predict(X_test_4gram)
accuracy_4gram = accuracy_score(test_data['Label'], y_pred_4gram)

print(f"4-gram model accuracy: {accuracy_4gram:.4f} ({accuracy_4gram*100:.2f}%)")

Trigram model accuracy: 0.7700 (77.00%)
4-gram model accuracy: 0.6500 (65.00%)


In [None]:
train_texts = train_data['Review']
train_labels = train_data['Label']

test_texts = test_data['Review']
test_labels = test_data['Label']

# Suggestion 1: Pure bigrams with TF-IDF
vectorizer_tfidf = TfidfVectorizer(ngram_range=(2, 2), lowercase=True)
X_train_tfidf = vectorizer_tfidf.fit_transform(train_texts)
X_test_tfidf = vectorizer_tfidf.transform(test_texts)

clf_tfidf = MultinomialNB()
clf_tfidf.fit(X_train_tfidf, train_labels)
pred_tfidf = clf_tfidf.predict(X_test_tfidf)
acc_tfidf = accuracy_score(test_labels, pred_tfidf)
print(f"Accuracy for pure bigrams with TF-IDF: {acc_tfidf:.4f}")

# Suggestion 2: Mixed unigrams + bigrams with counts
vectorizer_mixed = CountVectorizer(ngram_range=(1, 2), lowercase=True)
X_train_mixed = vectorizer_mixed.fit_transform(train_texts)
X_test_mixed = vectorizer_mixed.transform(test_texts)

clf_mixed = MultinomialNB()
clf_mixed.fit(X_train_mixed, train_labels)
pred_mixed = clf_mixed.predict(X_test_mixed)
acc_mixed = accuracy_score(test_labels, pred_mixed)
print(f"Accuracy for mixed unigrams + bigrams: {acc_mixed:.4f}")


Accuracy for pure bigrams with TF-IDF: 0.8500
Accuracy for mixed unigrams + bigrams: 0.8400
