## Movie Reviews Sentiment Analysis

**Outline**

* [Introduction and dataset](#data)
* [Feature/Model variations](#model)
   * [M1 - Unigrams (absence/presence)](#uni)
   * [M2 - Unigrams with frequency count](#uni_multi)
   * [M3 - Unigrams (only adjectives/adverbs)](#adjadv)
   * [M4 - Unigrams (sublinear tf-idf), apply stopword removal](#tfidf)
      * [tf-idf introduction](#tfidf_deets)
   * [M5 - Bigrams (absence/presence)](#bigram)
* [Model learnings](#summary)
* [References](#ref)

In [66]:
import os, glob
import numpy as np
import string, re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer #leaving only the word stem
from nltk import pos_tag
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text

from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score, make_scorer

## <a id="data">Introduction and Dataset</a>

The goal of this project is to classify moview review sentiments (positive or negative) using multinomial Naive Bayes, experiment with different language models and techniques along th way to evaluate performances and compare pros and cons of different methods.

**Steps included:**
* Data read in
* Split into train/test before vectorization
* Clean corpus based on language model specs and apply additional techniques such as stop word removal, lemmatization as needed for best performance
* Train model and predict on test set
* Calculate accuracy score and compare

Here's a graph from [Text Analytics for NLTK Beginners](https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk) that help illustrate the process:

![Text Classification Process Flow](img/ta_flow_chart.png)

**Data Readin**

In [20]:
# read a pos/neg dir
# each document is a review
corpus_folder = './data/review_polarity/txt_sentoken/'

# Function to read one document
def readFile(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        tokenzied_words = f.read().split()
        return tokenzied_words

def readDir(senti_folder, pattern, top_doc_num):
    """This funtion reads in data from the respective folder until filname starts with top_doc_num.
    
    Args:
        senti_folder: filepath to the corpus
        pattern: file name pattern
        top_doc_num: number included in file name to stop data read in
    Returns:
        Returns a list of the corpus. 
    """
    file_list = []
    path_pattern = os.path.join(corpus_folder, senti_folder, pattern + '*.txt')
    all_txt_paths = glob.glob(path_pattern)
    for file_path in all_txt_paths[:top_doc_num]:
        # print(file_path)
        word_List = readFile(file_path)
        # print(word_List)
        file_list.append(word_List)
    return file_list

#Read in train, test lists for pos and neg reviews, create train and test labels
train_pos_file_list = readDir('pos', 'cv[0-8]', top_doc_num = 900)
train_neg_file_list = readDir('neg', 'cv[0-8]', top_doc_num = 900)
train_pos_labels = [1 for i in range(len(train_pos_file_list))]
train_neg_labels = [0 for i in range(len(train_neg_file_list))]

test_pos_file_list = readDir('pos', 'cv9', top_doc_num = 1000)
test_neg_file_list = readDir('neg', 'cv9', top_doc_num = 1000)
test_pos_labels = [1 for i in range(len(test_pos_file_list))]
test_neg_labels = [0 for i in range(len(test_neg_file_list))]

train_file_list = train_pos_file_list + train_neg_file_list
test_file_list = test_pos_file_list + test_neg_file_list

train_labels = train_pos_labels + train_neg_labels
test_labels = test_pos_labels + test_neg_labels

## <a id="model">Feature/Model variations</a>
**M1 - Unigrams (absence/presence)/Bernoulli Naive Bayes**
* training corpus: tokenized set of unique words that appeared in training corpus
* technique: Stop words removal, Porter Stemmer

**M2 - Unigrams with frequency count**
* training corpus: tokenized entire vocabulary
* technique: Porter Stemmer

**M3 - Unigrams (only adjectives/adverbs)**
* training corpus for training: set of tagged adjectives and adverbs only
* technique: Part of Speech (POS) Tagging

**M4 - Unigrams (sublinear tf-idf), apply stopword removal**
* training corpus: tokenized entire vocabulary
* technique: Porter Stemmer

**M5 - Bigrams (absence/presence)**
* training corpus: tokenized entire vocabulary
* technique: Porter Stemmer

In [50]:
def clean_corpus(tokenized, model):
    """This funtion reads in data from each individual review in the training corpora and keeps words/features 
    based on model specification 
    
    Args:
        tokenized: individual review in training corpora (list of reviews)
        model: model type
    Returns:
        Returns a list of the corpus under model spec. 
    """
    #tokenize each document in training corpora
    punctuation_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x)]
    
    #Apply porter stemmer to selected models to only keep word stem
    ps = PorterStemmer()
    stemmed = [ps.stem(word) for word in punctuation_free]
    
    #unigrams with absence/presence
    if model == "m1":
        unique_stemmed = set(stemmed)
        return ' '.join(unique_stemmed)
    
    #Unigrams with only adjectives/adverbs
    elif model == "m3":
    #else:
        tags = ['JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS']
        all_tags = pos_tag(punctuation_free)
        result = [word[0] for word in all_tags if word[1] in tags]
        result2 = ' '.join(result)
        #print("output results for m3")
        return result2
    
    #Unigrams with frequency count
    elif model == "m2" or "m4" or "m5":
        #print("output results for m2/4")
        return ' '.join(stemmed)

In [98]:
def model_train_predict(train_clean, test_clean):
    #update list of stop words to include film, movie, etc that doesn't give signals to sentiment
    my_stop_words = text.ENGLISH_STOP_WORDS.union(["movie", "film", "movi", "hi"])
    
    #Create features
    vectorizer = CountVectorizer(lowercase=True,stop_words=my_stop_words) #count word frequency
    train_features = vectorizer.fit_transform([doc for doc in train_file_list_clean])
    test_features = vectorizer.transform([doc for doc in test_file_list_clean])

    nb_clf = MultinomialNB()
    
    # Fit model and predict on test features
    nb_clf.fit(train_features, train_labels)
    predictions = nb_clf.predict(test_features)

    accuracy = accuracy_score(test_labels, predictions)
    return vectorizer, nb_clf, accuracy

In [94]:
def show_most_informative_features(vectorizer, classifier, n=5):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()  
    topn_pos_class = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_neg_class = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]    

    print("Important words in positive reviews")
    for coef, feature in topn_pos_class:
        print(class_labels[1], coef, feature) 
    print("-----------------------------------------")
    print("Important words in negative reviews")
    for coef, feature in topn_neg_class:
        print(class_labels[0], coef, feature)  

## <a id="uni">M1 - Unigrams (absence/presence)</a>

In [99]:
# cleans each individual review and keeps word features based on model specification 
train_file_list_clean = [clean_corpus(doc, "m1") for doc in train_file_list]
test_file_list_clean = [clean_corpus(doc,"m1") for doc in test_file_list]

m1_vector = model_train_predict(train_file_list_clean, test_file_list_clean)[0]
m1 = model_train_predict(train_file_list_clean, test_file_list_clean)[1]
m1_accuracy = model_train_predict(train_file_list_clean, test_file_list_clean)[2]

In [100]:
print("m1_accuracy is: ",m1_accuracy,)
show_most_informative_features(m1_vector, m1)

m1_accuracy is:  0.865
Important words in positive reviews
1 852.0 thi
1 743.0 ha
1 677.0 like
1 668.0 make
1 666.0 time
-----------------------------------------
Important words in negative reviews
0 864.0 thi
0 716.0 ha
0 714.0 like
0 690.0 wa
0 648.0 make


## <a id="uni_multi">M2 - Unigrams with frequency count</a>

In [108]:
# cleans each individual review and keeps word features based on model specification 
train_file_list_clean = [clean_corpus(doc, "m2") for doc in train_file_list]
test_file_list_clean = [clean_corpus(doc,"m2") for doc in test_file_list]

m2_vector = model_train_predict(train_file_list_clean, test_file_list_clean)[0]
m2 = model_train_predict(train_file_list_clean, test_file_list_clean)[1]
m2_accuracy = model_train_predict(train_file_list_clean, test_file_list_clean)[2]

In [109]:
print(m2_accuracy)

show_most_informative_features(m2_vector, m2)

0.825
Important words in positive reviews
1 4158.0 thi
1 2329.0 ha
1 2224.0 wa
1 1758.0 like
1 1752.0 charact
-----------------------------------------
Important words in negative reviews
0 4420.0 thi
0 2199.0 wa
0 1974.0 ha
0 1842.0 like
0 1568.0 charact


## <a id="adjadv">M3 - Unigrams (only adjectives/adverbs)</a>

In [110]:
# cleans each individual review and keeps word features based on model specification 
train_file_list_clean = [clean_corpus(doc, "m3") for doc in train_file_list]
test_file_list_clean = [clean_corpus(doc,"m3") for doc in test_file_list]

m3_vector = model_train_predict(train_file_list_clean, test_file_list_clean)[0]
m3 = model_train_predict(train_file_list_clean, test_file_list_clean)[1]
m3_accuracy = model_train_predict(train_file_list_clean, test_file_list_clean)[2]

In [111]:
print(m3_accuracy)

show_most_informative_features(m3_vector, m3)

0.85
Important words in positive reviews
1 1199.0 just
1 1099.0 good
1 729.0 best
1 693.0 really
1 691.0 little
-----------------------------------------
Important words in negative reviews
0 1391.0 just
0 1012.0 good
0 928.0 bad
0 709.0 really
0 660.0 little


## <a id="tfidf">M4 - Unigrams (sublinear tf-idf), apply stopword removal</a>

In [112]:
train_file_list_clean = [clean_corpus(doc, "m4") for doc in train_file_list]
test_file_list_clean = [clean_corpus(doc,"m4") for doc in test_file_list]

#update list of stop words to include film, movie, etc that doesn't give signals to sentiment
my_stop_words = text.ENGLISH_STOP_WORDS.union(["movie", "film", "movi", "hi"])

# Create features for TF-IDF
# min_df: ignore terms that have a document frequency strictly lower than defined, default=1
# max_df: ignore terms that have appear in more than 80% of the documents, default=1
vectorizer = TfidfVectorizer(min_df = 2, max_df = 0.8, stop_words=my_stop_words, sublinear_tf=True) 
train_features = vectorizer.fit_transform([doc for doc in train_file_list_clean])
test_features = vectorizer.transform([doc for doc in test_file_list_clean])
nb_clf = MultinomialNB()

# Fit model and predict on test features
nb_clf.fit(train_features, train_labels)
predictions = nb_clf.predict(test_features)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [113]:
accuracy = accuracy_score(test_labels, predictions)
print(accuracy)

show_most_informative_features(vectorizer, nb_clf)

0.84
Important words in positive reviews
1 21.422062346052442 wa
1 19.37887511027754 charact
1 19.269493613295754 like
1 18.533446429908274 make
1 18.14754763977736 time
-----------------------------------------
Important words in negative reviews
0 23.292004621773454 wa
0 21.930512105992694 like
0 19.969588038385954 charact
0 19.688306525881615 just
0 18.638530042001538 bad


## <a id="tfidf_deets">tf-idf introduction</a>

**What is [Tf-idf](tfidf.come)?**

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. 

Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

**TF: Term Frequency**
TF measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization. 

$$TF(t) = (NumberOfTimes_Term_t_AppearsInDoc)/(DocWordCount)$$

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

$$IDF(t) = log_e(TotalNumberOfDocs / NumberOfDocs_Term_t_AppearsIn)$$

**Example**

Consider a document containing 100 words wherein the word **cat** appears 3 times: $tf=(3 / 100) = 0.03$

we have 10 million documents and the word **cat** appears in 1000 of these: $idf=log(10,000,000 / 1,000) = 4$

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

## <a id="bigram">M5 - Bigrams (absence/presence)</a>

In [106]:
# cleans each individual review and keeps word features based on model specification 
train_file_list_clean = [clean_corpus(doc, "m5") for doc in train_file_list]
test_file_list_clean = [clean_corpus(doc,"m5") for doc in test_file_list]

#Create features
vectorizer = CountVectorizer(lowercase=True,ngram_range = (2,2), binary = True) #count word frequency
train_features = vectorizer.fit_transform([doc for doc in train_file_list_clean])
test_features = vectorizer.transform([doc for doc in test_file_list_clean])

nb_clf = MultinomialNB()

# Fit model and predict on test features
nb_clf.fit(train_features, train_labels)
predictions = nb_clf.predict(test_features)

In [107]:
accuracy = accuracy_score(test_labels, predictions)
print(accuracy)

show_most_informative_features(vectorizer, nb_clf)

0.855
Important words in positive reviews
1 845.0 of the
1 787.0 in the
1 684.0 the film
1 657.0 to the
1 635.0 and the
-----------------------------------------
Important words in negative reviews
0 828.0 of the
0 796.0 in the
0 643.0 the film
0 625.0 to be
0 577.0 to the


## <a id="summary">Model learnings</a>

In this case, based on accuracy, the best performing model is **M1 Unigram absense/presence model with Porter Stemmer applied** with 86.5% in accuracy. The model is essentially a Bernoulli Naive Bayes model representing the feture vectors in a document with binary elements, in other words, whether a word in the vocabulary is present or not (1 or 0). 

Followed by **M5 Bigram with Porter Stemmer applied** with 85.5% in accuracy. 
Followed by **M3 Unigram with adj/adv** with 85% in accuracy. 

The reason these these models performed the best could be that M1 only considers the absence or presence of each word therefore it does not weight more to words that appear the most frequent (M2, which was performed the worst). M2 uses bigram to create feature vectors, the phrases consisted of two words could give more signal to sentiments. M3 only considers the count of adjectives and adverbs that often more indicative of the sentiment of a review and therefore helps with precision. 

## <a id="ref">References</a>
* [Tf-idf](http://www.tfidf.com/)
* [Text Analytics for NLTK Beginners](https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk)