# Text Classification
> # Sentiment Analysis of IMDB Movie Reviews
> Pattern Recognition Course Project #1 @<a href="http://www.iust.ac.ir/">IUST</a>

<br>
Hosein Mohebbi<br>
hosein_mohebbi@comp.iust.ac.ir


# Project Overview

>Text classification is used in many interesting applications such as spam or non-spam email detection, fake news detection, and sentiment analysis. The main aim of this project is to examine different base classifiers in sentiment analysis task as a text classification problem. Sentiment analysis aims to estimate the sentiment polarity of a body of a text solely on its content. To tackle this problem has been tried any possible combination of four approaches to word embedding (BOW, BERT, TF-IDF, Word2Vec) and four base classifiers (Naive Bayes, SVM, Decision Tree, Random Forest) as well as some text pre-processing techniques on the IMDB movie reviews dataset which contains 50K reviews, half of which are positive and the other half negative. This dataset was compiled by <a href="http://ai.stanford.edu/~amaas/">Andrew Maas</a> and can be find here: <a href="http://ai.stanford.edu/~amaas/data/sentiment/">Large Movie Review Dataset</a>



# Downloading & Installing Prerequisites

In [0]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [0]:
!tar -zxvf aclImdb_v1.tar.gz > /dev/null

In [0]:
!pip3 install bert-embedding

In [0]:
!pip3 install mxnet-cu100

In [0]:
!pip3 install sentence-transformers

In [0]:
!ls

# Required Packages

In [0]:
import ipywidgets as widgets
import os
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from nltk.tag import pos_tag
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import itertools
import mxnet as mx
from bert_embedding import BertEmbedding
from sentence_transformers import SentenceTransformer


# Loading DataSet


> Dataset has been divided evenly into a training set and a test set. Moreover, each set contains 12.5K positive and 12.5K negative reviews. The training and test data were loaded in Pandas data frames.



In [0]:
def loadDataset(data_dir):
    
    data = {}
    for partition in ["train", "test"]:
        data[partition] = []
        for sentiment in ["neg", "pos"]:
            lable = 1 if sentiment == "pos" else -1

            path = os.path.join(data_dir, partition, sentiment)
            files = os.listdir(path)
            for f_name in files:
                with open(os.path.join(path, f_name), "r") as f:
                    review = f.read()
                    data[partition].append([review, lable])

    np.random.shuffle(data["train"])
    np.random.shuffle(data["test"])
    
    data["train"] = pd.DataFrame(data["train"],
                                 columns=['text', 'sentiment'])
    data["test"] = pd.DataFrame(data["test"],
                                columns=['text', 'sentiment'])

    return data["train"], data["test"]

In [0]:
data_dir = "aclImdb/"
train_data, test_data = loadDataset(data_dir)

Here are the first 5 rows of the training data:

In [6]:
# Debugging
train_data.head()

Unnamed: 0,text,sentiment
0,"Lynne Ramsey makes arresting images, and Saman...",-1
1,"""I went to the movies, to see 'Beat Street' / ...",-1
2,Well. this was not a surprise. many people wil...,-1
3,Pyare Mohan can be safely included in the blac...,-1
4,If you're looking to be either offended or amu...,-1


# Cleaning Dataset
> Since this dataset scraped from the web, some HTML codes got mixed up with it. So, cleaning up these texts by removing HTML tags is required. Removing numbers, punctuations, and stop words, replacing negative contraction verb with whose complete forms like won't, splitting compound nouns that are made with hyphen like state-of-the-art, and normalizing texts by lowering them would be beneficial.

> To remove stop words, the NLTK stop words set have been used. But, some words which have a negative meaning, such as not or nor, have been removed from the set and some contraction patterns like 're or 'm have been added to stop words set.

> Due to training of the BERT embedding on Wikipedia data, for this model we allow some of the punctuations which may cause a more reliable embedding like [,/():;] to remain in the text. Moreover, we save !,?, and . to detect the end of the sentence for a later purpose (using BERT for each sentence).

> Stemming and lemmatization according to POS tags of words are used for all embedding except BERT.

> Finally, we have replaced white spaces with only one space.


In [0]:
# POS Tagging
def NormalizeWithPOS(word_list):
  
    lemmatizer = WordNetLemmatizer() 
    stemmer = PorterStemmer() 
    for word, tag in pos_tag(word_list):
        if tag.startswith('J'):
            w = lemmatizer.lemmatize(word, pos='a')
        elif tag.startswith('V'):
            w = lemmatizer.lemmatize(word, pos='v')
        elif tag.startswith('N'):
            w = lemmatizer.lemmatize(word, pos='n')
        elif tag.startswith('R'):
            w = lemmatizer.lemmatize(word, pos='r')
        else:
            w = word
        w = stemmer.stem(w)
        yield w

In [0]:
def cleanText(text):
    
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub(r"[0-9]+", ' ', text) # TODO: save 0-10 for IMDB rating
    text = re.sub(r"-", ' ', text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"n't", " not", text)
    
    text = text.strip().lower()
    
    if embedding is 'BOW' or embedding is 'TFIDF_NO_STOP' or embedding is 'Word2Vec_NO_STOP':
        # Remove Stop words
        default_stop_words = set(stopwords.words('english'))
        default_stop_words.difference_update({'no', 'not', 'nor', 'too', 'any'})
        stop_words = default_stop_words.union({"'m", "n't", "'d", "'re", "'s",
                                               'would','must',"'ve","'ll",'may'})
    
        word_list = word_tokenize(text)
        filtered_list = [w for w in word_list if not w in stop_words]
        text = ' '.join(filtered_list)
    
    # Remove other contractions
    text = re.sub(r"'", ' ', text)
    
    # Replace punctuations with space
    if embedding is 'BERT': # save ! ? . for end of sentences [,/():;]
        filters='"\'#$%&*+-<=>@[\\]^_`{|}~\t\n'
    else:
        filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    translate_dict = dict((i, " ") for i in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)
    
    text = ' '.join([w for w in text.split() if len(w)>1])
    # Replace multiple space with one space
    text = re.sub(' +', ' ', text)
    
    # Lemmatization & Stemming 
    if embedding is not 'BERT':
        text = ' '.join(NormalizeWithPOS(word_tokenize(text)))
  
    return text

In [8]:
# Debugging
embedding = 'BOW'
txt = train_data.iloc[10]['text']
print("A review example of dataset before cleaning:")
print(txt, end="\n\n")
print("After cleaning:")
print(cleanText(txt))

A review example of dataset before cleaning:
Hood of the Living Dead and all of the other movies these guys directed look like they got together and filmed this with their buddies who have zero talent one afternoon when they were bored (lines are completely unrehearsed and unconvincing). I find that 95% of amateur movies and 90% of home video footage is better than this film (although the similarities between them warrant the comparison). "Hey lets see if anyone is dumb enough to buy our movies!". Hopefully nobody ELSE wasn't. My apologies to those involved in the flic as this review is somewhat harsh but i was the dope who read your fake reviews and purchased the movie.

After cleaning:
hood live dead movi guy direct look like get togeth film buddi zero talent one afternoon bore line complet unrehears unconvinc find amateur movi home video footag well film although similar warrant comparison hey let see anyon dumb enough buy movi hope nobodi els not apolog involv flic review somewhat 

# Vectorization

> To introduce our data to our classifiers except for Decision Tree, we need to convert each review to numeric features: this is vectorization.

> <b>Bag Of Words (BOW):</b> In this approach, we make a list of all the unique words in training data called the vocabulary. then, given an input text, it outputs a numerical vector that counts each word of the vocabulary.</br> For example:</br>
Training data:</br> ['It was the best of times', 'It was the worst of times']
</br>=></br> Vocabulary: ['It', 'was', 'the', 'best', 'of', 'times', 'worst']
</br>Test data:</br> 'It was the age of the wisdom' ->
[1,1,2,0,1,0,0]</br>
As we can see, the values for 'best', 'times', and 'worst' are 0 because these words did not appear in the test data.

> To use BOW vectorization in Python, we can use the CountVectorizer function from the scikit-learn library. We pass our custom clean text function to remove useless words of training data thereby reducing the size of the BOW vectors until 49812.

> <b>Bidirectional Encoder Representations  from Transformers (<a href="https://arxiv.org/abs/1810.04805">BERT</a>):</b>
BERT, published by Google, is pre-trained language model word representation as a vector with the size of 768.</br>

> In this project, we used BERT embedding twice and report them independently. once, we tokenized each training data into words, then calculate BERT embedding of each word, and finally, mean all the BERT vectors of the words as a review representation. In a second way, we tokenized each training data into sentences and computed the mean of BERT embedding of the sentences for each training data.

> To utilize BERT embedding  for two mentioned purposes, we have used these two libraries respectively: </br>
<a href= "https://pypi.org/project/bert-embedding/">bert-embedding 1.0.1</a> </br>
<a href= "https://github.com/UKPLab/sentence-transformers">Sentence Transformers</a>

> <b>TFIDF:</b>

> <b>Word2Vec</b>

#BOW

In [0]:
embedding = 'BOW'
# , max_features=30000
vectorizer = CountVectorizer(preprocessor=cleanText)

bow_training_features = vectorizer.fit_transform(train_data["text"])    
bow_test_features = vectorizer.transform(test_data["text"])

# TF-IDF

> With Stop Words



In [0]:
embedding = 'TFIDF_WITH_STOP'
vectorizer = TfidfVectorizer(preprocessor=cleanText)

tfidf_training_features = vectorizer.fit_transform(train_data["text"])    
tfidf_test_features = vectorizer.transform(test_data["text"])

# TF-IDF

> Without Stop Words



In [0]:
embedding = 'TFIDF_NO_STOP'
vectorizer = TfidfVectorizer(preprocessor=cleanText)

tfidf_NO_STOP_training_features = vectorizer.fit_transform(train_data["text"])    
tfidf_NO_STOP_test_features = vectorizer.transform(test_data["text"])

# BERT

In [0]:
# Cleaning before BERT
embedding = 'BERT'

train_data['clean_text'] = train_data['text'].apply(cleanText)
test_data['clean_text'] = test_data['text'].apply(cleanText)


(BERT) Word Tokenization Version



In [0]:
def mean(z):
    return sum(itertools.chain(z))/len(z)

In [0]:
def embeddToBERT(text):
    sentences = re.split('!|\?|\.',text)
    sentences = list(filter(None, sentences)) 

    if bert_version == 'WORD':
        result = bert(sentences, 'avg') # avg is determined to handle OOV
    
        bert_vocabs_of_sentence = []
        for sentence in range(len(result)):
            for word in range(len(result[sentence][1])):
                bert_vocabs_of_sentence.append(result[sentence][1][word])
        feature = [mean(x) for x in zip(*bert_vocabs_of_sentence)]

    elif bert_version == 'SENTENCE':
        result = bert_transformers.encode(sentences)
        feature = [mean(x) for x in zip(*result)]
  
    return feature

In [0]:
ctx = mx.gpu(0)
bert = BertEmbedding(ctx=ctx)

In [0]:
bert_version = 'WORD'
bert_word_training_features = train_data['clean_text'].apply(embeddToBERT)
bert_word_test_features = test_data['clean_text'].apply(embeddToBERT)

In [0]:
feature = [x for x in bert_word_training_features.transpose()]
bert_word_training_features = np.asarray(feature)

feature = [x for x in bert_word_test_features.transpose()]
bert_word_test_features = np.asarray(feature)

print(bert_word_training_features.shape)

(BERT) Sentence Tokenization Version



In [0]:
bert_transformers = SentenceTransformer('bert-base-nli-mean-tokens')

In [0]:
bert_version = 'SENTENCE'
bert_sentence_training_features = train_data['clean_text'].apply(embeddToBERT)
bert_sentence_test_features = test_data['clean_text'].apply(embeddToBERT)

In [0]:
feature = [x for x in bert_sentence_training_features.transpose()]
bert_sentence_training_features = np.asarray(feature)

feature = [x for x in bert_sentence_test_features.transpose()]
bert_sentence_test_features = np.asarray(feature)

print(bert_sentence_training_features.shape)

# Classifiers

# Model & Training & Evaluation

In [0]:
# SVM 
model = SVC(kernel ='linear', C = 1)

# Training 
model.fit(bow_training_features, train_data["sentiment"])

# Evaluation
y_pred = model.predict(bow_test_features)

In [0]:
# Naive Bayes classifier
# use toarray() for BOW embedding
model = GaussianNB()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

In [0]:
# Decision tree classifier
model = DecisionTreeClassifier()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

# Result

In [0]:
acc = accuracy_score(test_data["sentiment"], y_pred)
# Result
print("Accuracy: {:.2f}".format(acc*100))
cm = confusion_matrix(test_data["sentiment"],y_pred)
print(cm)
print(classification_report(test_data["sentiment"],y_pred))

Accuracy: 75.64
[[10032  2468]
 [ 3623  8877]]
              precision    recall  f1-score   support

          -1       0.73      0.80      0.77     12500
           1       0.78      0.71      0.74     12500

    accuracy                           0.76     25000
   macro avg       0.76      0.76      0.76     25000
weighted avg       0.76      0.76      0.76     25000

