In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import re
import nltk
import string
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://github.com/MehmetFiratKomurcu/IMDBReviewClassification/raw/master/imdb_master.csv", directory="data")
nltk.download(['punkt', 'stopwords', 'wordnet'])

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## The Naïve Bayes Classifier

In this notebook we will have a look at the naïve Bayes classifier. We are going to apply it to a simple natural language analysis task. There is a dataset of movie reviews scraped, along with numeric scores, from the [IMDB website](https://www.imdb.com/). The scores were transformed into two classes, indicating a positive or a negative review. We will show why the naïve Bayes classifier is a good candidate for such tasks in spite of its extremely simplifying assumptions.

### Loading the Data

As the first step, we will load our data from a CSV file. We specify the ISO-8859-1 charset.



In [None]:
df = pd.read_csv("data/imdb_master.csv", encoding="ISO-8859-1")
df.head()

The dataset contains a column dividing the samples into train and test. We will therefore use this column to split the dataset in that predefined way.



In [None]:
df_train = df[df['type'] == 'train']
df_test = df[df['type'] == 'test']

### Preprocessing the Text

Next, since the Naïve Bayes classifier, cannot handle the text directly, we will need to preprocess each review into a fixed-size vector. We will go over this process step by step using one review to illustrate exactly what happens.



In [None]:
review = df_train['review'].iloc[2]
print(review)

# Remove any HTML tags using regular expressions.


In [None]:
html_re = re.compile(r"<[^>]*>")
without_tags = html_re.sub(' ', review)
print(without_tags)

# Remove punctuation (by replacing all characters in `string.punctuation` with an empty string).


In [None]:
without_punctuation = without_tags

for char in string.punctuation:
    without_punctuation = without_punctuation.replace(char, "")
    
print(without_punctuation)

# Transform all to lower case.


In [None]:
lower_case = without_punctuation.lower()
print(lower_case)

# Split to on whitespace to get at individual words.


In [None]:
words = lower_case.split()
print(words)

# Remove stop words (auxiliary words such as "and", "or", "the", etc.) and anything that is not exclusively made of letters.


In [None]:
stop_words = set(stopwords.words('english'))
only_useful_words = [w for w in words if w.isalpha() and not w in stop_words]
print(only_useful_words)

# Transform words into canonical forms by lemmatization.


In [None]:
lemmatizer = WordNetLemmatizer()
canonical = [lemmatizer.lemmatize(w) for w in only_useful_words]
print(canonical)

# We join the resulting words back together.


In [None]:
preproced_text = " ".join(canonical)
print(preproced_text)

We define function `preproc_text` with the same logic that we will apply to each review in the dataset.



In [None]:
#@title -- function preproc_text -- { display-mode: "form" }
html_re = re.compile(r"<[^>]*>")
lemmatizer = WordNetLemmatizer()

def preproc_text(text):
    text = html_re.sub(' ', text)
    
    # remove punctuation
    for char in string.punctuation:
        text = text.replace(char, "")

    # transform all to lower case
    text = text.lower()

    # split on whitespace
    words = text.split()

    # filter out anything that is not exclusively
    # made of letters or that is in stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if w.isalpha() and not w in stop_words]
    
    # lemmatize the words, turning them into canonical forms
    canonical = [lemmatizer.lemmatize(w) for w in words]
    
    # join the words back together
    preproced_text = " ".join(canonical)
    return preproced_text

In [None]:
reviews_train = [preproc_text(text) for text in df_train['review']]
reviews_test = [preproc_text(text) for text in df_test['review']]

### Transforming the Text into a Fixed-Size Vector

#### Bag of Words

Even though we have now done a lot of preprocessing on the text, it is still a string and not a fixed-size vector. So how do we vectorize it? One way to do this would be to create a **bag of words**  representation: simply count how many times each word is present in a review. This is what scikit-learn's `CountVectorizer` class does. 



In [None]:
count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(reviews_train)
bag_of_words.shape

The bag of words is a huge matrix, with one column per each unique word. This is why we took such care transforming the words into their canonical forms: otherwise the matrices would be even several times larger. Also, given that each review just contains a fraction of all the possible words, most entries will be zeros. For this reason, the bag of words is stored as a sparse matrix: only the non-zero elements are recorded.

#### Bag of N-grams

Now, in general, apart from the presence of single words, we often care about their relative order and about certain combinations of words as well. To capture this, we can use the so-called **n-grams** : we will look at words in their n-word contexts and count the presence of those instead. Here is how to do that for 2-grams:



In [None]:
count_vectorizer = CountVectorizer(ngram_range=(2, 2))
grams_2 = count_vectorizer.fit_transform(reviews_train)
grams_2.shape

Note how the matrix now has much more columns than before. This is because there are more 2-gram combinations than there are individual words. Thankfully, not all combinations of words occured in the texts, so the number of 2-grams is not the number of words squared.

If we want to keep track of individual words as well as 2-grams, we can also specify `ngram_range=(1, 2)` – which we are going to do.

#### The TF-IDF

Finally, there are common words (n-grams) that occur in a large number of documents. Intuitively, these are probably going to be less useful when differentiating among classes and we do not want them to have a disproportionately large effect on the predictions. To this end, instead of simple counts, we can compute the **TF-IDF** : the **term-frequency times inverse document-frequency** . Roughly, this will divide the frequency (number of occurences) of each term by the total number of documents in which it appears.

More precisely, if we denote the frequency (number of occurences) of term $t$ in document $d$ with $\text{tf}(t, d)$, the inverse document-frequency is defined as [TfidfTransformer](#TfidfTransformer):

$$
\text{idf}(t) = \log \left[ \frac{1 + n}{1 + \text{df}(t)} \right] + 1,
$$
where $n$ is the total number of documents and $\text{df}(t)$ is the number of documents that contain term $t$. The TF-IDF is then simply [TfidfTransformer](#TfidfTransformer):

$$
\text{tf-idf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t).
$$
To get the TF-IDF in Python, we can simply use scikit-learn's `TfidfVectorizer` instead of `CountVectorizer`. We will now use `TfidfVectorizer` to transform our reviews into `X_train` and `X_test`.



In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

X_train = vectorizer.fit_transform(reviews_train)
Y_train = df_train['label']

X_test = vectorizer.transform(reviews_test)
Y_test = df_test['label']

### Training a Model

Now that we have our data preprocessed, we can train a model on them. We have seen that the TF-IDF vectors are quite large: we even prefer to store them using sparse matrices. The reason a naïve Bayes classifier is not a bad fit for such task (in spite of its rather extreme simplifying assumptions), is that it would be quite a bit more difficult to train a complex model on data of such dimensions. In the past, with less powerful hardware, it was often not realistic and even now it may be preferable in some cases, provided the performance is sufficient.

Also, any training method that intends to train fast on our dataset, will need to support sparse matrices (scikit learn's decision trees do not, for instance): if it tries to convert them into a dense format, training will take quite a bit longer. However, there is a couple of other simple methods in scikit-learn that support sparse matrices such as logistic regression. Those would probably not be wildly more difficult to train on the same data.

In any case, we are now going to create and train a naïve Bayes classifier using class `MultinomialNB`:



In [None]:
model = MultinomialNB()
model.fit(X_train, Y_train)

### Testing

Finally, let's see how accurate our model is.



In [None]:
y_test = model.predict(X_test)

cm = pd.crosstab(Y_test, y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm, "\n")

acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))

### References

<a id="TfidfTransformer">[TfidfTransformer]</a> sklearn.feature_extraction.text.TfidfTransformer. [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

