# Preparing the IMDb movie review data for text processing

## Obtaining the movie revire dataset

## Preprocessing the movie dataset into a more convenient format

In [3]:
import pyprind
import pandas as pd
import os

basepath = '/home/jacob/Downloads/aclImdb'
labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000) # number of documents
df =pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),
                     'r', encoding='utf-8') as infile:
                txt = infile.read()
                df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()

df.columns = ['review', 'sentiment']

FileNotFoundError: [Errno 2] No such file or directory: '/home/jacob/Downloads/aclImdb/test/pos'

In [10]:
# store tge assembled and shuffled movie review dataset as a CSV file
import numpy as np
import pandas as pd
import os
np.random.seed(0)
# the class labels in the assembled dataset are sorted, we will now shuffle
# it using the permutation function
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [7]:
# read the csv file
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

EmptyDataError: No columns to parse from file

In [None]:
# view the size of data
df.shape

## Introducing the bag-of-words model
The idea behind bag-of-words is quite simple and can be summarized as follows:
 1. We create a vocabulary of unique tokens--for examples, words--from the entire set of documents
 2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.
 
Note that the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them sparse.

## Trasnforming words into feature vectors
We use the 'CountVectorizer' class implemented in scikit learn to construct a bag-of-words model based on the word counts in the respective documents.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer() # use n-grams by setting ngram_range = (2,2)
docs = np.array(['The sun is shining',
                'The weather is sweet',
                'The sun is shining, the weather is sweet,'
                'and one and one is two'])
bag = count.fit_transform(docs)

In [None]:
print(count.vocabulary_)

In [None]:
print(bag.toarray())

The values in the feature vectors are also called the **row term frequencies**: $\mathrm{tf}(t,d)$- the number if tumes a term $t$, occurs in a document, $d$.

It should be noted that, in the bag-of-words model, the word or term order in a sentence or document does not matter. The order in which the term frequencies appear in the feature vector is derived from the vocabulary indices, which are usually assigned aplhabetically.

## Assessing word relevancy via term frequency-inverse document frequency
The frequently occurrring words typically don't contain useful or discriminatory information. We will use **term frequency-inverse document frequency (tf-idf)** technique, which can be used to downweight these frequently occurring words in the feature vectors.

The tf-idf can be defined as the product of the term frequency and the inverse document frequency:
$$
\mathrm{tf-idf}(t,d)=\mathrm{tf}(t,d)\times \mathrm{idf}(t,d)
$$
Here, $\mathrm{tf}(t,d)$ is the term frequency that we introduced before, and $\mathrm{idf}(t,d)$ is the inverse document frequency, which can be calculated as folllows:
$$
\mathrm{idf}(t,d)=\log\frac{n_d}{1+\mathrm{df}(d,t)}
$$
Here, $n_d$ is the total number of documents, and $\mathrm{df}(d,t)$ is the number of documents, $d$, that contain the term $t$.

Note that adding the constant $1$ to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in none of the training examples; the $\log$ is used to ensure that low document frequencies are not given too much weight.

we use the `TfidfTransformer` class, which takes the raw item frequencies from the `CountVectorizer` class as input and transforms them into tf-idfs:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                        norm='l2', # normalize the tf-idf directly
                        smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

**Remark**: the equation for the inverse document frequency implemented in scikit-learn is computed as follows:
$$
\mathrm{idf}(t,d)=\log\frac{1+n_d}{1+\mathrm{df}(d,t)}
$$
Similarly, the tf-idf computed in scikit-learn deviates slightly from the default equartion we defined earlier:
$$
\mathrm{tf-idf}(t,d)=\mathrm{tf}(t,d)\times(\mathrm{idf}(t,d)+1)
$$
the $+1$ in the previous equations is due the setting `smooth_idf=True` in the previous code example, which is helpful for assigning zero-weight to terms that occur in all documents.

## cleaning text data
Before we build our bag-of-words model, we need to clean the text data by stripping it of all unwanted characters.

First, we display the last 50 characters from the first document in the reshuffled moive review dataset

In [None]:
df.loc[0, 'review'][-50:]

the result implies that we need to remove all punctuation marks except for emotion characters, since those are certainly useful for sentiment analysis.

We use python's **regular expresssion (regex)** library, `re`, as shown here

In [None]:
import re
def preprocessor(text):
    text =  re.sub('<[^>]*>', '', text) # remove all the HTML markup from the moive reviews
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
           ''.join(emoticons).replace('-', ''))
    
    return text

In [None]:
preprocessor(df.loc[0, 'review'][-50:])

In [None]:
preprocessor("</a>This :) is :( a test :-)!")

Apply our `preprocessor` function to all the moive reviews in our `Dataframe`:

In [None]:
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens
We now think about how to split the text into individual elements.

One way to *tokenize* documents is to split them into individual words by splitting the cleaned documents at their whitespace characters:

In [None]:
def tokenizer(text):
    return text.split()

In [None]:
tokenizer('runners like running and thus they run')

Another useful technique is **word stemming**, which is the process of transforming a word into its root form. It allows us to map related words to 
the same stem.

The following code shows how to use the Porter stemming algorithm:

In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokernizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokernizer_porter('runner like running and thus they run')

Using the `PorterStemmer`, we modified our `tokenizer` function to redue words to their root form, which was illustrated by the simple precedding example where the word `running` was `stemmed` to its root form `run`.

## Stop-word removal
Stop words are simply those words that are extemely common in all sorts of texts and probably bear no useful information that can be used to distinguish between different classes of documents. Example of stop words are *is,and,has* and *like*. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already diwbweighting frequently occurring words

In [None]:
import nltk
nltk.set_proxy('SYSTEM PROXY')
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likesrunning and runns a lot')[-10:] if w not in stop]

In [8]:
%pwd

'/home/jacob/Documents/Coding/python_learn/python_machine_learning'

In [9]:
%ls

'Applying Mchine Learning to Segntiment Analysis.ipynb'
 Combine_different_models_for_ensemble_learning.ipynb
 [0m[01;35mconfusion_matrix.png[0m
'Model Evaluation and Hyperparameter Tuning.ipynb'
 movie_data.csv
