In [3]:
# -*- coding: utf-8 -*-
# @author: tongzi
# @description: combining different models for ensemble learing
# @created date: 2019/09/04
# @last modification: 2019/09/04

In [4]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Preparing the movie dataset into more convenient format

In [5]:
import pyprind

In [6]:
import os

In [8]:
basepath = 'aclImdb'
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
labels = {'pos': 1, 'neg': 0}

In [11]:
for s in ('test', 'train'):
    for label in ('pos', 'neg'):
        path = os.path.join(basepath, s, label)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[label]]], ignore_index=True)
            pbar.update()



0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:08:09


In [15]:
df.columns = ['review', 'sentiment']

In [17]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

#### Transforming words into feature vectors  
To construct a bag-of-words model based on the word counts in the respective
documents, we can use the CountVectorizer class implemented in scikit-learn. As
we will see in the following code section, CountVectorizer takes an array of text
data, which can be documents or sentences, and constructs the bag-of-words model
for us:

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
count = CountVectorizer()

In [22]:
docs = np.array([
    'I am a boy',
    'I like baseketball',
    "But I don't play baseketball for a long time"
])

In [23]:
bag = count.fit_transform(docs)

In [25]:
count.vocabulary_

{'am': 0,
 'boy': 2,
 'like': 6,
 'baseketball': 1,
 'but': 3,
 'don': 4,
 'play': 8,
 'for': 5,
 'long': 7,
 'time': 9}

In [24]:
bag.toarray()

array([[1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 1]], dtype=int64)

These values in the feature vectors are also called the raw term frequencies: $tf(t, d)$ —the number of times a term t occurs in a document d.

>To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document "the sun is shining" would be constructed as follows:  
• 1-gram: "the", "sun", "is", "shining"  
• 2-gram: "the sun", "sun is", "is shining"  
The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter. While a 1-gram representation is used by default, we could switch to a 2-gram representation by initializing a new CountVectorizer instance with *ngram_range=(2,2)*.

#### Accessing word relevancy via term frequency-inverse document frequency  
  
  When we are analyzing text data, we often encounter words that occur across
multiple documents from both classes. These frequently occurring words typically
don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called **term frequency-inverse document frequency (tf-idf)** that can be used to downweight these frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:  
$$tf - idf(t, d) = tf(t, d) \times idf(t, d)$$


Here the $tf(t, d)$ is the term frequency that we introduced in the previous  section, and $idf(t, d)$ is the inverse document frequency and can be calculated as follows:  
$$idf(t, d) = \log \frac{n_d}{1+df(d, t)}$$

Here $n_d$ is the total number of documents, and $df(d, t)$ is the number of documents $d$ that contain the term $t$.

The scikit-learn library implements yet another transformer, the TfidfTransformer
class, that takes the raw term frequencies from the CountVectorizer class as input
and transforms them into tf-idfs:

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer

In [27]:
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)

In [28]:
np.set_printoptions(precision=2)

In [29]:
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.71 0.   0.71 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.61 0.   0.   0.   0.   0.8  0.   0.   0.  ]
 [0.   0.3  0.   0.39 0.39 0.39 0.   0.39 0.39 0.39]]


However, if we'd manually calculated the tf-idfs of the individual terms in our
feature vectors, we'd notice that TfidfTransformer calculates the tf-idfs slightly
differently compared to the standard textbook equations that we defined previously.
The equations for the inverse document frequency implemented in scikit-learn is
computed as follows:  
$$idf(t, d) = \log \frac{1+n_d}{1+df(d, t)}$$

Similarly, the $tf-idf$ computed in scikit-learn deviates slightly from the default equation we defined earlier:  
$$tf-idf(t,d) = tf(t, d) \times \left(idf(t,d)+1 \right)$$

By default ( norm='l2' ), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector v by its L2-norm:  
$$v_{norm} = \frac{v}{\lVert v \rVert_2} = \frac{v}{\sqrt{v_1^2 + v_2^2 + ... + v_n^2}} = \frac{v}{\sqrt{\sum_{i=1}^{n} v_i^2}}$$

#### Cleaning text data

In [41]:
df.loc[0, 'review']

"I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."

In [42]:
df.loc[0, 'review'][-50:]

'and I suggest that you go see it before you judge.'

In [52]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [53]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

 since we will make use of the cleaned text data over and over again during the
next sections, let us now apply our preprocessor function to all the movie reviews
in our DataFrame 

In [54]:
df['review'] = df['review'].apply(preprocessor)