<br>
<br>

In [27]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

### Read in the Dataset

The dataset must be read into a pandas dataframe in order for us to work with it.

In [28]:
import pyprind
import pandas as pd
import os

basepath = '/Users/moorejar/Documents/Course_Projects/CIS_365/Project_2/aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:01:32


#### Shuffle the Dataset

We don't want the data ordered, so adding a shuffle gives us a randomized ordering.

In [29]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

If you want, you can save the shuffled dataset into a csv file for later use.

In [30]:
df.to_csv('./movie_data.csv', index=False)

In [31]:
# Read in the dataset from the movie data csv file. 
# Bonus, we do not need to go through the process of reassembling the data as we did before.
import pandas as pd

df = pd.read_csv('./movie_data.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


# Introducing the bag-of-words model

... (See in class slides.)

## Transforming documents into feature vectors

Let's build a bag-of-words model with code.  The sklearn Python library gives us the option to do this simply, without having to code it ourself.  The following code snippet will build a set of feature vectors from the following three phrases:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [32]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

Let's print out the vocabulary that we have built.

In [33]:
print(count.vocabulary_)

{u'and': 0, u'sun': 4, u'is': 1, u'two': 7, u'one': 2, u'weather': 8, u'sweet': 5, u'the': 6, u'shining': 3}


The vocabulary is a dictionary, mapping unique words to integer indices.

Using the indicies in the previous command, we can then print out the feature vectors for the sentences.  Indicies correspond to the previous integer values in the dictionary.  Also note that we have the raw number of occurences marked in this representation.  

In [34]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


<br>

## Assessing word relevancy via term frequency-inverse document frequency

#### Many of the words occur across the data and in cases of both positive and negative sentiment.

These words typically don't contain useful information to make a decision on sentiment.  "The", "It", "As", "And", "However", "I", etc

Term frequency-inverse document frequency (tf-idf) allows us to discount frequently occurring words.

tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

tf(t, d) is the term frequency (How often it occurs.)
inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents,
*df(d, t)* is the number of documents *d* that contain the term *t*. 

Adding the constant 1 to the denominator is optional, but it allows us to assign non-zero values to terms that occur in all training data. log ensures that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [35]:
from sklearn.feature_extraction.text import TfidfTransformer

np.set_printoptions(precision=2)
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.    0.56  0.56  0.    0.43  0.    0.  ]
 [ 0.    0.43  0.    0.    0.    0.56  0.43  0.    0.56]
 [ 0.5   0.45  0.5   0.19  0.19  0.19  0.3   0.25  0.19]]


"is" had the largest term frequency in the 3rd document. After transforming the feature vector into tf-idfs, "is" is now given a *relatively* small tf-idf (0.45) in document 3 since it is
also in documents 1 and 2.  It likely doesn't contain useful information.


You might've noticed that calculating the values by hand don't line up.  Scikit-learn applies the L2-normalization during the calculation.  It rreturns a vector of length 1 by dividing a feature vector *v* by its L2-norm as follows:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

An example calculating the tf-idf of the word "is" in the 3rd document is as follows:

"is" has a term frequency of 3 (tf = 3) in document 3. 
"is" occurs in all three documents (df = 3). 

idf is as follows:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

tf-idf is calculated by adding 1 to the inverse document frequency and multiplying by the term frequency:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [12]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00


If we perform these calculations for all terms in the 3rd document, we get tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. This is different from the values that we obtained from the TfidfTransformer. We must now apply the L2-normalization as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

In [14]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

array([ 3.39,  3.  ,  3.39,  1.29,  1.29,  1.29,  2.  ,  1.69,  1.29])

In [15]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([ 0.5 ,  0.45,  0.5 ,  0.19,  0.19,  0.19,  0.3 ,  0.25,  0.19])

<br>

## Cleaning text data

It can be beneficial to remove special characters and HTML markup from text.  I'll leave that as an exercise for your group to do during the project.

<br>

## Processing documents into tokens

Now that we understand the basic process of creating feature vectors, let's get back to the movie review data.  After cleaning up the text as needed, the next step is to now create our bag-of-words model and derive the feature vectors.

A PorterStemmer is one possible stemming algorithm to use to turn words into their basic form.  (Removing '-ing', '-ed', '-s', etc.)

In [14]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [17]:
# View a simple tokenizer (non-stemming) on a basic string.
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [16]:
# View the PorterStemmer on a basic string
tokenizer_porter('runners like running and thus they run')

[u'runner', 'like', u'run', 'and', u'thu', 'they', 'run']

In [20]:
# Stopwords are common words in English that we might not want.
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/moorejar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
# Apply the stopwords as follows:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

['runner', u'like', u'run', u'run', 'lot']

<br>
<br>

# Training a logistic regression model for document classification

In [36]:
# Create a training and test set.  
# This is a simplified approach.
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [None]:
# This fitting process takes a long time!
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 19.8min


In [39]:
#print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

CV Accuracy: 0.889


In [38]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.892


<hr>
<hr>