### The topics that we will cover in the following sections include the following:

. Cleaning and preparing text data

. Building feature vectors from text documents

. Training a machine learning model to classify positive and negative movie 
reviews

. Working with large text datasets using out-of-core learning

. Inferring topics from document collections for categorization

#### Preparing the IMDb movie review data for text processing

In [1]:
import tarfile

with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
    tar.extractall()

Preprocessing the movie dataset into more
convenient format

we will use the

Python Progress Indicator 
(PyPrind, https://pypi.python.org/pypi/PyPrind/)
package that I developed several years ago for such purposes. PyPrind can be
installed by executing the pip install pyprind command.

In [5]:
# pip install PyPrind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl.metadata (1.1 kB)
Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3
Collecting PyPrind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl.metadata (1.1 kB)
Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: PyPrind
Successfully installed PyPrind-2.11.3
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pyprind
import pandas as pd
import os

# change the 'basepath' to the directory of the 
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)

df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file),
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df._append([[txt, labels[l]]], ignore_index=True)
            
            pbar.update()

df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:24:40


In [5]:
import numpy as np

np.random.seed(0)

df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [7]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


### Transforming words into feature vectors

To construct a bag-of-words model based on the word counts in the respective
documents, we can use the CountVectorizer class implemented in scikit-learn. As
we will see in the following code section, CountVectorizer takes an array of text
data, which can be documents or sentences, and constructs the bag-of-words model
for us:

In [12]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

docs = np.array([
    'The sun is shinig',
    'The Weather is sweet',
    'The sun is shining and the weather is sweet'
])

bag = count.fit_transform(docs)


In [13]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shinig': 2, 'weather': 7, 'sweet': 5, 'shining': 3, 'and': 0}


In [17]:
print(bag.toarray())

[[0 1 1 0 1 0 1 0]
 [0 1 0 0 0 1 1 1]
 [1 2 0 1 1 1 2 1]]


### Assessing word relevancy via term frequency-inverse document frequency (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import Tfidf