<DIV ALIGN=CENTER>

# Introduction to Practical Concepts
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

In this IPython Notebook, we explore 

1. Repeat unigram results (read in data/ set things up.)
2. stemming
3. Stemming results.
4. n-grams
5. bigram results
6. tri gram results.
7. results comparison (grid search?).

Sentiment analysis. Movie reviews?



-----

In [1]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
import nltk
mvr = nltk.corpus.movie_reviews

from sklearn.datasets import load_files

data_dir = '/home/data_scientist/nltk_data/corpora/movie_reviews'
mvr = load_files(data_dir, shuffle = False)
print('Number of Reviews: {0}'.format(len(mvr.data)))

from sklearn.cross_validation import train_test_split

mvr_train, mvr_test, y_train, y_test = train_test_split(
    mvr.data, mvr.target, test_size=0.25, random_state=23)

Number of Reviews: 2000


## n-grams

Formally, a [_n-gram_][ngd] is a contiguous sequence of **n** items from a
parent sequence of items, such as characters or words in a text
document. In general, we will focus solely on words in a document. Thus,
our initial approach has simply been to look at unigrams or single
words in a document when building a classification model. However,
sometimes the combination of words can be more descriptive, for example,
_unbelievably bad_ is generally viewed as a more powerful description
than just _bad_. As a result, the concept of an _n-gram_ was created,
where collections of words can be treated as features. In fact google
allows a user to search for [specific n-gram][gnv] combinations in books that
they have scanned.

While this clearly can improve classification power, it also increases
computational requirements. This is a result of the exponential rise in
the number of possible features. For example, given $n$ words, we have
$n \times (n - 1)$ possible bigrams, and so on for higher order
combinations. While this is not a problem for small vocabularies, for
larger vocabularies (and corresponding documents) the number of possible
features can quickly become very large. Thus, many text mining
applications will make use of Hadoop or Spark clusters to leverage the
inherent parallelism in these tasks.

To demonstrate using n-grams, the following code example builds a
feature vector containing both ingle words and b-grams from the
documents. We use this new sparse matrix to classify the documents by
using our simple Naive Bayes classifier, which obtains slightly better
results.

-----
[gnv]: https://books.google.com/ngrams
[ngd]: https://en.wikipedia.org/wiki/N-gram

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
pclf = Pipeline(tools)


# Lowercase and restrict ourselves to about half the available features
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,2), \
                cv__lowercase=True)

pclf.fit(mvr_train, y_train)
y_pred = pclf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.84      0.71      0.77       259
        pos       0.73      0.85      0.79       241

avg / total       0.79      0.78      0.78       500



In [4]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 421010


In [5]:
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,3), \
                cv__lowercase=True, \
                cv__min_df=2, \
                cv__max_df=0.5)

pclf.fit(mvr_train, y_train)
y_pred = pclf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.84      0.78      0.81       259
        pos       0.78      0.83      0.81       241

avg / total       0.81      0.81      0.81       500



In [6]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 62735


-----

Student Activity

-----

In [7]:
import string
import nltk
from nltk.stem.porter import PorterStemmer

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]

    stemmer = PorterStemmer()
    stems = map(stemmer.stem, tokens)
    return stems

pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,3), \
                cv__lowercase=True, \
                cv__tokenizer=tokenize)

pclf.fit(mvr_train, y_train)
y_pred = pclf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.83      0.77      0.80       259
        pos       0.77      0.83      0.80       241

avg / total       0.80      0.80      0.80       500



In [8]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 80529


### Clustering Analysis

We can also apply clustering analysis to our feature matrix. While
finding an unknown number of clusters in text documents can be
difficult, we can learn about our data by identifying the clusters for
our **known** labels. To demonstrate, in the following code cells, we
employ k-means to find twenty clusters in our feature matrix, after
which we identify the most frequently used words in each cluster.

-----

In [14]:
from sklearn.cluster import KMeans

true_k = 2

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

from sklearn.feature_extraction.text import CountVectorizer

# Verify attributes

cv = CountVectorizer(stop_words = 'english', \
                     ngram_range=(1, 3), max_features=100000)

train_counts = cv.fit_transform(mvr_train)
test_data = cv.transform(mvr_test)

km.fit(test_data)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=2, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [15]:
top_tokens = 20
labels = ['Neg', 'Pos']

print('Top {} tokens per cluster:\n'.format(top_tokens))

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = cv.get_feature_names()

for idx in range(true_k):
    print("Cluster {0}:".format(idx), end='')
    for jdx in order_centroids[idx, :top_tokens]:
        print(' {0}'.format(terms[jdx]), end='')
    print('\n')

Top 20 tokens per cluster:

Cluster 0: movie vacation vegas like good music heart series vegas vacation film music heart doesn directed craven griswold national streep family roberta isn

Cluster 1: film movie like just time good story character way characters make does plot really scene life people man little bad



In [11]:
# load dataset
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', subset='train', shuffle=True, random_state=23)
test = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', subset='test', shuffle=True, random_state=23)

In [12]:
from sklearn.cluster import KMeans

true_k = 20

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

from sklearn.feature_extraction.text import CountVectorizer

# Verify attributes

cv = CountVectorizer(stop_words = 'english', max_features=100000)
train_counts = cv.fit_transform(train['data'])
test_data = cv.transform(test['data'])

km.fit(test_data)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=20, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [13]:
top_tokens = 20
labels = test['target']

print('Top {} tokens per cluster:\n'.format(top_tokens))

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = cv.get_feature_names()

for idx in range(true_k):
    print("Cluster {0}:".format(idx), end='')
    for jdx in order_centroids[idx, :top_tokens]:
        print(' {0}'.format(terms[jdx]), end='')
    print('\n')

Top 20 tokens per cluster:

Cluster 0: ftp edu cancer file use files server available gopher information mac hiv health number 1993 software faq com medical mail

Cluster 1: 00 20 appears 40 art 50 80 10 wolverine 60 1st ghost rider hobgoblin punisher man annual sabretooth appear cover

Cluster 2: god people like don know just said say does jesus edu right think time way new did believe christ lord

Cluster 3: 92 12 10 hiv 17 11 aids patients et 03 30 medical 25 milk cd4 31 tb 1993 number 04

Cluster 4: com subject lines writes organization edu article don just like people know think posting host nntp time does use good

Cluster 5: stephanopoulos mr president general did think know attorney just don going decision george said statement yesterday house white responsibility mean

Cluster 6: 03 04 02 05 won lost 06 07 idle 08 01 10 09 edu sox berkeley york new chicago san

Cluster 7: edu graphics pub mail ray 128 send ftp 3d com server objects amiga rayshade archie image images file files

-----
## DImension Reduction

The matrices are big. Lets reduce the number of features. PCA can be difficult given the size. Could use incremental PCA or Truncated SVD. But lets select the best k features.

-----

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')

In [53]:

train_counts = tf.fit_transform(train['data'])
test_data = tf.transform(test['data'])

nb = MultinomialNB()

clf = nb.fit(train_counts, train['target'])
predicted = clf.predict(test_data)


print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test_data, test['target'])))

NB prediction accuracy =  82.0%


In [66]:
from sklearn.feature_selection import SelectKBest, chi2

num_k = 10000

ch2 = SelectKBest(chi2, k=num_k)
xtr = ch2.fit_transform(train_counts, train['target'])
xt = ch2.transform(test_data)

In [67]:
clf = nb.fit(xtr, train['target'])
predicted = clf.predict(xt)

print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(xt, test['target'])))

NB prediction accuracy =  82.0%


In [82]:
feature_names = vectorizer.get_feature_names()
feature_names = np.array([feature_names[i] for i in ch2.get_support(indices=True)])

In [92]:
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)

top_count = 20

for idx, target in enumerate(train['target_names']):
    top_names = np.argsort(nb.coef_[idx])[-top_count:]
    print('{0}:'.format(target))
    pp.pprint([name for name in feature_names[top_names]])

alt.atheism:
[ 'islam', 'islamic', 'say', 'allan', 'wpd', 'cco', 'article', 'solntze', 'don',
  'morality', 'sgi', 'schneider', 'people', 'atheism', 'com', 'livesey',
  'atheists', 'caltech', 'god', 'keith']
comp.graphics:
[ 'software', 'does', 'polygon', 'computer', 'version', 'format', 'images',
  'looking', 'need', 'help', 'file', 'program', 'com', 'nntp', 'host', 'files',
  '3d', 'thanks', 'image', 'graphics']
comp.os.ms-windows.misc:
[ 'help', 'nntp', 'version', 'ftp', 'host', 'problem', 'program', 'card',
  'using', 'com', 'win', 'thanks', 'use', 'drivers', 'ms', 'driver', 'files',
  'file', 'dos', 'windows']
comp.sys.ibm.pc.hardware:
[ 'does', 'monitor', 'computer', 'motherboard', 'nntp', 'host', 'help',
  'drives', 'dos', 'disk', 'isa', 'thanks', 'pc', 'com', 'controller', 'bus',
  'ide', 'scsi', 'card', 'drive']
comp.sys.mac.hardware:
[ 'new', 'computer', 'duo', 'use', 'scsi', 'lc', 'com', 'problem', 'does',
  'monitor', 'se', 'simms', 'thanks', 'host', 'nntp', 'centris', 'dri