<DIV ALIGN=CENTER>

# Introduction to Practical Concepts
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

In this IPython Notebook, we explore 

1. Repeat unigram results (read in data/ set things up.)
2. stemming
3. Stemming results.
4. n-grams
5. bigram results
6. tri gram results.
7. results comparison (grid search?).

Sentiment analysis. Movie reviews?



-----

In [1]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
import nltk
mvr = nltk.corpus.movie_reviews

from sklearn.datasets import load_files

data_dir = '/home/data_scientist/nltk_data/corpora/movie_reviews'
mvr = load_files(data_dir, shuffle = False)
print('Number of Reviews: {0}'.format(len(mvr.data)))

from sklearn.cross_validation import train_test_split

mvr_train, mvr_test, y_train, y_test = train_test_split(
    mvr.data, mvr.target, test_size=0.25, random_state=23)

Number of Reviews: 2000


## n-grams

Formally, a [_n-gram_][ngd] is a contiguous sequence of **n** items from a
parent sequence of items, such as characters or words in a text
document. In general, we will focus solely on words in a document. Thus,
our initial approach has simply been to look at unigrams or single
words in a document when building a classification model. However,
sometimes the combination of words can be more descriptive, for example,
_unbelievably bad_ is generally viewed as a more powerful description
than just _bad_. As a result, the concept of an _n-gram_ was created,
where collections of words can be treated as features. In fact google
allows a user to search for [specific n-gram][gnv] combinations in books that
they have scanned.

While this clearly can improve classification power, it also increases
computational requirements. This is a result of the exponential rise in
the number of possible features. For example, given $n$ words, we have
$n \times (n - 1)$ possible bigrams, and so on for higher order
combinations. While this is not a problem for small vocabularies, for
larger vocabularies (and corresponding documents) the number of possible
features can quickly become very large. Thus, many text mining
applications will make use of Hadoop or Spark clusters to leverage the
inherent parallelism in these tasks.

To demonstrate using n-grams, we first demonstrate the concept on a
single sentence.

-----
[gnv]: https://books.google.com/ngrams
[ngd]: https://en.wikipedia.org/wiki/N-gram

In [47]:
my_text = 'INFO490 introduces many concepts in data science.'

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english', ngram_range=(1,4), lowercase=True)

tk_func = cv.build_analyzer()

import pprint
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

pp.pprint(tk_func(my_text))

[ 'info490', 'introduces', 'concepts', 'data', 'science', 'info490 introduces',
  'introduces concepts', 'concepts data', 'data science',
  'info490 introduces concepts', 'introduces concepts data',
  'concepts data science', 'info490 introduces concepts data',
  'introduces concepts data science']


In [57]:
in_list = []
in_list.append(my_text)

cv = cv.fit(my_list)

import operator
my_voc = sorted(cv.vocabulary_.items(), key=operator.itemgetter(1))

print('Token mapping:')
print(40*'-')

for tokens, rank in my_voc:
    print(rank, tokens)

print(40*'-')
out_list = ['INFO490 is data science']
xsm = cv.transform(out_list)
print(out_list)
print(40*'-')
print(xsm.todense())

Token mapping:
----------------------------------------
0 concepts
1 concepts data
2 concepts data science
3 data
4 data science
5 info490
6 info490 introduces
7 info490 introduces concepts
8 info490 introduces concepts data
9 introduces
10 introduces concepts
11 introduces concepts data
12 introduces concepts data science
13 science
----------------------------------------
['INFO490 is data science']
----------------------------------------
[[0 0 0 1 1 1 0 0 0 0 0 0 0 1]]


-----

## Student Activity

In the preceding cells, we used XXX. Now that you
have run the Notebook, go back and make the following changes to see how
the results change.

1. Change 
2. Change 
3. Try making 

Finally, try applying 

-----

### N-gram classification

Having n-grams offers improved classification, since word or token
combinations often include more information than single words or tokens.
For example, _University Illinois_ means more than just _University_ and
_Illinois_. We can build on our previous simple text classification
pipeline to now develop a more complete code example that builds a
feature vector containing both single words and b-grams from the
documents. We use this new sparse matrix to classify the documents by
using our simple Naive Bayes classifier, which obtains slightly better
results.

-----

In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
pclf = Pipeline(tools)


# Lowercase and restrict ourselves to about half the available features
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,2), \
                cv__lowercase=True)

pclf.fit(mvr_train, y_train)
y_pred = pclf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.84      0.71      0.77       259
        pos       0.73      0.85      0.79       241

avg / total       0.79      0.78      0.78       500



In [4]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 421010


In [5]:
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,3), \
                cv__lowercase=True, \
                cv__min_df=2, \
                cv__max_df=0.5)

pclf.fit(mvr_train, y_train)
y_pred = pclf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.84      0.78      0.81       259
        pos       0.78      0.83      0.81       241

avg / total       0.81      0.81      0.81       500



In [6]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 62735


-----

## Student Activity

In the preceding cells, we used XXX. Now that you
have run the Notebook, go back and make the following changes to see how
the results change.

1. Change 
2. Change 
3. Try making 

Finally, try applying 

-----

-----

### Stemming

-----

In [7]:
import string
import nltk
from nltk.stem.porter import PorterStemmer

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]

    stemmer = PorterStemmer()
    stems = map(stemmer.stem, tokens)
    return stems

pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,3), \
                cv__lowercase=True, \
                cv__tokenizer=tokenize)

pclf.fit(mvr_train, y_train)
y_pred = pclf.predict(mvr_test)
print(metrics.classification_report(y_test, y_pred, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.83      0.77      0.80       259
        pos       0.77      0.83      0.80       241

avg / total       0.80      0.80      0.80       500



In [8]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 80529


-----

## Student Activity

In the preceding cells, we used XXX. Now that you
have run the Notebook, go back and make the following changes to see how
the results change.

1. Change 
2. Change 
3. Try making 

Finally, try applying 

-----

### Clustering Analysis

We can also apply clustering analysis to our feature matrix. While
finding an unknown number of clusters in text documents can be
difficult, we can learn about our data by identifying the clusters for
our **known** labels. To demonstrate, in the following code cells, we
employ k-means to find twenty clusters in our feature matrix, after
which we identify the most frequently used words in each cluster.

-----

In [9]:
from sklearn.cluster import KMeans

true_k = 2

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

from sklearn.feature_extraction.text import CountVectorizer

# Verify attributes

cv = CountVectorizer(stop_words = 'english', \
                     ngram_range=(1, 3), max_features=100000)

train_counts = cv.fit_transform(mvr_train)
test_data = cv.transform(mvr_test)

km.fit(test_data)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=2, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [10]:
top_tokens = 20
labels = ['Neg', 'Pos']

print('Top {} tokens per cluster:\n'.format(top_tokens))

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = cv.get_feature_names()

for idx in range(true_k):
    print("Cluster {0}:".format(idx), end='')
    for jdx in order_centroids[idx, :top_tokens]:
        print(' {0}'.format(terms[jdx]), end='')
    print('\n')

Top 20 tokens per cluster:

Cluster 0: film like movie just time good character story way characters scene really make films does plot life man people scenes

Cluster 1: film movie like just good time story character way does plot characters make life little really man people bad movies



In [11]:
# load dataset
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', subset='train', shuffle=True, random_state=23)
test = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', subset='test', shuffle=True, random_state=23)

In [12]:
from sklearn.cluster import KMeans

true_k = 20

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

from sklearn.feature_extraction.text import CountVectorizer

# Verify attributes

cv = CountVectorizer(stop_words = 'english', max_features=100000)
train_counts = cv.fit_transform(train['data'])
test_data = cv.transform(test['data'])

km.fit(test_data)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=20, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [13]:
top_tokens = 20
labels = test['target']

print('Top {} tokens per cluster:\n'.format(top_tokens))

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = cv.get_feature_names()

for idx in range(true_k):
    print("Cluster {0}:".format(idx), end='')
    for jdx in order_centroids[idx, :top_tokens]:
        print(' {0}'.format(terms[jdx]), end='')
    print('\n')

Top 20 tokens per cluster:

Cluster 0: edu subject com lines organization writes article university posting host nntp don just like know think people does ca time

Cluster 1: jehovah elohim lord god christ father mcconkie unto son ps jesus said gods shall thou thee mormon thy earth stated

Cluster 2: dos windows microsoft tcp ms mouse amiga software pc graphics higher macintosh network mbytes version 00 ip memory support card

Cluster 3: people god edu don like just know think new does say time way 10 subject com said right did government

Cluster 4: jpeg image gif file color format images quality version files bit free programs available use jfif software don display edu

Cluster 5: gopher search edu client pub database software ftp information veronica macintosh data retrieve available unix world micro clients mail sites

Cluster 6: 25 mac files comp disk file software sys macintosh ftp faq questions stuffit 75 54 hard need available 102 apple

Cluster 7: edu image graphics data pub 

-----

## Student Activity

In the preceding cells, we used XXX. Now that you
have run the Notebook, go back and make the following changes to see how
the results change.

1. Change 
2. Change 
3. Try making 

Finally, try applying 

-----

-----
## DImension Reduction

The matrices are big. Lets reduce the number of features. PCA can be difficult given the size. Could use incremental PCA or Truncated SVD. But lets select the best k features.

-----

In [14]:
# Following Example was insipred by scikit learn demo
# http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')

In [15]:
# First, train on normal set of features, baseline performance.

train_counts = tf.fit_transform(train['data'])
test_data = tf.transform(test['data'])

nb = MultinomialNB()
nb = nb.fit(train_counts, train['target'])
predicted = nb.predict(test_data)

print("Prediction accuracy = {0:5.1f}%".format(100.0 * nb.score(test_data, test['target'])))
print('Number of Features = {}'.format(nb.feature_log_prob_.shape[1]))

Prediction accuracy =  82.0%
Number of Features = 129792


In [16]:
from sklearn.feature_selection import SelectKBest, chi2

num_k = 10000

ch2 = SelectKBest(chi2, k=num_k)
xtr = ch2.fit_transform(train_counts, train['target'])
xt = ch2.transform(test_data)

In [17]:
nb = nb.fit(xtr, train['target'])
predicted = nb.predict(xt)

print("NB prediction accuracy = {0:5.1f}%".format(100.0 * nb.score(xt, test['target'])))
print('Number of Features = {}'.format(nb.feature_log_prob_.shape[1]))

NB prediction accuracy =  82.0%
Number of Features = 10000


In [18]:
feature_names = tf.get_feature_names()

indices = ch2.get_support(indices=True)
feature_names = np.array([feature_names[idx] for idx in indices])

In [19]:
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

top_count = 20

for idx, target in enumerate(train['target_names']):
    top_names = np.argsort(nb.coef_[idx])[-top_count:]
    tn_lst = [name for name in feature_names[top_names]]
    tn_lst.reverse()

    print('\n{0}:'.format(target))
    pp.pprint(tn_lst)


alt.atheism:
[ 'keith', 'god', 'caltech', 'atheists', 'livesey', 'com', 'atheism', 'people',
  'schneider', 'sgi', 'morality', 'don', 'solntze', 'article', 'cco', 'wpd',
  'allan', 'say', 'islamic', 'islam']

comp.graphics:
[ 'graphics', 'image', 'thanks', '3d', 'files', 'host', 'nntp', 'com',
  'program', 'file', 'help', 'need', 'looking', 'images', 'format', 'version',
  'computer', 'polygon', 'does', 'software']

comp.os.ms-windows.misc:
[ 'windows', 'dos', 'file', 'files', 'driver', 'ms', 'drivers', 'use', 'thanks',
  'win', 'com', 'using', 'card', 'program', 'problem', 'host', 'ftp', 'version',
  'nntp', 'help']

comp.sys.ibm.pc.hardware:
[ 'drive', 'card', 'scsi', 'ide', 'bus', 'controller', 'com', 'pc', 'thanks',
  'isa', 'disk', 'dos', 'drives', 'help', 'host', 'nntp', 'motherboard',
  'computer', 'monitor', 'does']

comp.sys.mac.hardware:
[ 'mac', 'apple', 'quadra', 'drive', 'centris', 'nntp', 'host', 'thanks',
  'simms', 'se', 'monitor', 'does', 'problem', 'com', 'lc', 'scsi

-----

## Student Activity

In the preceding cells, we used XXX. Now that you
have run the Notebook, go back and make the following changes to see how
the results change.

1. Change 
2. Change 
3. Try making 

Finally, try applying 

-----