<DIV ALIGN=CENTER>

# Introduction to Practical Concepts
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

In this IPython Notebook, we explore 

1. Repeat unigram results (read in data/ set things up.)
2. stemming
3. Stemming results.
4. n-grams
5. bigram results
6. tri gram results.
7. results comparison (grid search?).

Sentiment analysis. Movie reviews?



-----

In [1]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load dataset
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', subset='train', shuffle=True, random_state=23)
test = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', subset='test', shuffle=True, random_state=23)

## n-grams

Formally, a [_n-gram_][ngd] is a contiguous sequence of **n** items from a
parent sequence of items, such as characters or words in a text
document. In general, we will focus solely on words in a document. Thus,
our initial approach has simply been to look at unigrams or single
words in a document when building a classification model. However,
sometimes the combination of words can be more descriptive, for example,
_unbelievably bad_ is generally viewed as a more powerful description
than just _bad_. As a result, the concept of an _n-gram_ was created,
where collections of words can be treated as features. In fact google
allows a user to search for [specific n-gram][gnv] combinations in books that
they have scanned.

While this clearly can improve classification power, it also increases
computational requirements. This is a result of the exponential rise in
the number of possible features. For example, given $n$ words, we have
$n \times (n - 1)$ possible bigrams, and so on for higher order
combinations. While this is not a problem for small vocabularies, for
larger vocabularies (and corresponding documents) the number of possible
features can quickly become very large. Thus, many text mining
applications will make use of Hadoop or Spark clusters to leverage the
inherent parallelism in these tasks.

To demonstrate using n-grams, the following code example builds a
feature vector containing both ingle words and b-grams from the
documents. We use this new sparse matrix to classify the documents by
using our simple Naive Bayes classifier, which obtains slightly better
results.

-----
[gnv]: https://books.google.com/ngrams
[ngd]: https://en.wikipedia.org/wiki/N-gram

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
pclf = Pipeline(tools)

# Lowercase and restrict ourselves to about half the available features
pclf.set_params(cv__stop_words = 'english', \
               cv__ngram_range=(1,2), \
               cv__lowercase=True)

pclf = pclf.fit(train['data'], train['target'])
predicted = pclf.predict(test['data'])

print("NB (Bi-Grams with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * pclf.score(test['data'], test['target'])))

NB (Bi-Grams with Stop Words) prediction accuracy =  80.4%


In [4]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 1186573


In [8]:
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,2), \
                cv__lowercase=True, \
                cv__min_df=2, \
                cv__max_df=0.5)

pclf = pclf.fit(train['data'], train['target'])
predicted = pclf.predict(test['data'])

print("NB (Bi-Grams with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * pclf.score(test['data'], test['target'])))

NB (Bi-Grams with Stop Words) prediction accuracy =  79.9%


In [9]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 305509


-----

Student Activity

-----

In [19]:
import string
import nltk
from nltk.stem.porter import PorterStemmer


def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]

    stemmer = PorterStemmer()
    stems = map(stemmer.stem, tokens)
    return stems

pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,1), \
                cv__lowercase=True, \
                cv__tokenizer=tokenize)

pclf = pclf.fit(train['data'], train['target'])
predicted = pclf.predict(test['data'])

print("NB (Stemming with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * pclf.score(test['data'], test['target'])))

NB (Stemming with Stop Words) prediction accuracy =  80.2%


In [20]:
# Extract the classifier
clf = pclf.steps[1][1]
print('Number of Features = {}'.format(clf.feature_log_prob_.shape[1]))

Number of Features = 54220


### Clustering Analysis

We can also apply clustering analysis to our feature matrix. While
finding an unknown number of clusters in text documents can be
difficult, we can learn about our data by identifying the clusters for
our **known** labels. To demonstrate, in the following code cells, we
employ k-means to find twenty clusters in our feature matrix, after
which we identify the most frequently used words in each cluster.

-----

In [16]:
from sklearn.cluster import KMeans

true_k = 20

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

from sklearn.feature_extraction.text import CountVectorizer

# Verify attributes

cv = CountVectorizer(stop_words = 'english', ngram_range=(1,2), max_features=100000)
train_counts = cv.fit_transform(train['data'])
test_data = cv.transform(test['data'])

km.fit(test_data)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=20, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [17]:
labels = test['target']

print("Top 10 tokens per cluster:\n")

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = cv.get_feature_names()

for i in range(true_k):
    print("Cluster {0}:".format(i), end='')
    for ind in order_centroids[i, :10]:
        print(' {0}'.format(terms[ind]), end='')
    print('\n')

Top 10 tokens per cluster:

Cluster 0: edu 00 people image new like don time said university

Cluster 1: planet earth spacecraft solar surface sun venus moon atmosphere planets

Cluster 2: edu subject com lines organization writes article university posting don

Cluster 3: 92 12 12 92 10 hiv 17 10 12 11 aids patients

Cluster 4: dos dos dos windows microsoft windows microsoft tcp ms mouse amiga software

Cluster 5: inches pc diagonal compatible horizontal frequencies price max vertical resolution

Cluster 6: mb m4 ms ma mz mm m1 mo mc mu

Cluster 7: team year winning division things runs win series games double

Cluster 8: god homosexuality people paul love homosexual christ christians church jesus

Cluster 9: edu graphics pub mail ray 128 send 3d ftp com

Cluster 10: jpeg image gif file color format images quality version files

Cluster 11: jehovah elohim lord god christ father mcconkie unto son ps

Cluster 12: slip com driver use phone packet file dos ip cwru

Cluster 13: cancer hiv 