# Text analytics using Scikit-learn

&nbsp;
&nbsp;

## Mark Wicks

# Goals for this session

* Dive deeper into Jupyter notebooks

* Describe steps in solving data science problems

* Introduce Scikit-Learn API

* Build a simple document classifier using Scikit-Learn


### Steps to build a typical prediction model

1. Get your data (and clean it up if necessary).
2. RANDOMLY split your data into training, test, and validation sets. Test set is used to ensure that patterns learned are not unique to the training data. Validation set is used to tune the model. (Also check out [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)))
2. Extract features from your data
3. Build a preliminary model
4. Evaluate the model on the validation data
5. If not satisfied, adjust "tunable" training parameters (e.g, tree depth, feature selection, etc.) and go to 3 or 4.
6. Evalute the final model performance on the test data.

Note: Today we'll use the same data for testing and validation. 

### Introduction to the Scikit-Learn API

* Most Classifiers, regressors, and feature extractors/transformers share a common API

### Feature extraction/transformation:

      # Choose an extractor right for the data
      extractor = sklearn.some_extractor()

      # Learn some features from the data
      extractor.fit(training_data) 

      # Extract features from the data
      features = extractor.transform(new_data)  

      # Or combine previous two steps
      features = extractor.fit_transform(data)

##### Note: Data and features are represented by various type of Python arrays/matrices

### Classification/regression
 
    # Choose a classifier
    cfier = sklearn.some_classifier()

    # Build a model
    cfier.fit(training_features, training_labels)

    # Make predictions
    predicted_labels = cfier.predict(new_features)
   
    # Make predictions as probabilities
    probabilities = cfier.predict_proba(new_features)


# Example: Feature extraction/transformation

In [3]:
import sklearn.preprocessing
extractor = sklearn.preprocessing.LabelEncoder()

# Learn some features from the data
extractor.fit(
    ['CA', 'AZ', 'CA', 'DE', 'AZ', 'DE']) 

# Extract features from the data
features = extractor.transform(
    ['AZ', 'DE', 'DE', 'CA'])  
print(features)

# Or combine previous two steps (when data is the same)
features = extractor.fit_transform(
    ['CA', 'AZ', 'CA', 'DE', 'AZ', 'DE'])

print(features)


[0 2 2 1]
[1 0 1 2 0 2]


# Word counts are common features in text analyzers:

We'll use a CountVectorizer() to learn the words in some sample data.  Then we'll extract a feature vector for some sample text.

In [87]:
import sklearn.feature_extraction

quotes = [ 
"Imagination is more important than knowledge",
"If music be the food of love, play on",
"the way to get started is to quit talking and begin doing",
"Obstacles are those frightful things you see when you take your eyes off the goal",
"when you come to a fork in the road take it",
"live as if you were to die tomorrow. "
    "Learn as if you were to live forever"]

vectorizer = sklearn.feature_extraction.text.CountVectorizer(
                                ngram_range=(1,1),
                                analyzer='word')
vectorizer.fit(quotes)

print( [ (word, vectorizer.vocabulary_[word])
        for word in sorted(vectorizer.vocabulary_) ] )

[(u'and', 0), (u'are', 1), (u'as', 2), (u'be', 3), (u'begin', 4), (u'come', 5), (u'die', 6), (u'doing', 7), (u'eyes', 8), (u'food', 9), (u'forever', 10), (u'fork', 11), (u'frightful', 12), (u'get', 13), (u'goal', 14), (u'if', 15), (u'imagination', 16), (u'important', 17), (u'in', 18), (u'is', 19), (u'it', 20), (u'knowledge', 21), (u'learn', 22), (u'live', 23), (u'love', 24), (u'more', 25), (u'music', 26), (u'obstacles', 27), (u'of', 28), (u'off', 29), (u'on', 30), (u'play', 31), (u'quit', 32), (u'road', 33), (u'see', 34), (u'started', 35), (u'take', 36), (u'talking', 37), (u'than', 38), (u'the', 39), (u'things', 40), (u'those', 41), (u'to', 42), (u'tomorrow', 43), (u'way', 44), (u'were', 45), (u'when', 46), (u'you', 47), (u'your', 48)]


# Define a simple helper function to make it easy to print the word count feature vectors

In [88]:
def print_counts(vectorizer, features):
    reverse = dict((v, k) for k,v in vectorizer.vocabulary_.items()) 
    print('counts:')
    print( [(reverse[i], count) 
            for i,count in zip(features.indices, features.data)])

In [89]:

features = vectorizer.transform([
"live as if you were to die tomorrow. "
        "Learn as if you were to live forever"])
print("vocabulary:\n{0}".format([ (word, vectorizer.vocabulary_[word])
        for word in sorted(vectorizer.vocabulary_) ] ))
print("features:\n{0}".format(features))

vocabulary:
[(u'and', 0), (u'are', 1), (u'as', 2), (u'be', 3), (u'begin', 4), (u'come', 5), (u'die', 6), (u'doing', 7), (u'eyes', 8), (u'food', 9), (u'forever', 10), (u'fork', 11), (u'frightful', 12), (u'get', 13), (u'goal', 14), (u'if', 15), (u'imagination', 16), (u'important', 17), (u'in', 18), (u'is', 19), (u'it', 20), (u'knowledge', 21), (u'learn', 22), (u'live', 23), (u'love', 24), (u'more', 25), (u'music', 26), (u'obstacles', 27), (u'of', 28), (u'off', 29), (u'on', 30), (u'play', 31), (u'quit', 32), (u'road', 33), (u'see', 34), (u'started', 35), (u'take', 36), (u'talking', 37), (u'than', 38), (u'the', 39), (u'things', 40), (u'those', 41), (u'to', 42), (u'tomorrow', 43), (u'way', 44), (u'were', 45), (u'when', 46), (u'you', 47), (u'your', 48)]
features:
  (0, 2)	2
  (0, 6)	1
  (0, 10)	1
  (0, 15)	2
  (0, 22)	1
  (0, 23)	2
  (0, 42)	2
  (0, 43)	1
  (0, 45)	2
  (0, 47)	2


In [90]:
features = vectorizer.transform([
"live as if you were to die tomorrow. "
        "Learn as if you were to live forever"])
print_counts(vectorizer, features)

features = vectorizer.transform(["Most of these words are new"])
print_counts(vectorizer, features)


counts:
[(u'as', 2), (u'die', 1), (u'forever', 1), (u'if', 2), (u'learn', 1), (u'live', 2), (u'to', 2), (u'tomorrow', 1), (u'were', 2), (u'you', 2)]
counts:
[(u'are', 1), (u'of', 1)]


# Word or character combinations can be very useful features:
    

In [91]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(
                                ngram_range=(1,2),
                                analyzer='char')
vectorizer.fit(quotes);


In [92]:
features = vectorizer.transform([
"live as if you were to die tomorrow. "
        "Learn as if you were to live forever"
    ])
# print(features)
print_counts(vectorizer, features)


counts:
[(u' ', 15), (u' a', 2), (u' d', 1), (u' f', 1), (u' i', 2), (u' l', 2), (u' t', 3), (u' w', 2), (u' y', 2), (u'.', 1), (u'. ', 1), (u'a', 3), (u'ar', 1), (u'as', 2), (u'd', 1), (u'di', 1), (u'e', 10), (u'e ', 5), (u'ea', 1), (u'er', 3), (u'ev', 1), (u'f', 3), (u'f ', 2), (u'fo', 1), (u'i', 5), (u'ie', 1), (u'if', 2), (u'iv', 2), (u'l', 3), (u'le', 1), (u'li', 2), (u'm', 1), (u'mo', 1), (u'n', 1), (u'n ', 1), (u'o', 8), (u'o ', 2), (u'om', 1), (u'or', 2), (u'ou', 2), (u'ow', 1), (u'r', 7), (u're', 3), (u'rn', 1), (u'ro', 1), (u'rr', 1), (u's', 2), (u's ', 2), (u't', 3), (u'to', 3), (u'u', 2), (u'u ', 2), (u'v', 3), (u've', 3), (u'w', 3), (u'w.', 1), (u'we', 2), (u'y', 2), (u'yo', 2)]


# Example &mdash; Building a simple SPAM filter

&nbsp;

## (This is a classification problem)

### Import the libraries we need and take a look at the training data

In [93]:
import pandas

train = pandas.read_csv("datasets/smstrain.tsv",
                        sep='\t',
                        header=None,
                        names=('target', 'text'),
                        skipinitialspace = True)    
train.head(10)

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,I'm at home. Please call
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,Hey! Congrats 2u2. id luv 2 but ive had 2 go h...
5,ham,After my work ah... Den 6 plus lor... U workin...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,ham,I had askd u a question some hours before. Its...


## Extract features and train the classifier

In [94]:
import re
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as naive_bayes

preprocessor = lambda s: (re.sub("[0-9]+", ' WAS_A_NUMBER ',
                                 s.lower())
                     .replace(':)', ' SMILEY '))

vectorizer = text.CountVectorizer(lowercase=True,
                                  ngram_range=(1,3),
                                  stop_words='english',
                                  analyzer='word',
                                  preprocessor=preprocessor)
# Extract features and target...
train_features = vectorizer.fit_transform(train.text)
train_target = (train.target == 'spam');

# What does the preprocessor do?

In [95]:
preprocessor(
"Last year we had 350 new students "
    "and this year we have 450:)")

'last year we had  WAS_A_NUMBER  new students and this year we have  WAS_A_NUMBER  SMILEY '

# Train the model

In [96]:
# Train...
# MultinomialNB is a good choice when features are occurrence counts
# Other types of classifiers would work too
classifier = naive_bayes.MultinomialNB() # Default parameters are good choices
classifier.fit(train_features, train_target);

## At this point, we actually have a somewhat useful spam filter...

### Make predictions on the test data

In [97]:
import numpy

test = pandas.read_csv("datasets/smstest.tsv",
                        sep='\t',
                        header=None,
                        names=('target', 'text'),
                        skipinitialspace = True)
test_features = vectorizer.transform(test.text)

predictions = ['spam' if p else 'ham' 
               for p in classifier.predict(test_features)]

probabilities = ((classifier.predict_proba(test_features)*100.0)
                 .astype(numpy.int64))

s = classifier.score(test_features, test.target == 'spam')
print('Accuracy: {0:5.3f}'.format(s))

predictions = ['spam' if p > 60.0 else 'ham' 
               for p in probabilities[:,1]]

Accuracy: 0.985


## Evaluate the performance on the test set


In [98]:
import sklearn.metrics as metrics

confusion = sklearn.metrics.confusion_matrix(test.target, predictions)
tpr=float(confusion[1,1])/(confusion[1,0]+confusion[1,1])
fpr=float(confusion[0,1])/(confusion[0,1]+confusion[1,1])
accuracy=float(confusion[0,0]+confusion[1,1])/sum(sum(confusion))

check = pandas.DataFrame(zip(predictions, 
                             probabilities[:,1], 
                             test.target, 
                             test.text), 
                         columns=['Pred', 'Prob(%)',
                                  'Actual', 'Text'])

In [99]:
print(('Correct hams:  {0}\n' +
       'Correct spams:  {1}\n' + 
       'False Positives:  {2}\n' +
       'False Negatives: {3}\n\n' +
       '% of SPAMs detected: {4:4.1f}%\n' +
       'False positive rate: {5:4.1f}%\n' + 
       'Overall Accuracy:    {6:5.3f}%\n').format(
    confusion[0,0], confusion[1,1], confusion[0,1],
    confusion[1,0], tpr*100.0, fpr*100.0, accuracy*100.0))
print('AUROC: {0:5.5f}'
      .format(metrics.roc_auc_score(test.target == 'spam',
                                    probabilities[:,1])))

Correct hams:  1615
Correct spams:  218
False Positives:  3
False Negatives: 21

% of SPAMs detected: 91.2%
False positive rate:  1.4%
Overall Accuracy:    98.708%

AUROC: 0.96873


## What did it get right or wrong?
&nbsp;

### Hams that were classified correctly:

In [100]:
print('Hams that were classified correctly:')
check[(check.Pred == check.Actual) & (check.Pred == 'ham')]

Hams that were classified correctly:


Unnamed: 0,Pred,Prob(%),Actual,Text
0,ham,0,ham,I also thk too fast... Xy suggest one not me. ...
1,ham,52,ham,CAN I PLEASE COME UP NOW IMIN TOWN.DONTMATTER ...
2,ham,0,ham,Please sen :)my kind advice :-)please come her...
3,ham,0,ham,"House-Maid is the murderer, coz the man was mu..."
4,ham,0,ham,Where in abj are you serving. Are you staying ...
5,ham,0,ham,HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYARO...
6,ham,0,ham,"Sorry battery died, yeah I'm here"
7,ham,0,ham,"Nah man, my car is meant to be crammed full of..."
8,ham,0,ham,Why i come in between you people
9,ham,0,ham,Ok. There may be a free gym about.


## Spams that were classified correctly:

In [101]:
check[(check.Pred == check.Actual) & (check.Pred == 'spam')]

Unnamed: 0,Pred,Prob(%),Actual,Text
28,spam,100,spam,URGENT! We are trying to contact U Todays draw...
42,spam,100,spam,Congrats! 2 mobile 3G Videophones R yours. cal...
50,spam,100,spam,Double mins and txts 4 6months FREE Bluetooth ...
55,spam,99,spam,Please CALL 08712402972 immediately as there i...
58,spam,100,spam,Last chance 2 claim ur £150 worth of discount ...
63,spam,100,spam,Sex up ur mobile with a FREE sexy pic of Jorda...
74,spam,100,spam,"SMS SERVICES. for your inclusive text credits,..."
82,spam,100,spam,Final Chance! Claim ur £150 worth of discount ...
95,spam,99,spam,1000's flirting NOW! Txt GIRL or BLOKE & ur NA...
102,spam,100,spam,Free 1st week entry 2 TEXTPOD 4 a chance 2 win...


## Classification errors:

In [102]:
print('Misclassified Messages:')
check[check.Pred != check.Actual]

Misclassified Messages:


Unnamed: 0,Pred,Prob(%),Actual,Text
18,spam,99,ham,Yun ah.the ubi one say if ü wan call by tomorr...
78,ham,4,spam,Dorothy@kiefer.com (Bank of Granite issues Str...
147,ham,1,spam,Oh my god! I've found your number again! I'm s...
183,ham,0,spam,LIFE has never been this much fun and great un...
239,ham,8,spam,ROMCAPspam Everyone around should be respondin...
274,ham,1,spam,(Bank of Granite issues Strong-Buy) EXPLOSIVE ...
352,ham,3,spam,TBS/PERSOLVO. been chasing us since Sept for£3...
490,ham,0,spam,"Do you ever notice that when you're driving, a..."
522,ham,0,spam,In The Simpsons Movie released in July 2007 na...
572,ham,20,spam,Block Breaker now comes in deluxe format with ...


# Other applications
1. Sentiment Analysis (one approach is to treat it as a classification problem)
2. Topic Extraction [(LDA is a very effective algorithm)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

&nbsp;

# Additional resources

1. [Twenty newsgroups dataset (commonly used for text classification exercises and part of Scikit-Learn)](http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset)
2. [Reuters corpus volume (another common benchmark)](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/)
2. [Python Natural Language Toolkit](http://www.nltk.org/)
2. [Sentiment analysis demo trained on movie reviews](http://text-processing.com/demo/sentiment/)

# Questions?
