# Introduction to Natural Language Processing

In this notebook, we'll go over some fundamental feature extraction concepts (very shallow NLP) using NLTK and scikit-learn.  This will lay the foundation for using machine learning with text data.

### The term-document matrix representation of a corpus.

This representation is extremely common for shallow NLP tasks, such as document-level classification.

Also known as "bag of words".

In [1]:
# example unicode text
docs = [u'The patient was seen for bird flu.', 
        u'The patient was seen for chickenpox.', 
        u'The patient was seen for dengue.']

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(max_df=1., min_df=0,
                                max_features=100,
                                ngram_range=(1, 1),
                                stop_words=None)

tf = tf_vectorizer.fit_transform(docs)
tf_vectorizer.get_feature_names()

[u'bird',
 u'chickenpox',
 u'dengue',
 u'flu',
 u'for',
 u'patient',
 u'seen',
 u'the',
 u'was']

*tf*   is a sparse matrix.

To see its contents, we'll use pandas to convert it to a table...

In [3]:
import pandas as pd
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,bird,chickenpox,dengue,flu,for,patient,seen,the,was
0,1,0,0,1,1,1,1,1,1
1,0,1,0,0,1,1,1,1,1
2,0,0,1,0,1,1,1,1,1


**Stop words.**  Words such as 'the', 'a', 'an', and may prepositions are often not so informative, and thus thrown out of the analysis as follows...

In [4]:
# apply stop words
tf_vectorizer = CountVectorizer(max_df=1., min_df=0,
                                max_features=100,
                                ngram_range=(1, 1),
                                stop_words='english')

tf = tf_vectorizer.fit_transform(docs)
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,bird,chickenpox,dengue,flu,patient,seen
0,1,0,0,1,1,1
1,0,1,0,0,1,1
2,0,0,1,0,1,1


**N-grams.** Single words or tokens are called unigrams, or 1-grams.  Sometimes unigrams simply do not capture enough information. In this case, we can take sequences of tokens.  

For example, we want the matrix to record that "bird flu" was present in the first document, not just that the first document mentioned "bird" and that it also mentioned "flu".

In [5]:
# Let's get the unigrams and bi-grams from the documents.
tf_vectorizer = CountVectorizer(max_df=1., min_df=0,
                                max_features=100,
                                ngram_range=(1, 2),
                                stop_words='english')

tf = tf_vectorizer.fit_transform(docs)
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,bird,bird flu,chickenpox,dengue,flu,patient,patient seen,seen,seen bird,seen chickenpox,seen dengue
0,1,1,0,0,1,1,1,1,1,0,0
1,0,0,1,0,0,1,1,1,0,1,0
2,0,0,0,1,0,1,1,1,0,0,1


We may want to add more complex features, such as tokens with their POS tags.

In [6]:
import nltk
def tokenize_and_tag(doc):
     return nltk.pos_tag(nltk.word_tokenize(doc))
    
tf_vectorizer = CountVectorizer(max_df=1., min_df=1,
                                max_features=100,
                                ngram_range=(1, 1),
                                tokenizer=tokenize_and_tag)

tf = tf_vectorizer.fit_transform(docs)
pd.DataFrame(tf.todense(),columns=tf_vectorizer.get_feature_names())

Unnamed: 0,"(., .)","(bird, NN)","(chickenpox, NN)","(dengue, NN)","(flu, NN)","(for, IN)","(patient, NN)","(seen, VBN)","(the, DT)","(was, VBD)"
0,1,1,0,0,1,1,1,1,1,1
1,1,0,1,0,0,1,1,1,1,1
2,1,0,0,1,0,1,1,1,1,1


Now look at the 20 news groups data.  Check scalability.

In [7]:
# Get the 20 newsgroups Corpus (just the train set for now)
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train',
                          categories=('rec.autos',
                             'rec.motorcycles',
                             'rec.sport.baseball',
                             'rec.sport.hockey'),
                          remove=('headers', 'footers', 'quotes'))

In [8]:
for n in range(1,5):
    tf_vectorizer = CountVectorizer(max_df=1., min_df=1,
                                    max_features=999999,
                                    ngram_range=(1, n),
                                    stop_words='english')
    tf = tf_vectorizer.fit_transform(news.data[:500])
    print "Using 1-grams up to %d-grams on the first 500 news groups documents yields %d features" % (n,tf.shape[1])

Using 1-grams up to 1-grams on the first 500 news groups documents yields 9328 features
Using 1-grams up to 2-grams on the first 500 news groups documents yields 39452 features
Using 1-grams up to 3-grams on the first 500 news groups documents yields 71233 features
Using 1-grams up to 4-grams on the first 500 news groups documents yields 102880 features


Let's use 1-3 - grams to predict the newsgroups category.

In [51]:
tf_vectorizer = CountVectorizer(max_df=1., min_df=10,
                                max_features=1000,
                                ngram_range=(1, 3),
                                stop_words='english')
tf = tf_vectorizer.fit_transform(news.data)
print "bigrams data shape = ", tf.shape

bigrams data shape =  (2389, 1000)


In [52]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=.01)
clf.fit(tf, news.target)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [53]:
 tf_vectorizer.get_feature_names()

[u'00',
 u'000',
 u'01',
 u'02',
 u'03',
 u'04',
 u'05',
 u'06',
 u'10',
 u'100',
 u'11',
 u'110',
 u'12',
 u'13',
 u'130',
 u'14',
 u'15',
 u'150',
 u'16',
 u'17',
 u'18',
 u'19',
 u'1988',
 u'1990',
 u'1991',
 u'1992',
 u'1993',
 u'1st',
 u'20',
 u'200',
 u'21',
 u'22',
 u'23',
 u'24',
 u'25',
 u'250',
 u'26',
 u'27',
 u'28',
 u'29',
 u'2nd',
 u'30',
 u'300',
 u'31',
 u'32',
 u'33',
 u'34',
 u'35',
 u'36',
 u'37',
 u'38',
 u'39',
 u'3rd',
 u'40',
 u'400',
 u'41',
 u'42',
 u'43',
 u'44',
 u'45',
 u'46',
 u'47',
 u'48',
 u'49',
 u'50',
 u'500',
 u'51',
 u'52',
 u'53',
 u'54',
 u'55',
 u'56',
 u'57',
 u'58',
 u'59',
 u'60',
 u'600',
 u'61',
 u'62',
 u'63',
 u'64',
 u'65',
 u'66',
 u'67',
 u'68',
 u'69',
 u'70',
 u'71',
 u'72',
 u'73',
 u'74',
 u'75',
 u'76',
 u'77',
 u'78',
 u'79',
 u'80',
 u'81',
 u'82',
 u'83',
 u'84',
 u'85',
 u'86',
 u'87',
 u'88',
 u'89',
 u'90',
 u'91',
 u'91 92',
 u'92',
 u'93',
 u'94',
 u'95',
 u'97',
 u'99',
 u'__',
 u'___',
 u'abc',
 u'ability',
 u'able',
 u'a

In [54]:
import numpy as np
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: " % category)
        for j in top10:
            print("\t %s" % feature_names[j])

show_top10(clf, tf_vectorizer, news.target_names)

rec.autos: 
	 think
	 know
	 engine
	 new
	 good
	 don
	 just
	 like
	 cars
	 car
rec.motorcycles: 
	 time
	 motorcycle
	 ride
	 good
	 know
	 don
	 dod
	 like
	 just
	 bike
rec.sport.baseball: 
	 games
	 like
	 just
	 00
	 don
	 think
	 team
	 good
	 game
	 year
rec.sport.hockey: 
	 12
	 11
	 season
	 55
	 play
	 25
	 hockey
	 10
	 game
	 team


In [55]:
from sklearn import metrics
news_test = fetch_20newsgroups(subset='test',
                               categories=('rec.autos',
                                 'rec.motorcycles',
                                 'rec.sport.baseball',
                                 'rec.sport.hockey'),
                               remove=('headers', 'footers', 'quotes'))
tf_test = tf_vectorizer.transform(news_test.data)
pred = clf.predict(tf_test)

In [56]:
print(metrics.classification_report(news_test.target, pred, target_names=news.target_names))

                    precision    recall  f1-score   support

         rec.autos       0.75      0.78      0.77       396
   rec.motorcycles       0.78      0.76      0.77       398
rec.sport.baseball       0.79      0.81      0.80       397
  rec.sport.hockey       0.77      0.75      0.76       399

       avg / total       0.77      0.77      0.77      1590

