# Intro to scikit-learn

Notes from a workshop given by Lukas Biewald, CEO Crowdflower

If running in python2, include these at top of code to ensure compatibility with python3 code:
```
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
```

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [2]:
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

print ("Tweets:")
print (text[0:8])

print ("-" * 54)

print ("Sentiments:")
print (target[0:8])


Tweets:
0    .@wesley83 I have a 3G iPhone. After 3 hrs twe...
1    @jessedee Know about @fludapp ? Awesome iPad/i...
2    @swonderlin Can not wait for #iPad 2 also. The...
3    @sxsw I hope this year's festival isn't as cra...
4    @sxtxstate great stuff on Fri #SXSW: Marissa M...
5    @teachntech00 New iPad Apps For #SpeechTherapy...
6                                                  NaN
7    #SXSW is just starting, #CTIA is around the co...
Name: tweet_text, dtype: object
------------------------------------------------------
Sentiments:
0                      Negative emotion
1                      Positive emotion
2                      Positive emotion
3                      Negative emotion
4                      Positive emotion
5    No emotion toward brand or product
6    No emotion toward brand or product
7                      Positive emotion
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: object


## Creating features from text with CountVectorizer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
count_vect.fit(text)

ValueError: np.nan is an invalid document, expected byte or unicode string.

...*oops*

## Dealing with Missing Data

Notice there is a __`NaN`__ value on line six of the head of the tweets data. There is some missing data which was classified as __"No emotion toward brand or product"__ and that isn't useful for classifying anything. Let's drop those out here:

In [4]:
fixed_text = text[text.notnull()]
fixed_target = target[text.notnull()] # note getting rid of same lines in both Series based on null data in text.

## Back to CountVectorizer

Having fixed the missing data, we can create the vectorizer.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
count_vect.fit(fixed_text)

counts = count_vect.transform(fixed_text)

print (count_vect.vocabulary_.get(u'iphone'))
print (count_vect.transform(["I love my iphone!!!"]))

4573
  (0, 4573)	1
  (0, 5169)	1
  (0, 5699)	1


## Machine Learning

 NB has a bunch of parameters -- somewhat scary for those who haven't
 used it before. That said, Scikit-Learn mostly has sane defaults,
 and usually it's not necessary to modify them. Can also try to
 change a new algorithm, but usually it's not the best way to spend
 your time.

<img src= "http://scikit-learn.org/stable/_static/ml_map.png">
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/



In [7]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(counts, fixed_target)

predictions = nb.predict(counts)
print (sum(predictions == fixed_target))

7229


Great! ...but wait, what does that mean?

## Validation and Overfitting

It's a bad idea to validate against the training data set, because we still have no idea how the model does against data it's never seen before.

Let's split it into two chunks

In [8]:
nb = MultinomialNB()

nb.fit(counts[0:6000], fixed_target[0:6000])

predictions = nb.predict(counts[6000:9092])
print (sum(predictions == fixed_target[6000:9092]))

2053


### Cross Validation

Splitting it into two chunks worked well. Let's take that to the next level and split it into different chunks like this:

<img src="https://pbs.twimg.com/media/CSL3Uv2UsAAoreU.jpg">

Source: [Rob Hall](https://twitter.com/hallr)'s [ODSC](http://opendatascicon.com/) talk on Machine Learning, picture credit [@nicholasvu](https://twitter.com/nicholasvu)

In [9]:
nb = MultinomialNB()

from sklearn import cross_validation

scores = cross_validation.cross_val_score(nb, counts, fixed_target, cv=10)
print (scores)
print (scores.mean())

[ 0.65824176  0.63076923  0.60659341  0.60879121  0.64395604  0.68901099
  0.70077008  0.66886689  0.65270121  0.62183021]
0.648153102333


### Dummy Validation

In [10]:
from sklearn.dummy import DummyClassifier

nb_dummy = DummyClassifier(strategy='most_frequent')

nb_dummy.fit(counts[0:6000], fixed_target[0:6000])

predictions = nb_dummy.predict(counts[6000:9092])
print (sum(predictions == fixed_target[6000:9092]))

from sklearn import cross_validation

scores = cross_validation.cross_val_score(nb_dummy, counts, fixed_target, cv=10)
print (scores)
print (scores.mean())

1890
[ 0.59230769  0.59230769  0.59230769  0.59230769  0.59230769  0.59230769
  0.5929593   0.5929593   0.59316428  0.59316428]
0.592609330138


## scikit-learn Pipelines

Because scikit-learn is structured so that mutiple steps are so often run one after the other, it has a framework for building out a model in a "pipeline".

In [11]:
from sklearn.pipeline import Pipeline

p = Pipeline(steps=[('counts', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print (p.predict(["I love my iphone!"]))

['Positive emotion']


## N-grams

Instead of using single words in our vectorizer, what if we used adjacent words in pairs (bigrams)? Would that change the sentiment of how each word is used?

In [15]:
p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print (p.named_steps['counts'].vocabulary_.get(u'garage sale'))
print (len(p.named_steps['counts'].vocabulary_))

18967
59614


In [16]:
p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print (p.predict(["I love my iphone!"]))

['Positive emotion']


In [14]:
p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)

from sklearn import cross_validation

scores = cross_validation.cross_val_score(p, fixed_text, fixed_target, cv=10)
print (scores)
print (scores.mean())

[ 0.68351648  0.66593407  0.65384615  0.64725275  0.68021978  0.69120879
  0.73267327  0.70517052  0.68026461  0.64829107]
0.678837748442


## Feature Selection

Would it change things if we dropped rare words and only focused on the most common words or n-grams?

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('feature_selection', SelectKBest(chi2, k=10000)),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)

from sklearn import cross_validation

scores = cross_validation.cross_val_score(p, fixed_text, fixed_target, cv=10)
print (scores)
print (scores.mean())

## Grid Search

Like pipelines, scikit-learn has a framework for testing a model with multiple settings of parameters in fewer lines of code.

In [None]:
p = Pipeline(steps=[('counts', CountVectorizer()),
                ('feature_selection', SelectKBest(chi2)),
                ('multinomialnb', MultinomialNB())])

from sklearn.grid_search import GridSearchCV

parameters = {
    'counts__max_df': (0.5, 0.75, 1.0),
    'counts__min_df': (1, 2, 3),
    'counts__ngram_range': ((1,1), (1,2)),
#    'feature_selection__k': (1000, 10000, 100000)
    }

grid_search = GridSearchCV(p, parameters, n_jobs=1, verbose=1, cv=10)

grid_search.fit(fixed_text, fixed_target)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))