# Spacy Demo

If you haven't installed spacy yet, use:
```
conda install spacy
python -m spacy.en.download
```
This downloads about 500 MB of data.

Another popular package, `nltk`, can be installed as follows (you can skip this for now):

```
conda install nltk
python -m nltk.downloader all
```

This also downloads a lot of data

## Load StumbleUpon dataset

In [1]:
# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
import json

data = pd.read_csv("../dataset/stumbleupon.tsv", sep='\t',
                  encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [4]:
## Load spacy

from spacy.en import English
nlp_toolkit = English()
nlp_toolkit

<spacy.en.English at 0x11862b210>

Another way to load `spacy`:
```
import spacy
nlp_toolkit = spacy.load("en")
```

In [6]:
title = u"Mary had a little lamb."

# "Mary had a little lamb" is the sentence that is loaded
# the "u" means it recognizes all unicode characters

parsed = nlp_toolkit(title)

# This does all the work up to [parsing etc]

for (i, word) in enumerate(parsed): 
    print "Word: {}".format(word)
    print "\t Phrase type: {}".format(word.dep_)
    print "\t Is the word a known entity type? {}".format(
        word.ent_type_  if word.ent_type_ else "No")
    print "\t Lemma: {}".format(word.lemma_)
    print "\t Parent of this word: {}".format(word.head.lemma_)

Word: Mary
	 Phrase type: nsubj
	 Is the word a known entity type? PERSON
	 Lemma: mary
	 Parent of this word: have
Word: had
	 Phrase type: ROOT
	 Is the word a known entity type? No
	 Lemma: have
	 Parent of this word: have
Word: a
	 Phrase type: det
	 Is the word a known entity type? No
	 Lemma: a
	 Parent of this word: lamb
Word: little
	 Phrase type: amod
	 Is the word a known entity type? No
	 Lemma: little
	 Parent of this word: lamb
Word: lamb
	 Phrase type: dobj
	 Is the word a known entity type? No
	 Lemma: lamb
	 Parent of this word: have
Word: .
	 Phrase type: punct
	 Is the word a known entity type? No
	 Lemma: .
	 Parent of this word: have


## Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [8]:
def references_organization(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

# the english 

data['references_organization'] = data['title'].fillna(u'').map(references_organization)

# Above adds True to column 'reference_organization' when there is a organization name in 
# page title

# Take a look
data[data['references_organization']][['title']].head()

# This returns the top 5 titles (head) that include an organization name in the title
# (such as IBM)

Unnamed: 0,title
0,IBM Sees Holographic Calls Air Breathing Batte...
5,Genital Herpes Treatment
6,fashion lane American Wild Child
8,Valet The Handbook 31 Days 31 days
10,Business Financial News Breaking US Internatio...


## Exercise:

Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

.
.
.
.
.
.
.
.

In [13]:
## Exercise solution

def references_organization_person(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed]) and any([word.ent_type_ == 'PERSON' for word in parsed])

data['references_organization_person'] = data['title'].fillna(u'').map(references_organization_person)


# Take a look
data[data['references_organization_person']][['title']].head()


Unnamed: 0,title
11,A Tip of the Cap to The Greatest Iron Man of T...
29,Genevieve Morton Swimsuit by Tyler Rose Swimwe...
44,Alyssa Miller Swimsuit by Charlie by Matthew Z...
114,Baby Gorilla Tries To Act Tough Video
115,BBC News UK Sweet message in a bottle


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

> ### Let's try extracting some of the text content.
> ### Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [15]:
# Option 1: Create a function to check for this

def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0
    
# a "try / accept" block will run is "try" ois succesfull
        
data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions

#data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)


# Option 3: string functions
data['recipe'] = data['title'].str.contains('recipe')


### Demo: Use of the Count Vectorizer

In [17]:
titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

# identify that you are going to use the function mentioned below

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles

vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# So this function creates one column for each word with a maximum of 1000 columns (so will only show
# the most frequently used 1,000 words) and then indicate wether that word occurs in the 
# sentence or not

X = vectorizer.transform(titles)

In [23]:
print X

# Command below shows how many rows / records the data has
# the only matrix elements shown are the elements with a 1 (and
# in front you see the row and column number)

print data.shape

# To check out what the feature names are that go with the columns
print vectorizer.get_feature_names()

  (0, 43)	1
  (2, 209)	1
  (2, 357)	1
  (2, 384)	1
  (2, 435)	1
  (2, 564)	1
  (2, 565)	1
  (3, 1)	1
  (3, 102)	1
  (3, 797)	1
  (3, 907)	1
  (4, 34)	1
  (4, 240)	1
  (4, 504)	1
  (6, 51)	1
  (6, 347)	1
  (6, 970)	1
  (7, 217)	1
  (7, 476)	1
  (7, 477)	1
  (8, 273)	1
  (9, 137)	1
  (9, 231)	1
  (9, 232)	1
  (9, 245)	1
  :	:
  (7391, 213)	1
  (7392, 311)	1
  (7392, 312)	1
  (7392, 691)	1
  (7392, 863)	1
  (7394, 23)	1
  (7394, 24)	1
  (7394, 217)	1
  (7394, 314)	1
  (7394, 316)	1
  (7394, 393)	1
  (7394, 462)	1
  (7394, 463)	1
  (7394, 583)	1
  (7394, 584)	1
  (7394, 662)	1
  (7394, 663)	1
  (7394, 787)	1
  (7394, 788)	1
  (7394, 789)	1
  (7394, 829)	1
  (7394, 830)	1
  (7394, 865)	1
  (7394, 868)	1
  (7394, 869)	1
(7395, 32)
[u'000', u'10', u'10 best', u'10 things', u'10 ways', u'100', u'101', u'101 cookbooks', u'11', u'12', u'13', u'14', u'15', u'16', u'17', u'18', u'20', u'2007', u'2008', u'2008 sports', u'2009', u'2010', u'2010 sports', u'2011', u'2011 sports', u'2012', u'2013', u'2

 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [28]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

# AUC is the so-called area under the curve. The closer to 1 the more accurate the prediction

CV AUC [ 0.78852324  0.80190876  0.80378247], Average AUC 0.798071488129


In [29]:
model.fit(X, y)

all_feature_names = vectorizer.get_feature_names()
feature_importances = pd.DataFrame({'Features' : all_feature_names, 'Importance Score': model.feature_importances_})
feature_importances.sort_values('Importance Score', ascending=False).head()



Unnamed: 0,Features,Importance Score
715,recipe,0.046695
721,recipes,0.02424
192,chocolate,0.015319
347,fashion,0.014594
183,chicken,0.01321


In [None]:
model

### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [25]:
## TODO
X = vectorizer.transform(titles)
X_additional_cols = ['html_ratio', 'image_ratio']
X_additional_data = data[X_additional_cols]


# hstack lets you combine a regular matrix (with the quantitative data) with 
# the "sparse" matrix that captured the word information

from scipy.sparse import hstack
X = hstack((X, X_additional_data)).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

model = RandomForestClassifier(n_estimators = 20)
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))


CV AUC [ 0.78620425  0.79411561  0.79603556], Average AUC 0.79211847604


In [26]:
model.fit(X, y)

all_feature_names = vectorizer.get_feature_names() + X_additional_cols
feature_importances = pd.DataFrame({'Features' : all_feature_names, 'Importance Score': model.feature_importances_})
feature_importances.sort_values('Importance Score', ascending=False).head()


Unnamed: 0,Features,Importance Score
1000,html_ratio,0.154224
1001,image_ratio,0.096063
715,recipe,0.039067
721,recipes,0.022082
183,chicken,0.012421


 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [33]:
bodies = data['body'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

# identify that you are going to use the function mentioned below

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(bodies)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)

X = vectorizer.transform(bodies).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

# AUC is the so-called area under the curve. The closer to 1 the more accurate the prediction


CV AUC [ 0.83437006  0.84556555  0.84125276], Average AUC 0.840396122236


 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# So TfidfVectorized lowers the predictive value of a feature (= word) if it occurs 
# few times in a record relative to the numbers of times it occurs on average in 
# "the average record". 

# if frequently occurring features are very important you should use CountVectoriser approach
# not the TfidfVectorizer approach !!!

# the command fillna fills in the value ' ' whenever title does not have a value

titles = data['title'].fillna('')

# identify that you are going to use the function mentioned below

vectorizer = TfidfVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# You first combine the vectorizer (the spares matric) with the regular matrix (titles) and
# only after that you turn the combination into an array

X = vectorizer.transform(titles).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

# AUC is the so-called area under the curve. The closer to 1 the more accurate the prediction

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

CV AUC [ 0.80078166  0.81069796  0.80821534], Average AUC 0.806564987795


In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

bodies = data['body'].fillna('')

# identify that you are going to use the function mentioned below

vectorizer = TfidfVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(bodies)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(bodies).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

# AUC is the so-called area under the curve. The closer to 1 the more accurate the prediction



CV AUC [ 0.83532287  0.85512747  0.83624529], Average AUC 0.842231875901
