# Spacy Demo

If you haven't installed spacy yet, use:
```
conda install spacy
python -m spacy.en.download
```
This downloads about 500 MB of data.

Another popular package, `nltk`, can be installed as follows (you can skip this for now):

```
conda install nltk
python -m nltk.downloader all
```

This also downloads a lot of data

## Load StumbleUpon dataset

In [1]:
# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
import json

data = pd.read_csv("../dataset/stumbleupon.tsv", sep='\t',
                  encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [2]:
## Load spacy

from spacy.en import English
nlp_toolkit = English()
nlp_toolkit

<spacy.en.English at 0x573b908>

Another way to load `spacy`:
```
import spacy
nlp_toolkit = spacy.load("en")
```

In [3]:
title = u"IBM sees holographic calls, air breathing batteries"
parsed = nlp_toolkit(title)

for (i, word) in enumerate(parsed): 
    print "Word: {}".format(word)
    print "\t Phrase type: {}".format(word.dep_)
    print "\t Is the word a known entity type? {}".format(
        word.ent_type_  if word.ent_type_ else "No")
    print "\t Lemma: {}".format(word.lemma_)
    print "\t Parent of this word: {}".format(word.head.lemma_)

Word: IBM
	 Phrase type: nsubj
	 Is the word a known entity type? ORG
	 Lemma: ibm
	 Parent of this word: see
Word: sees
	 Phrase type: ROOT
	 Is the word a known entity type? No
	 Lemma: see
	 Parent of this word: see
Word: holographic
	 Phrase type: amod
	 Is the word a known entity type? No
	 Lemma: holographic
	 Parent of this word: call
Word: calls
	 Phrase type: dobj
	 Is the word a known entity type? No
	 Lemma: call
	 Parent of this word: see
Word: ,
	 Phrase type: punct
	 Is the word a known entity type? No
	 Lemma: ,
	 Parent of this word: call
Word: air
	 Phrase type: compound
	 Is the word a known entity type? No
	 Lemma: air
	 Parent of this word: breathing
Word: breathing
	 Phrase type: compound
	 Is the word a known entity type? No
	 Lemma: breathing
	 Parent of this word: battery
Word: batteries
	 Phrase type: appos
	 Is the word a known entity type? No
	 Lemma: battery
	 Parent of this word: call


In [8]:
import pandas as pd
import json

data = pd.read_csv("../dataset/stumbleupon.tsv", sep='\t',
                  encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=False, stop_words='english', min_df=3)

docs = cv.fit_transform(data.body.dropna())
id2word = dict(enumerate(cv.get_feature_names()))

In [9]:
id2word

{0: u'00',
 1: u'000',
 2: u'000000',
 3: u'001',
 4: u'007',
 5: u'00am',
 6: u'00pm',
 7: u'01',
 8: u'01pm',
 9: u'02',
 10: u'0206790666',
 11: u'025',
 12: u'03',
 13: u'04',
 14: u'044',
 15: u'05',
 16: u'06',
 17: u'0674921071',
 18: u'07',
 19: u'075',
 20: u'0782835788',
 21: u'08',
 22: u'09',
 23: u'0g',
 24: u'0http',
 25: u'0px',
 26: u'0s',
 27: u'0sodium',
 28: u'10',
 29: u'100',
 30: u'1000',
 31: u'100000000000000000',
 32: u'1000000000000000000',
 33: u'1000px',
 34: u'1000s',
 35: u'1001',
 36: u'10013',
 37: u'100g',
 38: u'100k',
 39: u'100m',
 40: u'100ml',
 41: u'100px',
 42: u'100th',
 43: u'101',
 44: u'10184',
 45: u'102',
 46: u'1024',
 47: u'103',
 48: u'1034',
 49: u'1036',
 50: u'104',
 51: u'105',
 52: u'10522',
 53: u'10529',
 54: u'106',
 55: u'107',
 56: u'108',
 57: u'1080p',
 58: u'109',
 59: u'1090',
 60: u'10am',
 61: u'10g',
 62: u'10km',
 63: u'10lbs',
 64: u'10m',
 65: u'10mm',
 66: u'10oz',
 67: u'10pm',
 68: u'10px',
 69: u'10th',
 70: u'11'

In [12]:
from gensim.models.ldamodel import LdaModel
from gensim.matutils import Sparse2Corpus

#First we convert our word-matrix into gensim's format
corpus = Sparse2Corpus(docs, documents_columns = False)

#Then we fit an LDA model

lda_model = LdaModel (corpus=corpus, id2word=id2word, num_topics=15)

In [13]:
lda_model

<gensim.models.ldamodel.LdaModel at 0x14b602e8>

In [15]:
lda_model.show_topics()

[(13,
  u'0.009*"sports" + 0.007*"world" + 0.005*"2012" + 0.004*"information" + 0.004*"com" + 0.003*"news" + 0.003*"videos" + 0.003*"like" + 0.003*"la" + 0.003*"site"'),
 (7,
  u'0.006*"size" + 0.005*"color" + 0.005*"use" + 0.004*"make" + 0.004*"wii" + 0.004*"just" + 0.004*"left" + 0.004*"like" + 0.004*"font" + 0.004*"margin"'),
 (5,
  u'0.009*"chicken" + 0.008*"soup" + 0.008*"recipes" + 0.006*"recipe" + 0.005*"oil" + 0.004*"just" + 0.004*"add" + 0.004*"like" + 0.004*"olive" + 0.004*"minutes"'),
 (6,
  u'0.014*"chocolate" + 0.013*"cup" + 0.011*"butter" + 0.009*"sugar" + 0.008*"cake" + 0.008*"baking" + 0.008*"cream" + 0.008*"add" + 0.008*"make" + 0.007*"recipe"'),
 (11,
  u'0.009*"chocolate" + 0.006*"recipes" + 0.004*"new" + 0.004*"best" + 0.004*"hot" + 0.004*"like" + 0.003*"cuisine" + 0.003*"just" + 0.003*"gastronomy" + 0.003*"make"'),
 (8,
  u'0.010*"2009" + 0.009*"2010" + 0.008*"2008" + 0.008*"2007" + 0.007*"2006" + 0.007*"calories" + 0.006*"april" + 0.006*"10" + 0.006*"september" + 

In [23]:
num_topics = 15
n_words_per_topic = 5
for ti, topic in enumerate(lda_model.show_topics(num_topics = num_topics, num_words = n_words_per_topic)):
    print("Topic: %d" % (ti))
    print (topic)
    print()

Topic: 0
(0, u'0.030*"food" + 0.014*"raw" + 0.008*"foods" + 0.006*"recipes" + 0.005*"make"')
()
Topic: 1
(1, u'0.009*"fashion" + 0.009*"image" + 0.008*"2011" + 0.007*"link" + 0.007*"images"')
()
Topic: 2
(2, u'0.030*"flashvars" + 0.011*"future" + 0.010*"technology" + 0.006*"com" + 0.006*"true"')
()
Topic: 3
(3, u'0.011*"swimsuit" + 0.010*"si" + 0.008*"models" + 0.007*"photo" + 0.006*"sports"')
()
Topic: 4
(4, u'0.011*"indie" + 0.009*"nav" + 0.008*"clothing" + 0.006*"recipe" + 0.006*"potato"')
()
Topic: 5
(5, u'0.009*"chicken" + 0.008*"soup" + 0.008*"recipes" + 0.006*"recipe" + 0.005*"oil"')
()
Topic: 6
(6, u'0.014*"chocolate" + 0.013*"cup" + 0.011*"butter" + 0.009*"sugar" + 0.008*"cake"')
()
Topic: 7
(7, u'0.006*"size" + 0.005*"color" + 0.005*"use" + 0.004*"make" + 0.004*"wii"')
()
Topic: 8
(8, u'0.010*"2009" + 0.009*"2010" + 0.008*"2008" + 0.008*"2007" + 0.007*"2006"')
()
Topic: 9
(9, u'0.006*"year" + 0.006*"like" + 0.005*"new" + 0.005*"said" + 0.004*"just"')
()
Topic: 10
(10, u'0.009

In [24]:
from gensim.models.word2vec import Word2Vec
# Setup the body text
text = data.body.dropna().map(lambda x: x.split())

from gensim.models import Word2Vec
model = Word2Vec(text, size=100, window=5, min_count=5, workers=4)

## Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [None]:
def references_organization(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

data['references_organization'] = data['title'].fillna(u'').map(references_organization)

# Take a look
data[data['references_organization']][['title']].head()

## Exercise:

Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

.
.
.
.
.
.
.
.

In [None]:
## Exercise solution

def references_org_person(title):
    parsed = nlp_toolkit(title)
    contains_org = any([word.ent_type_ == 'ORG' for word in parsed])
    contains_person = any([word.ent_type_ == 'PERSON' for word in parsed])
    return contains_org and contains_person

data['references_org_person'] = data['title'].fillna(u'').map(references_org_person)

# Take a look
data[data['references_org_person']][['title']].head()


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

> ### Let's try extracting some of the text content.
> ### Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [None]:
# Option 1: Create a function to check for this

def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0
        
data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions

#data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)


# Option 3: string functions
data['recipe'] = data['title'].str.contains('recipe')

 ### Demo: Use of the Count Vectorizer

In [None]:
titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [None]:
## TODO

 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [None]:
## TODO

 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [None]:
## TODO