In [2]:
import pandas as pd
import json

data = pd.read_csv("https://github.com/ga-students/DAT-NYC-37/blob/master/lessons/lesson-13/assets/dataset/stumbleupon.tsv?raw=true", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

#### What are 'evergreen' sites?

Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

*A sample of URLs is below, where `label = 1` are 'evergreen' websites*

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

---

### Objective: Predict if a given site will be evergreen based on the above features

**Problem:** Some of the above features are text-only (`title`, `url`, `body`). How can I leverage the modeling techniques we've covered so far to utilize text based features?

**Solution:** Transform text features into many numerical features.
  - Count Vectorization
  - Term frequency/inverse document frequency (TF-IDF) Vectorization.
---

## Demo: Understanding Count Vectorization

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Count vectorization can be thought of as a simple word count across all documents. 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

titles = [
    "IBM Sees Electronic Calls Air Breathing Batteries",
    "The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races"
    "The Chicago Bulls won"
]

count_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
count_vectorized_titles = count_vectorizer.fit_transform(titles)

print "Feature names: \n", count_vectorizer.get_feature_names()
print "Feature counts: \n", count_vectorized_titles.todense()
print

# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=count_vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=count_vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'advantages racesthe', u'air', u'air breathing', u'batteries', u'breathing', u'breathing batteries', u'bulls', u'bulls won', u'calls', u'calls air', u'chicago', u'chicago bulls', u'electronic', u'electronic calls', u'electronic futuristic', u'eliminates', u'eliminates advantages', u'fully', u'fully electronic', u'futuristic', u'futuristic starting', u'gun', u'gun eliminates', u'ibm', u'ibm sees', u'racesthe', u'racesthe chicago', u'sees', u'sees electronic', u'starting', u'starting gun', u'won']
Feature counts: 
[[0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0]
 [1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1]]



Unnamed: 0,advantages,advantages racesthe,air,air breathing,batteries,breathing,breathing batteries,bulls,bulls won,calls,...,gun eliminates,ibm,ibm sees,racesthe,racesthe chicago,sees,sees electronic,starting,starting gun,won
Article1,0,0,1,1,1,1,1,0,0,1,...,0,1,1,0,0,1,1,0,0,0
Article2,1,1,0,0,0,0,0,1,1,0,...,1,0,0,1,1,0,0,1,1,1


In [4]:
# TODO: Apply count vectorization to all titles

## Demo: Term-frequency, Inverse document frequency (Tf-Idf)

An alternative bag-of-words approach to CountVectorizer is a Term Frequency - Inverse Document Frequency (TF-IDF) representation.

TF-IDF uses the product of two intermediate values, the **Term Frequency** and **Inverse Document Frequency**.

- **Term Frequency** is equivalent to CountVectorizer features, just the number of times a word appears in the document (i.e. count).

- **Document Frequency** is the percentage of documents that a particular word appears in. 

For example, “the” would be 100% while “Syria” is much lower.  

Inverse Document Frequency is just 1/Document Frequency.


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

titles = [
    "IBM Sees Electronic Calls Air Breathing Batteries",
    "The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races"
]

tfidf_vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf = False)
tfidf_vectorized_titles = tfidf_vectorizer.fit_transform(titles)

print "Feature names: \n", tfidf_vectorizer.get_feature_names()
print "Feature counts: \n", tfidf_vectorized_titles.todense()
print "IDF weights", tfidf_vectorizer.idf_


# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=tfidf_vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=tfidf_vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'air', u'batteries', u'breathing', u'calls', u'electronic', u'eliminates', u'fully', u'futuristic', u'gun', u'ibm', u'races', u'sees', u'starting']
Feature counts: 
[[ 0.          0.39204401  0.39204401  0.39204401  0.39204401  0.27894255
   0.          0.          0.          0.          0.39204401  0.
   0.39204401  0.        ]
 [ 0.36499647  0.          0.          0.          0.          0.25969799
   0.36499647  0.36499647  0.36499647  0.36499647  0.          0.36499647
   0.          0.36499647]]
IDF weights [ 1.40546511  1.40546511  1.40546511  1.40546511  1.40546511  1.
  1.40546511  1.40546511  1.40546511  1.40546511  1.40546511  1.40546511
  1.40546511  1.40546511]


Unnamed: 0,advantages,air,batteries,breathing,calls,electronic,eliminates,fully,futuristic,gun,ibm,races,sees,starting
Article1,0.0,0.392044,0.392044,0.392044,0.392044,0.278943,0.0,0.0,0.0,0.0,0.392044,0.0,0.392044,0.0
Article2,0.364996,0.0,0.0,0.0,0.0,0.259698,0.364996,0.364996,0.364996,0.364996,0.0,0.364996,0.0,0.364996


In [6]:
# TODO: Determine Tf-Idf of title 
# Find the words with 100 highest and lowest inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf = False)

tfidf_vectorizer.fit(data['title'].dropna())
tfidf_vectorized_titles = tfidf_vectorizer.transform(titles)

feature_names         = tfidf_vectorizer.get_feature_names()  # Returns all feature names
inverse_document_freq = tfidf_vectorizer.idf_  # Returns document frequencies

tfidf = pd.DataFrame(
    {
        "feature_names": feature_names,
        "inverse_document_freq": inverse_document_freq
    }
)

sorted_tfidf = tfidf.sort_values(by='inverse_document_freq')

sorted_tfidf.tail(100)
sorted_tfidf.head(100)

Unnamed: 0,feature_names,inverse_document_freq
7510,recipe,3.542320
2015,com,3.556185
7513,recipes,3.920619
3623,food,4.196644
1809,chocolate,4.342550
8616,sports,4.354111
9805,video,4.477725
6292,news,4.517999
1011,best,4.584061
3375,fashion,4.714114


 ### Demo: Use of the Count Vectorizer with ngrams
 
 We can use the `ngram_range` parameter to find ngrams -- groups of n words.

In [7]:
# Note the inclusion of ngram_range
count_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
count_vectorized_titles = count_vectorizer.fit_transform(titles)

print "Feature names: \n", count_vectorizer.get_feature_names()
print "Feature counts: \n", count_vectorized_titles.todense()
print

# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=count_vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=count_vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'advantages races', u'air', u'air breathing', u'batteries', u'breathing', u'breathing batteries', u'calls', u'calls air', u'electronic', u'electronic calls', u'electronic futuristic', u'eliminates', u'eliminates advantages', u'fully', u'fully electronic', u'futuristic', u'futuristic starting', u'gun', u'gun eliminates', u'ibm', u'ibm sees', u'races', u'sees', u'sees electronic', u'starting', u'starting gun']
Feature counts: 
[[0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0]
 [1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1]]



Unnamed: 0,advantages,advantages races,air,air breathing,batteries,breathing,breathing batteries,calls,calls air,electronic,...,futuristic starting,gun,gun eliminates,ibm,ibm sees,races,sees,sees electronic,starting,starting gun
Article1,0,0,1,1,1,1,1,1,1,1,...,0,0,0,1,1,0,1,1,0,0
Article2,1,1,0,0,0,0,0,0,0,1,...,1,1,1,0,0,1,0,0,1,1


---

# Review Exercise

## Exercise Demo: Build a random forest model to predict evergreeness of a website using the title features

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

# 1. We need to fill NaN's with an empty string, otherwise the count vectorizer will fail.
titles = data['title'].fillna('')

# 2. Use `fit` to learn the vocabulary of the titles
count_vectorizer.fit(titles)

# 3. Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# Hint: Steps 2 & 3 can be combined by using `count_vectorizer.fit_transform(titles)`
X = count_vectorizer.transform(titles).toarray()
y = data['label']

# 4. Define our RandomForestClassifier model. It will fit 20 decision trees, each on a random subsample of the dataset.
rf_model = RandomForestClassifier(n_estimators = 20)
    
scores = cross_val_score(rf_model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.82250428  0.82798353  0.81689352], Average AUC 0.822460444702


### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [9]:
# To make our lives easier, let's define a simple utility function to convert an array or series text of
# text documents into  a count vectorized dataframe:
def count_vectorized_dataframe(documents, ngram_range=(1, 2)):
    count_vectorizer = CountVectorizer(stop_words='english', ngram_range=ngram_range)
    count_vectorized_results = count_vectorizer.fit_transform(documents)
    
    return pd.DataFrame(
        columns=count_vectorizer.get_feature_names(),
        data=count_vectorized_results.todense()
    )

# Example Usage
title_text = data['title'].fillna('')

print "Preview of documents to be vectorized (input): "
print title_text.head()

count_vectorized_titles = count_vectorized_dataframe(data['title'].fillna(''))

print "\nVectorized Output sample: "
count_vectorized_titles.head()

Preview of documents to be vectorized (input): 
0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
3                  10 Foolproof Tips for Better Sleep 
4    The 50 Coolest Jerseys You Didn t Know Existed...
Name: title, dtype: object

Vectorized Output sample: 


Unnamed: 0,00,00 000,000,000 deaths,000 eye,000 feet,000 fine,000 lbs,000 legos,000 macs,...,アンブロワジー evan,スタイルアリーナ,スタイルアリーナ style,南方周末,南方周末 首页,可乐鸡翅,可乐鸡翅 cooking,東京のストリートファッション最新情報,東京のストリートファッション最新情報 スタイルアリーナ,首页
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
## TODO: We want to repeat the above, but with these features as well:

# Step 1: Prepare the input data by selecting relevant columns and dummy-encoding categorical vars
quantitative_features = [
    'image_ratio',
    'html_ratio'
]

# Dummy encoding

quantitative_features = data[quantitative_features]

# Horizontally concantenate categorical features, quantitative features, and count_vectorized_title features into a single DF
X = pd.concat([count_vectorized_titles, quantitative_features], axis=1)

X.head()

Unnamed: 0,00,00 000,000,000 deaths,000 eye,000 feet,000 fine,000 lbs,000 legos,000 macs,...,スタイルアリーナ style,南方周末,南方周末 首页,可乐鸡翅,可乐鸡翅 cooking,東京のストリートファッション最新情報,東京のストリートファッション最新情報 スタイルアリーナ,首页,image_ratio,html_ratio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.003883,0.245831
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.088652,0.20349
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.120536,0.226402
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.035343,0.265656
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.050473,0.228887


In [None]:
# Step 2: Repeat the process in previous exercise, only with our new dataframe

rf_model = RandomForestClassifier(n_estimators=10)
scores = cross_val_score(rf_model, X.values, y, cv=2, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.77075899  0.77101785], Average AUC 0.770888419372


 ### Exercise: Build a random forest model to predict evergreeness of a website using only the features extracted from the `body` column

In [None]:
## TODO

body_documents = data['body'].fillna('')
X = count_vectorized_dataframe(body_documents)

# Same as before, but with a different input
rf_model = RandomForestClassifier(n_estimators=20)
scores = cross_val_score(rf_model, X.values, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

 ### Exercise Repeat above exercises using `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [None]:
## TODO