# Predicting Evergreeness of Content with Decision Trees and Random Forests

In [1]:
## DATA DICTIONARY

In [2]:
import pandas as pd
import json

data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()


#boilerplate column is a stringified JSON text
#.map is the same thing as a .apply, both functional programming
#json.loads(x) is a method of the JSON package
#this is where we parse the boilerplate data

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [3]:
data.dtypes

#this is our computational schema, not our data dictionary

url                                object
urlid                               int64
boilerplate                        object
alchemy_category                   object
alchemy_category_score             object
avglinksize                       float64
commonlinkratio_1                 float64
commonlinkratio_2                 float64
commonlinkratio_3                 float64
commonlinkratio_4                 float64
compression_ratio                 float64
embed_ratio                       float64
framebased                          int64
frameTagRatio                     float64
hasDomainLink                       int64
html_ratio                        float64
image_ratio                       float64
is_news                            object
lengthyLinkDomain                   int64
linkwordscore                       int64
news_front_page                    object
non_markup_alphanum_characters      int64
numberOfLinks                       int64
numwords_in_url                   

## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are 'evergreen' sites?

> #### Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

> #### A sample of URLs is below, where label = 1 are 'evergreen' websites

In [3]:
data[['url', 'label']].head()

#label is our target column
#the person who built this already determined whether it will be an evergreen site or not

Unnamed: 0,url,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,0
1,http://www.popsci.com/technology/article/2012-...,1
2,http://www.menshealth.com/health/flu-fighting-...,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,1
4,http://bleacherreport.com/articles/1205138-the...,0


### Exercises to Get Started

> ### Exercise: 1. In a group: Brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.
 ###  Exercise: 2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
- I.E. If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?
- I.E. If you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?

### Split up and develop 1-3 of the those features independently.

In [7]:
def checkForNewsSite(x):
    if x == "?" or x == 0:
        return 0
    return 1

In [6]:
#mshoes the ?
data.is_news.unique()

#use a .apply to make the ? a zero.  I.e. transforming on the existing column
data["is_news"] = data.is_news.apply( lambda x: 0 if x == "?" else 1)

In [8]:
data.is_news.unique()

array([1, 0])

> ### Exercise: 3. Does being a news site affect evergreeness? 
Compute or plot the percentage of news related evergreen sites.

In [4]:
import seaborn as sb
%matplotlib inline

#Option 1: find out P(evergreen | is_news = 1) vs P(evergreen | is_news = ?)
#here we are only grouping by "is_news" so there is only 2 keys
#grabbing only the label from each group and then taking the mean of that
#data.groupby(['is_news'])[['label']].describe())
data.groupby(['is_news'])[['label']].mean()

#Option 2:

Unnamed: 0_level_0,label
is_news,Unnamed: 1_level_1
1,0.516916
?,0.507562


In [10]:
data.groupby(["is_news","label"]).apply(lambda x: len(x))

is_news  label
0        0        1400
         1        1443
1        0        2199
         1        2353
dtype: int64

In [9]:
data.groupby(["label", "is_news"]).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,frameTagRatio,framebased,hasDomainLink,html_ratio,image_ratio,lengthyLinkDomain,linkwordscore,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,urlid
label,is_news,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,0,count,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0
0,0,mean,3.065084,0.419253,0.181998,0.071233,0.03497,3.4662,-0.149428,0.062832,0.0,0.027143,0.239801,0.369961,0.533571,33.112143,4391.452143,144.112857,4.288571,0.190702,0.109513,5194.679286
0,0,std,7.943761,0.226565,0.150608,0.085095,0.060242,7.143394,0.359652,0.05096,0.0,0.162558,0.067079,1.894105,0.49905,23.106544,7246.136369,139.066067,3.396948,0.22257,0.107471,3030.44057
0,0,min,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564,-1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0
0,0,25%,1.471421,0.268293,0.057497,0.0,0.0,0.451943,0.0,0.027812,0.0,0.0,0.19543,0.017723,0.0,14.0,739.75,51.0,2.0,0.032258,0.061817,2616.75
0,0,50%,2.0847,0.424621,0.162033,0.041096,0.00921,0.521664,0.0,0.050351,0.0,0.0,0.235818,0.082129,1.0,28.5,2191.0,114.0,4.0,0.105263,0.090909,5157.0
0,0,75%,2.748144,0.588235,0.277655,0.113821,0.045226,0.724654,0.0,0.082508,0.0,0.0,0.27129,0.370421,1.0,48.0,4749.5,192.25,6.0,0.271196,0.121721,7842.5
0,0,max,161.0,1.0,1.0,0.719086,0.544355,21.0,0.25,0.444444,0.0,1.0,0.716883,32.5,1.0,100.0,67406.0,1237.0,19.0,1.0,1.0,10553.0
0,1,count,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0,2199.0
0,1,mean,2.479331,0.47095,0.213914,0.08823,0.048539,2.056776,-0.095039,0.065422,0.0,0.018645,0.234456,0.271402,0.71487,34.101864,5104.633015,176.622101,5.52342,0.158222,0.103742,5307.475671


In [13]:
news_and_ev = data.groupby(["label", "is_news"]).apply(lambda group: len(group))
type (news_and_ev)

pandas.core.series.Series

In [14]:
news_and_ev = data.groupby(["label", "is_news"]).apply(lambda group: len(group)).reset_index()
type (news_and_ev)

#this generates an extra column called zero (not count)

pandas.core.frame.DataFrame

In [22]:
news_and_ev.columns = ["label", "is_news", "count_of_group"]
news_and_ev

Unnamed: 0,label,is_news,count_of_group
0,0,1,2199
1,0,?,1400
2,1,1,2353
3,1,?,1443


In [25]:
total = news_and_ev.count_of_group.sum()
print total

7395


In [31]:
news_and_ev["percent_of_group"] = news_and_ev.count_of_group.apply(lambda c: (float(c)/total)*100)
news_and_ev.head()

Unnamed: 0,label,is_news,count_of_group,percent_of_group
0,0,1,2199,29.736308
1,0,?,1400,18.931711
2,1,1,2353,31.818796
3,1,?,1443,19.513185


In [32]:
data.groupby("label").size()

label
0    3599
1    3796
dtype: int64

> ### Exercise: 4. Does category in general affect evergreeness? 
Plot the rate of evergreen sites for all Alchemy categories.

In [11]:
data[data.label ==1].alchemy_category.plot

<pandas.tools.plotting.SeriesPlotMethods object at 0x1194ea690>

In [None]:
data.groupby

> ### Exercise: 5. How many articles are there per category?

> #### Let's try extracting some of the text content.
> ### Exercise: 6. Create a feature for the title containing 'recipe'. 
Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [None]:
# ... #

###  Let's Explore Some Decision Trees

 ### Demo: Build a decision tree model to predict the "evergreeness" of a given website. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

X = data[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)
    
    
# Fits the model
model.fit(X, y)

# Helper function to visualize Decision Trees (creates a file tree.png)

from sklearn.tree import export_graphviz
from os import system 
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model,
                              out_file = dotfile,
                              feature_names = X.columns)
    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

## Decision Trees in scikit-learn

 ### Exercise: Evaluate the decision tree using cross-validation; use AUC as the evaluation metric.

In [None]:
from sklearn.cross_validation import cross_val_score

# ... #

###  Adjusting Decision Trees to Avoid Overfitting

 ### Demo: Control for overfitting in the decision model by adjusting the maximum number of questions (max_depth) or the minimum number of records in each final node (min_samples_leaf)

In [None]:
model = DecisionTreeClassifier(
                max_depth = 2,
                min_samples_leaf = 5)

model.fit(X, y)
build_tree_image(model)

 ### Demo: Build a random forest model to predict the evergreeness of a website. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

### Demo: Extracting importance of features

In [None]:
features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort('Importance Score', inplace=True, ascending=False)

features_df.head()

 ### Exercise: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.

In [None]:
# ... #

##  Independent Practice: Evaluate Random Forest Using Cross-Validation

1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the feature
  - 
3. **Bonus**: Just like the 'recipe' feature, add in similar text features and evaluate their performance.


In [None]:
# ... #