# Predicting Evergreeness of Content with Decision Trees and Random Forests

## DATA DICTIONARY

In [1]:
import pandas as pd
import json

data = pd.read_csv("../datasets/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are 'evergreen' sites?

> #### Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

> #### A sample of URLs is below, where label = 1 are 'evergreen' websites

In [2]:
data[['url', 'label']].head()

Unnamed: 0,url,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,0
1,http://www.popsci.com/technology/article/2012-...,1
2,http://www.menshealth.com/health/flu-fighting-...,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,1
4,http://bleacherreport.com/articles/1205138-the...,0


### Exercises to Get Started

> ### Exercise: 1. In a group: Brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.
 ###  Exercise: 2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
- I.E. If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?
- I.E. If you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?

### Split up and develop 1-3 of the those features independently.

> ### Exercise: 3. Does being a news site affect evergreeness? 
Compute or plot the percentage of news related evergreen sites.

In [6]:
# think about pandas groupby function. Can you grouy by is_ news? #
data.groupby(['is_news'])[['label']].mean()

Unnamed: 0_level_0,label
is_news,Unnamed: 1_level_1
1,0.516916
?,0.507562


> ### Exercise: 4. Does category in general affect evergreeness? 
Plot the rate of evergreen sites for all Alchemy categories.

In [7]:
# think about pandas groupby function #
data.groupby(['alchemy_category'])[['label']].mean()

Unnamed: 0_level_0,label
alchemy_category,Unnamed: 1_level_1
?,0.502135
arts_entertainment,0.371945
business,0.711364
computer_internet,0.246622
culture_politics,0.457726
gaming,0.368421
health,0.573123
law_crime,0.419355
recreation,0.684296
religion,0.416667


> ### Exercise: 5. How many articles are there per category?

In [9]:
# think about pandas groupby function #
data.groupby(['alchemy_category'])[['label']].count()

Unnamed: 0_level_0,label
alchemy_category,Unnamed: 1_level_1
?,2342
arts_entertainment,941
business,880
computer_internet,296
culture_politics,343
gaming,76
health,506
law_crime,31
recreation,1229
religion,72


> #### Let's try extracting some of the text content.
> ### Exercise: 6. Create a feature for the title containing 'recipe'. 
Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [None]:
# try data['title'].str.contains('recipe') #

In [17]:
data['recipe'] = data['title'].str.contains('recipe')
data['recipe'].value_counts()

False    7030
True      353
Name: recipe, dtype: int64

###  Let's Explore Some Decision Trees

 ### Demo: Build a decision tree model to predict the "evergreeness" of a given website. 

In [18]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

X = data[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)
    
    
# Fits the model
model.fit(X, y)

# Helper function to visualize Decision Trees (creates a file tree.png)

from sklearn.tree import export_graphviz
from os import system 
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model,
                              out_file = dotfile,
                              feature_names = X.columns)
    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

## Decision Trees in scikit-learn

 ### Exercise: Evaluate the decision tree using cross-validation; use AUC as the evaluation metric.

In [20]:
from sklearn.cross_validation import cross_val_score

cross_val_score(model, X, y, scoring='roc_auc', cv=5)

array([ 0.54170446,  0.52060176,  0.54808138,  0.53045509,  0.56025055])

###  Adjusting Decision Trees to Avoid Overfitting

 ### Demo: Control for overfitting in the decision model by adjusting the maximum number of questions (max_depth) or the minimum number of records in each final node (min_samples_leaf)

In [21]:
model = DecisionTreeClassifier(
                max_depth = 2,
                min_samples_leaf = 5)

model.fit(X, y)
build_tree_image(model)

In [22]:
cross_val_score(model, X, y, scoring='roc_auc', cv=5)

array([ 0.57301991,  0.57120891,  0.58914935,  0.5781598 ,  0.56592356])

 ### Demo: Build a random forest model to predict the evergreeness of a website. 

In [23]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

### Demo: Extracting importance of features

In [26]:
features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head()

Unnamed: 0,Features,Importance Score
1,html_ratio,0.522636
0,image_ratio,0.447262
2,recipe,0.030103


 ### Exercise: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.

In [30]:
from sklearn.cross_validation import cross_val_score

# create a for loop where you iterate n_trees in range(1, 100, 10)
# for each model, print the SCORES

for n_trees in range(1, 100, 10):
    model = RandomForestClassifier(n_estimators = n_trees)
    score = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
    print ('n_trees: ', n_trees, 'score: ', score)
    

('n_trees: ', 1, 'score: ', array([ 0.55414855,  0.523399  ,  0.55409522,  0.54112126,  0.55196566]))
('n_trees: ', 11, 'score: ', array([ 0.54375038,  0.56902177,  0.598454  ,  0.57604494,  0.57051249]))
('n_trees: ', 21, 'score: ', array([ 0.56166704,  0.56083189,  0.60437965,  0.5721827 ,  0.57013307]))
('n_trees: ', 31, 'score: ', array([ 0.56225251,  0.57718045,  0.60049261,  0.57535774,  0.59210207]))
('n_trees: ', 41, 'score: ', array([ 0.56945582,  0.57373826,  0.60312195,  0.57233796,  0.58176663]))
('n_trees: ', 51, 'score: ', array([ 0.5650598 ,  0.56614379,  0.59659822,  0.57918323,  0.58625267]))
('n_trees: ', 61, 'score: ', array([ 0.56453664,  0.56733645,  0.60510543,  0.57895815,  0.5840046 ]))
('n_trees: ', 71, 'score: ', array([ 0.56970503,  0.57644284,  0.60346922,  0.58059712,  0.57845562]))
('n_trees: ', 81, 'score: ', array([ 0.57117923,  0.56978048,  0.60006174,  0.57755988,  0.58586865]))
('n_trees: ', 91, 'score: ', array([ 0.57323889,  0.57245845,  0.60367409,

##  Independent Practice: Evaluate Random Forest Using Cross-Validation

1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the feature
 
3. **Bonus**: Just like the 'recipe' feature, add in similar text features and evaluate their performance.


In [None]:
## 1. Building a model with more relevant features
## 2a. Evaluate predictive performance for the given feature set
## (BONUS): Adding in text features
## 2b. Evaluating feature importances

In [35]:
for n_trees in range(1, 100, 5):
    model = RandomForestClassifier(n_estimators = n_trees)
    score = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
    print ('n_trees: ', n_trees, 'score: ', score)
    

('n_trees: ', 1, 'score: ', array([ 0.54353507,  0.51247151,  0.53892004,  0.53884838,  0.53564394]))
('n_trees: ', 6, 'score: ', array([ 0.56356728,  0.56383371,  0.58655585,  0.56436911,  0.56722536]))
('n_trees: ', 11, 'score: ', array([ 0.5587104 ,  0.57087313,  0.59663772,  0.578779  ,  0.57003752]))
('n_trees: ', 16, 'score: ', array([ 0.56752351,  0.56919791,  0.58869919,  0.57912903,  0.57663842]))
('n_trees: ', 21, 'score: ', array([ 0.57832207,  0.56541719,  0.59810306,  0.57416159,  0.58847594]))
('n_trees: ', 26, 'score: ', array([ 0.56216547,  0.56528692,  0.59693538,  0.57064754,  0.58172437]))
('n_trees: ', 31, 'score: ', array([ 0.56448716,  0.57255203,  0.59618204,  0.56932185,  0.57837018]))
('n_trees: ', 36, 'score: ', array([ 0.56968762,  0.56787315,  0.6020737 ,  0.58144233,  0.5761561 ]))
('n_trees: ', 41, 'score: ', array([ 0.56505339,  0.56574104,  0.60119266,  0.58076249,  0.58269269]))
('n_trees: ', 46, 'score: ', array([ 0.56730729,  0.56881901,  0.5972753 , 

In [39]:
data.corr()

Unnamed: 0,urlid,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,framebased,frameTagRatio,...,html_ratio,image_ratio,lengthyLinkDomain,linkwordscore,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
urlid,1.0,-0.011162,0.002856,0.008407,0.005285,0.009573,-0.007343,0.01334,,0.010065,...,0.016989,-0.00059,-0.00778,-0.013668,0.016732,-0.002019,-0.017342,-0.005868,0.002292,0.01345
avglinksize,-0.011162,1.0,0.120467,0.161769,0.174554,0.134527,-0.003578,0.005254,,-0.04927,...,0.018974,-0.003002,0.020852,0.12255,-0.010982,0.00036,-0.03389,0.006089,0.035393,0.006172
commonlinkratio_1,0.002856,0.120467,1.0,0.808047,0.560584,0.388801,-0.017878,0.00528,,-0.29486,...,-0.201501,-0.064435,0.421284,0.2572,0.193914,0.317293,0.144354,-0.078026,-0.035019,0.083364
commonlinkratio_2,0.008407,0.161769,0.808047,1.0,0.75833,0.555148,-0.03246,0.019387,,-0.259222,...,-0.159702,-0.044663,0.398817,0.257594,0.177785,0.311492,0.09694,-0.079485,-0.027888,0.083488
commonlinkratio_3,0.005285,0.174554,0.560584,0.75833,1.0,0.850604,-0.016188,0.007578,,-0.218559,...,-0.13337,-0.050357,0.363159,0.109654,0.264022,0.283924,0.049203,-0.008652,-0.008599,0.105964
commonlinkratio_4,0.009573,0.134527,0.388801,0.555148,0.850604,1.0,-0.020415,0.005473,,-0.178064,...,-0.136561,-0.038071,0.287159,0.059223,0.162883,0.233898,0.026384,0.036387,-0.013507,0.080464
compression_ratio,-0.007343,-0.003578,-0.017878,-0.03246,-0.016188,-0.020415,1.0,-0.889345,,0.159335,...,0.106335,-0.188976,-0.090325,0.14647,-0.064163,-0.055388,-0.042614,-0.033772,0.364122,-0.059737
embed_ratio,0.01334,0.005254,0.00528,0.019387,0.007578,0.005473,-0.889345,1.0,,-0.130753,...,-0.090938,0.183808,0.075322,-0.108476,0.046484,0.042942,0.043343,0.037361,-0.342206,0.039536
framebased,,,,,,,,,,,...,,,,,,,,,,
frameTagRatio,0.010065,-0.04927,-0.29486,-0.259222,-0.218559,-0.178064,0.159335,-0.130753,,1.0,...,0.384937,-0.088847,-0.196673,0.158874,-0.303682,-0.362491,0.04933,-0.094557,0.033663,-0.187762


In [49]:
# data.info()

In [50]:
model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [51]:
data.corr()['label']

urlid                             0.013450
avglinksize                       0.006172
commonlinkratio_1                 0.083364
commonlinkratio_2                 0.083488
commonlinkratio_3                 0.105964
commonlinkratio_4                 0.080464
compression_ratio                -0.059737
embed_ratio                       0.039536
framebased                             NaN
frameTagRatio                    -0.187762
hasDomainLink                    -0.004863
html_ratio                       -0.051149
image_ratio                      -0.017266
lengthyLinkDomain                 0.032824
linkwordscore                    -0.173800
non_markup_alphanum_characters    0.097580
numberOfLinks                     0.080187
numwords_in_url                  -0.024823
parametrizedLinkRatio             0.010668
spelling_errors_ratio            -0.058578
label                             1.000000
Name: label, dtype: float64

In [77]:
model = DecisionTreeClassifier()

X = data[['commonlinkratio_3', 'commonlinkratio_2', 'numberOfLinks', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)

In [78]:
model.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [79]:
score = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
score

array([ 0.51638341,  0.52683904,  0.52722058,  0.53146867,  0.55097018])

In [81]:
data['always'] = data['title'].str.contains('always')
data['always'].value_counts()

False    7379
True        4
Name: always, dtype: int64