<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Decision Trees - Lab

_Author: Adam Jones, PhD (San Francisco)_

---


In [1]:
# Load packages
import pandas as pd
import numpy as np
import json
from os import system 
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier

## Predicting Evergreen-ness of Content with Decision Trees and Random Forests

Read in the .tsv (tab separated value) file
> See [here](https://www.kaggle.com/c/stumbleupon) for more info on the dataset.

In [2]:
data = pd.read_csv("../../datasets/stumbleupon.tsv", sep='\t')

# Look at first 1000 characters of first row in data['boilerplate'] column
print(data['boilerplate'][0][:1000])

{"title":"IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries","body":"A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in San Jose 

In [3]:
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


### Predicting 'Evergreen-ness' Of Content
This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

#### What are 'evergreen' sites?

> Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

A sample of URLs is below, where label = 1 are 'evergreen' websites.

In [4]:
for i in range(5):
    print(data['url'].loc[i])
    print(data['title'].loc[i])
    print(data['label'].loc[i])

http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html
IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries
0
http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races
The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races
1
http://www.menshealth.com/health/flu-fighting-fruits?cm_mmc=Facebook-_-MensHealth-_-Content-Health-_-FightFluWithFruit
Fruits that Fight the Flu fruits that fight the flu | cold & flu | men's health
1
http://www.dumblittleman.com/2007/12/10-foolproof-tips-for-better-sleep.html
10 Foolproof Tips for Better Sleep 
1
http://bleacherreport.com/articles/1205138-the-50-coolest-jerseys-you-didnt-know-existed?show_full=
The 50 Cool

### Decision Trees in scikit-learn
**Objective:** Build a decision tree model to predict the "evergreen-ness" of a given website

In [5]:
model = DecisionTreeClassifier()

X = data[['image_ratio', 'html_ratio', 'label']].dropna()
#X = data[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)
    
# Fit the model
model.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [6]:
# Define custom function called 'print_cv_scores' that accepts a model and evaluates it using cross-validation
def print_cv_scores(model):
    scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
    print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
    
print_cv_scores(model)

CV AUC [0.49802632 0.50830588 0.5340168  0.5016469  0.55158588], Average AUC 0.5187163573189526


In [8]:
## Visualize tree in separate file (creates a file tree.png)
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model, out_file = dotfile, feature_names = X.columns,
                   filled=True, rounded=True)

    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

# ## (OPTIONAL ALTERNATIVE) Visualize tree inside notebook 
# dot_data = export_graphviz(model, out_file=None, 
#                          feature_names=X.columns.tolist(),  
#                          filled=True, rounded=True,  
#                          special_characters=True)
# 
# graphviz.Source(dot_data) 

Take a look at the [image file](./tree.png) that you just created. Obviously this tree has grown too big! It's difficult to understand which features are most useful, since there are so questions being asked in this tree. Let's try to improve on this.

### Feature Extraction
Let's try extracting some of the text content.

**Exercise:** We might expect pages that have recipe in the the title would be considered more evergreen. So, create a feature for the title containing '**recipe**'.

In [9]:
# Option 1: Create a function to check for this
def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0
        
data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions
#data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)

# Option 3: string functions
#data['recipe'] = data['title'].str.contains('recipe')

**Exercise:** Re-evaluate the decision tree using cross-validation, using 'AUC' as the evaluation metric. Has it improved?

In [10]:
# use cross_val_score()

# ... #

print_cv_scores(model)

CV AUC [0.50201023 0.50607982 0.53040184 0.49828173 0.54382734], Average AUC 0.5161201938160227


### Avoiding Overfitting
**Objective:** Adjust Decision Trees to Avoid Overfitting

You can control for overfitting in decision trees by adjusting one of the following parameters:
- `max_depth`:  Control the maximum number of questions
- `min_samples_in_leaf`:  Control the minimum number of records in each node

**Exercise:** Try applying each of these parameters. How did it affect your score?

In [11]:
model = DecisionTreeClassifier(
                max_depth = 2,
                min_samples_leaf = 5)

model.fit(X, y)

build_tree_image(model)

print_cv_scores(model)

CV AUC [0.54153783 0.54151113 0.55758857 0.54473814 0.5484744 ], Average AUC 0.5467700134809986


Notice the greater AUC for the adjusted model - _very exciting!_

### Random Forest Models
**Objective:** Build a random forest model to predict the evergreen-ness of a website.

In [12]:
model = RandomForestClassifier(n_estimators = 20)

model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

**Demo:** Extracting importance of features.

In [13]:
features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head()

Unnamed: 0,Features,Importance Score
1,html_ratio,0.546708
0,image_ratio,0.453292


**Exercise:** Tune the Random Forest model '*hyper-parameters*' using cross-validation.
- Try tweaking the number of estimators used by your `RandomForestClassifier` model and observe how that improves predictive performance.

In [12]:
print_cv_scores(model)

for n_trees in range(1, 100, 10):
    model = RandomForestClassifier(n_estimators = n_trees)
    print(n_trees)
    print_cv_scores(model)

CV AUC [0.54058114 0.54731921 0.57212432 0.556562   0.57035372], Average AUC 0.5573880764342866
1
CV AUC [0.53241959 0.50830588 0.54469971 0.51921937 0.52867491], Average AUC 0.5266638933357664
11
CV AUC [0.52824744 0.5362081  0.5738856  0.55202752 0.55715833], Average AUC 0.5495053963596751
21
CV AUC [0.5394216  0.54462835 0.56811777 0.56158414 0.56798895], Average AUC 0.5563481628105336
31
CV AUC [0.54308205 0.54440053 0.57270806 0.54121285 0.56773058], Average AUC 0.5538268146799458
41
CV AUC [0.54077485 0.5460886  0.5738984  0.55324623 0.56012871], Average AUC 0.5548273604876668
51
CV AUC [0.53758772 0.54304732 0.57778601 0.54425597 0.56454489], Average AUC 0.5534443809145626
61
CV AUC [0.53916667 0.54415349 0.57631661 0.55257008 0.56402722], Average AUC 0.555246814764321
71
CV AUC [0.54406798 0.54903016 0.5715415  0.55294796 0.57116457], Average AUC 0.5577504336703402
81
CV AUC [0.54449379 0.55400838 0.57347478 0.55282261 0.56692431], Average AUC 0.5583447738336474
91
CV AUC [0.54

In [14]:
# The same thing could also be accomplished with grid_search
n_trees = np.arange(1, 100, 10)
gs = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid={'n_estimators': n_trees},
    scoring='roc_auc')

gs.fit(X, y)

print(gs.param_grid) # Parameter space explored
print(gs.best_score_) # Best 'neg_mean_squared_error'
print(gs.best_estimator_) # Best combination of paramaters



{'n_estimators': array([ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91])}
0.5577440862295518
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=71, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


### Independent Practice: Evaluate Random Forest Using Cross-Validation
1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the feature (using __'model.feature\_importances_'__)
3. **Bonus**: Just like the 'recipe' feature, add in similar text features and evaluate their performance.

In [15]:
data.describe()

Unnamed: 0,urlid,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,framebased,frameTagRatio,...,image_ratio,lengthyLinkDomain,linkwordscore,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,recipe
count,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,...,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0
mean,5305.704665,2.761823,0.46823,0.21408,0.092062,0.049262,2.255103,-0.10375,0.0,0.056423,...,0.275709,0.660311,30.077079,5716.598242,178.754564,4.960649,0.172864,0.101221,0.51332,0.126302
std,3048.384114,8.619793,0.203133,0.146743,0.095978,0.072629,5.704313,0.306545,0.0,0.041446,...,1.91932,0.473636,20.393101,8875.43243,179.466198,3.233111,0.183286,0.079231,0.499856,0.332211
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,...,-1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,2688.5,1.602062,0.34037,0.105263,0.022222,0.0,0.442616,0.0,0.0,0.028502,...,0.0259,0.0,14.0,1579.0,82.0,3.0,0.040984,0.068739,0.0,0.0
50%,5304.0,2.088235,0.481481,0.202454,0.068627,0.022222,0.48368,0.0,0.0,0.045775,...,0.083051,1.0,25.0,3500.0,139.0,5.0,0.113402,0.089312,1.0,0.0
75%,7946.5,2.627451,0.616604,0.3,0.133333,0.065065,0.578227,0.0,0.0,0.073459,...,0.2367,1.0,43.0,6377.0,222.0,7.0,0.241299,0.112376,1.0,0.0
max,10566.0,363.0,1.0,1.0,0.980392,0.980392,21.0,0.25,0.0,0.444444,...,113.333333,1.0,100.0,207952.0,4997.0,22.0,1.0,1.0,1.0,1.0


In [16]:
## 1. Building a model with more relevant features
model = RandomForestClassifier(n_estimators=50)

# Continue to add features to X
#     Build dummy features, include quantitative features, or add text features
X = data[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)


## 2a. Evaluate predictive performance for the given feature set
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
# -or- 'print_cv_scores(model)'

# 3 (BONUS): Adding in text features

# Check for keywords in the title
data['PhotoInTitle'] = data['title'].fillna('').str.lower().str.contains('photo').astype(int)
X = data[['image_ratio', 'html_ratio', 'recipe', 'PhotoInTitle', 'label']].dropna()
X.drop('label', axis=1, inplace=True)

print_cv_scores(model)

# Check for keywords in the body
data['WeatherInBody'] = data['body'].fillna('').str.lower().str.contains('weather').astype(int)
X = data[['image_ratio', 'html_ratio', 'recipe', 'PhotoInTitle', 'WeatherInBody', 'label']].dropna()
X.drop('label', axis=1, inplace=True)

print_cv_scores(model)

data['CatInBody'] = data['body'].fillna('').str.lower().str.contains('cat').astype(int)
X = data[['image_ratio', 'html_ratio', 'recipe', 'PhotoInTitle', 'WeatherInBody', 'CatInBody', 'label']].dropna()
X.drop('label', axis=1, inplace=True)

print_cv_scores(model)

## 2b. Evaluating feature importances

# Fit a model on the whole dataset
model.fit(X, y)

# Get columns and their scores
features = X.columns
feature_importances = model.feature_importances_
features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df



CV AUC [0.61937665 0.63877372 0.63310499], Average AUC 0.6304184509969527
CV AUC [0.61508132 0.61764017 0.67016817 0.63215763 0.63435895], Average AUC 0.6338812472147743
CV AUC [0.61142727 0.62050487 0.66384406 0.63475425 0.6364122 ], Average AUC 0.6333885262864758
CV AUC [0.63089273 0.631164   0.66883875 0.64112502 0.63956857], Average AUC 0.6423178121000014


Unnamed: 0,Features,Importance Score
1,html_ratio,0.490769
0,image_ratio,0.404895
2,recipe,0.089449
5,CatInBody,0.006696
3,PhotoInTitle,0.005369
4,WeatherInBody,0.002823
