<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Decision Trees - Lab

_Author: Adam Jones, PhD (San Francisco)_

---


In [None]:
# Load packages
import pandas as pd
import numpy as np
import json
from os import system 
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier

## Predicting Evergreen-ness of Content with Decision Trees and Random Forests

Read in the .tsv (tab separated value) file

In [None]:
data = pd.read_csv("../../datasets/stumbleupon.tsv", sep='\t')

# Look at first 1000 characters of first row in data['boilerplate'] column
print(data['boilerplate'][0][:1000]) 

In [None]:
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

### Predicting 'Evergreen-ness' Of Content
This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender. A description of the columns is below:

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

#### What are 'evergreen' sites?

> Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

A sample of URLs is below, where label = 1 are 'evergreen' websites.

In [None]:
for i in range(5):
    print(data['url'].loc[i])
    print(data['title'].loc[i])
    print(data['label'].loc[i])

### Decision Trees in scikit-learn
**Objective:** Build a decision tree model to predict the "evergreen-ness" of a given website

In [None]:
model = DecisionTreeClassifier()

X = data[['image_ratio', 'html_ratio', 'label']].dropna()
#X = data[['image_ratio', 'html_ratio', 'recipe', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)
    
# Fit the model
model.fit(X, y)

In [None]:
# Define custom function called 'print_cv_scores' that accepts a model and evaluates it using cross-validation

# Just wrap your function around the code on the lines below:
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
       
# After you're done, run it with: 'print_cv_scores(model)'

In [None]:
## Visualize tree in separate file (creates a file tree.png)
def build_tree_image(model):
    dotfile = open("tree.dot", 'w')
    export_graphviz(model, out_file = dotfile, feature_names = X.columns,
                   filled=True, rounded=True)

    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

# ## (OPTIONAL ALTERNATIVE) Visualize tree inside notebook 
# dot_data = export_graphviz(model, out_file=None, 
#                          feature_names=X.columns.tolist(),  
#                          filled=True, rounded=True,  
#                          special_characters=True)
# 
# graphviz.Source(dot_data) 

Take a look at the [image file](./tree.png) that you just created. Obviously this tree has grown too big! It's difficult to understand which features are most useful, since there are so questions being asked in this tree. Let's try to improve on this.

### Feature Extraction
Let's try extracting some of the text content.

**Exercise:** We might expect pages that have recipe in the the title would be considered more evergreen. So, create a feature for the title containing '**recipe**'.

**Exercise:** Re-evaluate the decision tree using cross-validation, using 'AUC' as the evaluation metric. Has it improved?

### Avoiding Overfitting
**Objective:** Adjust Decision Trees to Avoid Overfitting

You can control for overfitting in decision trees by adjusting one of the following parameters:
- `max_depth`:  Control the maximum number of questions
- `min_samples_in_leaf`:  Control the minimum number of records in each node

**Exercise:** Try applying each of these parameters. How did it affect your score?

Notice the greater AUC for the adjusted model - _very exciting!_

### Random Forest Models
**Objective:** Build a random forest model to predict the evergreen-ness of a website.

In [None]:
model = RandomForestClassifier(n_estimators = 20)

model.fit(X, y)

**Demo:** Extracting importance of features.

In [None]:
features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head()

**Exercise:** Tune the Random Forest model '*hyper-parameters*' using cross-validation.
- Try tweaking the number of estimators used by your `RandomForestClassifier` model and observe how that improves predictive performance.

### Independent Practice: Evaluate Random Forest Using Cross-Validation
1. Continue adding input variables to the model that you think may be relevant
2. For each feature:
  - Evaluate the model for improved predictive performance using cross-validation
  - Evaluate the _importance_ of the feature (using __'model.feature\_importances_'__)
3. **Bonus**: Just like the 'recipe' feature, add in similar text features and evaluate their performance.