<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Predicting "Greenness" Of Content

_Authors: Joseph Nelson (DC), Kiefer Katovich (SF)_

---


This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender and was made available [here](https://www.kaggle.com/c/stumbleupon/download/train.tsv)

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonLinkRatio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
%matplotlib inline

# set max printout options for pandas:
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 300

### 1. Load the data
- Note it is a `.tsv` file and has a tab separator instead of comma.
- Clean the `is_news` column.
- Make two new columns, `title` and `body`, from the `boilerplate` column.

> **Note:** The `boilerplate` column is in json dictionary format. You can use the `json.loads()` function from the `json` module to convert this into a python dictionary.

In [None]:
evergreen_tsv = '../../data/evergreen_sites.tsv'

In [None]:
data = pd.read_csv(evergreen_tsv, sep='\t', na_values={'is_news' : '?'}).fillna(0)

# Extract the title and body from the boilerplate JSON text
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

### 2. What are 'evergreen' sites?
- These are websites that always relevant like recipes or reviews (as opposed to current events).
- Stored as a binary indicator in the `label` column.
- Look at some examples.

In [None]:
data[['title', 'label']].head()

### 3. Does being a news site affect green-ness?

**3.A Investigate with plots/EDA.**

In [None]:
print((data.groupby('is_news')[['label']].mean()))
sns.factorplot(x='is_news', y='label', data=data, kind='bar')

**3.B Test the hypothesis with a logistic regression using statsmodels.**

> **Hint:** The `sm.logit` function from `statsmodels.formula.api` will perform a logistic regression using a formula string.

In [None]:
import statsmodels.formula.api as sm

In [None]:
news_data = data[['label','is_news']]

news_model = sm.logit("label ~ is_news", data=news_data).fit()
news_model.summary()

**3.C Interpret the results of your model.**

In [None]:
# The effect of being a news site on evergreen status is insignificant.
# More formally, we would accept the null hypothesis that news sites and
# non-news sites have equal probability of being evergreen.

### 4. Does the website category affect green-ness?

**4.A Investigate with plots/EDA.**

In [None]:
# ? and unknown should be the same category:
data['alchemy_category'] = data.alchemy_category.map(lambda x: 'unknown' if x == '?' else x)

In [None]:
print((data.groupby('alchemy_category')[['label']].mean()))

sns.factorplot(x='alchemy_category', y='label', 
               data=data, kind='bar', aspect=3).set_xticklabels(rotation=45, horizontalalignment='right')


**4.B Test the hypothesis with a logistic regression.**

In [None]:
cat_model = sm.logit("label ~ C(alchemy_category, Treatment(reference='unknown'))", data=data).fit()
cat_model.summary()

**4.C Interpret the model results.**

In [None]:
# Many of the categories appear to have a significant effect on the likelihood of evergreen
# status. Note that I have set the reference category to be unknown. This is wrapped into
# the intercept term. These categories must be interpreted as significantly different from
# unknown or not.

# Positive predictors of evergreen vs. unknown:
# 1. Business
# 2. Health
# 3. Recreation

# Negative predictors of evergreen vs. unkown:
# 1. Arts and entertainment
# 2. Computer and internet
# 3. Gaming
# 4. Sports

# The rest of the categories are not significantly different than the unkown category
# in their probability of being evergreen or not.

### 5. Does the image ratio affect green-ness?

**5.A Investigate with plots/EDA.**

In [None]:
sns.distplot(data.image_ratio, bins=30, kde=False)

In [None]:
# qcut can divide things up by quantile - in this case into 5 bins
data['image_ratio_qbinned'] = pd.qcut(data['image_ratio'], 5)

sns.factorplot('image_ratio_qbinned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                  horizontalalignment='right')

**5.B Test the hypothesis using a logistic regression.**

> **Note**: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.

In [None]:
# a model using image ratio alone (ignoring the apparent nonlinear effect and skewed distribution):
image_model = sm.logit("label ~ image_ratio", data=data).fit()
image_model.summary()

In [None]:
# convert the image ratio to percentiles (this is what qcut is representing in bins):
# you can use the scipy.stats.percentileofscore for this:
from scipy import stats

data['image_ratio_pctl'] = data.image_ratio.map(lambda x: stats.percentileofscore(data.image_ratio.values, x))

In [None]:
sns.distplot(data.image_ratio_pctl, bins=30, kde=False)

In [None]:
# use the image_ratio_percentile instead
# this is still ignoring the nonlinearity we wee in the plot above!
image_model = sm.logit("label ~ image_ratio_pctl", data=data).fit()
image_model.summary()

In [None]:
# Fit a model with the percentile and the percentile squared (quadratic effect)
# This will let us model that inverse parabola
# Note: statsmodels formulas can take numpy functions!
image_model = sm.logit("label ~ image_ratio_pctl + np.power(image_ratio_pctl, 2)", data=data).fit()
image_model.summary()

**5.C Interpret the model.**

In [None]:
# Once it's modeled well (convert the image ratio to percentiles and include
# a quadratic term) we can see these significant effects:

# 1. There is a positive effect of the image ratio percentile score (its rank 
# across image_ratios)

# 2. There is a negative quadratic effect of image ratio. That is to say, at
# a certain point the squared term of image_ratio_pctl overtakes the linear
# term. The highest probability of evergreen sites have image ratios in the
# median range.

### 6. Fit a logistic regression with multiple predictors.
- The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.
- Do any EDA you may need.
- Interpret the coefficients of the model.

> **Tip:** [This pdf is very useful for an overview of interpreting logistic regression coefficients.](https://www.unm.edu/~schrader/biostat/bio2/Spr06/lec11.pdf)

In [None]:
# look at the distribution of html_ratio
sns.distplot(data.html_ratio, bins=30, kde=False)

In [None]:
# cut can divide things up into linear bins - in this case into 5 bins
data['html_ratio_binned'] = pd.cut(data['html_ratio'], 5)
sns.factorplot('html_ratio_binned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                 horizontalalignment='right')

In [None]:
# cut can divide things up into linear bins - in this case into 5 bins
data['html_ratio_qbinned'] = pd.qcut(data['html_ratio'], 5)
sns.factorplot('html_ratio_qbinned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                 horizontalalignment='right')

In [None]:
data['html_ratio_pctl'] = data.html_ratio.map(lambda x: stats.percentileofscore(data.html_ratio.values, x))

In [None]:
# You can see scipy puts percentiles from 0-100: important for interpreting coefs
data.html_ratio_pctl.head()

In [None]:
def title_len(x):
    try:
        return len(x.split())
    except:
        return 0.

# calculate the number of words in the title and plot distribution
data['title_words'] = data.title.map(title_len)
sns.distplot(data.title_words, bins=30, kde=False)

In [None]:
data['title_words_binned'] = pd.cut(data['title_words'], 10)

sns.factorplot('title_words_binned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                 horizontalalignment='right')

In [None]:
# Build a model with the image ratio percentile, html ratio, and title length
f = '''
label ~ image_ratio_pctl + np.power(image_ratio_pctl, 2) + html_ratio_pctl + title_words
'''
model = sm.logit(f, data=data).fit()
model.summary()

In [None]:
# exponentiate the coefficients to get the odds ratio:
np.exp(model.params)

In [None]:
# We've got all significant effects on our predictors here.
# Must interpret them as odds ratios.
# 1. for a 1 percentile increase in image_ratio, there is a ~1.03x increase in the odds of evergreen
# 2. for a 1 unit increase in image_ratio_pctl**2, there is a ~0.999x decrease in the odds of evergreen
# 3. for a 1 percentile increase in html_ratio, there is a ~0.992x decrease in the odds of evergreen
# 4. for a 1 word increase in the length of the title, there is a ~0.956x decrease in the odds of evergreen