In [1]:
import pandas as pd
import json

data = pd.read_csv("https://github.com/ga-students/DAT-NYC-37/blob/master/lessons/lesson-13/assets/dataset/stumbleupon.tsv?raw=true", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender.  

A description of the columns is below

#### What are 'evergreen' sites?

Evergreen sites are those that are always relevant.  As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season. 

*A sample of URLs is below, where `label = 1` are 'evergreen' websites*

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
title|string|Title of the article
body|string|Body text of article
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonlinkratio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

---

### Objective: Predict if a given site will be evergreen based on the above features

**Problem:** Some of the above features are text-only (`title`, `url`, `body`). How can I leverage the modeling techniques we've covered so far to utilize text based features?

**Solution:** Transform text features into many numerical features.
  - Count Vectorization
  - Term frequency/inverse document frequency (TF-IDF) Vectorization.
---

In [2]:
data.title

0       IBM Sees Holographic Calls Air Breathing Batte...
1       The Fully Electronic Futuristic Starting Gun T...
2       Fruits that Fight the Flu fruits that fight th...
3                     10 Foolproof Tips for Better Sleep 
4       The 50 Coolest Jerseys You Didn t Know Existed...
5                               Genital Herpes Treatment 
6                       fashion lane American Wild Child 
7       Racing For Recovery by Dean Johnson racing for...
8                      Valet The Handbook 31 Days 31 days
9             Cookies and Cream Brownies How Sweet It Is 
10      Business Financial News Breaking US Internatio...
11      A Tip of the Cap to The Greatest Iron Man of T...
12                         9 Foods That Trash Your Teeth 
13                                                       
14      French Onion Steaks with Red Wine Sauce french...
15      Izabel Goulart Swimsuit by Kikidoll 2012 Sport...
16                    Liquid Mountaineering The Awesomer 
17            

In [10]:
# Q: add a feature that is True if a title contains recipe. Do such articles have a relationship with "evergreeness"?

def has_recipe(text):
    if "recipe" in str(text).lower():
        return 1
    else:
        return 0
    
    
data['recipe'] = data['title'].map(has_recipe)
data["recipe"]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u202c' in position 70: ordinal not in range(128)

### A primitive solution:

Maybe certain keywords in the title are predictive for whether a website is evergreen or not? For instance, maybe recipes are usually "evergreen"?

We could create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [12]:
# Option 1: Create a function to check for this

def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0
        
# .map applies a function to each row in our series and returns the result as a list.
data['recipe'] = data['title'].map(has_recipe)

# Note: We can also use lambda functions, instead of the above function definition
# data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)


# Option 3: string functions
data['recipe'] = data['title'].str.lower().str.contains('recipe')

pd.crosstab(data['recipe'], data['label'])

label,0,1
recipe,Unnamed: 1_level_1,Unnamed: 2_level_1
False,3510,2939
True,82,852


Recipe articles seem to be evergreen more often than not -- but obviously most of our articles are not recipes. How can we apply this more generally?

**CountVectorizer:** converts a collection of text into a matrix of features.  Each row will be a sample (an article or piece of text) and each column will be a text feature (usually a count or binary feature per word).

### Demo: Understanding Count Vectorization

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

titles = [
    "IBM Sees Electronic Calls Air Breathing Batteries",
    "The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races"
]

vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))
vectorized_titles = vectorizer.fit_transform(titles)

print "Feature names: \n", vectorizer.get_feature_names()
print "Feature counts: \n", vectorized_titles.todense()
print

# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'advantages races', u'air', u'air breathing', u'batteries', u'breathing', u'breathing batteries', u'calls', u'calls air', u'electronic', u'electronic calls', u'electronic futuristic', u'eliminates', u'eliminates advantages', u'fully', u'fully electronic', u'futuristic', u'futuristic starting', u'gun', u'gun eliminates', u'ibm', u'ibm sees', u'races', u'sees', u'sees electronic', u'starting', u'starting gun']
Feature counts: 
[[0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0]
 [1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1]]



Unnamed: 0,advantages,advantages races,air,air breathing,batteries,breathing,breathing batteries,calls,calls air,electronic,...,futuristic starting,gun,gun eliminates,ibm,ibm sees,races,sees,sees electronic,starting,starting gun
Article1,0,0,1,1,1,1,1,1,1,1,...,0,0,0,1,1,0,1,1,0,0
Article2,1,1,0,0,0,0,0,0,0,1,...,1,1,1,0,0,1,0,0,1,1


In [24]:
titles=[]
titles.append("hi")
titles

['hi']

## Demo: Term-frequency, Inverse document frequency (Tf-Idf)

An alternative bag-of-words approach to CountVectorizer is a Term Frequency - Inverse Document Frequency (TF-IDF) representation.

TF-IDF uses the product of two intermediate values, the **Term Frequency** and **Inverse Document Frequency**.

- **Term Frequency** is equivalent to CountVectorizer features, just the number of times a word appears in the document (i.e. count).

- **Document Frequency** is the percentage of documents that a particular word appears in. 

For example, “the” would be 100% while “Syria” is much lower.  

Inverse Document Frequency is just 1/Document Frequency.


In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

titles = [
    "IBM Sees Electronic Calls Air Breathing Batteries",
    "The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races"
]

vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf = False)
vectorized_titles = vectorizer.fit_transform(data["title"].dropna())

print "Feature names: \n", vectorizer.get_feature_names()
print "Feature counts: \n", vectorized_titles.todense()
print "IDF weights", vectorizer.idf_


# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=vectorized_titles.todense()
)

Feature names: 
Feature counts: 
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
IDF weights [ 9.2139236   7.34212142  9.2139236  ...,  9.2139236   8.80845849
  9.2139236 ]


ValueError: Shape of passed values is (10316, 7383), indices imply (10316, 2)

 ### Demo: Use of the Count Vectorizer with ngrams
 
 We can use the `ngram_range` parameter to find ngrams -- groups of n words.

In [110]:
# Note the inclusion of ngram_range
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
vectorized_titles = vectorizer.fit_transform(titles)

print "Feature names: \n", vectorizer.get_feature_names()
print "Feature counts: \n", vectorized_titles.todense()
print

# Represent Count Vectorized results as a dataframe so we can preview it more easily.
pd.DataFrame(
    columns=vectorizer.get_feature_names(),
    index=['Article1', 'Article2'],
    data=vectorized_titles.todense()
)

Feature names: 
[u'advantages', u'advantages races', u'air', u'air breathing', u'batteries', u'breathing', u'breathing batteries', u'calls', u'calls air', u'electronic', u'electronic calls', u'electronic futuristic', u'eliminates', u'eliminates advantages', u'fully', u'fully electronic', u'futuristic', u'futuristic starting', u'gun', u'gun eliminates', u'ibm', u'ibm sees', u'races', u'sees', u'sees electronic', u'starting', u'starting gun']
Feature counts: 
[[0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0]
 [1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1]]



Unnamed: 0,advantages,advantages races,air,air breathing,batteries,breathing,breathing batteries,calls,calls air,electronic,...,futuristic starting,gun,gun eliminates,ibm,ibm sees,races,sees,sees electronic,starting,starting gun
Article1,0,0,1,1,1,1,1,1,1,1,...,0,0,0,1,1,0,1,1,0,0
Article2,1,1,0,0,0,0,0,0,0,1,...,1,1,1,0,0,1,0,0,1,1


 ### Demo: Build a random forest model to predict evergreeness of a website using the title features

In [29]:
from sklearn.ensemble import RandomForestClassifier

titles = data['title'].fillna('')

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.81608807  0.82887978  0.82386112], Average AUC 0.822942990477


### Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [None]:
## TODO

 ### Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [None]:
## TODO

 ### Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [None]:
## TODO