# Classification with stumbleupon data

Project 4 has been changed since scraping was untenable. The project now focuses on the stumbleupon kaggle dataset. For more information on this dataset, [check out the website here](https://www.kaggle.com/c/stumbleupon).

---

## 1. Load in the dataset

This is the only part completed for you.

---

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('white')

% matplotlib inline

In [29]:
su = pd.read_csv('../dataset/evergreen.tsv', delimiter='\t')

In [30]:
su.head(1)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0


## 2. Clean up/examine your data

Some of the columns may have values that need changing or that are of the wrong type. There could also be columns that aren't very useful.

---

In [31]:
# Check for nulls

print su.isnull().any(axis = 0).value_counts()
print su.isnull().any(axis = 1).value_counts()

# As you can see below, no nulls... yet

False    27
dtype: int64
False    7395
dtype: int64


In [32]:
# Convert alchemy_category_score

su.alchemy_category_score = pd.to_numeric(su.alchemy_category_score, errors='coerce')

In [33]:
# Drop the NaNs you created in the previous line of code

su = su.dropna(axis=0)

In [34]:
# Convert is_news and fill '?'s with '0'

su.is_news = pd.to_numeric(su.is_news, errors='coerce')
su.is_news = su.is_news.fillna(0)

In [35]:
# Convert news_front_page and fill '?'s with '0'

su.news_front_page = pd.to_numeric(su.news_front_page, errors='coerce')
su.news_front_page = su.news_front_page.fillna(0)

In [36]:
# Check shape

su.shape

(5053, 27)

__*Note*__ 

For the sake of concision, I will not show the pairplots I created to view relationships between the numeric data and target.

## 3. Use sklearn to evaluate variable significance


### 3.1 Run a logistic regression predicting evergreen from the numeric columns

And print out the results as shown in the example above.

---

In [37]:
# List numeric features

numeric_features = ['avglinksize', 'alchemy_category_score', 'commonlinkratio_1', 'commonlinkratio_2', 'commonlinkratio_3', 
                    'commonlinkratio_4', 'compression_ratio', 'embed_ratio', 'is_news', 'framebased', 
                    'frameTagRatio', 'numwords_in_url', 'lengthyLinkDomain', 'linkwordscore', 'numberOfLinks', 
                    'numwords_in_url', 'parametrizedLinkRatio']

In [38]:
# Assign X and y

X = su[numeric_features]
y = su.label

In [39]:
# Import train_test_split

from sklearn.cross_validation import train_test_split

# Create training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [40]:
# Confirm shape of train and test sets

print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape

(3789, 17)
(3789,)
(1264, 17)
(1264,)


In [41]:
# Normalize X

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [42]:
# Fit a logistic regression model

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

logreg.fit(X_train_scaled, y_train)
pred_class = logreg.predict(X_test_scaled)

In [43]:
# Accuracy score of train_test_split on numeric data

from sklearn import metrics
print "Accuracy Score:"
print metrics.accuracy_score(y_test, pred_class)

Accuracy Score:
0.614715189873


In [44]:
# Area under ROC curve

# First predict probabilities for evergreen
probs = logreg.predict_proba(X_test)

# Next print auc_score, which is area under ROC curve
print "Area under ROC curve"
print metrics.roc_auc_score(y_test, probs[:, 1])

Area under ROC curve
0.59518473956


### 3.2 Run a logistic regression predicting evergreen from the numeric columns and a categorical variable of alchemy_category

And print out the results as shown in the example.

---

In [45]:
# Create dummy variables for alchemy_category

dummy_alc = pd.get_dummies(su.alchemy_category, drop_first = True)

In [46]:
# Attach dummy variable DataFrame to numeric features

su_new = su[numeric_features].join(dummy_alc)

In [47]:
# Reinstantiate logistic regression

logreg = LogisticRegression()
X = su_new[su_new.columns]
y = su.label

In [48]:
# Build train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)

In [49]:
# Normalize X

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [50]:
# Fit model on normalized data

logreg.fit(X_train_scaled, y_train)
pred_class = logreg.predict(X_test_scaled)

In [51]:
# Print accuracy score

print "Accuracy Score:"
print metrics.accuracy_score(y_test, pred_class)

Accuracy Score:
0.669303797468


In [52]:
# Area under ROC curve

# First predict probabilities for evergreen
probs = logreg.predict_proba(X_test)

# Next print auc_score, which is area under ROC curve
print "Area under ROC curve:"
print metrics.roc_auc_score(y_test, probs[:, 1])

Area under ROC curve:
0.609984602481


In [53]:
# Take a shot at confusion matrix

confusion = metrics.confusion_matrix(y_test, pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

In [54]:
# Print out recall - of all the positives, how many did you predict?

print "Recall score:"
print metrics.recall_score(y_test, pred_class)

Recall score:
0.708661417323


__*Findings:*__

Although accuracy score and recall score are reasonably high, the area under the ROC curve is .61, only marginally better than the null hypothesis.

## 4. Use sklearn to cross-validate the accuracy of the model above

Normalize the numeric and categorical columns of the predictor matrix.

In [55]:
# standardize X for cross validation

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

In [56]:
# Import necessary module

from sklearn.cross_validation import cross_val_score

logreg = LogisticRegression()

print "Mean cross_val score (accuracy)"
print cross_val_score(logreg, X_scaled, y, cv=10, scoring='accuracy').mean()

Mean cross_val score (accuracy)
0.677025789535


__*Findings*__

Cross validating lowered the accuracy score by .01, providing a more durable accuracy score than that provided through the train_test_split.

## 5. Gridsearch regularization parameters for logistic regression

Find the best regularization type (Ridge, Lasso) across a set of regularization strengths.

[NOTE: C is the inverse of the regularization strength. Lower C values are stronger regularization. Having a C higher than 1 will significantly slow down the search. I'm not particularly interested in values over 1, since this is the default regularization strength in LogisticRegression.]

**After you find the best set of parameters, build a Logistic Regression with those parameters and crossvalidate the score.**

[NOTE 2: to run Lasso regularization the solver should be `'liblinear'`]

---

In [57]:
# Before running a gridsearch, play with L1 and L2
# First L1

logreg_L1 = LogisticRegression(C=0.1, penalty='l1') 

# Cross validation score 

print "Mean cross_val score (accuracy):"
print cross_val_score(logreg_L1, X_scaled, y, cv=10, scoring='accuracy').mean()

Mean cross_val score (accuracy):
0.679601220992


In [58]:
# Before running a gridsearch, play with L1 and L2
# Next L2

logreg_L2 = LogisticRegression(C = 0.1, penalty = 'l2')

# Cross validation score

print "Mean cross_val score (accuracy):"
print cross_val_score(logreg_L2, X_scaled, y, cv=10, scoring='accuracy').mean()

Mean cross_val score (accuracy):
0.677420263766


In [59]:
# Import GridSearchCV
from sklearn.grid_search import GridSearchCV

# Reinstantiate logreg
logreg_for_grid = LogisticRegression()

# Generate C_range
C_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0]

# Penalty options include l1 and l2
penalty_options = ['l1', 'l2']

# Create parameter space
# param_grid = dict(logisticregression__C = C_range, logisticregression__penalty = penalty_options)
parameters = {'C':[.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0], 'penalty':['l1', 'l2']}

# Use 10-fold cross validation in grid search
grid = GridSearchCV(logreg_for_grid, param_grid = parameters, cv=10, scoring='accuracy')

# Fit model
grid.fit(X_scaled, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

In [60]:
# Examine the best model

print "Best cross validation score derived from the grid search:"
print grid.best_score_

print "\n"

print "Best parameters derived from the grid search:"
print grid.best_params_

Best cross validation score derived from the grid search:
0.679596279438


Best parameters derived from the grid search:
{'penalty': 'l1', 'C': 0.1}


__*Findings*__

Using C and penalty as parameters to refine in the grid search, running GridSearchCV produced optimal penalty: L1, and optimal C: 0.1. The best cross validation score from the grid search was 0.67, almost identical to pre-grid search cross validation scores I derived above.

## 6. Gridsearch neighbors for kNN

Find the best number of neighbors with your predictors to predict the `label` target variable.

Start by bulding a kNN model with a set number of neighbors, then use gridsearch to run through a series of neighbors.

---

__*NOTE TO SELF: GridSearchCV will automatically perform a cross validation for you*__

In [61]:
# Begin by importing necessary modules and instantiating KNN

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

In [62]:
# Specify k_range

k_range = range(1, 31)

# Create dictionary for grid search

param_grid = dict(n_neighbors=k_range)

# Instantiate the grid

grid = GridSearchCV(knn, param_grid, cv = 10, scoring = 'accuracy')

# Fit grid to data

grid.fit(X_scaled, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

In [63]:
# examine the best model

print grid.best_score_
print grid.best_params_

0.697407480705
{'n_neighbors': 24}


__*Findings*__

KNN grid search produced a slightly better accuracy score than the logistic regression grid search.

## 7. Choose a new target from alchemy_category to predict with logistic regression

**Ideally your category choice will have a small fraction of the total rows, but not TOO small!**

---

__*Note*__

I've chosen "science_technology" as my new target.

### 7.1 Chose your target category, create the Y vector, and check the fraction of instances

---

In [64]:
# View value_counts of alchemy_categories

su.alchemy_category.value_counts()

recreation            1229
arts_entertainment     941
business               880
health                 506
sports                 380
culture_politics       343
computer_internet      296
science_technology     289
gaming                  76
religion                72
law_crime               31
unknown                  6
weather                  4
Name: alchemy_category, dtype: int64

In [65]:
# Choose "science and technology" as the category to use as target

su['sci_or_not'] = su.alchemy_category.map({'science_technology':1, 'business':0, 'recreation':0, 
                                            'health':0, 'sports':0, 'arts_entertainment':0,
                                            'gaming':0, 'culture_politics':0, 'computer_internet':0, 
                                            'law_crime':0, 'religion':0, 'weather':0, 'unknown':0})

In [66]:
# View head of new DataFrame with target column "sci_or_not"

su.head(1)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,sci_or_not
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,24,0.0,5424,170,8,0.152941,0.07913,0,0


### Intermediate Step - perform cross validation on new features and target

In [67]:
# Build features and target

features = ['avglinksize', 'commonlinkratio_1', 'commonlinkratio_2', 'commonlinkratio_3', 
                    'commonlinkratio_4', 'compression_ratio', 'embed_ratio', 'is_news', 'framebased', 
                    'frameTagRatio', 'numwords_in_url', 'lengthyLinkDomain', 'linkwordscore', 'numberOfLinks', 
                    'numwords_in_url', 'parametrizedLinkRatio', 'alchemy_category_score', 'label']

X = su[features]
y = su.sci_or_not

In [68]:
# Normalize features

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

In [69]:
# Instantiate logreg

logreg = LogisticRegression()

print "Mean cross_val score (accuracy):"
print cross_val_score(logreg, X_scaled, y, cv=10, scoring='accuracy').mean()

Mean cross_val score (accuracy):
0.942806671972


### 7.4 Gridsearch a logistic regression to predict accuracy on your new target from the interaction predictors

Include Ridge and Lasso.


In [70]:
# Import GridSearchCV
# Already done from first Grid Search

# Reinstantiate logreg
logreg_for_grid = LogisticRegression()

# Penalty options include l1 and l2
penalty_options = ['l1', 'l2']

# Create parameter space
parameters = {'C':[.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0], 'penalty':['l1', 'l2']}

# Use 10-fold cross validation in grid search
grid = GridSearchCV(logreg_for_grid, param_grid = parameters, cv=10, scoring='accuracy')

# Fit model
grid.fit(X_scaled, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

In [71]:
# Examine the best model

print "Best cross validation score derived from the grid search:"
print grid.best_score_

print "\n"

print "Best parameters derived from the grid search:"
print grid.best_params_

Best cross validation score derived from the grid search:
0.942806253711


Best parameters derived from the grid search:
{'penalty': 'l1', 'C': 0.1}


### 7.5 Build a logistic regression with the optimal parameters, and look at the coefficients

---

In [72]:
# Run a logistic regression with parameters determined above

logreg_refined = LogisticRegression(C=0.1, penalty='l1') 

# Fit the model to features and target

model = logreg_refined.fit(X_scaled, y)

# Look at coefficients 

print "Coefficients of refined model:"

pd.DataFrame(zip(features, np.transpose(model.coef_)))

Coefficients of refined model:


Unnamed: 0,0,1
0,avglinksize,[0.0214403275306]
1,commonlinkratio_1,[0.0974102160723]
2,commonlinkratio_2,[0.0]
3,commonlinkratio_3,[0.0]
4,commonlinkratio_4,[0.0]
5,compression_ratio,[0.0]
6,embed_ratio,[0.0]
7,is_news,[0.0284667838802]
8,framebased,[0.0]
9,frameTagRatio,[0.121325784115]


### 7.6 Gridsearch parameters for a logistic regression with the same target and predictors, but score based on precision rather than accuracy

Look at the documentation.

---

In [73]:
# Import GridSearchCV
# Already done from first Grid Search

# Reinstantiate logreg
logreg_for_grid = LogisticRegression()

# Penalty options include l1 and l2
penalty_options = ['l1', 'l2']

# Create parameter space
parameters = {'C':[.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0], 'penalty':['l1', 'l2']}

# Use 10-fold cross validation in grid search
grid = GridSearchCV(logreg_for_grid, param_grid = parameters, cv=10, scoring='average_precision')

# Fit model
grid.fit(X_scaled, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring='average_precision',
       verbose=0)

In [74]:
# Examine the best model

print "Best precision score derived from the grid search:"
print grid.best_score_

print "\n"

print "Best parameters derived from the grid search:"
print grid.best_params_

Best precision score derived from the grid search:
0.0787738983661


Best parameters derived from the grid search:
{'penalty': 'l2', 'C': 1.0}


## [BONUS] 8. Build models predicting from words

This is a bit of the NLP we covered in the pipeline lecture!

---

### 8.1 Choose 'body' or 'title' from the boilerplate to be the basis of your word predictors

You will need to parse the json from the boilerplate field.

---

In [75]:
import json

### 8.2 Use CountVectorizer to create your predictor matrix from the string column

It is up to you what range of ngrams and features, and whether or not you want the columns binary or counts.

---

### 8.3 Gridsearch a logistic regression predicting accuracy of your chosen target category from word predictor matrix

---

### 8.4 Do the same as above, but score the gridsearch based on precision rather than accuracy

---

### 8.5 Build a logistic regression with optimal precision categories

Print out the top 20 or 25 word features as ranked by their coefficients.

---