# Classification with stumbleupon data

Project 4 has been changed since scraping was untenable. The project now focuses on the stumbleupon kaggle dataset. For more information on this dataset, [check out the website here](https://www.kaggle.com/c/stumbleupon).

---

## 1. Load in the dataset

This is the only part completed for you.

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('white')

%matplotlib inline

In [2]:
su = pd.read_csv('../dataset/evergreen.tsv', delimiter='\t')

## 2. Clean up/examine your data

Some of the columns may have values that need changing or that are of the wrong type. There could also be columns that aren't very useful.

---

In [3]:
su['alchemy_category'] = su['alchemy_category'].apply(lambda x: 'unknown' if x == '?' else x)
su['alchemy_category_score'] = su['alchemy_category_score'].apply(lambda x: 0 if x == '?' else float(x))
su['is_news'] = su['is_news'].apply(lambda x: 0 if x == '?' else int(x))
su['news_front_page'] = su['news_front_page'].apply(lambda x: 0 if x == '?' else int(x))

In [4]:
su.alchemy_category.tail()

7390     computer_internet
7391      culture_politics
7392            recreation
7393    arts_entertainment
7394               unknown
Name: alchemy_category, dtype: object

## 3. Use statsmodels' logistic regression function to look at variable significance

The **`import statsmodels.formula.api as smf`** code below gives us access to a statsmodels api that can run logistic regressions using patsy-style formulas.

Ex:

```python
formula = 'target ~ var1 + var2 + C(var3) -1'
logreg = smf.logit(formula, data=data)
logreg_results = logreg.fit()
print logreg_results.summary()
```

---

In [5]:
import statsmodels.formula.api as smf

### 3.1 Run a logistic regression predicting evergreen from the numeric columns

And print out the results as shown in the example above.

---

In [6]:
# su.columns

In [7]:
import patsy
import statsmodels.formula.api as smf

formula = 'label ~  alchemy_category_score + avglinksize + commonlinkratio_1 \
+ commonlinkratio_2 + commonlinkratio_3 + commonlinkratio_4 + compression_ratio + embed_ratio \
+ frameTagRatio + hasDomainLink+html_ratio + is_news + image_ratio + lengthyLinkDomain + linkwordscore\
+ news_front_page + non_markup_alphanum_characters + numberOfLinks \
+ numwords_in_url+parametrizedLinkRatio+spelling_errors_ratio + news_front_page -1'

logreg = smf.logit(formula, data=su)
logreg_results = logreg.fit()
print logreg_results.summary()

Optimization terminated successfully.
         Current function value: 0.652585
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                  label   No. Observations:                 7395
Model:                          Logit   Df Residuals:                     7374
Method:                           MLE   Df Model:                           20
Date:                Tue, 17 May 2016   Pseudo R-squ.:                 0.05804
Time:                        10:33:51   Log-Likelihood:                -4825.9
converged:                       True   LL-Null:                       -5123.2
                                        LLR p-value:                3.821e-113
                                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------
alchemy_category_score            -0.1598      0.078     -2.057     


### 3.2 Run a logistic regression predicting evergreen from the numeric columns and a categorical variable of alchemy_category

And print out the results as shown in the example.

---

In [8]:
import patsy
import statsmodels.formula.api as smf

formula = 'label ~  C(alchemy_category) + alchemy_category_score + avglinksize + commonlinkratio_1 \
+ commonlinkratio_2 + commonlinkratio_3 + commonlinkratio_4 + compression_ratio + embed_ratio \
+ frameTagRatio + hasDomainLink+html_ratio + is_news + image_ratio + lengthyLinkDomain + linkwordscore\
+ news_front_page + non_markup_alphanum_characters + numberOfLinks \
+ numwords_in_url+parametrizedLinkRatio+spelling_errors_ratio + news_front_page -1'

logreg = smf.logit(formula, data=su)
logreg_results = logreg.fit()
print logreg_results.summary()

Optimization terminated successfully.
         Current function value: 0.612383
         Iterations 17
                           Logit Regression Results                           
Dep. Variable:                  label   No. Observations:                 7395
Model:                          Logit   Df Residuals:                     7361
Method:                           MLE   Df Model:                           33
Date:                Tue, 17 May 2016   Pseudo R-squ.:                  0.1161
Time:                        10:33:52   Log-Likelihood:                -4528.6
converged:                       True   LL-Null:                       -5123.2
                                        LLR p-value:                1.136e-228
                                              coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------
C(alchemy_category)[arts_entertainment]     0.567

## 4. Use sklearn to cross-validate the accuracy of the model above

Normalize the numeric and categorical columns of the predictor matrix.

---

In [9]:
su_cols = su[['url', 'urlid', 'boilerplate', 'framebased','alchemy_category','hasDomainLink',\
                'is_news','lengthyLinkDomain','news_front_page']]
su_cols.head(1)

Unnamed: 0,url,urlid,boilerplate,framebased,alchemy_category,hasDomainLink,is_news,lengthyLinkDomain,news_front_page
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",0,business,0,1,1,0


In [10]:
su_cols = su[[col for col in su if col not in su_cols]]
su_cols.head(1)

Unnamed: 0,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,frameTagRatio,html_ratio,image_ratio,linkwordscore,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,0.443783,0.0,0.090774,0.245831,0.003883,24,5424,170,8,0.152941,0.07913,0


In [11]:
su_ = su_cols.ix[:,'alchemy_category_score':]
# su.dtypes

In [12]:
ob = [c for c in su_.columns if c != 'label']

su_.ix[:,ob] = (su_.ix[:,ob] - su_.ix[:,ob].mean())/su_.ix[:,ob].std()

In [13]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cross_validation import cross_val_score

target = 'label'
cols = [c for c in su_.columns if c != target]
x = su_[cols]
y = su_[target]
model = LogisticRegression()

In [14]:
scores = cross_val_score(model,x,y,cv=5)
print scores

[ 0.61621622  0.63218391  0.62407032  0.60987153  0.63193505]


In [15]:
# su.head()
model.fit(x,y)
model.score

<bound method LogisticRegression.score of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)>

In [16]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.30)
print X_train.shape,Y_train.shape
print X_test.shape, Y_test.shape
tts = model.fit(X_train, Y_train)
tts.score(X_test,Y_test)

(5176, 17) (5176,)
(2219, 17) (2219,)


0.62280306444344302

## 5. Gridsearch regularization parameters for logistic regression

Find the best regularization type (Ridge, Lasso) across a set of regularization strengths.

[NOTE: C is the inverse of the regularization strength. Lower C values are stronger regularization. Having a C higher than 1 will significantly slow down the search. I'm not particularly interested in values over 1, since this is the default regularization strength in LogisticRegression.]

**After you find the best set of parameters, build a Logistic Regression with those parameters and crossvalidate the score.**

[NOTE 2: to run Lasso regularization the solver should be `'liblinear'`]

---

In [17]:
from sklearn.grid_search import GridSearchCV
from sklearn import linear_model
from sklearn.metrics import classification_report

In [18]:
logistic = linear_model.LogisticRegression()

In [19]:
search_parameters = {
    "penalty":             ['l1','l2'],   # Used to specify the norm used in the penalization.
    "C":                   [.01,2.5],  # Regularization paramter -- totally out of bounds but we will try it
    #"dual":                [True, False], # Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features
    "fit_intercept":       [False, True], # Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
    #"class_weight":        [None, "balanced", "auto"], # The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
    "intercept_scaling":   [2, 1],        # Useful only if solver is liblinear. when self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. 
    "solver":              ['liblinear'],
    "warm_start":          [False, True]
}

estimator = GridSearchCV(logistic, search_parameters)
estimator.fit(x,y)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'warm_start': [False, True], 'C': [0.01, 2.5], 'intercept_scaling': [2, 1], 'solver': ['liblinear'], 'fit_intercept': [False, True], 'penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [21]:
estimator.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'warm_start': [False, True], 'C': [0.01, 2.5], 'intercept_scaling': [2, 1], 'solver': ['liblinear'], 'fit_intercept': [False, True], 'penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [22]:
print "Best C / Regularization Param:", estimator.best_estimator_.C # This estimator.best_estimator_ object has many great reporting metrics
print "Best Params:", estimator.best_params_
print "Best Score:", estimator.best_score_

Best C / Regularization Param: 2.5
Best Params: {'warm_start': False, 'C': 2.5, 'intercept_scaling': 2, 'solver': 'liblinear', 'fit_intercept': False, 'penalty': 'l2'}
Best Score: 0.623738393218


In [23]:
y_true, y_pred = y_test, estimator.predict(X_test)
print classification_report(y_true, y_pred, target_names=["Evergreen","Non-Evergreen"])

               precision    recall  f1-score   support

    Evergreen       0.64      0.56      0.60      1198
Non-Evergreen       0.62      0.69      0.65      1243

  avg / total       0.63      0.63      0.63      2441



## 6. Gridsearch neighbors for kNN

Find the best number of neighbors with your predictors to predict the `label` target variable.

Start by bulding a kNN model with a set number of neighbors, then use gridsearch to run through a series of neighbors.

---

In [24]:
# Load gridsearch
from sklearn import svm, grid_search, datasets
from sklearn.neighbors import KNeighborsClassifier



In [25]:
# Setup our GridSearch Parmaters
search_parameters = {
    'n_neighbors':  [3,50], 
    'weights':      ("uniform", "distance"),
    'algorithm':    ("ball_tree", "kd_tree", "brute", "auto"),
    'p':            [1,2]
}


In [26]:
### Intialize KNN 
knn = KNeighborsClassifier()

# Intialize GridSearchCV
clf = grid_search.GridSearchCV(knn, search_parameters)
clf.fit(x,y)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [3, 50], 'weights': ('uniform', 'distance'), 'algorithm': ('ball_tree', 'kd_tree', 'brute', 'auto'), 'p': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [27]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [28]:
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [3, 50], 'weights': ('uniform', 'distance'), 'algorithm': ('ball_tree', 'kd_tree', 'brute', 'auto'), 'p': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [29]:
y_true, y_pred = y_test, clf.predict(X_test)
print classification_report(y_true, y_pred, target_names=["Evergreen","Non-Evergreen"])

               precision    recall  f1-score   support

    Evergreen       0.69      0.52      0.59      1198
Non-Evergreen       0.63      0.78      0.69      1243

  avg / total       0.66      0.65      0.64      2441



## 7. Choose a new target from alchemy_category to predict with logistic regression

**Ideally your category choice will have a small fraction of the total rows, but not TOO small!**

---

### 7.1 Chose your target category, create the Y vector, and check the fraction of instances

---

In [30]:
su.alchemy_category.unique()

array(['business', 'recreation', 'health', 'sports', 'unknown',
       'arts_entertainment', 'science_technology', 'gaming',
       'culture_politics', 'computer_internet', 'law_crime', 'religion',
       'weather'], dtype=object)

In [31]:
# new_target = su.alchemy_category[su.alchemy_category == 'recreation']


In [32]:
def step_function(x):
    return 1 if x == "recreation" else 0

## Copy of alchemy cat
su['recreation'] = [x for x in su.alchemy_category.values]

## Making it binary
su['recreation'] = su.recreation.apply(step_function)

### 7.2 Use patsy to create an X matrix of the numeric predictors and all two-way interactions between them

Ex:

```python
import patsy

formula_interactions = '~ (var1 + var2 + var3)**2 -1'
X_interactions = patsy.dmatrix(formula_interactions, data=data
```

Get the column names from the `design_info` property of the patsy X matrix.

---

In [33]:
# C(alchemy_category) + alchemy_category_score + avglinksize + commonlinkratio_1 \
# + commonlinkratio_2 + commonlinkratio_3 + commonlinkratio_4 + compression_ratio + embed_ratio \
# + frameTagRatio + hasDomainLink+html_ratio + is_news + image_ratio + lengthyLinkDomain + linkwordscore\
# + news_front_page + non_markup_alphanum_characters + numberOfLinks \
# + numwords_in_url+parametrizedLinkRatio+spelling_errors_ratio + news_front_page 

In [34]:
import patsy
import statsmodels.formula.api as smf

formula = 'recreation ~ (news_front_page + embed_ratio + spelling_errors_ratio +image_ratio)**2  -1'

logreg = smf.logit(formula, data=su)
logreg_results = logreg.fit()
print logreg_results.summary()

Optimization terminated successfully.
         Current function value: 0.464919
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:             recreation   No. Observations:                 7395
Model:                          Logit   Df Residuals:                     7385
Method:                           MLE   Df Model:                            9
Date:                Tue, 17 May 2016   Pseudo R-squ.:                -0.03362
Time:                        10:34:54   Log-Likelihood:                -3438.1
converged:                       True   LL-Null:                       -3326.3
                                        LLR p-value:                     1.000
                                            coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------------------
news_front_page                           0.3536      

In [35]:
import patsy

# Get the non-target cols with a simple list comprehension
non_target_cols = [c for c in su_.columns if c != new_target]

# Use some string adding and joining to make the simple model formula:
formula_simple = target + ' ~ ' + ' + '.join(non_target_cols) + ' -1'
print formula_simple

# Make the complex formula:
formula_complex = target + ' ~ (' + ' + '.join(non_target_cols) + ')**2 -1'
print '\n',formula_complex

# Create the X and Y pairs for both!
Y, X = patsy.dmatrices(formula_simple, data=su_)
Yoverfit, Xoverfit = patsy.dmatrices(formula_complex, data=su_)


NameError: name 'new_target' is not defined

### 7.3 Normalize the predictor matrix columns

---

### 7.4 Gridsearch a logistic regression to predict accuracy on your new target from the interaction predictors

Include Ridge and Lasso.

---

### 7.5 Build a logistic regression with the optimal parameters, and look at the coefficients

---

### 7.6 Gridsearch parameters for a logistic regression with the same target and predictors, but score based on precision rather than accuracy

Look at the documentation.

---

## [BONUS] 8. Build models predicting from words

This is a bit of the NLP we covered in the pipeline lecture!

---

### 8.1 Choose 'body' or 'title' from the boilerplate to be the basis of your word predictors

You will need to parse the json from the boilerplate field.

---

In [None]:
import json

### 8.2 Use CountVectorizer to create your predictor matrix from the string column

It is up to you what range of ngrams and features, and whether or not you want the columns binary or counts.

---

### 8.3 Gridsearch a logistic regression predicting accuracy of your chosen target category from word predictor matrix

---

### 8.4 Do the same as above, but score the gridsearch based on precision rather than accuracy

---

### 8.5 Build a logistic regression with optimal precision categories

Print out the top 20 or 25 word features as ranked by their coefficients.

---