This time we employ the cross validation to figure out the best model for spam filter.

**Remark** The objective functions for logistic regression implemented in `sklearn` are:
<img src="L1.png">
and
<img src="L2.png">

where
- $w$ are the coefficients, which was denoted by $\beta_i$ in the class.
- $c$ is the intercept, which was denoted by $\beta_0$ in the class. We can change the parameter "fit_intercept" to keep or remove it.
- $C$ is the inverse of regularization strength. This is opposite to the $\alpha$ we used in Ridge and Lasso. Smaller values specify stronger regularization.
- Therefore the first objective function is of $L_1$ panelty and the second of $L_2$.

### Problem 1
Use the class <code>GridSearchCV</code> to find out the best combination of parameter for logistic regression. (Set <code>cv=5</code> and <code>scoring='accuracy'</code>). 

In [1]:
from __future__ import print_function
import pandas as pd
import numpy as np

spam_train_df = pd.read_csv('data/spam_train.csv')
x_train = spam_train_df.iloc[:, :57].values
y_train = spam_train_df.iloc[:, -1].values


spam_test_df = pd.read_csv('data/spam_test.csv')
x_test = spam_test_df.iloc[:, :57].values
y_test = spam_test_df.iloc[:, -1].values

In [2]:
para_grid = [{
    'penalty': ['l1', 'l2'],
    'fit_intercept': [False, True],
    'C': np.logspace(-5, 5, 100)
}]

# Your solution
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
logit = linear_model.LogisticRegression()

para_search = GridSearchCV(estimator=logit, param_grid=para_grid, scoring='accuracy', cv=5).fit(x_train, y_train)

    - What's the best combination?
    - What's the best score?
    - Refit the best estimator on the whole data set. How many coefficients were reduced to 0?(Hint: the absolute value of coefficients that are smaller than 1e-4.) 
    - What's the corresponding training error and test error? (Training error is the model performance on spam_train, while test error is the performance on spam_test.)

In [3]:
### your solution
print("Best Parameters: {}".format(para_search.best_params_))
print("Best Score:      {}".format(para_search.best_score_))

logit_best = para_search.best_estimator_
logit_best.fit(x_train,y_train)

lowest_coefficients = logit_best.coef_[1e-4>=logit_best.coef_]
print("# of Lowest Coefficients: {}".format(len(lowest_coefficients)))

print("Training Error: {}".format(1 - logit_best.score(x_train,y_train)))
print("Test Error:     {}".format(1 - logit_best.score(x_test,y_test)))

Best Parameters: {'C': 46.415888336127821, 'fit_intercept': True, 'penalty': 'l1'}
Best Score:      0.928695652173913
# of Lowest Coefficients: 27
Training Error: 0.0621739130434783
Test Error:     0.07083876575402004


### Problem 2

Set *scoring = 'roc_auc'* and search again, what's the best parameters? Fit the best estimator on the spam_train data set. What's the training error and test error?

In [4]:
### your solution
y_train_to_binary = (y_train == "email").astype(int)
y_test_to_binary = (y_test == "email").astype(int)

para_search = GridSearchCV(estimator=logit, param_grid=para_grid, scoring='roc_auc', cv=5).fit(x_train, y_train_to_binary)
print("Best Parameters: {}".format(para_search.best_params_))

roc_auc_best = para_search.best_estimator_
roc_auc_best.fit(x_train,y_train_to_binary)

print("Training Error: {}".format(1 - roc_auc_best.score(x_train,y_train_to_binary)))
print("Test Error:     {}".format(1 - roc_auc_best.score(x_test,y_test_to_binary)))

Best Parameters: {'C': 1.4174741629268048, 'fit_intercept': True, 'penalty': 'l1'}
Training Error: 0.06391304347826088
Test Error:     0.07170795306388522


### Problem 3

In this exercise, we will predict the number of applications received(*Apps*) using the other variables in the College data set.

The features and the target variable are prepared as $x$ and $y$.

In [5]:
import pandas as pd
college = pd.read_csv('data/college.csv')
x = college.iloc[:, 2:]
y = college.iloc[:, 1]
print(college.shape)
college.head()

(777, 18)


Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


- (1) Split this data into a training set and a test set with train_size=0.5.(Hint: Use the function **sklearn.cross_validation.train_test_split** , set *random_state=0* and *tran_size=0.5*.)

- (2) Fit a linear model on the training set and report the training error and test error(mean squared error, you can use the function *sklearn.metrics.mean_squared_error*).

- (3) Fit a ridge regression on the training set, with $\alpha$ chosen by the cross validation. Report the training error and test error.

- (4) Fit a lasso on the training set, with $\alpha$ chosen by the cross validation. Report the training error and test error

- (5) Compare the results obtained, what do you find?

In [6]:
### your solution
import sklearn.model_selection as ms
from sklearn import linear_model
from sklearn import metrics

x_train, x_test, y_train, y_test = ms.train_test_split(x, y, test_size=0.5, random_state=0)

logit = linear_model.LogisticRegression()
logit.fit(x_train,y_train)

print("Logit Train Error: {}".format(1 - logit.score(x_train,y_train)))
print("Logit Test Error:  {}".format(1 - logit.score(x_test,y_test)))

print("Mean Squared Error: {}".format(metrics.mean_squared_error(logit.predict(x_train),y_train)))

# ridge = linear_model.Ridge(alpha = 1)
ridge = linear_model.Ridge(alpha = 1)
ridge.fit(x_train,y_train)

random_divide = ms.KFold(n_splits=5)
ridge_scores = ms.cross_val_score(estimator=ridge, X=x, y=y, cv=random_divide)
print("Ridge STD chosen: {}".format(ridge_scores.std()))

ridge_new_alpha = linear_model.Ridge(alpha = ridge_scores.std())
ridge_new_alpha.fit(x_train,y_train)
print("Ridge Train Error: {}".format(1 - ridge_new_alpha.score(x_train,y_train)))
print("Rdige Test Error:  {}".format(1 - ridge_new_alpha.score(x_test,y_test)))

lasso = linear_model.Lasso(alpha=1) 
lasso.fit(x, y) 
lasso_scores = ms.cross_val_score(estimator=lasso, X=x, y=y, cv=random_divide)
print("Lasso STD chosen: {}".format(lasso_scores.std()))
lasso_new_alpha = linear_model.Lasso(alpha=lasso_scores.std())
lasso_new_alpha.fit(x,y)
print("Lasso Train Error: {}".format(1 - lasso_new_alpha.score(x_train,y_train)))
print("Lasso Test Error:  {}".format(1 - lasso_new_alpha.score(x_test,y_test)))

print("Results: Logit performed perfectly with train data but failed completely \
on test data. Ridge had best results for both train & test data. Lasso came in second place.")

Logit Train Error: 0.005154639175257714
Logit Test Error:  1.0
Mean Squared Error: 843.0360824742268
Ridge STD chosen: 0.02084579049336999
Ridge Train Error: 0.07346621434320233
Rdige Test Error:  0.08463722096531667
Lasso STD chosen: 0.02084673965401931
Lasso Train Error: 0.0788493453638105
Lasso Test Error:  0.06527547195489669
Results: Logit performed perfectly with train data but failed completely on test data. Ridge had best results for both train & test data. Lasso came in second place.


### Problem 4
This time  we will try to predict the variable *Private* using the other variables in the College data set. The features and target variable are prepared for you.

In [7]:
x = college.iloc[:, 1:]
y = college.iloc[:, 0]

- (1) Split this data into a training set and a test set with train_size=0.5(Hint: Use the function **sklearn.cross_validation.train_test_split** , set *random_state=1* and *tran_size=0.5*.)]

- (2) Fit a logistic regression with regularizaton. Use the function **GridSearchCV** to fint out the best parameters.

In [10]:
from sklearn import cross_validation

x_train, x_test, y_train, y_test = cross_validation.train_test_split(x, y, train_size=0.5, random_state=1)
grid_para_logit = [{'penalty': ['l1', 'l2'], 'C': np.logspace(-5, 5, 100)}]
para_search = GridSearchCV(estimator=logit, param_grid=grid_para_logit, scoring='accuracy', cv=5).fit(x_train, y_train)



    - What's the best parameters?
    - Refit the model on the training set with best parameters. What's the training error and test error?
    
- (3) Fit a KNN model. Use the function **GridSearchCV** to fint out the appropriate parameter *n_neighbors*. Refit the model on the training set and report the training error and test error.

- (4) Compare the results of logistic regression and KNN.

In [13]:
### your solution
print("Best Parameters: {}".format(para_search.best_params_))

update_logit = para_search.best_estimator_
update_logit.fit(x_train,y_train)
print("Logit Training Error: {}".format(1 - update_logit.score(x_train,y_train)))
print("Logit Test Error:     {}".format(1 - update_logit.score(x_test,y_test)))

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier()
knn_grid_param = [{'n_neighbors': range(3, 31)}]

para_search = GridSearchCV(estimator=knn, param_grid=knn_grid_param, scoring='accuracy', cv=5).fit(x_train,y_train)
update_knn = para_search.best_estimator_
update_knn.fit(x_train,y_train)

print("KNN Training Error: {}".format(1 - update_knn.score(x_train,y_train)))
print("KNN Test Error:     {}".format(1 - update_knn.score(x_test,y_test)))

print("Results: The results are identical between both models.")

Best Parameters: {'n_neighbors': 24}
Logit Training Error: 0.05670103092783507
Logit Test Error:     0.06683804627249357
KNN Training Error: 0.05670103092783507
KNN Test Error:     0.06683804627249357
Results: The results are identical between both models.
