# Part1. K-Nearest Neighbors

<p>KNN is a form of <i>instance</i>, or <i>memory</i> based learning wherein we don't learn a function $f(X)$ to estimate $E[Y|X]$. It is a nonlinear, nonparametric model. To make a classification for a given instance $\mathbf{x}^{(i)}$, we search the training data for the $k$-nearest neighbors, as defined by some distance metric $d(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$. The estimate of $h_{k,\cal{D}}(\mathbf{x})$ is then given by:<br><br>

<center>$h_{k,\cal{D}}(\mathbf{x}) = \underset{j}{\operatorname{argmax}}\big({\frac{1}{k} \sum\limits_{i \in \cal{N_k(\mathbf{x})}}} \mathbb{1}\{y^{(i)}=j\}\big)$</center><br><br>

<center>where $\cal{N_{k}(\mathbf{x})} = \{x^{(j)}\in\cal{D} \text{: k closest points to } \mathbf{x} \text{ in } \cal{D}\}$</center>

<br>
So, we need a distance metric to determine closest points. The most common distance function used in k-NN is the <i>Euclidean Distance</i>.<br><br>

Let $\mathbf{x} = <x_1,...x_p>$ be a $p$-dimensional vector, then for two instances $i \text{ and } j$:<br><br>
<center>$eud(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \sqrt{(x_1^{(i)}-x_1^{(j)})^2+...+(x_p^{(i)}-x_p^{(j)})^2} = \sqrt{\sum\limits_{t=1}^p (x_t^{(i)}-x_t^{(j)})^2}$
</center>
<br><br>

See here for implementation details, 
(https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
</p>

In [None]:
%matplotlib notebook 

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy.spatial.distance 
import math

from IPython.display import Image
from sklearn.preprocessing import scale 
# from sklearn.grid_search import GridSearchCV
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.metrics import mean_squared_error

from YourAnswer import crossValidation_Ridge,crossValidation_Lasso
from YourAnswer import predictKNN
from utils import plotData, vis_decision_boundary, vis_coef, vis_mse

## 1. Data preparation

### what does the data look like?

In [None]:
data1 = pd.read_csv("ex2data1.txt", header=None, names=['test1', 'test2', 'accepted'])
data1.head()

In [None]:
ax = plotData(data1)
ax.set_ylim([20, 130])
ax.legend(['Admitted', 'Not admitted'], loc='best')
ax.set_xlabel('Exam 1 score')
ax.set_ylabel('Exam 2 score')

### data shape

In [None]:
X = data1[['test1', 'test2']].values
y = data1.accepted.values
n, d = X.shape
n, d

## 2. Modeling

In [None]:
k=3
result_knn = predictKNN(X,X,y,k)

In [None]:
ax = plotData(data1)
ax.set_ylim([20, 130])
i = 0
for xy in zip(X[:,0],X[:,1]):
    ax.annotate('(%s)' % int(result_knn[i]), xy=xy, textcoords='data',size=8)
    i += 1
ax.set_ylim([20, 130])
ax.legend(['Admitted', 'Not admitted'], loc='best')
ax.set_xlabel('Exam 1 score')
ax.set_ylabel('Exam 2 score')

In [None]:
print ('K-nearest neighbors, k = '+str(k)+', training accuracy : ' + str(np.mean(result_knn == y)))

## 3. Decision boundary

In [None]:
plotData(data1)
vis_decision_boundary(X, y, k)
plt.show()
plt.tight_layout()

 Let's plot the training accuracy when k becomes large.

In [None]:

accuracy_from_dif_k = np.zeros((100,))

for dif_k in range(1,100):
    result_knn = predictKNN(targetX=X, dataSet=X, labels=y, k=dif_k)
    accuracy = np.mean(result_knn == y)
    accuracy_from_dif_k[dif_k] = accuracy
    
#print ('K-nearest neighbors, k = '+str(k)+', training accuracy : ' + str(np.mean(result_knn == y)))    
plt.figure(figsize=(8,5))
plt.plot(accuracy_from_dif_k, linewidth=2.0)
plt.ylabel('training accuracy',fontsize=15)
plt.xlabel('k',fontsize=15)
plt.show()

# Part2. Regularizers

### Data Preparation
Let's explore what the data look like via info() which shows the number of data, whether non exists, and data type 

In [None]:

df = pd.read_csv('./Hitters.csv').dropna().drop('Unnamed: 0', axis=1)
df.info()

In [None]:
df.head()

In [None]:
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
dummies.info()
print(dummies.head())

In [None]:
y = df.Salary

# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')

# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)

# Dataset splitting (Train/Test)
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1,shuffle=False)

X.info()

### 1. Ridge Regression 

Scikit-learn Ridge regression uses linear least squares with L2 regularization.



The __sklearn Ridge()__ function has the standard L2 penalty:
### $$ \lambda ||\theta_1||^2_2 $$

In __sklearn Ridge()__ function, it uses hyperparameter 'alpha' which is the same as 'lambda' as we learned.

In [None]:
from sklearn.preprocessing import StandardScaler
lambdas = 10**np.linspace(5,1,100)*0.5

ridge = Ridge()
coefs = []

scaled_X = StandardScaler().fit_transform(X)

for a in lambdas:
    ridge.set_params(alpha=a)
    ridge.fit(scaled_X, y)
    coefs.append(ridge.coef_)

vis_coef(lambdas, coefs, method='Ridge')

The above plot shows that the Ridge coefficients get larger when we decrease alpha.

#### lambda = 4

In [None]:
ridge2 = Ridge(alpha=4)
scaled_X_train = StandardScaler().fit_transform(X_train)
scaled_X_test = StandardScaler().fit_transform(X_test)
ridge2.fit(scaled_X_train, y_train)
pred = ridge2.predict(scaled_X_test)
print('MSE : ',mean_squared_error(y_test, pred))

#### lambda = $10^{10}$ 
This big penalty shrinks the coefficients to a very large degree and makes the model more biased, resulting in a higher MSE.

In [None]:
ridge2.set_params(alpha=10**10)
ridge2.fit(scaled_X_train, y_train)
pred = ridge2.predict(scaled_X_test)
print('MSE : ',mean_squared_error(y_test, pred))

### Cross Validation for selecting the best lambda(=alpha)
#### Implement crossValidation_Ridge to find the best lambda
You should return these values/objects correctly.
1. MSE_set : MSE list that each element is correspond to the mean squared error of each lambda, i.e. having the same length as the lambdas. 
2. best_MSE : The lowest MSE of cross validation, which indicates the lambda used for this cv estimation is the best
3. best_lambda : Suggested lambda from CV estimation
4. test_MSE : MSE estimated on the test data after fitting the model using the whole data and best_lambda
5. ridge : The best model we obtained

Here, we divide the data into 5 groups, i.e. 5-fold CV

In [None]:
MSE_set, best_MSE, best_lambda, test_MSE, ridge= crossValidation_Ridge(lambdas,5, scaled_X_train, y_train,scaled_X_test,y_test)

In [None]:
print('best lambda : ',best_lambda)
print('test MSE : ',test_MSE)

In [None]:
vis_mse(lambdas, MSE_set, best_lambda, best_MSE)

In [None]:
pd.Series(ridge2.coef_, index=X.columns)

### 2. The Lasso

For sklearn __Lasso()__ function, the standard L1 penalty is:
### $$ \lambda ||\theta_1||_1 $$

In [None]:
lasso = Lasso(max_iter=10000)
coefs_lasso = []

for a in lambdas*2:
    lasso.set_params(alpha=a)
    lasso.fit(scaled_X_train, y_train)
    coefs_lasso.append(lasso.coef_)

vis_coef(lambdas,coefs_lasso, method='Lasso')

#### http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

### Cross Validation for selecting best lambda(=alpha)
#### Implement crossValidation_Lasso to find the best lambda
You should return these values/objects correctly.
1. MSE_set : MSE list that each element is correspond to the mean squared error of each lambda, i.e. having the same length as the lambdas. 
2. best_MSE : The lowest MSE of cross validation, which indicates the lambda used for this cv estimation is the best
3. best_lambda : Suggested lambda from CV estimation
4. test_MSE : MSE estimated on the test data after fitting the model using the whole data and best_lambda
5. lasso : The best model we obtained

Here, we divide the data into 5 groups, i.e. 5-fold CV

In [None]:
MSE_set_lasso, best_MSE_lasso, best_lambda_lasso, test_MSE_lasso, lasso  = crossValidation_Lasso(lambdas, 5, scaled_X_train, y_train,scaled_X_test,y_test)

In [None]:
print('best_lambda : ',best_lambda_lasso)
print('test_MSE : ',test_MSE_lasso)

In [None]:
vis_mse(lambdas,MSE_set_lasso,best_lambda_lasso,best_MSE_lasso)

### Compare the coefficients of Lasso with the ones of Ridge

In [None]:
# Some of the coefficients are now reduced to exactly zero.
pd.Series(lasso.coef_, index=X.columns)