# Introduction to Machine Learning
## Lesson 4 Naive Bayes Classifier, KNN, Cross-validation, Regularization
## Introduction

In this lab work, we will be introduced to basic methods for classification and evaluation of machine learning models such as Naive Bayesian classifier (Naive Bayes), nearest neighbor method (KNN), cross-validation and regularization. Learn to apply these methods to solve classification problems on real data.


### Objectives:
1. Lasso and Ridge
2. Naïve Bayes
3. KNN
4. Cross Validation

---
## Lasso and Ridge
Both models are the regularized forms of the linear regression.
Lass with L1 regularization and Ridge with L2 regularization.
Both act as a constraint region for the coefficients/weight, where they must reside in.

### Issues:
1. When to use Lasso?

2. When to use Ridge?

3. Since it is hard to decide the parameters influence, How we can decide which regularization? and decide the value of lambda?

[Pandas](https://pandas.pydata.org/) - For data analysis and manipulation

[Numpy](https://numpy.org/) - To deal with matrices

[Scikit-learn](https://scikit-learn.org/stable/) - Scikit-learn is one of the most widely used Python packages for Data Science and Machine Learning. It allows you to perform many operations and provides many algorithms. 

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

## Loading Boston dataset
Preparing data

### About dataset:
 The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
 prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics...', Wiley, 1980.   N.B. Various transformations are used in the table on pages 244-261 of the latter.

 Variables in order:
 CRIM   -  per capita crime rate by town
 ZN    -  proportion of residential land zoned for lots over 25,000 sq.ft.
 INDUS  -  proportion of non-retail business acres per town
 CHAS   -  Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 NOX   -   nitric oxides concentration (parts per 10 million)
 RM    -   average number of rooms per dwelling
 AGE   -   proportion of owner-occupied units built prior to 1940
 DIS   -   weighted distances to five Boston employment centres
 RAD   -   index of accessibility to radial highways
 TAX   -   full-value property-tax rate per 10000 dollars
 PTRATIO - pupil-teacher ratio by town
 B    -    1000 (Bk - 0.63) ^2 where Bk is the proportion of blacks by town
 LSTAT  -  % lower status of the population
 MEDV   -  Median value of owner-occupied homes in $1000s

In [None]:
import pandas as pd
import numpy as np
# This dataset is biased, but we will use it for educational purposes
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=1/8, random_state=123)

## Fitting both Lasso and Ridge

### Exercise 1:

Fit two models: Lasso and Ridge - with the default alpha-.
Then print their coefficients and notice the difference.

use function **fit()**

In [None]:
from sklearn.linear_model import Lasso, Ridge

# write code here
lasso = None
ridge = None

print("Lasso Coefficient:", *lasso.coef_, sep='\n\t')
print("Ridge Coefficient:", *ridge.coef_, sep='\n\t')
print('Sum of lasso abs values:', np.sum(np.abs(lasso.coef_)))
print('Sum of ridge abs values:', np.sum(np.abs(ridge.coef_)))

### Exercise 2:
We need to regularize the Lasso and Ridge models by analyzing the effect of different values of the regularization parameter alpha on the mean square error (MSE). As a result, we will determine the optimal values of alpha that minimize the error for each of the models.

use **Lasso**, **Ridge**, **fit()**, **predict()**, **mse_squared_error()**

In [None]:
import matplotlib.pylab as plt
from sklearn.metrics import mean_squared_error
%matplotlib inline

lasso_alphas = [0.3, 0.5, 1, 1.1, 1.2, 1.3, 1.5, 2, 2.2, 2.5]
ridge_alphas = [50, 200, 300, 350, 400, 500, 600, 700, 1000, 1200]
lasso_losses = []
ridge_losses = []
for i in range(len(lasso_alphas)):
    # Create a Lasso regressor with the alpha value.
    # Fit it to the training set, then get the prediction of the validation set (x_val).
    # calculate the mean squared error loss, then append it to the losses array
    lasso = None

    y_pred = None
    mse = None
    lasso_losses.append(mse)

    ridge = None

    y_pred = None
    mse = None
    ridge_losses.append(mse)

plt.suptitle('The effect of changing alpha on MSE for lasso and ridge')



# plt.figure(figsize=(10, 8))
fig, (ax1, ax2) = plt.subplots(1,2, sharey=True)
fig.suptitle('Aligning x-axis using sharex')

ax1.plot(lasso_alphas, lasso_losses, label='lasso')
ax1.legend()
ax1.set(xlabel='alpha', ylabel='MSE')

ax2.plot(ridge_alphas, ridge_losses, label='ridge')
ax2.legend()
ax2.set(xlabel='alpha', ylabel='MSE')

plt.show()

lasso_best_alpha = lasso_alphas[np.argmin(lasso_losses)]
ridge_best_alpha = ridge_alphas[np.argmin(ridge_losses)]
print("Best value of alpha for lasso:", lasso_best_alpha)
print("Best value of alpha for ridge:", ridge_best_alpha)


Next, we do the following operations
1) Determines the best alpha values for Lasso and Ridge models based on MSE minimization on a validation dataset.
2) Trains Lasso and Ridge models with these best alpha values and evaluates their performance on a test dataset.
3) Investigates the effect of different alpha values on the coefficients of the Lasso and Ridge models, displaying the results in graphs.


In [None]:
lasso = Lasso(lasso_best_alpha)
lasso.fit(x_train, y_train)
y_pred = lasso.predict(x_test)
print("Lasso MSE on test set:", mean_squared_error(y_test, y_pred))

ridge = Ridge(ridge_best_alpha)
ridge.fit(x_train, y_train)
y_pred = ridge.predict(x_test)
print("Ridge MSE on test set:", mean_squared_error(y_test, y_pred))

In [None]:
# feature_names =

lasso_alphas = [1, 1.1, 1.2, 1.3, 1.5, 2, 2.2, 2.5, 3, 5]
ridge_alphas = [100, 200, 300, 350, 400, 500, 600, 700, 1000, 1200, 2000, 3000]
lasso_coefs_ = np.zeros((len(lasso_alphas), len(X[0])))
ridge_coefs_ = np.zeros((len(ridge_alphas), len(X[0])))
for i in range(len(lasso_alphas)):
    lasso = Lasso(alpha=lasso_alphas[i])
    lasso.fit(x_train, y_train)
    lasso_coefs_[i] = lasso.coef_

for i in range(len(ridge_alphas)):
    ridge = Ridge(alpha=ridge_alphas[i])
    ridge.fit(x_train, y_train)
    ridge_coefs_[i] = ridge.coef_

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
fig.suptitle('The effect of changing alpha on the coefficients of lasso and ridge')

for idx in range(len(X[0])):
    ax1.plot(lasso_alphas, lasso_coefs_[:, idx], label=f'feature {idx}')
ax1.legend()
ax1.set(xlabel='alpha', ylabel='coefs')

for idx in range(len(X[0])):
    ax2.plot(ridge_alphas, ridge_coefs_[:, idx], label=f'feature {idx}')
ax2.legend()
ax2.set(xlabel='alpha', ylabel='coefs')

plt.show()

## Loading the iris dataset
We split the dataset into training and test samples and prepare the data for further use in training and evaluating the machine learning model. 

In [None]:
from sklearn.datasets import load_digits, load_iris

X, y = load_digits(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# We will show why we didn't split a validation set.

## Naïve Bayes
We will use the Gaussian Naïve Bayes, that deals -as a assumption- with the continous features as gaussian variables to compute their probability.

$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}}exp(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2})$$

While $\mu_y$ and $\sigma_y^2$ are the mean and the variance of the feature $i$ for class $y$.

Note: The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i|y)$.

___
What are the pros and cons of Naive bayes classifier?
___

Let's train a naive-bayes model and check the test accuracy.


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

gauss_nb = GaussianNB()
gauss_nb.fit(x_train, y_train)
y_pred = gauss_nb.predict(x_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

## K nearest neighbour classifier
1. What are the pros and cons of KNN?

___
Let's do the same with KNN classifier.


Rescale the features first.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

Let's tune the hyperparameter $n\_neighbors$ in the KNN classifier object using the cross-validation.

___
## Cross Validation
Cross validation comes as an alternative for the validation set splitting.

Note: that's why we didn't make a validation set.

We investigate the effect of the n_neighbors parameter on the classification accuracy of the K-Nearest Neighbors (KNN) model using cross-validation. For each value of K from a given range, the average cross-validation accuracy is calculated, and then a graph of this dependence is plotted. Based on the graph, the optimal value of K that provides the highest accuracy of the model is determined.

In [None]:
from sklearn.model_selection import cross_val_score
import matplotlib.pylab as plt
from matplotlib.ticker import MaxNLocator
import numpy as np
%matplotlib inline

Ks = list(range(1, 20))
cv_scores = []
for K in Ks:
    knn = KNeighborsClassifier(n_neighbors=K)
    scores = cross_val_score(knn, x_train, y_train,
                             cv=7, scoring='accuracy')
    avg_score = np.mean(scores)
    cv_scores.append(avg_score)

plt.title('The effect of changing K on accuracy')
plt.plot(Ks, cv_scores)
plt.xlabel('K')
plt.xticks(Ks)
plt.ylabel('CV Average accuracy')
plt.show()
print('Best K:', Ks[np.argmax(cv_scores)])

In KNN classifier, there are several hyperparamters to tune, tuning them one
by one is exhaustive approach. Let's try a better approach called GridSearchCV.

### GridSearchCV
In GridSearch Cross-validation, you give different values for each hyperparamter and it will try all combinations for you.
At the end, it will return the best combination of hyperparamters that got the best cross-validation score.

### Exercise 3:
Use gridsearch to tune 3 hyperparameters:

1. $n\_neighbors$: [1, 2, . . ., 10]
2. $weights$: ['uniform', 'distance']
3. $metric$: ['euclidean', 'manhattan', 'chebyshev', 'cosine']

Check this [link](https://scikit-learn.org/stable/modules/grid_search.html)
for help.

Then measure the accuracy on the test set.

In [None]:
from sklearn.model_selection import GridSearchCV

# Modify the next lines to run GridSearchCV with cv=7
param_grid = {'n_neighbors':list(range(1, 11)),
              'weights':['uniform', 'distance'],
              'metric':['euclidean', 'manhattan', 'chebyshev', 'cosine']
              }

# create a GridSearch cross validation with cv=7,
# and accuracy as scoring, and specify param_grid


grid_search_clf = None
# then train on the train dataset


means = grid_search_clf.cv_results_['mean_test_score']
stds = grid_search_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_search_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
        % (mean, std * 2, params))
print()
print("Best parameters set found on development set:")
print()
print(grid_search_clf.best_params_)

y_pred = grid_search_clf.predict(x_test)
print(accuracy_score(y_test, y_pred))

---
When the hyper-parameter(s) range is big, grid-search becomes exponentially inefficient.
What other approaches can we use to solve this problem?

## Further reading

[L1 vs L2](https://www.analyticssteps.com/blogs/l2-and-l1-regularization-machine-learning),
What if we used something other than the norm? like $\Sigma \ln(w_j^2)$

[Elastic net](https://scikit-learn.org/stable/modules/linear_model.html#elastic-net)

[Huber regularization](http://www.stephanmandt.com/papers/ECML_2016.pdf),
what if we switch the condition?

[Pruned Cross Validation](https://piotrekga.github.io/Pruned-Cross-Validation/)