## KNN

Now that you're familiar with `sklearn`, you're ready to do a KNN regression.  

Sklearn's regressor is called `sklearn.neighbors.KNeighborsRegressor`. Its main parameter is the `number of nearest neighbors`. There are other parameters such as the distance metric (default for 2 order is the Euclidean distance). For a list of all the parameters see the [Sklearn kNN Regressor Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html).

In [1]:
#import pandas and numpy
import pandas as pd
import numpy as np

Let's start by first getting our dataset.

Our goal is to predict **mileage per gallon** of a car, given a bunch of other factors.

You've already seen this dataset! It was part of your HW2!

In [2]:
url = 'https://drive.google.com/uc?id=1eJR43LOmkbtKercI2iJHl_eJaZo2VfpY'
dfcars = pd.read_csv(url)
dfcars = dfcars.rename(columns={"Unnamed: 0":"name"})

dfcars.head()

Unnamed: 0,name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [3]:
dfcars.columns

Index(['name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

Now that we know what our data looks like, we split into train and test

In [4]:
# Exercise: train-validation-test split with size 8:1:1


We're trying to predict mpg, and we don't really need name to predict

So, let's make our Y = mpg, and X = all other columns except mpg and name

In [6]:
y_train = np.array(traindf.mpg)
y_train.shape

(25,)

In [7]:
X_train = np.array(traindf.drop(['mpg','name'], axis=1))
X_train.shape

(25, 10)

## Model Selection with KNN

Exercise: What's the hyperparamter in this case? 

In [8]:
from sklearn.neighbors import KNeighborsRegressor

Let's code the model selection pipeline: this should be very similar to the regularized logistic regression and regularized linear regression case

Let's apply the same visualization to find the best value of K

In the end, let's evaluate the final performance

In [17]:
# Excercise: what should be the modifications if I want to deal with a classification problem with KNN? 

## SVM

SVM can be used for both classification and regression. Let's start off with a regression model on the thing we just solved with KNN using SVR, or Support Vector Regression

In [None]:
from sklearn import svm
clf = svm.SVR(kernel='linear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
y_pred

array([19.17864632, 10.15816705, 15.07846834, 26.77321215, 23.3523077 ,
       19.8458688 , 11.38011547])

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test,y_pred)

print("MSE: ", mse)
print("RMSE: ", rmse)
print("R2: ", r2)


MSE:  8.971973118935926
RMSE:  2.9953252108804356
R2:  0.7756113745123772


Let's try the same thing, but with a polynomial kernel

In [None]:
from sklearn import svm
clf = svm.SVR(kernel='poly')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test,y_pred)

print("MSE: ", mse)
print("RMSE: ", rmse)
print("R2: ", r2)


MSE:  19.803214114905785
RMSE:  4.450080236906497
R2:  0.5047225469164343


Looks like the linear kernel works better!

[This](https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/) tutorial explains how SVR works pretty well!