## <center> K-Nearest Neighbors (KNN) 
## <center> and 
## <center> Support Vector Machines (SVM)

## <center> KNN

<center><img src='knn_anim.gif'>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.datasets import make_blobs,make_classification
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
X,y = make_classification(n_samples=100, n_features=2, n_classes=2, n_redundant=0, class_sep=0.25, hypercube=50, random_state=4)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05)

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(np.hstack((X_train[:,0],X_test[:,0])), np.hstack((X_train[:,1], X_test[:,1])), 
            c=np.hstack((y_train,[2]*len(y_test))), cmap='Paired', s=250)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1).fit(X_train,y_train)
y_pred = knn.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print('Actual')
print(y_test)
print('Predicted')
print(y_pred)
accuracy_score(y_test, y_pred)

In [None]:
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = .05
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap='coolwarm', alpha=.8)
plt.scatter(X[:,0],X[:,1],c=y)

In [None]:
scores = [accuracy_score(y_test, KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train).predict(X_test)) for k in range(1,30)]
plt.plot([k for k in range(1,30)], scores)
plt.xlabel('k');plt.ylabel('Accuracy')

## <center> SVM

<center><img src="supportvectors.png">

<center><img src="supportvectors2.png">

## <center> Regularization
<center>How much you want to avoid misclassifying.

Low regularization value
<center><img src='low_regular.png'>

High regularization value
<center><img src='high_regular.png'>

## <center> Gamma
<center> How much close/far away points should be considered.

<center><img src="low_gamma.png">

<center><img src="high_gamma.png">

## <center> Margin
<center>The amount of seperation between the hyperplane and the support vectors.

<center><img src="support_vectors.jpg">

## <center> Kernels
<center> Feature transformations that allow classes to be more easily seperated. 

<center><img src="kernel.png">

## <center> Kernel Options

<center><a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html">Sklearn SVM documentation</a>

In [None]:
from sklearn.datasets import make_circles
X,y = make_circles(n_samples=1000, noise=0.05, random_state=4)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05)
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(np.hstack((X_train[:,0],X_test[:,0])), np.hstack((X_train[:,1], X_test[:,1])), 
            c=np.hstack((y_train,[2]*len(y_test))), cmap='Paired', s=250)

In [None]:
from sklearn.svm import SVC
svm = SVC(kernel='linear').fit(X_train, y_train)
print('Linear Kernel')
print('Acc:', accuracy_score(y_test,svm.predict(X_test)))

In [None]:
svm = SVC(kernel='rbf').fit(X_train, y_train)
print('RBF Kernel')
print('Acc:', accuracy_score(y_test,svm.predict(X_test)))

## <center> Classification Methods Guide </center>
<center> When to use each algorithm </center>

<b>Logistic Regression:</b>
- Binary classification

<b>KNN:</b>
- Low-medium number of data points (<10000)
- Not many features

<b>SVM:</b>
- Low-medium number of data points
- High dimensional data (many features)

<b>Random Forest:</b>
- Lots of categorical features
- High number of data points (>10000)
- High dimensional data

## <center> Activity
<center> Using the data in <i>clf_data.csv</i>, build a classification model.

<b>1)</b> Load in the data and perform a train-test split with a test size of 0.15 and set the random state to 4.

<b>2)</b> Looking at the data, which classifier would be best?

<b>3)</b> Fit a KNN, SVM, and Random Forest model to the training data. Record how long it takes to train each model.

In [None]:
import time
start = time.time()
## ... train ...
print('Time:', np.round(time.time()-start,2), 'seconds')

<b>4)</b> Make predictions using each model on the test data. Record the accuracy and F1 score of each. Which model performs best?

In [None]:
from sklearn.metrics import accuracy_score, f1_score


<b>5)</b> Choose one of the models and perform a grid-search to choose the best hyperparameters.

In [None]:
## https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV
params = {}