# Scikit-Learn: Classification and Regression Models & Examples

This notebook will explore various classification and regression techniques available through the scikit-learn python library. The techniques covered in this notebook are:
- Nearest neighbor (NN)
- Support vector machines (SVM)
- Decision trees/random forrest
- Linear discriminant analysis
- Logistic regression

The dataset that we will use are the MNIST handwritten digits

In [None]:
import numpy as np
import gzip
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets


data = datasets.load_digits()
images = data.images
targets = data.target

images_train, images_test, labels_train, labels_test = train_test_split(images, targets, test_size=.2, shuffle=False)

image = np.asarray(images[2]).squeeze()
plt.imshow(image)
plt.show()
print(f"This is a number {targets[2]}\n\n")

# # Re-shape data so that it's 2D
images_train = np.reshape(images_train, (np.shape(images_train)[0], 64))
images_test = np.reshape(images_test, (np.shape(images_test)[0], 64))

print(f"Shapes of our train/test data and labels:")
print(f"\timages_train: {np.shape(images_train)}")
print(f"\tlabels_train: {np.shape(labels_train)}")
print(f"\timages_test: {np.shape(images_test)}")
print(f"\tlabels_test: {np.shape(labels_test)}")

## I. Nearest Neighbor Algorithm

The nearest neighbor algorithm is one of the more basic classification techniques, and in some cases is actually used as a building block in other richer machine learning algorithms. It's simplicity, in some regard, is where it's power lies: it's fast to implement, and easy to understand. The algorithm looks at the Euclidean distance of given data point to it's nearest neighboring data points to help us gain an understanding of the information contained in the data point of interest. Data points with "closer" Euclidean distances are more likely to be similar than ones whose Euclidean distances are quite far. 

The nearest neighbor algorithm can come in a few different flavors.

- **unsupervised**: we don't know the label of any of the neighboring data points

- **supervised**: we do know the label of the neighboring data points

    - **classification**: the data has a discrete and finite number of values it can take. In supervised nearest neighbor classification, we are typically aiming to predict the class of some data by taking a "vote" of the neighboring pixels. Which ever class receives the most votes, we classify our data as such
    
    - **regression**: the data can take any infinite number of values. In this scenario, we usually take an average of the values of the nearest neighbors
    
- **radius v.s. k-nearest neighbors**: this determines whether we are looking at a fixed number of neighbors, or all neighbors who are within a fixed distance


Let's pick a sample image to plot the 9 nearest neighbors of it, so that we can get a feel for the ```neighbors``` interface and how we would use unsupervised learning to get a feel for our data, without actually know which labels the data corresponds to. 

In [None]:
# Display a digit in the training set
seed = 43
random_digit = images_train[seed] # get an image from out training images
random_digit_label = labels_train[seed] # get it's corresponding label

#Plot the image and display it's label
fig = plt.figure()
plt.imshow(random_digit.reshape(8,8))
plt.title(f"This is a number {random_digit_label}")


Now we can import ```NearestNeighbors```, which is the interface required when we are detecting nearest neighbors without their labels (ie. unsupervised)

The key input parameters for the ```NearestNeighbors``` constructor is as follows:
```NearestNeighbors(n_neighbors, algorithm).fit(data)```

- ```n_neighbors```: number of neighbors we want to find, default: 5. You can also use ```radius``` instead to specify a radius instead of number of neighbors
- ```algorithm```: which algorithm to use to search for the neighbors. Valid parameter values are ```auto```, ```ball_tree```, ```kd_tree```, ```brute```. Default is ```auto```, and that's what we will use in this notebook. The two tree searches are more efficient in certain scenarios which are beyond the scope of this notebook, so we'll let ```auto``` figure it out for us
- ```.fit(data)```: the data set which will be used to find the nearest neighbors within

After fitting the ```NearestNeighbors``` object to our dataset, we can call the object's ```.kneighbors``` method to specify the point who's neighbors we'd like to find, in our case the point ```random_digit```. This method will return two lists: one for the distances of the neighbors, and one for the indices in the neighbors in the input data that we ```.fit()``` our model to.

In [None]:
from sklearn.neighbors import NearestNeighbors

# Initialize our neighbors object
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(images_train)
# return the distances and indices of the neighbors nearest to seed
distances, indices = nbrs.kneighbors(images_train[seed].reshape((1,-1)))

# Plot the 9 nearest neighbors
fig = plt.figure()
for i in range(1,10):
    img = np.reshape(images_train[indices[0][i], :], (8,8))
    fig.add_subplot(3, 3, i)
    plt.imshow(img)
plt.show()

### KNN classifier

Let's use the supervised classifier, KNeighborsClassifier, to get a feel for how all of this works. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

nbrs_k = KNeighborsClassifier(n_neighbors=20, weights='uniform', algorithm='auto').fit(images_train, labels_train)
prediction = nbrs_k.predict(images_train[seed].reshape(1,-1))
print(f"Predicted class: {prediction[0]}")
print(f"Actual class: {labels_train[seed]}")

# See if there were any other digits the algorithm thought it could be
print(f"Probability of each class for this example: {nbrs_k.predict_proba(images_train[seed].reshape(1,-1))[0]}\n\n")

# Now lets do this for each data in our test set, to see how accurate KNN classifier can be
nbrs_k = KNeighborsClassifier(n_neighbors=20, weights='uniform', algorithm='auto').fit(images_test, labels_test)

# We'll also run a weighted version to see how much this helps (if at all)
nbrs_k_weighted = KNeighborsClassifier(n_neighbors=20, weights='distance', algorithm='auto').fit(images_test, labels_test) 

# Some variables to help keep track of how we're doing
number_correct = 0
number_correct_weighted = 0

# Loop over all 
for index_test in range(len(labels_test)):
    
    prediction = nbrs_k.predict(images_test[index_test].reshape(1,-1))
    if prediction[0] == labels_test[index_test]:
        number_correct += 1
    
    prediction_weighted = nbrs_k_weighted.predict(images_test[index_test].reshape(1,-1))
    if prediction_weighted[0] == labels_test[index_test]:
        number_correct_weighted += 1
print("Results from running this on all test data...")
print(f"\tPercent correct for all test data: {100*number_correct/len(labels_test)}%")
print(f"\tPercent correct for all test data (weighted): {100*number_correct_weighted/len(labels_test)}%")

Wow! Can't beat 100%. Weighting the neighbors by distance ended up working well here, but this is guarunteed to be the case. It could be the case that the closer neighbors just so happen to be the wrong class label, in which case them getting weighted more might be detrimental instead of beneficial.

### KNN Regression

In case we had a dataset where the labels took on continuous values, we could have still used the nearest neighbors algorithm, and instead of invoking ```KNeighborsClassifier```, which predicts data into discrete classes, we could have used ```KNeighborsRegressor```. The regressor works by taking an average of the K nearest data points to predict the value of point of interest. The syntax is exactly the same for the regressor despite the two estimators targetting different applications.

### KNN Nearest Centroid Classifier

An alternative flavor of the supervised KNN classifier is to use nearest-centroid classification, meaning we will classify a data point as the label of the nearest centroid. Running this classifier is similar as before:

In [None]:
from sklearn.neighbors import NearestCentroid

nbrs_k_centroid = NearestCentroid().fit(images_train, labels_train)

number_correct_centroid = 0

# Loop over all 
for index_test in range(len(labels_test)):
    
    prediction_centroid = nbrs_k_centroid.predict(images_test[index_test].reshape(1,-1))
    if prediction_centroid[0] == labels_test[index_test]:
        number_correct_centroid += 1

print(f"Percent correct for all test data (centroid method): {100*number_correct_centroid/len(labels_test)}%")

## 2. Naive Bayes Classifier

Below: multinomial (several classes) is more accurate representation of the distribution than Gaussian (which is for continuous)

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB().fit(images_train, labels_train)

number_correct_mnb = 0

for index_test in range(len(labels_test)):
    prediction_mnb = mnb.predict(images_test[index_test,:].reshape(1,-1))[0]
    if prediction_mnb == labels_test[index_test]:
        number_correct_mnb += 1

print("Naive Bayes Classifier (multinomial distribution)...")
print(f"\tPercent correct for all test data: {100*number_correct_mnb/len(labels_test)}%")

Not quite as accurate on this particular data set as the nearest neighbor classifiers were. Naive Bayes models are simple and hence fast (both to develop and to compute a result), and they work relatively well even in a paucity of training data. Because of the core assumption made about the conditionally indepedent relationship between features, the Naive Bayes method also extends well to high dimensional data with a lot of features; the posterior distribution simply reduces to that many (assumingly) independent posteriors of a single variable. 

$P(x_i | y)$ was taken to be a multinomial distribution above. When that assumption doesn't hold, there are alternative distributions (such as Gaussian, complement, and Bernoulli) in the SciKit-Learn naive_bayes module which still permit the necessary assumptions of conditional independence.

## 3. Logistic Regression


In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=100000,multi_class='multinomial').fit(images_train,labels_train)
print(lr.predict(images_test[0,:].reshape(1,-1)))
print(lr.score(images_test, labels_test))