# More Supervised Learning Models


**Lesson Goals**

In this lesson we will expand our repertoire of supervised learning models by introducing naive bayes and k-nearest neighbors. These are two supervised learning models that are typically used for classification problems.


**Introduction**

So far, we have discovered a few models for supervised learning. In this lesson, we will explore two different classification models. Naive Bayes is a probabilistic model for classification. K Nearest Neighbors is a model that makes a prediction based on the observations closest to it. Both models make certain assumptions for us to consider them.


# Naive Bayes

You may recall Bayes Theorem for conditional probability. This theorem states that the probability of A given B is the probability of the intersection of A and B divided by the probability of B.

We use this rule to create a general model. Say we would like to make a revenue prediction for our e-commerce site. Using Bayes Theorem, we can make a prediction of a customer making a purchase given the version of the website they see and the customer group they are in.

It is important to note that the Naive Bayes algorithm makes a conditional independence assumption. This means that the effect of a single predictor on the outcome is independent on the values of the other predictor variables. This is a simplifying assumption that cannot always be made in some scenarios. Therefore, we should try to see if making this assumption may or may not work with our data.

To calculate the probability of a customer making a purchase given that they are a millennial and looking at site version 1. We are using the model to compute probabilities and compare them. Therefore, we don't care about the denominator. We then get rid of the denominator and change our equation from being equal to, to being proportional to the probability of a purchase given customer group and site version.

# Gaussian Naive Bayes

Instead of looking at a probability distribution table, we can make the assumption that our likelihood comes from a Gaussian (or normal) distribution.

To examine the code in Scikit-Learn, we will look at the famous Iris dataset. This dataset was first introduced by Ronald Fisher in 1936 and is used in many classification examples. The Iris dataset contains 4 features for Iris flowers (petal length, petal width, sepal length, and sepal width). The measurements in these variables are used to classify the type of Iris flower. We will import the dataset from Scikit-Learn and then fit the GaussianNB model to the data.



In [1]:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

iris = datasets.load_iris()
iris.data[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

The iris dataset contains the features in the data section and the classification in the target.

Next, we will initialize the GaussianNB model and fit the model. 

In [2]:
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)

We can then generate predictions and compare the predictions with the observed data.



In [3]:
from sklearn import metrics

y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
metrics.confusion_matrix(iris.target, y_pred)

array([[50,  0,  0],
       [ 0, 47,  3],
       [ 0,  3, 47]])

The accuracy of a model is measured by how many observations are classified correctly. All the data correctly classified appears in a confusion matrix along the diagonal. So out of 150 observations, only 6 are incorrectly classified.



# K-Nearest Neighbors

This algorithm is based on the idea that observations in a "neighorbood" will have the same classification. We typically decide whether observations are considered neighbors by a distance metric of our choice. Two common choices for distance metrics are Euclidean distance (defined as the sum of squared distances) or L1 distance (defined as the sum of the absolute value of the distances). We look at the labels of all the observations in the "neighborhood" and assign the most common label (the mode) to the observation that we are trying to predict.

Our choice of k is defined by us. We can test different models with multiple values of k and select the model with the highest accuracy. This is the most common way to optimize k.

**Advantages and Disadvantages of K-Nearest Neighbors**

The main advantage is that while we can train a model and then apply it to new data, we do not have to perform this process. We can use the k closest observations with known labels to predict the label of the new observations and make predictions on the fly. However, this means that every time we make a prediction, we have to compute the distance between the observation and all labeled data. This can be computationally intensive and a disadvantage of this algorithm.

**K-Nearest Neighbors with Scikit-Learn**

We are able to apply a K-Nearest Neighbors model to our data using the KNeighborsClassifier in Scikit-Learn. In the example below, we will use the abalone data from the UCI dataset repository. This data contains 8 features describing different abalone observations. Using these features, we are able to predict the sex of the abalone (Male, Female, or Infant).

We'll start by loading the data and examining it.

In [4]:
import pandas as pd

abalone_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone_cols = ['Sex', 'Length', 'Diameter', 'Height', 'Whole_Weight', 
                'Shucked_Weight', 'Visecra_Weight', 'Shell_Weight', 'Rings']
abalone = pd.read_csv(abalone_url, names=abalone_cols)
abalone.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_Weight,Shucked_Weight,Visecra_Weight,Shell_Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


Next, we will load the KNeighborsClassifier from Scikit-Learn and create a model with k=3. 

In [5]:
from sklearn.neighbors import KNeighborsClassifier

# We create a list of all feature columns.
cols = [x for x in abalone.columns.values if x != 'Sex']

neighbor_model = KNeighborsClassifier(n_neighbors=3)
neighbor_model.fit(abalone[cols], abalone['Sex']) 
KNeighborsClassifier()

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

Now that we have created our model, we will create a single observation and predict the sex of this observation.

In [6]:
import numpy as np

obs = np.array([[0.5, 0.3, 0.05, 0.6, 0.2, 0.1, 0.1, 8]])
print(neighbor_model.predict(obs))

['I']
