<h2 align="center">Machine Learning</h2> 
<h3 align="center">Travis Millburn<br>Spring 2020</h3> 

<center>
<img src="../images/logo.png" alt="drawing" style="width: 300px;"/>
</center>

<h3 align="center">Class 3: K Nearest Neighbors Algorithms</h3> 


# Outline 
1. Supervised vs Unsupervised models
2. K-NN 
    * K-NN Classifiers
    * Pseudocode
    * Build K-NN in SkLearn
    * Determine and critically compare effectiveness


# Until now, we've focused on data analysis ... 

# Today we build some models !

# Supervised vs Unsupervised
  
  Within the field of machine learning, there are two main types of model types: supervised learning and unsupervised learning.



<center>
<img src="../images/supervised_unsupervised_tldr.png" alt="drawing" style="width: 600px;"/>
</center>


# Supervised Learning
* Built using known values from historical data  
* Known inputs matched with known outputs  
* Generalizing from known examples:
    * The model will be able to create an output when presented with not-seen-before input in the future

Examples:
    * Identifying an email as spam or not
    * Flagging a financial transaction as fraudulent or not
    * Determining whether an elephant or monkey is present in a photo
    * Predicting a profitable financial trade
    
Two Common Types:
1. Classification  
    A) Discrete Labels
2. Regression  
    A) Continuous Labels
   

# Unsupervised Learning
* No known outputs, so we are learning about the structure of the data  
* Often used for dimensionality reduction (PCA - Principle Component Analysis)  
* Very useful for exploratory analysis, or understanding our data

# Today, we will focus on K-NN Models
K Nearest Neighbors Algorithm  

* K-NN is one of the simpler machine learning algorithms  
* We will warehouse the training dataset.
    * When we run a prediction using the model, we simply find the nearest datapoint in the training dataset  

# Let's back up.  What is classification ?
Discrete Labels:
1. Dog | Cat
2. Spam | Not Spam
3. Profitable Trade | Loser Trade
4. Apple | Orange | Pear | Banana



### Formal Classification Problem 

Mathematically, a classifier is a function or model $f$ that predicts the class label  for a given input example ${\bf x}$, that is

$$\hat{y} = f({\bf x})$$

* The value $\hat{y}$ belongs to a set $\{c_1,c_2,...,c_k\}$, each $c_i$ is a class label.


The quality of the model is inherently determined by the quality and accuracy of the training set, an important consideration upon evaluating any implementation of a model. 

Several standardized data sets that have been tested and evaluated over many years.

Scikits has a very good sample of the standardized data sets set up to be easily accessed from the standard api, as [discussed here](http://scikit-learn.org/stable/datasets/).

http://scikit-learn.org/stable/datasets/

# Classification Model using K-NN  
  
A very simple K-NN model will take only one nearest-neighbor into account.

For a new input, we find the nearest input in the training dataset, and use that output

<center>
<img src="../images/one-neighbor-knn.png" alt="drawing" style="width: 600px;"/>
 Introduction to Machine Learning; Sarah Guido, Andreas Muller
</center>



Three new inputs (stars)

# It's a Party: More Neighbors

We can build a more-sophisticated model by adding a few more neighbors to our prediction (not too many, though!)


<center>
<img src="../images/three_knn.png" alt="drawing" style="width: 600px;"/>
 Introduction to Machine Learning; Sarah Guido, Andreas Muller
</center>

The "K" in K-NN refers to an algorithm that can take an arbitrary number of neighbors.

Each neighbor gets one vote.  Most votes get the classification.

# We can display the regions of predictions

<center>
<img src="../images/knn_boundaries.JPG" alt="drawing" style="width: 1200px;"/>
</center>

# Another Example

<center>
<img src="../images/knn_example.jpg" alt="drawing" style="width: 1200px;"/>
</center>

What happens to the prediction with one neighbor instead of 5 ?

# K-NN: One Neighbor

<center>
<img src="../images/knn_1.jpg" alt="drawing" style="width: 1200px;"/>
</center>


# K-NN Five Neighbors

<center>
<img src="../images/knn_5.jpg" alt="drawing" style="width: 1200px;"/>
</center>


### Algorithm

The KNN algorithm is very simple to implement, as it does not need to be trained. The training phase merely stores the training data. For each test point, we calculate the distance of that data point to every existing data point and find the $K$ closest ones. What we return is the the most common amongst the top k classification nearest to the test point. Here's the pseudocode for _K_ Nearest Neighbors:

```
kNN:

    Learn:
        Store training set T to X_train: X_train <-- T


    Predict:
        for every point xp in X_predict:
            for every point x in X_train:
                calculate the distance d in D between x and xp
            sort D in increasing order
            take the "k" items in X_train with the smallest distances to x
            return the majority class among these k items
```

### KNN is a typical example of a lazy learner. It is called "lazy" not because of its apparent simplicity, but because it doesn't learn a discriminative function from the training data but memorizes the training dataset instead.

-Python Machine Learning - Third Edition
    Mirjalili & Raschka

### What does closest mean in this sense?  How do we calculate it?

* Distance Metric
    * Minkowski distance is generalization of Euclidean/Manhattan distance
* sklearn allows us to use the metric parameter.

# K-NN Models seem straightforward.  Let's build one.

In [1]:
# let's import the census and income data we used in prior class

# First: some imports
import postgresql
import pandas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Normalizer

with postgresql.open("pq://new_haven_ds_read:new_haven_ds_secret_99@nhds.cwroivw0q1rc.us-east-1.rds.amazonaws.com/nhds") as db:
    ps = db.prepare('select * from nhds.uci_adults')
    res = ps()
df = pandas.DataFrame(res, columns=ps.column_names)
df.tail()

ClientCannotConnectError: could not establish connection to server
  CODE: 08001
  LOCATION: CLIENT
CONNECTION: [failed]
  failures[0]:
    socket('54.152.120.194', 5432)
    Traceback (most recent call last):
      File "c:\program files\python37\lib\site-packages\postgresql\protocol\client3.py", line 136, in connect
        self.socket = self.socket_factory(timeout = timeout)
      File "c:\program files\python37\lib\site-packages\postgresql\python\socket.py", line 64, in __call__
        s.connect(self.socket_connect)
    TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

    The above exception was the direct cause of the following exception:

    postgresql.exceptions.ConnectionRejectionError: could not connect
      CODE: 08004
      LOCATION: CLIENT
CONNECTOR: [Host] pq://new_haven_ds_read:***@nhds.cwroivw0q1rc.us-east-1.rds.amazonaws.com:5432/nhds
  category: None
DRIVER: postgresql.driver.pq3.Driver

# See that income column at the end ?  That is the discrete variable we are estimating

We are building a model that will estimate if an adult makes over $50k..... or not.

First things first: we need to split our data into training and testing

SkLearn can do this for us!

In [None]:
# We need X and Y
x_df = df.drop(columns=['income'])    # We want everything but the response variable...in this case income
y_df = df[['income']]                 # We ONLY want reponse variable

In [None]:
# Split data into training and testing
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df)

In [None]:
# Build a K-NN Classifier
clf = KNeighborsClassifier(n_neighbors=1)

In [None]:
# Fit to the data.......what happened !?!? 
clf.fit(x_train, y_train)

In [None]:
# Let's look at that dataFrame again:
x_train.tail()

In [None]:
df.dtypes

In [None]:
# Let's use a little Encoding magic.....we'll come back to this later.
for column in df.columns:
    if df[column].dtype == type(object):
        le = sklearn.preprocessing.LabelEncoder()
        df[column] = le.fit_transform(df[column])

In [None]:
df.dtypes

In [None]:
df.tail()

In [None]:
x_df = df.drop(columns=['income'])
y_df = df['income']
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df)
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(x_train, y_train)

In [None]:
clf.predict(x_test)

In [None]:
print('Accuracy: {:.2f}'.format(clf.score(x_test, y_test)))

# Wow, not bad for our first model !

In [None]:
# Do things get better or worse using more neighbors?

clf_3 = KNeighborsClassifier(n_neighbors=3)
clf_3.fit(x_train, y_train)
clf_3.predict(x_test)
print('Accuracy: {:.2f}'.format(clf_3.score(x_test, y_test)))

# Things have improved!  What if we do do more?

In [None]:
# Do things get better or worse using more neighbors?

clf_10 = KNeighborsClassifier(n_neighbors=10)
clf_10.fit(x_train, y_train)
clf_10.predict(x_test)
print('Accuracy: {:.2f}'.format(clf_10.score(x_test, y_test)))

In [None]:
# Remember how many columns we have:
len(df.columns) - 1  # one is response var

# Should we keep adding to the number of estimators?  No.

Let's try anyway ...

In [None]:
# Do things get better or worse using more neighbors?

clf_20 = KNeighborsClassifier(n_neighbors=20)
clf_20.fit(x_train, y_train)
clf_20.predict(x_test)
print('Accuracy: {:.2f}'.format(clf_20.score(x_test, y_test)))

# Overfitting is always a concern

### "$K$-Nearest-Neighbor" Algorithm - dealing with overfitting

As name implies, take vote of $K$ nearest samples to address those outliers.

<center>
<img src="../images/kNN_1_3_5_9.png" width=1100/>
</center>



# For the lab, let's build another one.

# Lets go back to the Titanic Dataset from Class 1
# https://www.kaggle.com/c/titanic/data

# Build a K-NN model to predict survival


In [None]:
train_file = 'C:\\Users\\tmillburn\\OneDrive - University of New Haven\\Class Materials\\Fall 2019\\Week 3\\titanic\\train.csv'
titanic_df = pandas.read_csv(train_file)

test_file = 'C:\\Users\\tmillburn\\OneDrive - University of New Haven\\Class Materials\\Fall 2019\\Week 3\\titanic\\test.csv'
test_df = pandas.read_csv(test_file)

In [None]:
titanic_df.tail()

In [None]:
titanic_df.drop(columns=['Ticket', 'Cabin', 'Name', 'Embarked'], inplace=True)
test_df.drop(columns=['Ticket', 'Cabin', 'Name', 'Embarked'], inplace=True)


In [None]:
#titanic_df = titanic_df.fillna()
titanic_df.tail()

In [None]:
titanic_df.dtypes

In [None]:
test_df.dtypes

In [None]:
le = sklearn.preprocessing.LabelEncoder()
titanic_df['Sex'] = le.fit_transform(titanic_df['Sex'])
test_df['Sex'] = le.fit_transform(test_df['Sex'])

In [None]:
t_x_df = titanic_df.drop(columns=['Survived'])
t_y_df = titanic_df['Survived']

In [None]:
t_x_df.fillna(0, inplace=True)
#t_y_df.fillna(0, inplace=True)

In [None]:
#t_x_

In [None]:
t_clf = KNeighborsClassifier(n_neighbors=3)
t_clf.fit(t_x_df, t_y_df)
#t_clf.predict(t_x_test)
print('Accuracy: {:.2f}'.format(t_clf.score(t_x_df, t_y_df)))