<h2 align="center">Machine Learning</h2> 
<h3 align="center">Travis Millburn<br>Spring 2020</h3> 

<center>
<img src="../images/logo.png" alt="drawing" style="width: 300px;"/>
</center>

<h3 align="center">Class 4: Clustering and Principal Component Analysis</h3> 


# Outline 
      
1. Review
    Supervised vs Unsupervised models
2. Preprocessing Intro
3. LAB A
    * Wine Dataset, Predict Quality
    * Split data, K-NN 
4. PCA 
5. Clustering Concepts
6. K Means Clustering
    
7. Lab B
    * PCA


# Review from last week: Supervised vs Unsupervised
  
  Within the field of machine learning, there are two main types of model types: supervised learning and unsupervised learning.



<center>
<img src="../images/supervised_unsupervised_tldr.png" alt="drawing" style="width: 600px;"/>
</center>


## Midterm: 3/9/2020

### Many Concepts from (dense) HW Reading!
* Regression vs Classification
* K-NN, decision boundary
* Linear Models
    - OLS, Ridge, Lasso, LogisticRegression
* Naive Bayes
* Decision Trees
    - Feature Importance
* Ensembles, Random Forest
* Support Vector Machines (SVM)
* Neural Networks

### That's a lot.  We will revisit these topics in detail.

# Review: Unsupervised learning includes all kinds of machine learning where there is no known output when training.


Why do we do this?
1. Visualization
2. Data Compression
3. Improvement: Making the data a better input for a supervised model.

# From the top: It may be difficult to evaluate whether an unsupervised model has a powerful prediction

It is hard to determine if the model has done WELL or not, because we do not have any "actuals"

# Unsupervised models are frequently used in a "tell me more about the data" stage: exploration

# Preprocessing Methods

Chart First (from Guido + Muller's Introduction to Machine Learning with Python)

<center><img src="../images/scaling_example.jpg" alt="drawing" style="width: 1000px;"/></center>


### Why Preprocessing ???

Simple.  We want to increase the efficacy of our predictions.

Recall the Mastercard bounty example from Class 1.

# We see Four Different Scalers  

1. Standard Scaler
    * Mean is 0, variance is 1
2. Robust Scaler
    * Similar to Standard Scaler, but uses median and quartiles as opposed to mean and variance
    * Result: Resistant to outliers, just like robust regressors etc
3. Min/Max Scaler
    * Shifts the feature(s) such that all data is between 0 and 1
4. Normalizer
    * Scales data such that features have Euclidean length of 1

# Sklearn has some great visualizations on this:
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

They have also made a notebook file available at that link.  Let's check it out!

### Recall last week's lab: Building K-NN on our now-familiar census income data.

* We got prediction of roughly 75% with one neighbor and 79% with 5 neighbors.
* Can simply scaling the data improve this?  Worsen?  Let's try.

In [41]:
import pandas
import postgresql
import pickle
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [42]:
# Grab the data, put in DataFrame
df = pandas.read_pickle('../Week 3/df.pkl')

# Apply our encoding magic!
for column in df.columns:
    if df[column].dtype == type(object):
        le = sklearn.preprocessing.LabelEncoder()
        df[column] = le.fit_transform(df[column])
        
x_df = df.drop(columns=['income', 'fnlwgt'])
y_df = df['income']
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df)
        
# What does that give us?
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income,id
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0,32557
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1,32558
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0,32559
32559,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,0,32560
32560,52,5,287927,11,9,2,4,5,4,0,15024,0,40,39,1,32561


In [43]:
# 1 Neighbor, no preprocessing

clf = KNeighborsClassifier(n_neighbors=1, p=1)
clf.fit(x_train, y_train)
print('Accuracy: {:.2f}'.format(clf.score(x_test, y_test)))

Accuracy: 0.75


In [44]:
# 5 Neighbors, no preprocessing

clf = KNeighborsClassifier(n_neighbors=5, p=1)
clf.fit(x_train, y_train)
print('Accuracy: {:.2f}'.format(clf.score(x_test, y_test)))

Accuracy: 0.79


### Okay, that gets us back to where we were before.  What does preprocessing do?

In [45]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_df_scaled = scaler.fit_transform(x_df)
x_train, x_test, y_train, y_test = train_test_split(x_df_scaled, y_df)

In [46]:
clf = KNeighborsClassifier(n_neighbors=1, p=1)
clf.fit(x_train, y_train)
print('Accuracy: {:.2f}'.format(clf.score(x_test, y_test)))

Accuracy: 0.80


In [47]:
clf = KNeighborsClassifier(n_neighbors=5, p=1)
clf.fit(x_train, y_train)
print('Accuracy: {:.2f}'.format(clf.score(x_test, y_test)))

Accuracy: 0.83


### What about RobustScaler ?

In [55]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
x_df_scaled = scaler.fit_transform(x_df)
x_train, x_test, y_train, y_test = train_test_split(x_df_scaled, y_df)

In [56]:
clf = KNeighborsClassifier(n_neighbors=5, p=1)
clf.fit(x_train, y_train)
print('Accuracy: {:.2f}'.format(clf.score(x_test, y_test)))

Accuracy: 0.85


### We got several percentage points of improvement by scaling the data prior to fitting the K-NN model.

# At this point we have introducted scaling......we will later integrate with some of supervised models we have seen.

# PCA: Principal Component Analysis

One of the most prevalent unsupervised algorithms is PCA.

Dimensionality Reduction.

Methodology rotates data such that rotated features on uncorrelated.

This can be done for accuracy OR performance
    Sometimes, running data through PCA and then using it as input for a supervised model can be good for performance


<center><img src="../images/pca.jpg" alt="drawing" style="width: 1000px;"/></center>

Introduction to Machine Learning; Guido and Muller

# Example: Let's explore the popular wine dataset

In [4]:
import pandas
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
%matplotlib inline


In [5]:
#train_test_split from last class
# note: we are doing this on numpy array, but usage is just like pandas DataFrame

# X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)
red_df = pandas.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', delimiter=';')
red_df.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


### PCA via Eigenvector decomposition

* Data matrix $\bf X$ - _rows are features_

* Consider matrix $\boldsymbol\Sigma = \bf X \bf X^T$ - covariance estimate  - each element is a covariance between pairs of features (note need to remove mean in definition... and in data).

* eigenvalue decomposition gives directions of variation

In [6]:
red_x_train, red_x_test, red_y_train, red_y_test = train_test_split(red_df.drop(columns=['quality']), red_df['quality'])

In [7]:
pca = PCA(n_components=2)
pca.fit(red_x_train)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [57]:
# x_pca

In [9]:
x_pca = pca.transform(red_x_train)
x_pca_test = pca.transform(red_x_test)

In [10]:
print("Original shape: {}".format(str(red_x_train.shape)))
print("Reduced shape: {}".format(str(x_pca.shape)))

Original shape: (1199, 11)
Reduced shape: (1199, 2)


In [11]:
clf = KNeighborsRegressor()
clf.fit(x_pca, red_y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [12]:
red_y_results = pandas.DataFrame(red_y_test)
red_y_results['prediction'] = clf.predict(x_pca_test)
red_y_results['pred_rounded'] = red_y_results['prediction'].round().astype(int)
red_y_results['error'] = red_y_results['pred_rounded'] - red_y_results['quality']

In [58]:
red_y_results['error'].abs().mean()

# How does this compare to just a KNN model on the same data?|

0.5925

In [14]:
red_y_results.tail()

Unnamed: 0,quality,prediction,pred_rounded,error
245,6,5.2,5,-1
1580,6,5.8,6,0
756,6,6.0,6,0
1340,6,5.8,6,0
834,5,6.0,6,1


# Now, to some Clustering

K-Means Clustering is likely the most commonly used clustering algorithm

We are going to try and classify data, without first being trained!
Why would we do this?
    * Search engines
    * Customer profiling

In [25]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(red_x_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [26]:
print("Cluster memberships:\n{}".format(kmeans.labels_))

Cluster memberships:
[0 0 2 ... 0 2 0]
