<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Unsupervised Learning: Principal Component Analysis & Clustering</h1></center>

For this notebook, we'll explore add two important techniques to our Data Science toolbox: **_Principal Component Analysis (PCA)_** and **_Clustering_**.  Unlike all the Supervised Learning techniques we've used to far, these two are unique because they are examples of **_Unsupervised Learning_**.  Whereas we require labeled data to double check the accuracy of algorithms like Decision Trees and Naive Bayesian Classifiers, these techniques work on unlabeled data.  While this makes it much easier to apply these techniques to many more kinds of data, it also means that we have no way to measure how well the algorithm is or isn't working.  


<center><h3>Challenge: Apply PCA to the Iris Data Set</h3></center> 

To help us explore the concept of PCA, we're going to start by applying PCA to the _Iris Dataset_.  We'll then use it to fit a model and classify the flower types as we have done in previous examples.


 Before we begin, watch this primer from StatQuest to understand what PCA, and read this [short (interactive) article](http://setosa.io/ev/principal-component-analysis/) about using PCA on the Iris dataset.  These two examples should help you better understand how PCA works, and more importantly, how it can be useful to you.  

In [1]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/_UVHneBUBW0" \
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

<center><h3>Challenge: Apply PCA to the Iris Dataset</h3></center>

For our first challenge, we'll import the _Iris Dataset_ from `sklearn.datasets` and use PCA on it. By examining the explained variance of Principal Components, we'll see that we can actually drop 1 or 2 columns (reducing our **_dimensionality_**) while only losing a minimal amount of predictive accuracy.  

Follow these steps in the code block below:

1. Call `load_iris()` and store the results in the `iris` variable. 
<br>
<br>
1. Create a `StandardScaler()` object and store it in `scaler`.
<br>
<br>
1. Call `scaler.fit()` on `iris.data`, and then use `scaler.transform` to create a scaled version of your data.  Store the results in `scaled_x`.
<br>
<br>
1. Store the labels for _iris_ `labels`.
<br>
<br>
1.  Create a `PCA()` object and store it in `pca`.  Fit it to the scaled data using `pca.fit()`.  Then, call `pca.transform()` on `scaled_x` and store the results in `X_with_pca`.
<br>
<br>
1. Complete the `enumerate` statement to to enumerate through `pca.explained_variance_ratio_` and print out the variance captured by each of the Principal Components.

If you followed these steps correctly, you have will have now created 4 _Principal Components_ from your original dataset.  Be sure to use the information printed out by running the cell below to answer the following questions below it!

In [16]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
scaler = StandardScaler()
#Fit the scaler to iris.data
scaler.fit(iris.data)

# call scaler.transform() on iris.data and store the result in scaled_X
scaled_X = scaler.transform(iris.data)

# Store the labels contained in iris.targets below
labels = iris.target

# Create a PCA() object
pca = PCA()

#Fit the pca object to scaled_X
pca.fit(scaled_X)

# Call pca.transform() on scaled_X and store the results below
X_with_pca = pca.transform(scaled_X)

# Enumerate through pca.explained_variance_ratio_ to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
    print("Explained Variance for Principal Component {}: {}".format(ind, var))

Explained Variance for Principal Component 0: 0.7277045209380135
Explained Variance for Principal Component 1: 0.2303052326768062
Explained Variance for Principal Component 2: 0.036838319576273953
Explained Variance for Principal Component 3: 0.0051519268089063085


<center><h3>Understanding our Results</h3></center>
<br>
<br>
<center>**_Challenge: Use your results from above to answer the following questions._**</center>
<br>
<br>
<center>**_1.)Complete the following table using your results from above. _**</center>

| Principal Component | Variance Explained  |
|---------------------|---------------------|
|      PC1            |0.7277045209380135                     |
|      PC2            |0.2303052326768062                     |
|      PC3            |0.036838319576273953                     |
|      PC4            |0.0051519268089063085                     |

<center>**_2.) Based on the explained variances in the table above, do you recommend dropping any of the columns to reduce dimensionality? Explain your answer._**</center>

Answer: Drop PC3 and PC4 because both Variance Explained is very small, in the other words less corrlation 


<center><h3>Challenge: Fit a model using using Principal Components</h3></center>

Using the data from above, complete the following steps:

1.  Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, and `PC4`.
1.  Drop `PC3` and `PC4` columns.
1.  Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.
1.  Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`
1.  Create two `DecisionTreeClassifier` objects.  Store one in `pca_clf` and one in `reg_clf`.
1.  Fit each model on their respective datasets, and make predictions from each.  Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.  

**_Stretch Challenge:_** Use `K-Fold Cross Validation` on each to run the models multiple times and get an average performance for each.  Try this with K >= 5.  

In [15]:
# import PCA data into a dataframe. Name the columns PC1, PC2, PC3, and PC4.
df = pd.DataFrame(data=X_with_pca, columns=['PC1', 'PC2', 'PC3', 'PC4'])
# Drop PC3 and PC4 columns.
cleaned_df =  df.drop(['PC3', 'PC4'], 1)
cleaned_df

Unnamed: 0,PC1,PC2
0,-2.264542,0.505704
1,-2.086426,-0.655405
2,-2.367950,-0.318477
3,-2.304197,-0.575368
4,-2.388777,0.674767
5,-2.070537,1.518549
6,-2.445711,0.074563
7,-2.233842,0.247614
8,-2.341958,-1.095146
9,-2.188676,-0.448629


In [None]:
# Split your PCA data (currently stored in X_with_pca and labels) into training and testing sets
pca_X_train, pca_X_test, pca_y_train, pca_y_test = train_test_split(X_with_pca, labels)
# Split your scaled data (currently stored in scaled_X and labels) into training and testing data
reg_X_train, reg_X_test, reg_y_train, reg_y_test = train_test_split(scaled_X, labels)

clf = DecisionTreeClassifier()
clf_for_pca = DecisionTreeClassifier()

# Fit both models on the appropriate datasets
clf.fit(reg_X_train, reg_y_train)
clf_for_pca.fit(pca_X_train, pca_y_train)

# Use each fitted model to make predictions on the appropriate test sets
reg_pred = clf.predict(reg_X_test)
pca_pred =

print("Accuracy for regular model: {}".format(accuracy_score(reg_y_test, reg_pred)))
print("Accuracy for model with PCA: {}".format(accuracy_score(pca_y_test, pca_pred)))

<center><h3>What is PCA?</h3></center>

**_TASK:_** Answer the following questions about PCA based on what you learned from class, the video, and the reading listed above. 
<br>
<br>

<center>**_ How would you explain how PCA works to someone non-technical?_**</center>
<br>
Answer:
<br>
<br>
<center>**_In what way(s) can PCA be useful in Data Science and Machine Learning? Provide at least 2 examples._**</center>
<br>
Answer:
<br>
<center><h3>Challenge: Apply PCA and Clustering to Wholesale Customer Data</h3></center>

In this notebook, we'll examine the [**_Wholesale Customers Dataset_**](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers), which we'll get from the UCI Machine Learning Datasets repository.  This dataset contains the purchase records from clients of a wholesale distributor.  It details the total annual purchases across categories seen in the data dictionary below:

**Category** | **Description** 
:-----:|:-----:
CHANNEL| 1= Hotel/Restaurant/Cafe, 2=Retailer (Nominal)
REGION| Geographic region of Portugal for each order (Nominal)
FRESH| Annual spending (m.u.) on fresh products (Continuous);
MILK| Annual spending (m.u.) on milk products (Continuous); 
GROCERY| Annual spending (m.u.)on grocery products (Continuous); 
FROZEN| Annual spending (m.u.)on frozen products (Continuous) 
DETERGENTS\_PAPER| Annual spending (m.u.) on detergents and paper products (Continuous) 
DELICATESSEN| Annual spending (m.u.)on and delicatessen products (Continuous); 

**_TASK:_** Read in `wholesale_customers_data.csv` from the `datasets` folder and store in a dataframe.  Store the `Channel` column in a separate variable, and then drop the `Channel` and `Region` columns from the dataframe. Scale the data and use PCA to engineer new features (Principal Components).  Print out the explained variance for each principal component.  Be sure to make your code portable--we'll be using this in our next Jupyter Notebook on K-Means Clustering!

In [None]:
df = None
channel = None

# Now Drop the Channel and Region Columns