<center><h1>Unsupervised Learning: Principal Component Analysis & Clustering</h1></center>

For this notebook, we'll explore add two important techniques to our Data Science toolbox: **_Principal Component Analysis (PCA)_** and **_Clustering_**.  Unlike all the Supervised Learning techniques we've used to far, these two are unique because they are examples of **_Unsupervised Learning_**.  Whereas we require labeled data to double check the accuracy of algorithms like Decision Trees and Naive Bayesian Classifiers, these techniques work on unlabeled data.  While this makes it much easier to apply these techniques to many more kinds of data, it also means that we have no way to measure how well the algorithm is or isn't working.  

<center><h2>Challenge 1: Apply PCA to the Iris Data Set</h2></center> 

To help us explore the concept of PCA, we're going to start by applying PCA to the _Iris Dataset_.  We'll then use it to fit a model and classify the flower types as we have done in previous examples.


 Before we begin, watch this [primer from StatQuest](https://www.youtube.com/embed/_UVHneBUBW0) to understand what PCA, and read this [short (interactive) article](http://setosa.io/ev/principal-component-analysis/) about using PCA on the Iris dataset.  These two examples should help you better understand how PCA works, and more importantly, how it can be useful to you. 

For our first challenge, we'll import the _Iris Dataset_ from `sklearn.datasets` and use PCA on it. By examining the explained variance of Principal Components, we'll see that we can actually drop 1 or 2 columns (reducing our **_dimensionality_**) while only losing a minimal amount of predictive accuracy.  

Follow these steps in the code block below:

1. Call `load_iris()` and store the results in the `iris` variable. 
<br>
<br>
1. Create a `StandardScaler()` object and store it in `scaler`.
<br>
<br>
1. Call `scaler.fit()` on `iris.data`, and then use `scaler.transform` to create a scaled version of your data.  Store the results in `scaled_x`.
<br>
<br>
1. Store the labels for _iris_ `labels`.
<br>
<br>
1.  Create a `PCA()` object and store it in `pca`.  Fit it to the scaled data using `pca.fit()`.  Then, call `pca.transform()` on `scaled_x` and store the results in `X_with_pca`.
<br>
<br>
1. Complete the `enumerate` statement to to enumerate through `pca.explained_variance_ratio_` and print out the variance captured by each of the Principal Components.

If you followed these steps correctly, you have will have now created 4 _Principal Components_ from your original dataset.  Be sure to use the information printed out by running the cell below to answer the following questions below it!

In [8]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
scaler = StandardScaler()
#Fit the scaler to iris.data

# call scaler.transform() on iris.data and store the result in scaled_X
scaled_X = scaler.fit_transform(iris.data)

# Store the labels contained in iris.targets below
labels = iris.target

# Create a PCA() object
pca = PCA()

#Fit the pca object to scaled_X
pca.fit(scaled_X)

# Call pca.transform() on scaled_X and store the results below
X_with_pca = pca.transform(scaled_X)

# Enumerate through pca.explained_variance_ratio_ to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
    print("Explained Variance for Principal Component {}: {}".format(ind, var))

Explained Variance for Principal Component 0: 0.7296244541329986
Explained Variance for Principal Component 1: 0.22850761786701776
Explained Variance for Principal Component 2: 0.03668921889282878
Explained Variance for Principal Component 3: 0.005178709107154798


<center><h3>Understanding our Results</h3></center>
<br>
<br>
<center>Challenge: Use your results from above to answer the following questions:</center>
<br>
<br>
<center>1) Complete the following table using your results from above.</center>

| Principal Component | Variance Explained  |
|---------------------|---------------------|
|      PC1               |0.7296244541329986|
|      PC2            |0.22850761786701776|
|      PC3               |0.03668921889282878|
|      PC4            |0.005178709107154798|

<center>2) Based on the explained variances in the table above, do you recommend dropping any of the columns to reduce dimensionality? Explain your answer.</center>

<center><b>Answer:</b> We could drom the third and fourth columns because they have low variances. This means that without them the results won't change drastically.</center>

<center><h3>Challenge: Fit a model using using Principal Components</h3></center>

Using the data from above, complete the following steps:

1.  Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, and `PC4`.
1.  Drop `PC3` and `PC4` columns.
1.  Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.
1.  Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`
1.  Create two `DecisionTreeClassifier` objects.  Store one in `pca_clf` and one in `reg_clf`.
1.  Fit each model on their respective datasets, and make predictions from each.  Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.  

**_Stretch Challenge:_** Use `K-Fold Cross Validation` on each to run the models multiple times and get an average performance for each.  Try this with K >= 5.  

In [19]:
# fancy syntax to keep only the first 2 columns
X_with_pca = X_with_pca[:, 0:2]

pca_X_train, pca_X_test, pca_y_train, pca_y_test = train_test_split(X_with_pca, labels)
reg_X_train, reg_X_test, reg_y_train, reg_y_test = train_test_split(scaled_X, labels)

# Fit both models on the appropriate datasets
reg_clf = DecisionTreeClassifier()
reg_clf = reg_clf.fit(reg_X_train, reg_y_train)

pca_clf = DecisionTreeClassifier()
pca_clf = pca_clf.fit(pca_X_train, pca_y_train)


# Use each fitted model to make predictions on the appropriate test sets
reg_pred = reg_clf.predict(reg_X_test)
pca_pred = pca_clf.predict(pca_X_test)

print("Accuracy for regular model: {}".format(accuracy_score(reg_y_test, reg_pred)))
print("Accuracy for model with PCA: {}".format(accuracy_score(pca_y_test, pca_pred)))

Accuracy for regular model: 0.9210526315789473
Accuracy for model with PCA: 0.8947368421052632


<center><h3>What is PCA?</h3></center>

**_TASK:_** Answer the following questions about PCA based on what you learned from class, the video, and the reading listed above. 
<br>
<br>

<b>Q1: </b>How would you explain how PCA works to someone non-technical?  
<b>Answer: </b>PCA is a dimensionality reduction algorithm (it helps you get rid of less necessary features). It can tell you the variance (or how far apart the data is spread) of each dimensions in your data. From this it can give you a score on how much of your data is preserved in each feature and ones with less variance can be dropped. This will make your model far more efficient for little cost.
<br>
<br>
<b>Q2: </b>In what way(s) can PCA be useful in Data Science and Machine Learning? Provide at least 2 examples.  
<b>Answer: </b>For making models more efficient. For dropping unnecessary columns.
<br>

<center><h2>Challenge 2: Apply PCA and Clustering to Wholesale Customer Data</h2></center>
Challenge 2: Apply PCA to [Wholesale Customers Dataset](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers), which we'll get from the UCI Machine Learning Datasets repository.  This dataset contains the purchase records from clients of a wholesale distributor.  It details the total annual purchases across categories seen in the data dictionary below:

**Category** | **Description** 
:-----:|:-----:
CHANNEL| 1= Hotel/Restaurant/Cafe, 2=Retailer (Nominal)
REGION| Geographic region of Portugal for each order (Nominal)
FRESH| Annual spending (m.u.) on fresh products (Continuous);
MILK| Annual spending (m.u.) on milk products (Continuous); 
GROCERY| Annual spending (m.u.)on grocery products (Continuous); 
FROZEN| Annual spending (m.u.)on frozen products (Continuous) 
DETERGENTS\_PAPER| Annual spending (m.u.) on detergents and paper products (Continuous) 
DELICATESSEN| Annual spending (m.u.)on and delicatessen products (Continuous); 

**_TASK:_** Read in `wholesale_customers_data.csv` from the `datasets` folder and store in a dataframe.  Store the `Channel` column in a separate variable, and then drop the `Channel` and `Region` columns from the dataframe. Scale the data and use PCA to engineer new features (Principal Components).  Print out the explained variance for each principal component.  Be sure to make your code portable--we'll be using this in our next Jupyter Notebook on K-Means Clustering!

In [23]:
df = pd.read_csv('../datasets/Wholesale customers data.csv')
# Now Drop the Channel and Region Columns
df = df.drop(['Channel', 'Region'], axis=1)
df.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185
