<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Unsupervised Learning: Principal Component Analysis & Clustering</h1></center>

For this notebook, we'll explore add two important techniques to our Data Science toolbox: **_Principal Component Analysis (PCA)_** and **_Clustering_**.  Unlike all the Supervised Learning techniques we've used to far, these two are unique because they are examples of **_Unsupervised Learning_**.  Whereas we require labeled data to double check the accuracy of algorithms like Decision Trees and Naive Bayesian Classifiers, these techniques work on unlabeled data.  While this makes it much easier to apply these techniques to many more kinds of data, it also means that we have no way to measure how well the algorithm is or isn't working.  


<center><h3>Challenge: Apply PCA to the Iris Data Set</h3></center> 

To help us explore the concept of PCA, we're going to start by applying PCA to the _Iris Dataset_.  We'll then use it to fit a model and classify the flower types as we have done in previous examples.


 Before we begin, watch this primer from StatQuest to understand what PCA, and read this [short (interactive) article](http://setosa.io/ev/principal-component-analysis/) about using PCA on the Iris dataset.  These two examples should help you better understand how PCA works, and more importantly, how it can be useful to you.  

In [1]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/_UVHneBUBW0" \
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')



<center><h3>Challenge: Apply PCA to the Iris Dataset</h3></center>

For our first challenge, we'll import the _Iris Dataset_ from `sklearn.datasets` and use PCA on it. By examining the explained variance of Principal Components, we'll see that we can actually drop 1 or 2 columns (reducing our **_dimensionality_**) while only losing a minimal amount of predictive accuracy.  

Follow these steps in the code block below:

1. Call `load_iris()` and store the results in the `iris` variable. 
<br>
<br>
1. Create a `StandardScaler()` object and store it in `scaler`.
<br>
<br>
1. Call `scaler.fit()` on `iris.data`, and then use `scaler.transform` to create a scaled version of your data.  Store the results in `scaled_x`.
<br>
<br>
1. Store the labels for _iris_ `labels`.
<br>
<br>
1.  Create a `PCA()` object and store it in `pca`.  Fit it to the scaled data using `pca.fit()`.  Then, call `pca.transform()` on `scaled_x` and store the results in `X_with_pca`.
<br>
<br>
1. Complete the `enumerate` statement to to enumerate through `pca.explained_variance_ratio_` and print out the variance captured by each of the Principal Components.

If you followed these steps correctly, you have will have now created 4 _Principal Components_ from your original dataset.  Be sure to use the information printed out by running the cell below to answer the following questions below it!

In [10]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
scaler = StandardScaler()
#Fit the scaler to iris.data
scaler.fit(iris.data)

# call scaler.transform() on iris.data and store the result in scaled_X
scaled_X = scaler.transform(iris.data)

# Store the labels contained in iris.targets below
labels = iris.target

# Create a PCA() object
pca = PCA()

#Fit the pca object to scaled_X
pca.fit(scaled_X)

# Call pca.transform() on scaled_X and store the results below
X_with_pca = pca.transform(scaled_X)

# Enumerate through pca.explained_variance_ratio_ to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
    print("Explained Variance for Principal Component {}: {}".format(ind, var))

Explained Variance for Principal Component 0: 0.7296244541329986
Explained Variance for Principal Component 1: 0.22850761786701776
Explained Variance for Principal Component 2: 0.03668921889282878
Explained Variance for Principal Component 3: 0.005178709107154798


<center><h3>Understanding our Results</h3></center>
<br>
<br>
<center>**_Challenge: Use your results from above to answer the following questions._**</center>
<br>
<br>
<center>**_1.)Complete the following table using your results from above. _**</center>

| Principal Component | Variance Explained  |
|---------------------|---------------------|
|      PC1               |  0.7296244541329986                   |
|      PC2            |  0.22850761786701776                   |
|      PC3               |  0.03668921889282878                   |
|      PC4            |  0.005178709107154798                   |

<center>**_2.) Based on the explained variances in the table above, do you recommend dropping any of the columns to reduce dimensionality? Explain your answer._**</center>

Answer: Yes, PCA3 & PCA4 account for only 3.5% of variance


<center><h3>Challenge: Fit a model using using Principal Components</h3></center>

Using the data from above, complete the following steps:

1.  Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, and `PC4`.
1.  Drop `PC3` and `PC4` columns.
1.  Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.
1.  Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`
1.  Create two `DecisionTreeClassifier` objects.  Store one in `pca_clf` and one in `reg_clf`.
1.  Fit each model on their respective datasets, and make predictions from each.  Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.  

**_Stretch Challenge:_** Use `K-Fold Cross Validation` on each to run the models multiple times and get an average performance for each.  Try this with K >= 5.  

In [40]:
df = pd.DataFrame(data=X_with_pca,columns=['PC1','PC2','PC3','PC4'])
df = df.drop(labels=['PC3', 'PC4'],axis=1)
print(df)

pca_X_train, pca_X_test, pca_y_train, pca_y_test = train_test_split(df, labels)
reg_X_train, reg_X_test, reg_y_train, reg_y_test = train_test_split(scaled_X, labels)

clf = DecisionTreeClassifier()
clf_for_pca = DecisionTreeClassifier()

# Fit both models on the appropriate datasets
clf.fit(reg_X_train, reg_y_train)
clf_for_pca.fit(pca_X_train, pca_y_train)

# Use each fitted model to make predictions on the appropriate test sets
reg_pred = clf.predict(reg_X_test)
pca_pred = clf_for_pca.predict(pca_X_test)

print("Accuracy for regular model: {}".format(accuracy_score(reg_y_test, reg_pred)))
print("Accuracy for model with PCA: {}".format(accuracy_score(pca_y_test, pca_pred)))

          PC1       PC2
0   -2.264703  0.480027
1   -2.080961 -0.674134
2   -2.364229 -0.341908
3   -2.299384 -0.597395
4   -2.389842  0.646835
5   -2.075631  1.489178
6   -2.444029  0.047644
7   -2.232847  0.223148
8   -2.334640 -1.115328
9   -2.184328 -0.469014
10  -2.166310  1.043691
11  -2.326131  0.133078
12  -2.218451 -0.728676
13  -2.633101 -0.961507
14  -2.198741  1.860057
15  -2.262215  2.686284
16  -2.207588  1.483609
17  -2.190350  0.488838
18  -1.898572  1.405019
19  -2.343369  1.127849
20  -1.914323  0.408856
21  -2.207013  0.924121
22  -2.774345  0.458344
23  -1.818670  0.085559
24  -2.227163  0.137254
25  -1.951846 -0.625619
26  -2.051151  0.242164
27  -2.168577  0.527150
28  -2.139563  0.313218
29  -2.265261 -0.337732
..        ...       ...
120  2.037716  0.910467
121  0.977981 -0.571764
122  2.897651  0.413641
123  1.333232 -0.481811
124  1.700734  1.013922
125  1.954327  1.007778
126  1.175104 -0.316394
127  1.020951  0.064346
128  1.788350 -0.187361
129  1.863648  0

In [45]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=8)
reg_scores = []
# scaled data
for train_index, test_index in kf.split(scaled_X):
    kX_train, kX_test = scaled_X[train_index], scaled_X[test_index]
    ky_train, ky_test = labels[train_index], labels[test_index]
    kclf = DecisionTreeClassifier()
    kclf.fit(kX_train, ky_train)
    kreg_pred = kclf.predict(kX_test)
    kscore = accuracy_score(ky_test, kreg_pred)
    reg_scores.append(kscore)
    print("Accuracy for regular model: {}".format(kscore))
avg_score = sum(reg_scores) / len(reg_scores)
print(f"Average accuracy for regular model: {avg_score}")
print("\n")

# pca data
pca_scores = []
pca_array = df.to_numpy()
for train_index, test_index in kf.split(pca_array):
    kX_train, kX_test = pca_array[train_index], pca_array[test_index]
    ky_train, ky_test = labels[train_index], labels[test_index]
    kclf = DecisionTreeClassifier()
    kclf.fit(kX_train, ky_train)
    kreg_pred = kclf.predict(kX_test)
    kscore = accuracy_score(ky_test, kreg_pred)
    pca_scores.append(kscore)
    print("Accuracy for PCA model: {}".format(kscore))
avg_score = sum(pca_scores) / len(pca_scores)
print(f"Average accuracy for PCA model: {avg_score}")

Accuracy for regular model: 1.0
Accuracy for regular model: 1.0
Accuracy for regular model: 1.0
Accuracy for regular model: 0.8947368421052632
Accuracy for regular model: 0.8947368421052632
Accuracy for regular model: 0.9473684210526315
Accuracy for regular model: 0.8888888888888888
Accuracy for regular model: 0.8333333333333334
Average accuracy for regular model: 0.9323830409356725


Accuracy for PCA model: 1.0
Accuracy for PCA model: 1.0
Accuracy for PCA model: 0.9473684210526315
Accuracy for PCA model: 0.7368421052631579
Accuracy for PCA model: 0.6842105263157895
Accuracy for PCA model: 0.8421052631578947
Accuracy for PCA model: 0.9444444444444444
Accuracy for PCA model: 0.8888888888888888
Average accuracy for PCA model: 0.8804824561403508


<center><h3>What is PCA?</h3></center>

**_TASK:_** Answer the following questions about PCA based on what you learned from class, the video, and the reading listed above. 
<br>
<br>

<center>**_ How would you explain how PCA works to someone non-technical?_**</center>
<br>
Answer:
<br>
<br>
<center>**_In what way(s) can PCA be useful in Data Science and Machine Learning? Provide at least 2 examples._**</center>
<br>
Answer:
<br>
<center><h3>Challenge: Apply PCA and Clustering to Wholesale Customer Data</h3></center>

In this notebook, we'll examine the [**_Wholesale Customers Dataset_**](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers), which we'll get from the UCI Machine Learning Datasets repository.  This dataset contains the purchase records from clients of a wholesale distributor.  It details the total annual purchases across categories seen in the data dictionary below:

**Category** | **Description** 
:-----:|:-----:
CHANNEL| 1= Hotel/Restaurant/Cafe, 2=Retailer (Nominal)
REGION| Geographic region of Portugal for each order (Nominal)
FRESH| Annual spending (m.u.) on fresh products (Continuous);
MILK| Annual spending (m.u.) on milk products (Continuous); 
GROCERY| Annual spending (m.u.)on grocery products (Continuous); 
FROZEN| Annual spending (m.u.)on frozen products (Continuous) 
DETERGENTS\_PAPER| Annual spending (m.u.) on detergents and paper products (Continuous) 
DELICATESSEN| Annual spending (m.u.)on and delicatessen products (Continuous); 

**_TASK:_** Read in `wholesale_customers_data.csv` from the `datasets` folder and store in a dataframe.  Store the `Channel` column in a separate variable, and then drop the `Channel` and `Region` columns from the dataframe. Scale the data and use PCA to engineer new features (Principal Components).  Print out the explained variance for each principal component.  Be sure to make your code portable--we'll be using this in our next Jupyter Notebook on K-Means Clustering!

In [53]:
df = pd.read_csv('../datasets/Wholesale customers data.csv')
channel = df['Channel']

# Now Drop the Channel and Region Columns
df = df.drop(labels=['Channel','Region'],axis=1)
df.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185


In [54]:
uci_scaler = StandardScaler()
uci_scaler.fit(df)
scaled_uci = uci_scaler.transform(df)

uci_pca = PCA()
uci_pca.fit(scaled_uci)
uci_with_pca = uci_pca.transform(scaled_uci)

for ind, var in enumerate(uci_pca.explained_variance_ratio_):
    print("Explained Variance for Principal Component {}: {}".format(ind, var))

Explained Variance for Principal Component 0: 0.4408289288112801
Explained Variance for Principal Component 1: 0.2837639952661695
Explained Variance for Principal Component 2: 0.12334412896786472
Explained Variance for Principal Component 3: 0.09395503752971504
Explained Variance for Principal Component 4: 0.04761272400688682
Explained Variance for Principal Component 5: 0.010495185418083766
