![Make School Logo](./img/ms-logo.png)

# Principal Component Analysis & Clustering

For this notebook, we'll explore add two important techniques to our Data Science toolbox: **Principal Component Analysis (PCA)** and **Clustering**. Unlike all the Supervised Learning techniques we've used to far, these two are unique because they are examples of **Unsupervised Learning**. Whereas we require labeled data to double check the accuracy of algorithms like Decision Trees and Naive Bayesian Classifiers, these techniques work on unlabeled data. While this makes it much easier to apply these techniques to many more kinds of data, it also means that we have no way to measure how well the algorithm is or isn't working.

## `CHALLENGE 1` Apply PCA to the *Iris Dataset*
To help us explore the concept of PCA, we're going to start by applying PCA to the *Iris Dataset*. We'll then use it to fit a model and classify the flower types as we have done in previous examples.
There are two prerequisite media posts to check out before we begin:
- Watch this [primer from StatQuest](https://www.youtube-nocookie.com/embed/_UVHneBUBW0) to understand what PCA is.
- Read this [short interactive article](http://setosa.io/ev/principal-component-analysis/) about using PCA on the *Iris Dataset*.

These two examples should help you better understand how PCA works, and more importantly, how it can be useful to you.

### Getting Started
For our first challenge, we'll import the *Iris Dataset* from `sklearn.datasets` and use PCA on it. By examining the explained variance of Principal Components, we'll see that we can actually drop 1 or 2 columns (reducing our **dimensionality**) while only losing a minimal amount of predictive accuracy.

Follow these steps in the code block below:
1. Call `load_iris()` and store the results in the `iris` variable.
1. Create a `StandardScaler()` object and store it in `scaler`.
1. Call `scaler.fit()` on `iris.data`, and then use `scaler.transform` to create a scaled version of your data. Store the results in `scaled_x`.
1. Store the labels for iris `labels`.
1. Create a `PCA()` object and store it in `pca`. Fit it to the scaled data using `pca.fit()`. Then, call `pca.transform()` on `scaled_x` and store the results in `X_with_pca`.
1. Complete the `enumerate` statement to to enumerate through `pca.explained_variance_ratio_` and print out the variance captured by each of the Principal Components.

If you follow these steps correctly, you will create 4 *Principal Components* from the *Iris Dataset*. Be sure to use the information printed out by running the cell below to answer the following questions below it!

In [81]:
# standard imports --
# numpy for math stuff
# pandas for data stuff
import numpy
import pandas

# needed for first section
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [82]:
# get iris dataset from sklearn.datasets
iris_json = load_iris()
print(iris_json.DESCR)

# create a pandas dataframe from the data
iris = pandas.DataFrame(iris_json['data'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [83]:
# check out the dataframe!
iris.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [84]:
# okay, now add in some column names...
iris.columns = iris_json.feature_names
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [86]:
# get the "standard scaler"
# ==TODO==
# define "standard scaler"
scaler = StandardScaler()

# fit the scaler to iris_json.data
scaler.fit(iris)

# call scaler.transform() on iris.data and store the result in iris_scaled_x
iris_scaled_x = scaler.transform(iris)

In [87]:
# get the PCA
pca = PCA()

# fit the pca object to iris_scaled_x
pca.fit(iris_scaled_x)

# enumerate through "iris_pca.explained_variance_ratio_"
# to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
	print(f'Explained Variance for Principal Component {ind}: {var}')

# Call pca.transform() on scaled_X and store the results below
iris_pca_x = pca.transform(iris_scaled_x)

Explained Variance for Principal Component 0: 0.7296244541329985
Explained Variance for Principal Component 1: 0.22850761786701793
Explained Variance for Principal Component 2: 0.036689218892828786
Explained Variance for Principal Component 3: 0.005178709107154802


## Understanding our Results
#### Deducting Variance
Based on the explained variances in the output above, do you recommend dropping any of the principal components to reduce dimensionality? Explain your answer.

> *Answer will be here.*

### Challenge: Fit a model using using Principal Components
Using the data from above, complete the following steps:
1. Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, and `PC4`.
1. Drop `PC3` and `PC4` columns.
1. Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.
1. Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`
1. Create two `DecisionTreeClassifier` objects. Store one in `pca_clf` and one in `reg_clf`.
1. Fit each model on their respective datasets, and make predictions from each. Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.

**Stretch Challenge:** Use `K-Fold Cross Validation` on each to run the models multiple times and get an average performance for each. Try this with K >= 5.

In [15]:
pca_X_train, pca_X_test, pca_y_train, pca_y_test = None
reg_X_train, reg_X_test, reg_y_train, reg_y_test = None

clf = None
clf_for_pca = None

# Fit both models on the appropriate datasets

# Use each fitted model to make predictions on the appropriate test sets
reg_pred = None
pca_pred None

print("Accuracy for regular model: {}".format(accuracy_score(reg_y_test, reg_pred)))
print("Accuracy for model with PCA: {}".format(accuracy_score(pca_y_test, pca_pred)))

SyntaxError: invalid syntax (<ipython-input-15-142f6b1cb550>, line 11)

## What is PCA?

### TASK: Answer the following questions about PCA based on what you learned from class, the video, and the reading listed above.

**How would you explain how PCA works to someone non-technical?**
Answer:

**In what way(s) can PCA be useful in Data Science and Machine Learning? Provide at least 2 examples.**
Answer:

### Challenge: Apply PCA and Clustering to Wholesale Customer Data
In this notebook, we'll examine the [**Wholesale Customers Dataset**](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers), which we'll get from the UCI Machine Learning Datasets repository. This dataset contains the purchase records from clients of a wholesale distributor. It details the total annual purchases across categories seen in the data dictionary below:

**Category** | **Description**
:-----:|:-----:
CHANNEL| 1= Hotel/Restaurant/Cafe, 2=Retailer (Nominal)
REGION| Geographic region of Portugal for each order (Nominal)
FRESH| Annual spending (m.u.) on fresh products (Continuous);
MILK| Annual spending (m.u.) on milk products (Continuous);
GROCERY| Annual spending (m.u.)on grocery products (Continuous);
FROZEN| Annual spending (m.u.)on frozen products (Continuous)
DETERGENTS\_PAPER| Annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN| Annual spending (m.u.)on and delicatessen products (Continuous);

**TASK:** Read in `wholesale_customers_data.csv` from the `datasets` folder and store in a dataframe. Store the `Channel` column in a separate variable, and then drop the `Channel` and `Region` columns from the dataframe. Scale the data and use PCA to engineer new features (Principal Components). Print out the explained variance for each principal component. Be sure to make your code portable--we'll be using this in our next Jupyter Notebook on K-Means Clustering!

In [16]:
df = None
channel = None

# Now Drop the Channel and Region Columns