# Principle Component Analysis
- **Unsupervised** learning model, meaning that unlike supervised learning models, we are working with unlabeled data.
    - You can confirm accuracies with supervised models but cannot with unsupervised models
- The PCA plot is used to visualise correlations in a 2d graph 
- Tight clusters indicate highe correlation
- The number of dimenions are reduced so that multiple so that patterns in correlation between features can be easily identified 
- In other words, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one

## Tasks

1. [X] Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, `PC4`, etc.. 
2. [X] Drop add columns following `PC3`.
3. [ ] Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.  
4. [ ] Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`.
5. [ ] Create two `DecisionTreeClassifier` objects. Store one in `pca_clf` and one in `reg_clf`.
6. [ ] Fit each model on their respective datasets, and make predictions from each. Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.

## Imports

In [5]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## STEP 1: *Initialize dataframe*

In [9]:
# Data
pca_df = pd.read_csv('../datasets/Wholesale customers data.csv')
pca_df.head(3)

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844


## STEP 2: *Data Scaling & Transform*

In [29]:
# Scaling
scaler = StandardScaler(copy = True, with_mean=True, with_std=True)
scaler.fit(pca_df)
scaled_df = scaler.transform(pca_df)
# scaled_df

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.


## STEP 3: *PCA Model*

In [31]:
# Initialise the model
pca = PCA() 

# Fit the model with scaled data
pca.fit(scaled_df)

# Transform the scaled data
pca_transformed = pca.transform(scaled_df)

# enumerate through "pca_transformed.explained_variance_ratio_"
# to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
    print(f'Explained Variance for Principal Component {ind}: {var}')    

Explained Variance for Principal Component 0: 0.3875012291159263
Explained Variance for Principal Component 1: 0.2237458795103888
Explained Variance for Principal Component 2: 0.1264717345230583
Explained Variance for Principal Component 3: 0.09229903718659847
Explained Variance for Principal Component 4: 0.06957904970880058
Explained Variance for Principal Component 5: 0.0574135443531443
Explained Variance for Principal Component 6: 0.03514075682001749
Explained Variance for Principal Component 7: 0.007848768782065951


### STEP 3.1: *Dropping Principle Components*

#### Converting Principle Components into a Data Frame

### Now that we see the explained variance we can drop components with low variance
##### Columns 0,1 and 2 have the highest variances so we'll keep those

In [113]:
# only take the first three columns as principal components
pc_scaled_df = scaled_df[:, 0:4]
pc_scaled_df

array([[ 1.44865163,  0.59066829,  0.05293319],
       [ 1.44865163,  0.59066829, -0.39130197],
       [ 1.44865163,  0.59066829, -0.44702926],
       ...,
       [ 1.44865163,  0.59066829,  0.20032554],
       [-0.69029709,  0.59066829, -0.13538389],
       [-0.69029709,  0.59066829, -0.72930698]])

In [118]:
# Initialise the model
pca = PCA() 

# Fit the model with scaled data
pca.fit(pc_scaled_df)

# Transform the scaled data
pca_transformed = pca.transform(pc_scaled_df)

# enumerate through "pca_transformed.explained_variance_ratio_"
# to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
    print(f'Explained Variance for Principal Component {ind}: {var}') 
print()
print(pca.explained_variance_ratio_)

Explained Variance for Principal Component 0: 0.3897748386022531
Explained Variance for Principal Component 1: 0.3445814912712074
Explained Variance for Principal Component 2: 0.2656436701265396

[0.38977484 0.34458149 0.26564367]


In [119]:
pca_transformed

array([[-1.01359198,  0.91589454,  0.76423229],
       [-1.32508214,  0.7868632 ,  0.47497519],
       [-1.36415717,  0.7706768 ,  0.4386892 ],
       ...,
       [-0.91024296,  0.95870572,  0.86020462],
       [ 0.37785955,  0.33841711, -0.7657832 ],
       [-0.03858925,  0.16590782, -1.15250737]])

## Tasks

1. [X] Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, `PC4`, etc.. 
2. [X] Drop add columns following `PC3`.
3. [ ] Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.  
4. [ ] Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`.
5. [ ] Create two `DecisionTreeClassifier` objects. Store one in `pca_clf` and one in `reg_clf`.
6. [ ] Fit each model on their respective datasets, and make predictions from each. Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.