# Principle Component Analysis
- **Unsupervised** learning model, meaning that unlike supervised learning models, we are working with unlabeled data.
    - You can confirm accuracies with supervised models but cannot with unsupervised models
- The PCA plot is used to visualise correlations in a 2d graph 
- Tight clusters indicate highe correlation
- The number of dimenions are reduced so that multiple so that patterns in correlation between features can be easily identified 
- In other words, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one

## Tasks

1. [ ] Import your PCA data into a dataframe. Name the columns `PC1`, `PC2`, `PC3`, and `PC4`.  
2. [ ] Drop PC3 and PC4 columns.
3. [ ] Split your scaled data (currently stored in `scaled_X` and `labels`) into training and testing data using `train_test_split()`.  
4. [ ] Split your PCA data (currently stored in `X_with_pca` and `labels`) into training and testing sets using `train_test_split()`.
5. [ ] Create two `DecisionTreeClassifier` objects. Store one in `pca_clf` and one in `reg_clf`.
6. [ ] Fit each model on their respective datasets, and make predictions from each. Compare the accuracy of each. Was the performance of the model fitted using the 2-dimensional PCA data of comparable performance? How would you tell.

## Imports

In [5]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## STEP 1: *Initialize dataframe*

In [9]:
# Data
pca_df = pd.read_csv('../datasets/Wholesale customers data.csv')
pca_df.head(3)

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844


## STEP 2: *Data Scaling & Transform*

In [29]:
# Scaling
scaler = StandardScaler(copy = True, with_mean=True, with_std=True)
scaler.fit(pca_df)
scaled_df = scaler.transform(pca_df)
# scaled_df

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.


## STEP 3: *PCA Model*

In [31]:
# Initialise the model
pca = PCA() 

# Fit the model with scaled data
pca.fit(scaled_df)

# Transform the scaled data
pca_transformed = pca.transform(scaled_df)

# enumerate through "pca_transformed.explained_variance_ratio_"
# to see the amount of variance captured by each Principal Component
for ind, var in enumerate(pca.explained_variance_ratio_):
    print(f'Explained Variance for Principal Component {ind}: {var}')    

Explained Variance for Principal Component 0: 0.3875012291159263
Explained Variance for Principal Component 1: 0.2237458795103888
Explained Variance for Principal Component 2: 0.1264717345230583
Explained Variance for Principal Component 3: 0.09229903718659847
Explained Variance for Principal Component 4: 0.06957904970880058
Explained Variance for Principal Component 5: 0.0574135443531443
Explained Variance for Principal Component 6: 0.03514075682001749
Explained Variance for Principal Component 7: 0.007848768782065951


### STEP 3.1: *Dropping Principle Components*

#### Converting Principle Components into a Data Frame

In [15]:
print(pca.explained_variance_ratio_)

[0.38750123 0.22374588 0.12647173 0.09229904 0.06957905 0.05741354
 0.03514076 0.00784877]


In [16]:
def create_pca_df(values_list):
    # create empty df     
    df_pca = pd.DataFrame()
    
    # loop through pricniple components    
    for i in range(len(values_list)):
        # feature name
        feature = "PCA" + str(i+1)
        # principle component explained variance ratio value
        value = values_list[i]
        # create new feature with corresponding value
        df_pca[feature] = value 
    
    return df_pca

df_pca = create_pca_df(pca.explained_variance_ratio_)
df_pca

Unnamed: 0,PCA1,PCA2,PCA3,PCA4,PCA5,PCA6,PCA7,PCA8


In [17]:
pca_transformed

array([[ 0.84393893, -0.51535075, -0.76763222, ..., -0.93944129,
         0.65476177,  0.01810169],
       [ 1.06267645, -0.48460126, -0.67297526, ..., -0.86722684,
         0.51102248,  0.0778948 ],
       [ 1.26914052,  0.68205455, -0.6640946 , ..., -1.07844165,
        -0.20315184, -0.2540374 ],
       ...,
       [ 3.86514909, -0.47985376, -0.52534452, ...,  0.28032041,
        -0.57529675, -0.08900336],
       [-1.09706738, -0.06989568, -0.63012755, ...,  0.33517   ,
        -0.15374358, -0.03730795],
       [-1.16595067, -0.90215675, -0.59770486, ...,  0.50872064,
         0.02436002,  0.01866823]])