# PCA Transformer

This notebook shows the functionality in the `PCATransformer` class. This transformer applys the `sklearn.decomposition.pca` method to the input `X`. <br>
This transformer means that principal component analysis dimension reduction technique is applied to project data to a lower dimensional space.     
This PCA Transformer is based on [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) class.



In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

In [2]:
import tubular
from tubular.numeric import PCATransformer

In [3]:
tubular.__version__

'0.3.3'

## Load California housing dataset from sklearn

In [4]:
cali = fetch_california_housing()
cali_df = pd.DataFrame(cali["data"], columns=cali["feature_names"])
print(cali_df.shape)
cali_df.head()

(20640, 8)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


## Simple usage    
### Initialising PCATransformer

The user can specify the following; <br>
- `columns` the columns in the `DataFrame` passed to the`fit` and `transform` method to be transformed <br>
- `n_components` number of PCA dimension expected. "mle" value can also be provided to guess the dimension. (default value is 2) <br>
- `svd_solver` the solver used to compute the Singular Value Decomposition. Available solvers : 'auto', 'full', 'arpack', 'randomized' (default value is 'auto') <br>
- `random_state` used when the 'arpack' or 'randomized' solvers are used. (default value is None) <br>
- `pca_column_prefix` prefix added to each the n components features generated.(default value is pca_) <br>

In [5]:
pca_transformer = PCATransformer(
    columns=["HouseAge", "Population", "MedInc"],
    n_components=2,
)

### InteractionTransformer fit
The `PCATransformer` must be `fit` on data before running `transform` to compute the SVD. 

In [6]:
pca_transformer.fit(cali_df)

PCATransformer(columns=['HouseAge', 'Population', 'MedInc'])

### InteractionTransformer transform
When running transform with this configuration new PCA dimensions columns are added to the input `X`.

In [7]:
cali_df_2 = pca_transformer.transform(cali_df)
cali_df_2.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,pca_0,pca_1
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,-1103.511425,8.636318
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,975.543158,-4.514731
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,-929.548596,20.228204
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,-867.548945,20.464517
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,-860.548998,20.523405


### Use a different solver


In [8]:
pca_transformer_arpack = PCATransformer(
    columns=["HouseAge", "Population", "MedInc"],
    n_components=1,
    svd_solver="arpack",
    random_state=32,
)

In [9]:
cali_df_3 = pca_transformer_arpack.fit_transform(cali_df)
cali_df_3.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,pca_0
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,-1103.511425
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,975.543158
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,-929.548596
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,-867.548945
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,-860.548998
