## Principal Component Analysis (PCA)

PCA is essentially a method of preprocessing data, for the purpose of increasing generalizability of our model. PCA reduces the dimensionality of the data by reducing the features down to common vectors. The goal of this method is to reduce overfitness by ensuring our model does not measure the same feature characteristic multiple times.

For example, think of a housing price model. The home has many characteristics; number of bedrooms, number of bathrooms, square footage, lot size, etc. Many of these features can be considered proxies of the size of the home, larger lot sizes can lead to larger home square footage, which leads to more bathrooms, bedrooms, etc. Having the model predict price based on the aforementioned features can reduce generalizability because the model is being overtuned to certain patterns present in the training data.

Let's take a look at how we can apply PCA to a dataset; SKlearn thankfully makes this rather simple. The data set we will be working with is on Cars and their respective Miles Per Gallon.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA

In [6]:
# import and clean training data data
cars = pd.read_csv('cars.csv')
X_train, X_test, y_train, y_test = train_test_split(cars.drop('mpg', axis=1),
                                                    cars['mpg'],
                                                   random_state=20)

X_train[' cubicinches'].replace(' ', np.nan, inplace=True)
X_train[' cubicinches'] = X_train[' cubicinches'].map(float)
X_train[' cubicinches'].fillna(X_train[' cubicinches'].mean(skipna=True), inplace=True)

X_train[' weightlbs'].replace(' ', np.nan, inplace=True)
X_train[' weightlbs'] = X_train[' weightlbs'].map(float)
X_train[' weightlbs'].fillna(X_train[' weightlbs'].mean(), inplace=True)
b
X_train[' cylinders'] = X_train[' cylinders'].map(float)
X_train[' hp'] = X_train[' hp'].map(float)
X_train[' time-to-60'] = X_train[' time-to-60'].map(float)
X_train[' year'] = X_train[' year'].map(float)

In [14]:
# scale training data

ss=StandardScaler()
X_train_sc = ss.fit_transform(X_train.drop(' brand', axis=1))

In [15]:
# cleaning & scale test data

def clean(df):
    for col in [' cubicinches', ' weightlbs']:
        df[col].replace(' ', np.nan, inplace=True)
        df[col] = df[col].map(float)
        df[col].replace(np.nan, df[col].mean(), inplace=True)
    return df

def to_float(df):
    for col in [' cylinders', ' hp', ' time-to-60', ' year']:
        df[col] = df[col].map(float)
    return df

def drop(df):
    return df.drop(' brand', axis=1)

def scale(df):
    return ss.transform(df)

test_cleaned = clean(X_test)
test_floated = to_float(test_cleaned)
test_dropped = drop(test_floated)
test_scaled = scale(test_dropped)

In [28]:
lr1=LinearRegression()
lr1.fit(X_train_sc, y_train)
lr1.score(test_scaled, y_test)

0.7504046085411239

### Now with PCA

In [16]:
# initiate PCA

pca = PCA(n_components=5)

In [24]:
# fit & transform PCA to train data

X_train_pca = pca.fit_transform(X_train_sc)

In [20]:
# transform test data

test_pca = pca.transform(test_scaled)

In [21]:
lr = LinearRegression()

In [26]:
lr.fit(X_train_pca, y_train)
lr.score(test_pca, y_test)

0.7525161654890221

## Conclusion

We can see that the regression with PCA performed better on the testing data. This is because the model was fitted to training data that had been processed through a PCA algorithm, which allowed the model to generalize more effectively. 

Remember, this is in relation to the TEST data. The performance on the training data will actually be lower for PCA models. This is because it's level of overfitness was reduced. 