# Principal Component Analysis

1. Standardize the data. (with mean=0 and variance=1)

2. Compute covariance matrix of dimensions.

3. Obtain the Eigenvectors and Eigenvalues from the covariance matrix.

4. Sort eigenvalues in descending order and choose the top k Eigenvectors that correspond to the k largest eigenvalues  
(k will become the number of dimensions of the new feature subspace k ≤ d, d is the number of original dimensions).

5. Construct the projection matrix W from the selected k Eigenvectors.

6. Transform the original data set X via W to obtain the new k-dimensional feature subspace Y.

Dataset used: https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists

In [6]:
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

In [17]:
full_dataset = pd.concat([pd.read_csv('aug_train.csv'), pd.read_csv('aug_test.csv')]).dropna()
full_dataset

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
7,402,city_46,0.762,Male,Has relevent experience,no_enrollment,Graduate,STEM,13,<10,Pvt Ltd,>4,18,1.0
8,27107,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,7,50-99,Pvt Ltd,1,46,1.0
11,23853,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,5,5000-9999,Pvt Ltd,1,108,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19147,21319,city_21,0.624,Male,No relevent experience,Full time course,Graduate,STEM,1,100-500,Pvt Ltd,1,52,1.0
19149,251,city_103,0.920,Male,Has relevent experience,no_enrollment,Masters,STEM,9,50-99,Pvt Ltd,1,36,1.0
19150,32313,city_160,0.920,Female,Has relevent experience,no_enrollment,Graduate,STEM,10,100-500,Public Sector,3,23,0.0
19152,29754,city_103,0.920,Female,Has relevent experience,no_enrollment,Graduate,Humanities,7,10/49,Funded Startup,1,25,0.0


In [18]:
full_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8955 entries, 1 to 19155
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             8955 non-null   int64  
 1   city                    8955 non-null   object 
 2   city_development_index  8955 non-null   float64
 3   gender                  8955 non-null   object 
 4   relevent_experience     8955 non-null   object 
 5   enrolled_university     8955 non-null   object 
 6   education_level         8955 non-null   object 
 7   major_discipline        8955 non-null   object 
 8   experience              8955 non-null   object 
 9   company_size            8955 non-null   object 
 10  company_type            8955 non-null   object 
 11  last_new_job            8955 non-null   object 
 12  training_hours          8955 non-null   int64  
 13  target                  8955 non-null   float64
dtypes: float64(2), int64(2), object(10)
mem

In [19]:
X = full_dataset.iloc[:,:-1].values
y = full_dataset.iloc[:,-1].values

In [20]:
X

array([[29725, 'city_40', 0.7759999999999999, ..., 'Pvt Ltd', '>4', 47],
       [666, 'city_162', 0.767, ..., 'Funded Startup', '4', 8],
       [402, 'city_46', 0.762, ..., 'Pvt Ltd', '>4', 18],
       ...,
       [32313, 'city_160', 0.92, ..., 'Public Sector', '3', 23],
       [29754, 'city_103', 0.92, ..., 'Funded Startup', '1', 25],
       [24576, 'city_103', 0.92, ..., 'Pvt Ltd', '4', 44]], dtype=object)

In [21]:
y

array([0., 0., 1., ..., 0., 0., 0.])