### Linear Discriminant Analysis  (Normal Discriminant Analysis / Discriminant Function Analysis)
LDA is a dimensionality reduction technique which is commonly used for the supervised classification problems. 
It is used for modeling differences in groups i.e. separating two or more classes. 
It is used to project the features in higher dimension space into a lower dimension space.

### Principal Component Analysis:
PCA is a statistical procedure that uses an orthogonal transformation which converts a set of correlated variables 
to a set of uncorrelated variables. 
PCA is a most widely used tool in exploratory data analysis and in machine learning for predictive models. 
Moreover, PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables. 

It is also known as a general factor analysis where regression determines a line of best fit.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt  
%matplotlib inline

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

In [3]:
data = pd.read_csv("train.csv", nrows=20000)
data.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [4]:
X =data.drop("TARGET", axis=1)
y =data["TARGET"]

X.shape , y.shape

((20000, 370), (20000,))

In [5]:
X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

### Remove Constant, Quasi Constant, duplicated feature removal

In [6]:
constant_filter =VarianceThreshold(threshold=0)
constant_filter.fit(X_train)

VarianceThreshold(threshold=0)

In [7]:
X_train_filter =constant_filter.transform(X_train)
X_test_filter =constant_filter.transform(X_test)

In [8]:
quasi_constant_filter =VarianceThreshold(threshold=0.01)
quasi_constant_filter.fit(X_train_filter)

VarianceThreshold(threshold=0.01)

In [9]:
X_train_quasi_filter =quasi_constant_filter.transform(X_train_filter)
X_test_quasi_filter =quasi_constant_filter.transform(X_test_filter)

In [10]:
X_train_T =X_train_quasi_filter.T
X_test_T =X_test_quasi_filter.T

In [11]:
X_train_T =pd.DataFrame(X_train_T)
X_test_T =pd.DataFrame(X_test_T)

In [12]:
duplicated =X_train_T.duplicated()

feature_keep =[not index for index in duplicated]

X_train_unique =X_train_T[feature_keep].T
X_test_unique =X_test_T[feature_keep].T

In [13]:
X_train_unique.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,235,236,237,238,239,240,241,242,243,244
0,17282.0,2.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,63200.7
1,38270.0,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,88640.61
2,31526.0,2.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,96314.16
3,38737.0,2.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117568.02
4,16469.0,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016


In [14]:
X_train_unique.shape

(16000, 227)

### Remove correlated features

In [15]:
corrmat =X_train_unique.corr()

In [19]:
def get_correlation(data, threshold):
    corr_col= set()
    corrmat = data.corr()
    for i in range(len(corrmat.columns)):
        for j in range(i):
            if abs(corrmat.iloc[i, j]) > threshold:
                colname = corrmat.columns[i]
                corr_col.add(colname)
    return corr_col

corr_features =get_correlation(X_train_unique, .70)
print("correlated feature:", len(set(corr_features)))

correlated feature: 148


In [20]:
xtrain_uncorr = X_train_unique.drop(labels =corr_features, axis=1)
xtest_uncorr = X_test_unique.drop(labels =corr_features, axis=1)

In [21]:
xtrain_uncorr.shape, xtrain_uncorr.shape

((16000, 79), (16000, 79))

### LDA

In [22]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [24]:
lda = LDA(n_components=1)
X_train_lda =lda.fit_transform(xtrain_uncorr, y_train)

In [25]:
X_train_lda.shape

(16000, 1)

In [28]:
X_test_lda =lda.transform(xtest_uncorr)

In [29]:
def randomforest(X_train, X_test, y_train, y_test):
    rf= RandomForestClassifier(n_estimators=100 , n_jobs=-1, random_state=0)
    rf.fit(X_train, y_train)
    y_pred =rf.predict(X_test)
    print("Accuracy on test set:", accuracy_score(y_test, y_pred))

In [None]:
%%time
randomforest(X_train_lda, X_test_lda, y_train, y_test)

In [31]:
X_train_lda.shape, X_test_lda.shape, y_train.shape, y_test.shape

((16000, 1), (4000, 1), (16000,), (4000,))

In [32]:
%%time
randomforest(X_train, X_test, y_train, y_test)

Accuracy on test set: 0.9585
Wall time: 3.2 s


### PCA

In [33]:
from sklearn.decomposition import PCA

In [34]:
pca =PCA(n_components=2, random_state=42)
pca.fit(xtest_uncorr)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False)

In [35]:
X_train_pca =pca.transform(xtrain_uncorr)
X_test_pca =pca.transform(xtest_uncorr)

In [36]:
X_train_pca.shape, X_test_pca.shape

((16000, 2), (4000, 2))

In [37]:
%%time
randomforest(X_train_pca, X_test_pca, y_train, y_test)

Accuracy on test set: 0.95925
Wall time: 601 ms


In [38]:
for components in range(1,79):
    pca =PCA(n_components=components, random_state=42)
    pca.fit(xtest_uncorr)   
    X_train_pca =pca.transform(xtrain_uncorr)
    X_test_pca =pca.transform(xtest_uncorr)
    
    print("No of components:",components)
    randomforest(X_train_pca, X_test_pca, y_train, y_test)

No of components: 1
Accuracy on test set: 0.95925
No of components: 2
Accuracy on test set: 0.95925
No of components: 3
Accuracy on test set: 0.95925
No of components: 4
Accuracy on test set: 0.959
No of components: 5
Accuracy on test set: 0.9585
No of components: 6
Accuracy on test set: 0.95875
No of components: 7
Accuracy on test set: 0.95825
No of components: 8
Accuracy on test set: 0.9195
No of components: 9
Accuracy on test set: 0.9535
No of components: 10
Accuracy on test set: 0.95625
No of components: 11
Accuracy on test set: 0.95575
No of components: 12
Accuracy on test set: 0.95625
No of components: 13
Accuracy on test set: 0.95625
No of components: 14
Accuracy on test set: 0.9565
No of components: 15
Accuracy on test set: 0.9565
No of components: 16
Accuracy on test set: 0.95675
No of components: 17
Accuracy on test set: 0.95675
No of components: 18
Accuracy on test set: 0.9575
No of components: 19
Accuracy on test set: 0.95725
No of components: 20
Accuracy on test set: 0.956

 components = 1 gives better accuracy