## Description:
- Removing Constant
- Removing Quasi Constant
- Removing Duplicate Feature
- Removing Correlated Feature
- Performing LDA and PCA

## What we have done

We remove constant feature which have low variance
We remove Quasi-constant are the features that are almost constant. In other words, these features have the same values for a very large subset of the outputs. Such features are not very useful for making predictions. There is no rule as to what should be the threshold for the variance of quasi-constant features.But in this project we used 0.01 as the threshold value.
We remove two or more than two features which are mutually correlated because the convey redundant information to the model and hence only one of the correlated feature should be retained to reduce the number of features.

We perform LDA and PCA for feature reduction

After performing the feature selection method we divide the data into 5 fold and for each fold we split the fold into train and test data set (80/20)ratio

Finally We run Random forest,SVM,Decision Tree and KNN classifier the selected feature

#### Short Description of LDA and PCA

LDA is a supervised data compression technique which is aimed increasing class distinction techniques.
The general concept behind LDA is very similar to PCA ,while PCA attempts to find the orthogonal component axes of maximum variance in a data-set, the goal in LDA is to find the feature subspace that optimizes class separability and to serve this purpose it requires the class labels.
PCA is an unsupervised linear transformation technique.
PCA helps us to identify patterns in data based on the correlation betweeen features. In a nutshell , PCA aims at finding the directions of maximum variance in high-dimensional data and projects it onto a new subspace of lower or equal number of dimensions than original feature space.


In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import KFold

In [8]:
class Model:
    def __init__(self,location,numOfFold):
        self.fold = numOfFold
        self.kFold = KFold(numOfFold,True,1)
        self.avg_accuracy = []
        self.data = pd.read_csv(location)
        self.data = self.data.fillna(self.data.mean())
        self.X = self.data.drop('label',axis=1)
        self.Y = self.data['label']
        print('X shape:',str(self.X.shape))
        print('Y shape:',str(self.Y.shape))
    def loadData(self,location,numOfFold):
        self.fold = numOfFold
        self.kFold = KFold(numOfFold,True,1)
        self.avg_accuracy = []
        self.data = pd.read_csv(location)
        self.data = self.data.fillna(self.data.mean())
        self.X = self.data.drop('label',axis=1)
        self.Y = self.data['label']
        print('X shape:',str(self.X.shape))
        print('Y shape:',str(self.Y.shape))
    def removeContantFeature(self):
        #print('Removing constant feature')
        constant_filter = VarianceThreshold(threshold=0)
        constant_filter.fit(self.X_train)
        #print('Number of constant feature ',constant_filter.get_support().sum())
        constant_list = [not temp for temp in constant_filter.get_support()]
        self.X.columns[constant_list]
        self.X_train_filter = constant_filter.transform(self.X_train)
        self.X_test_filter = constant_filter.transform(self.X_test)
        #print('Shape of the dataset after removal of constant features')
        #print(self.X_train_filter.shape,self.X_test_filter.shape,self.X_train.shape,'\n')
    def removeQuasiConstant(self):
        #print('Removing Quasi constant feature')
        quasi_constant_filter = VarianceThreshold(threshold = 0.01)
        quasi_constant_filter.fit(self.X_train_filter)
        #print('Number of quasi constant feature ',quasi_constant_filter.get_support().sum())
        self.X_train_quasi_filter = quasi_constant_filter.transform(self.X_train_filter)
        self.X_test_quasi_filter = quasi_constant_filter.transform(self.X_test_filter)
        #print('Shape of the dataset after removal of quasi constant features')
        #print(self.X_train_quasi_filter.shape,self.X_test_quasi_filter.shape,self.X_train.shape,'\n')
        
    def removeDuplicateFeature(self):
        X_train_T = self.X_train_quasi_filter.T
        X_test_T = self.X_test_quasi_filter.T
        X_train_T = pd.DataFrame(X_train_T)
        X_test_T = pd.DataFrame(X_test_T)
        #print('Number of duplicate feature ',X_train_T.duplicated().sum())
        duplicated_feature = X_train_T.duplicated()
        features_to_keep = [not index for index in duplicated_feature]
        self.X_train_unique = X_train_T[features_to_keep].T
        self.X_test_unique = X_test_T[features_to_keep].T
        #print('Shape of the dataset after removal of duplicate features')
        #print(self.X_train_unique.shape,self.X_test_unique.shape,self.X_train.shape,'\n')
    def get_correlation(self,data, threshold):
        corr_col = set()
        corrmat = data.corr()
        for i in range(len(corrmat.columns)):
            for j in range(i):
                if abs(corrmat.iloc[i, j])> threshold:
                    colname = corrmat.columns[i]
                    corr_col.add(colname)
        return corr_col
    def removeCorrelatedFeature(self):
        corrmat = self.X_train_unique.corr()
        corr_features = self.get_correlation(self.X_train_unique, 0.85)
        self.X_train_uncorr = self.X_train_unique.drop(labels=corr_features, axis = 1)
        self.X_test_uncorr = self.X_test_unique.drop(labels = corr_features, axis = 1)
        #print('Shape of the dataset after removal of correlated features')
        #print(self.X_train_uncorr.shape,self.X_test_uncorr.shape,self.X_train.shape)

    def runRandomForest(self,corrParm):#invoke corrParm to remove correlated feature
        count = 1
        for train_index,test_index in self.kFold.split(self.data):
            self.X_train, self.X_test, self.y_train, self.y_test = self.X.iloc[train_index], self.X.iloc[test_index],self.Y.iloc[train_index], self.Y.iloc[test_index]
            #print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
            self.removeContantFeature()
            self.removeQuasiConstant()
            self.removeDuplicateFeature()
            if corrParm == 'Y':
                self.removeCorrelatedFeature()
                clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
                clf.fit(self.X_train_unique, self.y_train)
                self.y_pred = clf.predict(self.X_test_unique)
            else:
                clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
                clf.fit(self.X_train_unique, self.y_train)
                self.y_pred = clf.predict(self.X_test_unique)

            accuracy = accuracy_score(self.y_test, self.y_pred)*100
            print('Accuracy of fold ',str(count),': ',accuracy)
            self.avg_accuracy.append(accuracy)
            count = count+1
        accDF = pd.DataFrame(self.avg_accuracy,columns = ['Accuracy per fold'],index = None)
        print(accDF)
        print('Average accuracy of Random forest ', sum(self.avg_accuracy)/self.fold)
            
        return
    def runSVM(kernelTrick):
        count = 1
        scaler = StandardScaler()
        for train_index,test_index in self.kFold.split(self.data):
            self.X_train, self.X_test, self.y_train, self.y_test = self.X.iloc[train_index], self.X.iloc[test_index],self.Y.iloc[train_index], self.Y.iloc[test_index]
            #print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
            self.removeContantFeature()
            self.removeQuasiConstant()
            self.removeDuplicateFeature()
            X_train_scaled = scaler.fit_transform(self.X_train_unique)
            X_test_scaled = scaler.fit_transform(self.X_test_unique)
            clf = SVC(kernel = kernelTrick , C = 1)
            clf.fit(self.X_train_scaled, self.y_train)
            self.y_pred = clf.predict(self.X_test_scaled)
            accuracy = accuracy_score(self.y_test, self.y_pred)*100
            #print('Accuracy of fold ',str(count),': ',accuracy)
            self.avg_accuracy.append(accuracy)
            count = count+1
        accDF = pd.DataFrame(self.avg_accuracy,columns = ['Accuracy per fold'],index = None)
        print(accDF)
        print('Average accuracy of SVM with',kernelTrick,' : ', sum(self.avg_accuracy)/self.fold)
    def runDecisionTree(self,Criterion,corrParm):
        count = 1
        for train_index,test_index in self.kFold.split(self.data):
            self.X_train, self.X_test, self.y_train, self.y_test = self.X.iloc[train_index], self.X.iloc[test_index],self.Y.iloc[train_index], self.Y.iloc[test_index]
            #print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
            self.removeContantFeature()
            self.removeQuasiConstant()
            self.removeDuplicateFeature()
            self.removeCorrelatedFeature()
            if corrParm == 'Y':
                self.removeCorrelatedFeature()
                clf = DecisionTreeClassifier(criterion = Criterion, random_state = 100,
                               max_depth=30, min_samples_leaf=5)
                clf.fit(self.X_train_unique, self.y_train)
                self.y_pred = clf.predict(self.X_test_unique)
            else:
                clf = DecisionTreeClassifier(criterion = Criterion, random_state = 100,
                               max_depth=30, min_samples_leaf=5)
                clf.fit(self.X_train_unique, self.y_train)
                self.y_pred = clf.predict(self.X_test_unique)


            accuracy = accuracy_score(self.y_test, self.y_pred)*100
            #print('Accuracy of fold ',str(count),': ',accuracy)
            self.avg_accuracy.append(accuracy)
            count = count+1
        accDF = pd.DataFrame(self.avg_accuracy,columns = ['Accuracy per fold'],index = None)
        print(accDF)
        print('Average accuracy of Decision Tree with ',Criterion,' as criterion: ', sum(self.avg_accuracy)/self.fold)
    def runKNNClassifier(self,neighbor,corrParm):
        count = 1
        for train_index,test_index in self.kFold.split(self.data):
            self.X_train, self.X_test, self.y_train, self.y_test = self.X.iloc[train_index], self.X.iloc[test_index],self.Y.iloc[train_index], self.Y.iloc[test_index]
            #print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
            self.removeContantFeature()
            self.removeQuasiConstant()
            self.removeDuplicateFeature()
            if corrParm == 'Y':
                self.removeCorrelatedFeature()
                clf = KNeighborsClassifier(n_neighbors=neighbor)
                clf.fit(self.X_train_unique, self.y_train)
                self.y_pred = clf.predict(self.X_test_unique)
            else:
                clf = KNeighborsClassifier(n_neighbors=neighbor)
                clf.fit(self.X_train_unique, self.y_train)
                self.y_pred = clf.predict(self.X_test_unique)

            accuracy = accuracy_score(self.y_test, self.y_pred)*100
            #print('Accuracy of fold ',str(count),': ',accuracy)
            self.avg_accuracy.append(accuracy)
            count = count+1
        accDF = pd.DataFrame(self.avg_accuracy,columns = ['Accuracy per fold'],index = None)
        print(accDF)
        print('Average accuracy of KNN Classifier', sum(self.avg_accuracy)/self.fold)
    def runLDA(self,corrParm):
        count=0
        for train_index,test_index in self.kFold.split(self.data):
            self.X_train, self.X_test, self.y_train, self.y_test = self.X.iloc[train_index], self.X.iloc[test_index],self.Y.iloc[train_index], self.Y.iloc[test_index]
            #print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
            self.removeContantFeature()
            self.removeQuasiConstant()
            self.removeDuplicateFeature()
            if corrParm == 'Y':
                self.removeCorrelatedFeature()
                lda = LDA(n_components=1)
                X_train_lda = lda.fit_transform(self.X_train_uncorr, self.y_train)
                X_test_lda = lda.transform(self.X_test_uncorr)
                clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
                clf.fit(X_train_lda, self.y_train)
                self.y_pred = clf.predict(X_test_lda)
            else:
                lda = LDA(n_components=1)
                X_train_lda = lda.fit_transform(self.X_train_unique, self.y_train)
                X_test_lda = lda.transform(self.X_test_unique)
                clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
                clf.fit(X_train_lda, self.y_train)
                self.y_pred = clf.predict(X_test_lda)
            accuracy = accuracy_score(self.y_test, self.y_pred)*100
            #print('Accuracy of fold ',str(count),': ',accuracy)
            self.avg_accuracy.append(accuracy)
            count = count+1
        accDF = pd.DataFrame(self.avg_accuracy,columns = ['Accuracy per fold'],index = None)
        print(accDF)
        print('Average accuracy of running LDA', sum(self.avg_accuracy)/self.fold)

    def runPCA(self,corrParm):
        count=0
        for train_index,test_index in self.kFold.split(self.data):
            self.X_train, self.X_test, self.y_train, self.y_test = self.X.iloc[train_index], self.X.iloc[test_index],self.Y.iloc[train_index], self.Y.iloc[test_index]
            #print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
            self.removeContantFeature()
            self.removeQuasiConstant()
            self.removeDuplicateFeature()
            if corrParm == 'Y':
                self.removeCorrelatedFeature()
                pca = PCA(n_components=2, random_state=42)
                pca.fit(self.X_train_uncorr)
                X_train_pca = pca.transform(self.X_train_uncorr)
                X_test_pca = pca.transform(self.X_test_uncorr)
                clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
                clf.fit(X_train_pca, self.y_train)
                self.y_pred = clf.predict(X_test_pca)
            else:
                pca = PCA(n_components=2, random_state=42)
                pca.fit(self.X_train_unique)
                X_train_pca = pca.transform(self.X_train_unique)
                X_test_pca = pca.transform(self.X_test_unique)
                clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
                clf.fit(X_train_pca, self.y_train)
                self.y_pred = clf.predict(X_test_pca)
            accuracy = accuracy_score(self.y_test, self.y_pred)*100
            #print('Accuracy of fold ',str(count),': ',accuracy)
            self.avg_accuracy.append(accuracy)
            count = count+1
        accDF = pd.DataFrame(self.avg_accuracy,columns = ['Accuracy per fold'],index = None)
        print(accDF)
        print('Average accuracy of running LDA', sum(self.avg_accuracy)/self.fold)

    def showData(self):
        return self.data.head()

In [9]:
location = r'/home/mirsahib/Desktop/Project-Andromeda/Dataset/Fusing_Geometric_Feature_Extracted/fusing_geometric.csv'
FilterModel = Model(location,5)

X shape: (22797, 141)
Y shape: (22797,)


## Random Forrest

In [None]:
FilterModel.runRandomForest('N')

Accuracy of fold  1 :  89.51754385964912


## Decision tree


In [None]:
FilterModel.loadData(location,5)
FilterModel.runDecisionTree('gini','N')

In [None]:
FilterModel.loadData(location,5)
FilterModel.runDecisionTree('entropy','N')

## KNN Classifier

In [None]:
FilterModel.loadData(location,5)
FilterModel.runKNNClassifier(4,'N')

In [None]:
FilterModel.loadData(location,5)
FilterModel.runKNNClassifier(3,'N')

## LDA (After removing correlated data))

In [None]:
FilterModel.loadData(location,5)
FilterModel.runLDA('Y')

## LDA (before removing correlated data)

In [None]:
FilterModel.loadData(location,5)
FilterModel.runLDA('N')

## PCA (before removing correlated data)

In [None]:
FilterModel.loadData(location,5)
FilterModel.runPCA('N')

## PCA (After removing correlated data)

In [None]:
FilterModel.loadData(location,5)
FilterModel.runPCA('Y')