<a href="https://colab.research.google.com/github/maneri13/Data-Science-Experiments/blob/master/5_XOR_Feature_Selection_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Selection Framework

The purpose of this notebook is to make it a bit easier to test feature selection and classification algorithms.

##Imports

In [1]:
import scipy.io 
import numpy as np
import pandas as pd
import csv
import seaborn as sn

from sklearn.model_selection import *
from sklearn.feature_selection import *
from sklearn.ensemble import *
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import *
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.utils import class_weight
from sklearn.metrics import *


!pip install pymrmr
import pymrmr

Collecting pymrmr
[?25l  Downloading https://files.pythonhosted.org/packages/b3/ab/903712947a2f5cd1af249132885dbd81ae8bf8cfd30fb3b3f2beddab23e8/pymrmr-0.1.8.tar.gz (65kB)
[K     |█████                           | 10kB 10.3MB/s eta 0:00:01[K     |██████████                      | 20kB 2.4MB/s eta 0:00:01[K     |███████████████                 | 30kB 3.2MB/s eta 0:00:01[K     |███████████████████▉            | 40kB 3.7MB/s eta 0:00:01[K     |████████████████████████▉       | 51kB 3.0MB/s eta 0:00:01[K     |█████████████████████████████▉  | 61kB 3.4MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 2.6MB/s 
Building wheels for collected packages: pymrmr
  Building wheel for pymrmr (setup.py) ... [?25l[?25hdone
  Created wheel for pymrmr: filename=pymrmr-0.1.8-cp36-cp36m-linux_x86_64.whl size=256754 sha256=187be002a9bb1c8083db4bead75f262db3b65f118c3a0fce3e04cf15258edab8
  Stored in directory: /root/.cache/pip/wheels/5b/ce/3a/bc9b80047f68973d909a35bb8e3062b7c737

## Data Initialization

If you want to test more models, add them to the models dictionary. **Note: If any model is added to models dictionary, it should have the fit(), predict() and score() methods.** 

If you want to test more feature selection methods, add them to the fs dictionary. **Note: If any FS method is added to the dictionary, it should have fit and transform methods.**

But in both cases, first read the methods in the section below and make sure the required functions are available.

Source: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

In [28]:
filename = '2_5XOR_rep1.csv'

metrics_columns = ['K-fold', 'FS', 'No. of columns', 'Columns selected', 'Algorithm', 'Accuracy', 'ROC', 'F1', 'APS']
metrics_df = pd.DataFrame(columns = metrics_columns)

models = [
          ('LogReg', LogisticRegression()), 
          ('RF', RandomForestClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('SVM', SVC()), 
          ('GNB', GaussianNB()),
          ('ADA', AdaBoostClassifier()),
          ('XGB', XGBClassifier()),
          # ('Isolation Forest', IsolationForest()),
          # ('Stacking Classifier', StackingClassifier(estimators= 5)),
          # ('Voting Classifier', VotingClassifier(estimators= 5)),
          ('Bagging', BaggingClassifier(base_estimator= RandomForestClassifier())),
          ('Hist Gradient', HistGradientBoostingClassifier())
        ]

fs = [
      ('Chi-Squared(Best 2)', SelectKBest(chi2, k=2)),
      ('F-Classification(Best 2)', SelectKBest(f_classif, k=2)),
      ('FPR test', SelectFpr(chi2, alpha=0.8)),
      # ('FDR Test', SelectFdr(f_classif, alpha=0.8)),
      # ('FWE Test', SelectFwe(f_classif, alpha=0.9)),
      ('Variance Threshold', VarianceThreshold(threshold=(.2 * (1 - .2)))),
      ('RFE', RFE(SVR(kernel="linear"), n_features_to_select=5, step=1)),
      # ('Mutual Information Classification', mutual_info_classif(X,y)),
      ('Lasso(SVC)', SelectFromModel(LinearSVC(C=0.1, penalty="l1", dual=False))),
      ('Tree Classifier', SelectFromModel(RandomForestClassifier())),
      ('Logistic Regression', SelectFromModel(LogisticRegression())),
      ('AdaBoost', SelectFromModel(AdaBoostClassifier())),
      ('Extra Trees', SelectFromModel(ExtraTreesClassifier())),
      ('Gradient Boost', SelectFromModel(GradientBoostingClassifier()))
]

In [18]:
metrics_df.head()

Unnamed: 0,K-fold,FS,No. of columns,Columns selected,Algorithm,Accuracy,ROC,F1,APS


## File Upload

In [11]:
from google.colab import files
uploaded = files.upload()

## Initial Dataset Analysis
**Note: Needs to be changed for every data set**

In [12]:
df = pd.read_csv(filename, header= None,
                 delimiter=',', encoding="utf-8")

y = df[5]
X = df.drop(labels= 5, axis= 1)
print(X.size)
print(X.head())

3500
   0  1  2  3  4
0  0  0  1  0  0
1  0  1  0  0  1
2  1  0  0  0  0
3  0  1  0  1  0
4  0  1  0  0  1


## Methods

*   data_split: Reads csv and returns X(dataset) and y(target). **Note: Needs to be changed for every data set.**
*   model_test: Takes split data set as 4 inputs. Loops over models dictionary and tests all models. 
*   feature_selection:  Takes split data set as 4 inputs. loops over fs dictionary and tests each feature selection method. Last line of method calls the model_test method in each loop.
*   Main: Splits data into K-folds and runs feature_selection method in loop.



In [13]:
def data_split():
  df = pd.read_csv(filename, header= None,
                 delimiter=',', encoding="utf-8")
  y = df[5]
  X = df.drop(labels= 5, axis= 1)
  return X,y

def model_test(k_fold, fs, no_columns, sel_columns, X_train, y_train, X_test, y_test):
  for name, model in models:
    clf = model
    print(name)
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    roc = roc_auc_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    aps = average_precision_score(y_test, predictions)
    score = clf.score(X_test, y_test)
    print('Accuracy Score: ' + str(score))
    print('ROC Score: ' + str(roc))
    print('F1 score: ' + str(f1))
    print('Average Precision Score: ' + str(aps))
    row_metrics = pd.DataFrame([
                         [k_fold, fs, no_columns, sel_columns,name, score, roc, f1, aps] 
                         ], columns = metrics_columns)
    global metrics_df 
    metrics_df = metrics_df.append(row_metrics, ignore_index=True)


def feature_selection(k_fold, X_train, y_train, X_test, y_test):
  for name, selection in fs:
    print('\nOriginal number of rows and columns in training data set: ' + str(X_train.shape))
    print('Original number of rows and columns in testing data set: ' + str(X_test.shape))
    print('Feature Selection Method: ' + name)
    sel = selection
    sel.fit(X_train, y_train)
    no_col = sum(sel.get_support())
    sel_col = [x for x in X.columns if x  in X.columns[sel.get_support()]]
    print('Number of features selected: ' + str(no_col))
    print( 'Number of features not selected: ' +
     str( len([
        x for x in X.columns
        if x not in X.columns[sel.get_support()]
    ])))
    print('columns selected: ' + str(sel_col))
    X_new_train = sel.transform(X_train)
    X_new_test = sel.transform(X_test)
    print('The new training set has rows and columns: ' + str(X_new_train.shape))
    print('The new testing set has rows and columns: ' + str(X_new_test.shape)+ '\n')
    model_test(k_fold, name, no_col, sel_col, X_new_train, y_train, X_new_test, y_test)

def Main():
  kf = KFold(n_splits=10, random_state= 1, shuffle = True)
  count = 0
  for train_index, test_index in kf.split(X):
    print("This is Fold: " + str(count))

    data_split()
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print('This is the size of the training set: ' + str(X_train.shape))
    print('This is the size of the testing set: ' + str(X_test.shape))

    feature_selection(count, X_train, y_train, X_test, y_test)
    count = count + 1

## Program Call

In [29]:
metrics_df.head()
Main()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
GNB
Accuracy Score: 0.6
ROC Score: 0.6003276003276002
F1 score: 0.588235294117647
Average Precision Score: 0.532034632034632
ADA
Accuracy Score: 0.6
ROC Score: 0.6003276003276002
F1 score: 0.588235294117647
Average Precision Score: 0.532034632034632
XGB
Accuracy Score: 0.6
ROC Score: 0.6036036036036037
F1 score: 0.611111111111111
Average Precision Score: 0.5332112332112332
Bagging
Accuracy Score: 0.4714285714285714
ROC Score: 0.48525798525798525
F1 score: 0.5647058823529413
Average Precision Score: 0.46423576423576424
Hist Gradient
Accuracy Score: 0.6
ROC Score: 0.6036036036036037
F1 score: 0.611111111111111
Average Precision Score: 0.5332112332112332

Original number of rows and columns in training data set: (630, 5)
Original number of rows and columns in testing data set: (70, 5)
Feature Selection Method: Extra Trees
Number of features selected: 2
Number of features not selected: 3
columns selected: [3, 4]
The new train

## Metrics

In [30]:
print(metrics_df.shape) #Should be 1170 rows
metrics_df.head(150)


(990, 9)


Unnamed: 0,K-fold,FS,No. of columns,Columns selected,Algorithm,Accuracy,ROC,F1,APS
0,0,Chi-Squared(Best 2),2,"[0, 1]",LogReg,0.428571,0.426699,0.393939,0.440903
1,0,Chi-Squared(Best 2),2,"[0, 1]",RF,0.428571,0.426699,0.393939,0.440903
2,0,Chi-Squared(Best 2),2,"[0, 1]",KNN,0.571429,0.573301,0.571429,0.513315
3,0,Chi-Squared(Best 2),2,"[0, 1]",SVM,0.428571,0.426699,0.393939,0.440903
4,0,Chi-Squared(Best 2),2,"[0, 1]",GNB,0.428571,0.426699,0.393939,0.440903
...,...,...,...,...,...,...,...,...,...
145,1,Lasso(SVC),5,"[0, 1, 2, 3, 4]",RF,1.000000,1.000000,1.000000,1.000000
146,1,Lasso(SVC),5,"[0, 1, 2, 3, 4]",KNN,1.000000,1.000000,1.000000,1.000000
147,1,Lasso(SVC),5,"[0, 1, 2, 3, 4]",SVM,1.000000,1.000000,1.000000,1.000000
148,1,Lasso(SVC),5,"[0, 1, 2, 3, 4]",GNB,0.514286,0.519247,0.540541,0.481354


In [33]:
from google.colab import drive
drive.mount('drive')
metrics_df.to_csv('Dataset_5_XOR.csv')
!cp Dataset_5_XOR.csv "drive/My Drive/"