<a href="https://colab.research.google.com/github/maneri13/Data-Science-Experiments/blob/master/100_XOR_Feature_Selection_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Selection Framework

The purpose of this notebook is to make it a bit easier to test feature selection and classification algorithms.

##Imports

In [None]:
import scipy.io 
import numpy as np
import pandas as pd
import csv
import seaborn as sn

from sklearn.model_selection import *
from sklearn.feature_selection import *
from sklearn.ensemble import *
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import *
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.utils import class_weight
from sklearn.metrics import *


!pip install pymrmr
import pymrmr

Collecting pymrmr
[?25l  Downloading https://files.pythonhosted.org/packages/b3/ab/903712947a2f5cd1af249132885dbd81ae8bf8cfd30fb3b3f2beddab23e8/pymrmr-0.1.8.tar.gz (65kB)
[K     |█████                           | 10kB 18.7MB/s eta 0:00:01[K     |██████████                      | 20kB 3.0MB/s eta 0:00:01[K     |███████████████                 | 30kB 4.1MB/s eta 0:00:01[K     |███████████████████▉            | 40kB 4.5MB/s eta 0:00:01[K     |████████████████████████▉       | 51kB 3.5MB/s eta 0:00:01[K     |█████████████████████████████▉  | 61kB 4.0MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.1MB/s 
Building wheels for collected packages: pymrmr
  Building wheel for pymrmr (setup.py) ... [?25l[?25hdone
  Created wheel for pymrmr: filename=pymrmr-0.1.8-cp36-cp36m-linux_x86_64.whl size=256759 sha256=3070f897b348c94e36c23efce104b0d020f35d00d7bf7c4a68985d030a521aa1
  Stored in directory: /root/.cache/pip/wheels/5b/ce/3a/bc9b80047f68973d909a35bb8e3062b7c737

## Data Initialization

If you want to test more models, add them to the models dictionary. **Note: If any model is added to models dictionary, it should have the fit(), predict() and score() methods.** 

If you want to test more feature selection methods, add them to the fs dictionary. **Note: If any FS method is added to the dictionary, it should have fit and transform methods.**

But in both cases, first read the methods in the section below and make sure the required functions are available.

Source: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

In [None]:
filename = '2_100XOR_rep1.csv'

metrics_columns = ['K-fold', 'FS', 'No. of columns', 'Columns selected', 'Algorithm', 'Accuracy', 'ROC', 'F1', 'APS']
metrics_df = pd.DataFrame(columns = metrics_columns)

# fs_columns = ['K-Fold', 'FS', 'No. of columns', 'Columns selected']
# fs_table = pd.DataFrame(columns= fs_columns)

models = [
          ('LogReg', LogisticRegression()), 
          ('RF', RandomForestClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('SVM', SVC()), 
          ('GNB', GaussianNB()),
          ('ADA', AdaBoostClassifier()),
          ('XGB', XGBClassifier()),
          # ('Isolation Forest', IsolationForest()),
          # ('Stacking Classifier', StackingClassifier(estimators= 5)),
          # ('Voting Classifier', VotingClassifier(estimators= 5)),
          ('Bagging', BaggingClassifier(base_estimator= RandomForestClassifier())),
          ('Hist Gradient', HistGradientBoostingClassifier())
        ]

fs = [
      ('Chi-Squared(Best 2)', SelectKBest(chi2, k=2)),
      ('F-Classification(Best 2)', SelectKBest(f_classif, k=2)),
      ('FPR test', SelectFpr(chi2, alpha=0.10)),
      ('FDR Test', SelectFdr(f_classif, alpha=0.5)),
      ('FWE Test', SelectFwe(f_classif, alpha=0.5)),
      ('Variance Threshold', VarianceThreshold(threshold=(.2 * (1 - .2)))),
      ('RFE', RFE(SVR(kernel="linear"), n_features_to_select=5, step=1)),
      # ('Mutual Information Classification', mutual_info_classif(X,y)),
      ('Lasso(SVC)', SelectFromModel(LinearSVC(C=0.1, penalty="l1", dual=False))),
      ('Tree Classifier', SelectFromModel(RandomForestClassifier())),
      ('Logistic Regression', SelectFromModel(LogisticRegression())),
      ('AdaBoost', SelectFromModel(AdaBoostClassifier())),
      ('Extra Trees', SelectFromModel(ExtraTreesClassifier())),
      ('Gradient Boost', SelectFromModel(GradientBoostingClassifier()))
]

In [None]:
metrics_df.head()

Unnamed: 0,K-fold,FS,No. of columns,Columns selected,Algorithm,Accuracy,ROC,F1,APS


## File Upload

In [None]:
from google.colab import files
uploaded = files.upload()

Saving 2_100XOR_rep1.csv to 2_100XOR_rep1.csv


## Initial Dataset Analysis
**Note: Needs to be changed for every data set**

In [None]:
df = pd.read_csv(filename, header= None,
                 delimiter=',', encoding="utf-8")

y = df[100]
X = df.drop(labels= 100, axis= 1)
print(X.size)
print(X.head())

70000
   0   1   2   3   4   5   6   7   8   ...  91  92  93  94  95  96  97  98  99
0   0   1   0   0   1   0   1   0   1  ...   1   1   0   1   0   1   0   1   0
1   0   1   1   1   1   1   0   0   0  ...   1   1   0   1   1   0   1   1   0
2   1   0   1   1   0   0   0   0   0  ...   1   1   1   1   1   0   0   1   0
3   1   1   0   1   1   1   1   0   0  ...   0   1   1   1   0   0   1   1   1
4   0   1   0   1   0   0   1   1   0  ...   0   0   0   0   0   0   1   0   1

[5 rows x 100 columns]


## Methods

*   data_split: Reads csv and returns X(dataset) and y(target). **Note: Needs to be changed for every data set.**
*   model_test: Takes split data set as 4 inputs. Loops over models dictionary and tests all models. 
*   feature_selection:  Takes split data set as 4 inputs. loops over fs dictionary and tests each feature selection method. Last line of method calls the model_test method in each loop.
*   Main: Splits data into K-folds and runs feature_selection method in loop.



In [None]:
def data_split():
  df = pd.read_csv(filename, header= None,
                 delimiter=',', encoding="utf-8")
  y = df[100]
  X = df.drop(labels= 100, axis= 1)
  return X,y

def model_test(k_fold, fs, no_columns, sel_columns, X_train, y_train, X_test, y_test):
  for name, model in models:
    clf = model
    print(name)
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    roc = roc_auc_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    aps = average_precision_score(y_test, predictions)
    score = clf.score(X_test, y_test)
    print('Accuracy Score: ' + str(score))
    print('ROC Score: ' + str(roc))
    print('F1 score: ' + str(f1))
    print('Average Precision Score: ' + str(aps))
    row_metrics = pd.DataFrame([
                         [k_fold, fs, no_columns, sel_columns,name, score, roc, f1, aps] 
                         ], columns = metrics_columns)
    global metrics_df 
    metrics_df = metrics_df.append(row_metrics, ignore_index=True)


def feature_selection(k_fold, X_train, y_train, X_test, y_test):
  for name, selection in fs:
    print('\nOriginal number of rows and columns in training data set: ' + str(X_train.shape))
    print('Original number of rows and columns in testing data set: ' + str(X_test.shape))
    print('Feature Selection Method: ' + name)
    sel = selection
    sel.fit(X_train, y_train)
    no_col = sum(sel.get_support())
    sel_col = [x for x in X.columns if x  in X.columns[sel.get_support()]]
    print('Number of features selected: ' + str(no_col))
    print( 'Number of features not selected: ' +
     str( len([
        x for x in X.columns
        if x not in X.columns[sel.get_support()]
    ])))
    print('columns selected: ' + str(sel_col))
    X_new_train = sel.transform(X_train)
    X_new_test = sel.transform(X_test)
    print('The new training set has rows and columns: ' + str(X_new_train.shape))
    print('The new testing set has rows and columns: ' + str(X_new_test.shape)+ '\n')
    model_test(k_fold, name, no_col, sel_col, X_new_train, y_train, X_new_test, y_test)

def Main():
  kf = KFold(n_splits=10, random_state= 1, shuffle = True)
  count = 0
  for train_index, test_index in kf.split(X):
    print("This is Fold: " + str(count))

    data_split()
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print('This is the size of the training set: ' + str(X_train.shape))
    print('This is the size of the testing set: ' + str(X_test.shape))

    feature_selection(count, X_train, y_train, X_test, y_test)
    count = count + 1

## Program Call

In [None]:
metrics_df.head()
Main()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Average Precision Score: 0.44493087557603683
SVM
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064
GNB
Accuracy Score: 0.45714285714285713
ROC Score: 0.4598842018196857
F1 score: 0.4411764705882353
Average Precision Score: 0.4247353344127538
ADA
Accuracy Score: 0.45714285714285713
ROC Score: 0.4598842018196857
F1 score: 0.4411764705882353
Average Precision Score: 0.4247353344127538
XGB
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064
Bagging
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064
Hist Gradient
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064

Original number of rows a

## Metrics

In [None]:
print(metrics_df.shape) #Should be 1170 rows
metrics_df.head(150)


(1170, 9)


Unnamed: 0,K-fold,FS,No. of columns,Columns selected,Algorithm,Accuracy,ROC,F1,APS
0,0,Chi-Squared(Best 2),2,"[0, 25]",LogReg,0.485714,0.485714,0.454545,0.493088
1,0,Chi-Squared(Best 2),2,"[0, 25]",RF,0.557143,0.557143,0.626506,0.530952
2,0,Chi-Squared(Best 2),2,"[0, 25]",KNN,0.500000,0.500000,0.666667,0.500000
3,0,Chi-Squared(Best 2),2,"[0, 25]",SVM,0.557143,0.557143,0.626506,0.530952
4,0,Chi-Squared(Best 2),2,"[0, 25]",GNB,0.485714,0.485714,0.454545,0.493088
...,...,...,...,...,...,...,...,...,...
145,1,FDR Test,3,"[0, 25, 91]",RF,0.500000,0.499183,0.520548,0.513878
146,1,FDR Test,3,"[0, 25, 91]",KNN,0.485714,0.487745,0.454545,0.508333
147,1,FDR Test,3,"[0, 25, 91]",SVM,0.528571,0.531863,0.476190,0.531481
148,1,FDR Test,3,"[0, 25, 91]",GNB,0.514286,0.513072,0.540541,0.520969


In [None]:
from google.colab import drive
drive.mount('drive')
metrics_df.to_csv('Dataset_1.csv')
!cp Dataset_1.csv "drive/My Drive/"

Mounted at drive
