<a href="https://colab.research.google.com/github/maneri13/Data-Science-Experiments/blob/master/Feature_Selection_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Selection Framework

The purpose of this notebook is to make it a bit easier to test feature selection and classification algorithms.

##Imports

In [2]:
import scipy.io 
import numpy as np
import pandas as pd
import csv
import seaborn as sn

from sklearn.model_selection import *
from sklearn.feature_selection import *
from sklearn.ensemble import *
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import *
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.utils import class_weight
from sklearn.metrics import *


!pip install pymrmr
import pymrmr

Collecting pymrmr
[?25l  Downloading https://files.pythonhosted.org/packages/b3/ab/903712947a2f5cd1af249132885dbd81ae8bf8cfd30fb3b3f2beddab23e8/pymrmr-0.1.8.tar.gz (65kB)
[K     |█████                           | 10kB 17.6MB/s eta 0:00:01[K     |██████████                      | 20kB 6.0MB/s eta 0:00:01[K     |███████████████                 | 30kB 5.8MB/s eta 0:00:01[K     |███████████████████▉            | 40kB 6.7MB/s eta 0:00:01[K     |████████████████████████▉       | 51kB 6.3MB/s eta 0:00:01[K     |█████████████████████████████▉  | 61kB 6.7MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 4.3MB/s 
Building wheels for collected packages: pymrmr
  Building wheel for pymrmr (setup.py) ... [?25l[?25hdone
  Created wheel for pymrmr: filename=pymrmr-0.1.8-cp36-cp36m-linux_x86_64.whl size=256745 sha256=716e20d803f7b6be168f0d2e03ff8b9d52af177a2abae63143b46d00e0273a67
  Stored in directory: /root/.cache/pip/wheels/5b/ce/3a/bc9b80047f68973d909a35bb8e3062b7c737

## Data Initialization

If you want to test more models, add them to the models dictionary. **Note: If any model is added to models dictionary, it should have the fit(), predict() and score() methods.** 

If you want to test more feature selection methods, add them to the fs dictionary. **Note: If any FS method is added to the dictionary, it should have fit and transform methods.**

But in both cases, first read the methods in the section below and make sure the required functions are available.

Source: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

In [4]:
filename = '2_100XOR_rep1.csv'

model_columns = ['K-fold', 'FS', 'Algorithm', 'Accuracy', 'ROC', 'F1', 'APS']
model_table = pd.DataFrame(columns = model_columns)

fs_columns = ['K-Fold', 'FS', 'No. of columns', 'Columns selected']
fs_table = pd.DataFrame(columns= fs_columns)

models = [
          ('LogReg', LogisticRegression()), 
          ('RF', RandomForestClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('SVM', SVC()), 
          ('GNB', GaussianNB()),
          ('ADA', AdaBoostClassifier()),
          ('XGB', XGBClassifier()),
          # ('Isolation Forest', IsolationForest()),
          # ('Stacking Classifier', StackingClassifier(estimators= 5)),
          # ('Voting Classifier', VotingClassifier(estimators= 5)),
          ('Bagging', BaggingClassifier(base_estimator= RandomForestClassifier())),
          ('Hist Gradient', HistGradientBoostingClassifier())
        ]

fs = [
      ('Chi-Squared(Best 2)', SelectKBest(chi2, k=2)),
      ('F-Classification(Best 2)', SelectKBest(f_classif, k=2)),
      ('FPR test', SelectFpr(chi2, alpha=0.10)),
      ('FDR Test', SelectFdr(f_classif, alpha=0.5)),
      ('FWE Test', SelectFwe(f_classif, alpha=0.5)),
      ('Variance Threshold', VarianceThreshold(threshold=(.2 * (1 - .2)))),
      ('RFE', RFE(SVR(kernel="linear"), n_features_to_select=5, step=1)),
      # ('Mutual Information Classification', mutual_info_classif(X,y)),
      ('Lasso(SVC)', SelectFromModel(LinearSVC(C=0.1, penalty="l1", dual=False))),
      ('Tree Classifier', SelectFromModel(RandomForestClassifier())),
      ('Logistic Regression', SelectFromModel(LogisticRegression())),
      ('AdaBoost', SelectFromModel(AdaBoostClassifier())),
      ('Extra Trees', SelectFromModel(ExtraTreesClassifier())),
      ('Gradient Boost', SelectFromModel(GradientBoostingClassifier()))
]

In [7]:
model_table.head()
fs_table.head()

Unnamed: 0,K-Fold,FS,No. of columns,Columns selected


## File Upload

In [22]:
from google.colab import files
uploaded = files.upload()

Saving 2_100XOR_rep1.csv to 2_100XOR_rep1 (1).csv


## Initial Dataset Analysis

In [23]:
df = pd.read_csv(filename, header= None,
                 delimiter=',', encoding="utf-8")

y = df[100]
X = df.drop(labels= 100, axis= 1)
print(X.size)
print(X.head())

70000
   0   1   2   3   4   5   6   7   8   ...  91  92  93  94  95  96  97  98  99
0   0   1   0   0   1   0   1   0   1  ...   1   1   0   1   0   1   0   1   0
1   0   1   1   1   1   1   0   0   0  ...   1   1   0   1   1   0   1   1   0
2   1   0   1   1   0   0   0   0   0  ...   1   1   1   1   1   0   0   1   0
3   1   1   0   1   1   1   1   0   0  ...   0   1   1   1   0   0   1   1   1
4   0   1   0   1   0   0   1   1   0  ...   0   0   0   0   0   0   1   0   1

[5 rows x 100 columns]


## Methods

*   data_split: Reads csv and returns X(dataset) and y(target). **Note: Needs to be changed for every data set.**
*   model_test: Takes split data set as 4 inputs. Loops over models dictionary and tests all models. 
*   feature_selection:  Takes split data set as 4 inputs. loops over fs dictionary and tests each feature selection method. Last line of method calls the model_test method in each loop.
*   Main: Splits data into K-folds and runs feature_selection method in loop.



In [24]:
def data_split():
  df = pd.read_csv(filename, header= None,
                 delimiter=',', encoding="utf-8")
  y = df[100]
  X = df.drop(labels= 100, axis= 1)
  return X,y

def model_test(X_train, y_train, X_test, y_test):
  for name, model in models:
    clf = model
    print(name)
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    roc = roc_auc_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    aps = average_precision_score(y_test, predictions)
    score = clf.score(X_test, y_test)
    print('Accuracy Score: ' + str(score))
    print('ROC Score: ' + str(roc))
    print('F1 score: ' + str(f1))
    print('Average Precision Score: ' + str(aps))


def feature_selection(X_train, y_train, X_test, y_test):
  for name, selection in fs:
    print('\nOriginal number of rows and columns in training data set: ' + str(X_train.shape))
    print('Original number of rows and columns in testing data set: ' + str(X_test.shape))
    print('Feature Selection Method: ' + name)
    sel = selection
    sel.fit(X_train, y_train)
    print('Number of features selected: ' + str(sum(sel.get_support())))
    print( 'Number of features not selected: ' +
     str( len([
        x for x in X.columns
        if x not in X.columns[sel.get_support()]
    ])))
    print('columns selected: ' + str([x for x in X.columns if x  in X.columns[sel.get_support()]]))
    X_new_train = sel.transform(X_train)
    X_new_test = sel.transform(X_test)
    print('The new training set has rows and columns: ' + str(X_new_train.shape))
    print('The new testing set has rows and columns: ' + str(X_new_test.shape)+ '\n')
    model_test(X_new_train, y_train, X_new_test, y_test)

def Main():
  kf = KFold(n_splits=10, random_state= 1, shuffle = True)
  count = 0
  for train_index, test_index in kf.split(X):
    print("This is Fold: " + str(count))

    data_split()
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print('This is the size of the training set: ' + str(X_train.size))
    print('This is the size of the testing set: ' + str(X_test.size))

    feature_selection(X_train, y_train, X_test, y_test)
    count = count + 1

## Program Call

In [25]:
Main()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Average Precision Score: 0.44493087557603683
SVM
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064
GNB
Accuracy Score: 0.45714285714285713
ROC Score: 0.4598842018196857
F1 score: 0.4411764705882353
Average Precision Score: 0.4247353344127538
ADA
Accuracy Score: 0.45714285714285713
ROC Score: 0.4598842018196857
F1 score: 0.4411764705882353
Average Precision Score: 0.4247353344127538
XGB
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064
Bagging
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064
Hist Gradient
Accuracy Score: 0.5285714285714286
ROC Score: 0.553763440860215
F1 score: 0.5925925925925926
Average Precision Score: 0.4716129032258064

Original number of rows a

## MRMR (Giving error on current dataset)

In [None]:
df = pd.read_csv("2_100XOR_rep1.csv", header= None,
                 delimiter=',', encoding="utf-8")

y = df[100]
X = df.drop(labels= 100, axis= 1)


pymrmr.mRMR(df, 'MIQ',6)


AttributeError: ignored

In [None]:
print(table)

[['Variance Threshold', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]], ['K Best (Best 5)', [0, 1, 25, 26, 91]], ['Tree Classifier', [0, 1, 2, 5, 6, 7, 8, 12, 14, 17, 24, 25, 30, 31, 33, 39, 40, 41, 47, 50, 51, 52, 54, 57, 58, 59, 67, 72, 73, 75, 76, 78, 79, 81, 83, 84, 87, 91, 94, 95, 98, 99]], ['Lasso', [0, 1, 2, 5, 6, 7, 9, 14, 15, 20, 21, 22, 25, 26, 29, 31, 35, 36, 37, 38, 40, 52, 55, 57, 59, 61, 62, 63, 68, 69, 70, 71, 72, 74, 75, 76, 78, 83, 85, 89, 91, 93, 98, 99]]]
