<a href="https://colab.research.google.com/github/jay-giametta/data-sci/blob/main/5_featureSelection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Selection

Import necessary libraries

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd

Read the Iris dataset into a dataframe

In [2]:
iris = pd.read_csv("https://datahub.io/machine-learning/iris/r/iris.csv")

Choose feature values

In [3]:
features = iris[['sepallength', 'sepalwidth', 'petallength', 'petalwidth']]

Choose target values

In [4]:
target = iris['class']

Choose the top two out of the four available features

In [5]:
selector = SelectKBest(chi2, k=2)
selector.fit(features, target)
cols = selector.get_support(indices=True)

bestFeatures = features.iloc[:,cols]

dic = {'feature': features.columns, 'score': selector.scores_}
scores = pd.DataFrame(data=dic)
print(scores)

       feature       score
0  sepallength   10.817821
1   sepalwidth    3.594499
2  petallength  116.169847
3   petalwidth   67.244828


Instantiate a regression model

In [6]:
model = LogisticRegression()

Fit the model using the entire dataset.

Note: Will cover using separate training/test sets and validation in a later notebook

In [7]:
model.fit(bestFeatures, target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Make predictions with the fitted model

In [8]:
predict = model.predict(bestFeatures)

Create a new dataframe with truth/prediction data

In [9]:
dic = {'truth': target, 'predict': predict}
classes = pd.DataFrame(data=dic)
classes['correct'] = classes.predict == classes.truth

Print the first few obsevations in the new dataframe

In [10]:
classes.head()

Unnamed: 0,truth,predict,correct
0,Iris-setosa,Iris-setosa,True
1,Iris-setosa,Iris-setosa,True
2,Iris-setosa,Iris-setosa,True
3,Iris-setosa,Iris-setosa,True
4,Iris-setosa,Iris-setosa,True


Calculate and print the model accuracy

In [11]:
print('Model accuracy: ' + str(round(classes.correct.sum()/classes.correct.count()*100, 2)) + '%')

Model accuracy: 96.67%


Print the incorrect predictions

In [12]:
classes.loc[classes.correct == False]

Unnamed: 0,truth,predict,correct
70,Iris-versicolor,Iris-virginica,False
77,Iris-versicolor,Iris-virginica,False
83,Iris-versicolor,Iris-virginica,False
106,Iris-virginica,Iris-versicolor,False
119,Iris-virginica,Iris-versicolor,False
