### Feature Selection Exercie

In this exercise, we will apply Feature Selection to a Iris flowers dataset, where the target variable is the Species. Essentially, our goal is to identify the features that are most relevant in discerning the species of each Iris flower. The dataset is from: https://www.kaggle.com/datasets/uciml/iris
You can view the demos found in the repository for some methods.

1. Load the dataset from the exercise's Github Repository (Iris.csv)
2. Using buisness logic/common sense, drop features that are surely irrevelvant to the target variable.
3. Preprocess your data (split data into training and testing)
4. Apply feature selection using any 3 (three) different methods:
(Hint) Since the target variable, Species, is categorical, you can apply the numerical methods on the numerical predictor variables against themselves instead to reduce Feature redundancy.
    - Pearson's correlation coefficient (r)
    - Kendall's tau (τ)
    - Mutual Information (MI)
    - Logistic Regression with L1 penalty
    - Any other method/model of Feature Selection....
6. Compare the results of each feature selection method:
    - What features did you manually dropped before applying the feature selection methods? Explain why.
    - Are there any common features selected across multiple methods?
    - Can you explain why certain features were selected based on their characteristics?
(Optional) Visualize the importance of features using techniques like bar charts or heatmaps to make it easier to compare.



In [8]:
#Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, chi2
from sklearn.preprocessing import LabelEncoder
from scipy.stats import pearsonr, kendalltau
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [9]:
data = pd.read_csv('Iris.csv')

In [10]:
labels = LabelEncoder()
data["FlowerColour"] = labels.fit_transform(data["FlowerColour"])
data["Species"] = labels.fit_transform(data["Species"])

X = data.drop("Species", axis=1)
y = data["Species"]

X = X.drop(['Id','MonthCollected','YearCollected'],axis=1)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Feature Selection

In [20]:
#Using MUTUAL_INFO
selector = SelectKBest(mutual_info_classif, k=3)
selected_features = selector.fit(X_train, y_train).get_support(indices=True)
print("Mutual Info Selected Features:", X_train.columns[selected_features])

Mutual Info Selected Features: Index(['PetalLengthCm', 'PetalWidthCm', 'StigmaLegnth'], dtype='object')


In [25]:
#Using ANOVA F-test
selector = SelectKBest(f_classif, k=3)
selected_features = selector.fit(X_train, y_train).get_support(indices=True)
print("ANOVA Selected Features:", X_train.columns[selected_features])

ANOVA Selected Features: Index(['PetalLengthCm', 'PetalWidthCm', 'StigmaLegnth'], dtype='object')


In [24]:
#Using Chi-squared
selector = SelectKBest(chi2, k=3)
selected_features = selector.fit(X_train, y_train).get_support(indices=True)
print("Chi-squared Info Selected Features:", X_train.columns[selected_features])

Chi-squared Info Selected Features: Index(['PetalLengthCm', 'PetalWidthCm', 'StigmaLegnth'], dtype='object')
