<center>
  <a href="MLSD-04-FeatureEngineering-Ex-3.ipynb" target="_self">Feature Engineering Exercise 3</a> | <a href="./">Content Page</a> | <a href="MLSD-05-FeatureSelection-B.ipynb">Feature Selection B | <a href="MLSD-05-FeatureSelection-Ex-1.ipynb">Feature Selection Exercise 1</a>
</center>

# <center>FEATURE SELECTION A</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Definition
Feature selection is the process where you automatically or manually select the features that contribute the most to your prediction variable or output.

# Importance
Having irrelevant features in your data can decrease the accuracy of the machine learning models.

The top reasons to use feature selection are:
- It enables the machine learning algorithm to train faster.
- It reduces the complexity of a model and makes it easier to interpret.
- It improves the accuracy of a model if the right subset is chosen.
- It reduces overfitting.

# Feature Selection Methods
**Univariate Selection**
- Statistical tests can help to select independent features that have the strongest relationship with the target feature in your dataset e.g. chi-square test.
- The Scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.
- <b>Chi-square Test</b>
- <b>Logit (Logistic Regression model)</b>

**Feature Importance**
- Feature importance gives you a score for each feature of your data. 
- The higher the score, the more important or relevant that feature is to your target feature.
- Feature importance is an inbuilt class that comes with tree-based classifiers such as: Random Forest Classifiers and <b>Extra Tree Classifiers</b>

**Correlation**
- Correlation shows how the features are related to each other or the target feature.
- Correlation can be positive (an increase in one value of the feature increases the value of the target variable) or negative (an increase in one value of the feature decreases the value of the target variable).
- <b>Pearson Correlation</b>
- <b>Correlation Matrix Heatmap</b>

**Iterative Search**
- <b>Forward stepwise selection</b>

# Univariate Selection

In [None]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
# Load iris data
iris_dataset = load_iris()

In [None]:
# Create features and target
X = iris_dataset.data
y = iris_dataset.target

In [None]:
X.shape

In [None]:
y.shape

In [None]:
# Convert to categorical data by converting data to integers
X = X.astype(int)

In [None]:
# Two features with highest chi-squared statistics are selected
chi2_features = SelectKBest(chi2, k = 2)
X_kbest_features = chi2_features.fit_transform(X, y)

In [None]:
# Reduced features
print('Original feature number:', X.shape[1])
print('Reduced  feature number:', X_kbest_features.shape[1])

**Observations**:
- Selected two important independent features out of the original 4 that have the strongest relationship with the target feature.

In [None]:
X[0:10]

In [None]:
X_kbest_features[0:10]

# Feature Importance

In [None]:
# Import libraries
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
# Load iris data
iris_dataset = load_iris()

In [None]:
# Create features and target
X = iris_dataset.data
y = iris_dataset.target

In [None]:
# Convert to categorical data by converting data to integers
X = X.astype(int)

In [None]:
# Building the model
extra_tree_forest = ExtraTreesClassifier(n_estimators = 5,
                                        criterion ='entropy', max_features = 2)

In [None]:
# Training the model
extra_tree_forest.fit(X, y)

In [None]:
# Computing the importance of each feature
feature_importance = extra_tree_forest.feature_importances_

In [None]:
# Normalizing the individual importances
feature_importance_normalized = np.std([tree.feature_importances_ for tree in 
                                        extra_tree_forest.estimators_],
                                        axis = 0)

In [None]:
# Plotting a Bar Graph to compare the models
plt.bar(iris_dataset.feature_names, feature_importance_normalized)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()

**Observations**:
- The most important features are petal length (cm) and  petal width (cm).
- The least important feature is sepal width (cms). 
- This means you can use the most important features to train your model and get best performance.

# Correlation

In [None]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load data
dataframe = pd.read_csv('./data/housePrices/train.csv')
dataframe.head()

In [None]:
df = dataframe[['OverallQual','TotalBsmtSF','GarageArea','GarageCars','1stFlrSF','YearBuilt','YearRemodAdd','MasVnrArea','FullBath','GrLivArea','TotRmsAbvGrd','MoSold','YrSold','SalePrice']]

In [None]:
df.head()

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
# save heatmap as .png file
# dpi - sets the resolution of the saved image in dots/inches
# bbox_inches - when set to 'tight' - does not allow the labels to be cropped
plt.savefig('heatmap.png', dpi=300, bbox_inches='tight')

**Observations**:
- The correlation coefficient ranges from -1 to 1. 
- If the value is close to 1, it means that there is a strong positive correlation between the two features. 
- When it is close to -1, the features have a strong negative correlation.
- Features in the dataset that are correlated to each other suggests they convey the same information. It is recommended to remove one of them.

<center>
  <a href="MLSD-04-FeatureEngineering-Ex-3.ipynb" target="_self">Feature Engineering Exercise 3</a> | <a href="./">Content Page</a> | <a href="MLSD-05-FeatureSelection-B.ipynb">Feature Selection B | <a href="MLSD-05-FeatureSelection-Ex-1.ipynb">Feature Selection Exercise 1</a>
</center>