<a href="https://colab.research.google.com/github/mkjubran/ENCS5141-INTELLIGENT-SYSTEMS-LAB/blob/main/ENCS5141_Exp3_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## EXPLORATORY DATA ANALYSIS – FEATURE SELECTION AND FEATURE ENGINEERING

In this notebook, we will demonstrate Feature Selection and Feature Engineering as part of Exploratory Data Analysis (EDA). We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). The dataset has been cleaned as presented in another notebook (https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb). 



# Import Libraries

First, we need to import some libraries that will be used during data cleaning.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preparation

**Clone the dataset Repository**

The modified dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_train_cleaned.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_train_cleaned.csv",sep=";")
df.head()

**Display Data Info and Check NAN**

To display the content of the data and type of features use the info() method

In [None]:
df.info()

Here the dataframe consists of 69917 rows with 12 variables (features). Ten features are numerical and two features are objects (gender, smoke). We notice that the number of non-null values of all features is equal to 69917 which means that none of the features has missing values. We can also use the isnull() method to check for any missing data.

In [None]:
df.isnull().sum()

As expected there are no NaN values as the data has been cleaned.

# Feature Selection and Feature Engineering

After we identified the type of each feature, let us have a close look at the statistical properties of the numeric features

In [None]:
df.describe()

**Is the dataset balanced**

The target feature "cardio" equals 1, when a patient has cardiovascular disease, and it's 0 if a patient is healthy. Let us get the target variable distribution.

In [None]:
No_cardio_0 = df[df.cardio==0].shape[0] ## number of records with no cardiovascular disease
No_cardio_1 = df[df.cardio==1].shape[0] ## number of records with cardiovascular disease
print('cardio_0 = {}, cardio_1 = {} , Percentage of cardio_1 = {} %'.format(No_cardio_0,No_cardio_1,(No_cardio_1/(No_cardio_0+No_cardio_1))*100))

The dataset is said to be balanced as the distribution of the target feature is almost uniform. This can also be observed from the bar plot below.

In [None]:
rcParams['figure.figsize'] = 10, 6
df['cardio'].value_counts().plot(kind='bar').set_title('Target class : Cardio')

**Examine 'age' and 'age_year' features**

Let us now examine the correlation between the features and the target variable and check how are they spread among this target feature. Let us start with the 'age_years' feature.

In [None]:
rcParams['figure.figsize'] = 10, 6
pd.crosstab(df.age_years,df.cardio).plot(kind='bar', ylabel='count', xlabel='years')

It can be observed that people aged 55 and above are more exposed to cardiovascular disease. So the age presented in the 'age' or 'age_years' is an important feature for the prediction of cardiovascular disease and hence will be selected for ML. This is called feature selection.

**Examine 'height' and 'weight' features**

Let us check the 'height' and 'weight' features as well by plotting the distribution of these features according to the target feature

In [None]:
rcParams['figure.figsize'] = 12, 6
pd.crosstab(df.height,df.cardio).plot(kind='bar', ylabel='count', xlabel='height (cm)', title='height')

In [None]:
rcParams['figure.figsize'] = 20, 6
pd.crosstab(df.weight.round(),df.cardio).plot(kind='bar', ylabel='count', xlabel='weight (kg)', title='weight')

Although not clear, we could observe from the height chart that records with low 'height' values (short persons) are more exposed to cardiovascular disease as compared to the high 'height' feature (long persons). It is more clear in the weight chart, records with low 'weight' values (thin persons) are less exposed to cardiovascular disease as compared to high 'weight' features (fat persons). We could try to create a new feature that is a function of 'height' and 'weight' that is called the Body Mass Index (BMI) and check the distribution of this feature with the target feature.

The formula is BMI = kg/m2 where kg is a person's weight in kilograms and m2 is their height in meters squared.

In [None]:
df['BMI'] = df['weight']/((df['height']/100)**2)
rcParams['figure.figsize'] = 20, 6
pd.crosstab(df.BMI.round(),df.cardio).plot(kind='bar', ylabel='count', xlabel='BMI', title='BMI')

Now we can observe clearly that records with low BMI values are less exposed to cardiovascular disease as compared to high BMI values. So we could use the BMI feature instead of the 'hight' and 'weight' features. This is called feature engineering. 

**Examine the 'ap_hi' and 'ap_lo' features**

Let us examine the 'ap_hi' and 'ap_lo' features and their distribution with the target feature

In [None]:
rcParams['figure.figsize'] = 12, 6
pd.crosstab((df.ap_hi/5).round()*5,df.cardio).plot(kind='bar', ylabel='count', xlabel='ap_hi', title='ap_hi')

In [None]:
rcParams['figure.figsize'] = 12, 6
pd.crosstab((df.ap_lo/5).round()*5,df.cardio).plot(kind='bar', ylabel='count', xlabel='ap_lo', title='ap_lo')

We observe clearly that records with low 'ap_lo' and 'ap_hi' values are less exposed to cardiovascular disease as compared to high 'ap_lo' and 'ap_hi' values. So we should use the 'ap_lo' and 'ap_hi' features for the classification of cardiovascular disease.

**Examine the 'cholesterol' feature**

In [None]:
rcParams['figure.figsize'] = 8, 4
pd.crosstab(df.cholesterol.round(3),df.cardio).plot(kind='bar', ylabel='count', xlabel='cholesterol')

We observe clearly that records with low cholesterol values are less exposed to cardiovascular disease as compared to high cholesterol values. So the 'cholesterol' feature is important for the prediction of cardiovascular disease.

**Examine the 'gluc' feature**

In [None]:
rcParams['figure.figsize'] = 8, 4
pd.crosstab(df.gluc,df.cardio).plot(kind='bar', ylabel='count', xlabel='gluc')

Similar to the 'cholesterol' feature, we observe from the chart clearly that the 'gluc' feature is important for the prediction of cardiovascular disease.

**Examine the 'alco' feature**

Let us examine the 'alco' feature

In [None]:
rcParams['figure.figsize'] = 8, 4
pd.crosstab(df.alco,df.cardio).plot(kind='bar', ylabel='count', xlabel='alco')

The values of the bars in the chart are very close. So let us get numeric results (percentages).

For non-alcoholic persons, the distribution of the target feature is as follows

In [None]:
pd.crosstab(df.alco,df.cardio).iloc[0]/pd.crosstab(df.alco,df.cardio).iloc[0].sum()

For alcoholic persons, the distribution of the target feature is as follows

In [None]:
pd.crosstab(df.alco,df.cardio).iloc[1]/pd.crosstab(df.alco,df.cardio).iloc[1].sum()

The distribution is very similar for non-alcoholic persons. However, for alcoholic persons, the probability (percentage) of being exposed to cardiovascular disease is 45.8% compared to 54.2% of not being exposed to cardiovascular disease.

**Examine the 'active' feature**

Let us examine the 'active' feature

In [None]:
rcParams['figure.figsize'] = 8, 4
pd.crosstab(df.active,df.cardio).plot(kind='bar', ylabel='count', xlabel='active')

The chart shows that the 'active' feature is important as low 'active' values imply people are more exposed to cardiovascular diseases compared to high 'active' values. So we will keep this feature for the prediction of cardiovascular disease.

**Examine the 'gender' feature**

Let us examine the 'gender' feature

In [None]:
pd.crosstab(df.gender,df.cardio).plot(kind='bar', ylabel='count', xlabel='gender')

The values of the bars in the chart are very close. So let us get numeric results (percentages).

In [None]:
pd.crosstab(df.cardio,df.gender)/pd.crosstab(df.cardio,df.gender).sum()

Also, the crosstab numeric values are very close. So the 'gender' feature could be ignored during the training of the model for the prediction of cardiovascular disease.

**Examine the 'smoke' feature**

Let us examine the 'smoke' feature

In [None]:
pd.crosstab(df.smoke,df.cardio).plot(kind='bar', ylabel='count', xlabel='smoke')

The values of the bars in the chart are very close. So let us get numeric results (percentages).

In [None]:
pd.crosstab(df.cardio,df.smoke)/pd.crosstab(df.cardio,df.smoke).sum()

The distribution is very similar for non-smokers. However, for smokers, the probability (percentage) of being exposed to cardiovascular disease is 45.8% compared to 54.2% of not being exposed to cardiovascular disease. This is very similar to the 'alco' feature.

**Conclusion**

Based on the analysis above, we will drop the 'height', 'weight', and 'gender' features. We will also drop the 'age_years' feature because we have another feature 'age' that is related to this feature.

In [None]:
df.drop(['height', 'weight', 'gender','age_years'], axis=1)

# Save Data

Now, we will save the clean dataset into a CSV file to be used in the next session.

In [None]:
df.to_csv("/content/AIData/cardio_EDA.csv",sep=";",index=False)

Check the '/content/AIData/' folder for the 'cardio_EDA.csv' file and download it for future usage.