<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_FEATURE_SELECTION_AND_FEATURE_ENGINEERING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## EXPLORATORY DATA ANALYSIS – FEATURE SELECTION AND FEATURE ENGINEERING

In this notebook, we will demonstrate Feature Selection and Feature Engineering as part of Exploratory Data Analysis (EDA). We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). The dataset has been cleaned as presented in another notebook (https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb). 



# Import Libraries

First, we need to import some libraries that will be used during data cleaning.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import rcParams

# Data Preparation

**Clone the dataset Repository**

The modified dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_train_cleaned.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_train_cleaned.csv",sep=";")
df.head()

**Display Data Info and Check NAN**

To display the content of the data and type of features use the info() method

In [None]:
df.info()

Here the dataframe consists of 69917 rows with 12 variables (features). Ten features are numerical and two features are objects (gender, smoke). We notice that the number of non-null values of all features is equal to 69917 which means that none of the features has missing values. We can also use the isnull() method to check for any missing data.

In [None]:
df.isnull().sum()

As expected there are no NaN values as the data has been cleaned.

# Feature Selection and Feature Engineering

After we identified the type of each feature, let us have a close look at the statistical properties of the numaric features

In [None]:
df.describe()

Age is measured in days, height is in centimeters, and weight is in kgs. Let us convert the Age into days so that it is easier to understand.

In [None]:
df['age_years'] = (df['age'] / 365).round().astype('int')
df.drop('age', axis=1, inplace=True)
df.head()

The target feature "cardio" equals 1, when a patient has cardiovascular disease, and it's 0 if a patient is healthy. Let's look ate the numerical variables and how are they spread among this target feature.

In [None]:
rcParams['figure.figsize'] = 11, 8
pd.crosstab(df.age_years,df.cardio).plot(kind='bar', ylabel='count', xlabel='years')