<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## EXPLORATORY DATA ANALYSIS – DATA CLEANING

# Import Libraries

First, we need to import some libraries that will be used during data cleaning.

In [1]:
import numpy as np
import pandas as pd

# Preparing Data

In this notebook, we will demonstrate Data Cleaning as part of Exploratory Data Analysis (EDA). We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). The dataset consists of 70000 records of patient data in 12 features. The target class "cardio" equals 1, when a patient has cardiovascular disease, and it's 0 if a patient is healthy.

***Clone the dataset Repository***

The modified dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [2]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

Cloning into 'AIData'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 16 (delta 4), reused 12 (delta 3), pack-reused 0[K
Unpacking objects: 100% (16/16), done.


***Read the dataset***

The data is stored in the cardio_train.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [3]:
df = pd.read_csv("/content/AIData/cardio_train_modified.csv",sep=";")
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0.0,18393.0,male,168.0,62.0,110.0,80.0,1.0,1.0,No,0.0,1.0,0.0
1,1.0,20228.0,female,156.0,85.0,140.0,90.0,3.0,1.0,No,0.0,1.0,1.0
2,2.0,18857.0,female,165.0,64.0,130.0,70.0,3.0,1.0,No,0.0,0.0,1.0
3,3.0,17623.0,male,169.0,82.0,150.0,100.0,1.0,1.0,No,0.0,1.0,1.0
4,4.0,17474.0,female,156.0,56.0,100.0,60.0,1.0,1.0,No,0.0,0.0,0.0


***Display Data Info and Check NAN***

To display the content of the data and type of features use the info() method

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           69997 non-null  float64
 1   age          69997 non-null  float64
 2   gender       69008 non-null  object 
 3   height       68996 non-null  float64
 4   weight       69993 non-null  float64
 5   ap_hi        69992 non-null  float64
 6   ap_lo        69991 non-null  float64
 7   cholesterol  69398 non-null  float64
 8   gluc         69995 non-null  float64
 9   smoke        69003 non-null  object 
 10  alco         69997 non-null  float64
 11  active       69997 non-null  float64
 12  cardio       69997 non-null  float64
dtypes: float64(11), object(2)
memory usage: 6.9+ MB


Here the dataframe consists of 70000 rows with 12 variables (features). Ten features are numerical and two features are objects (gender, smoke). We notice that for some of the features the number of non-null values does not equal 70000 which means that some feature values in the data are missing.

We can get the exact number of missing values for each feature using the isnull() method as below

In [5]:
df.isnull().sum()

id                3
age               3
gender          992
height         1004
weight            7
ap_hi             8
ap_lo             9
cholesterol     602
gluc              5
smoke           997
alco              3
active            3
cardio            3
dtype: int64

We can also get the number and percentage of patients' records that has one or more missing values

In [6]:
print(df.isnull().any(axis=1).sum())
print(100*df.isnull().any(axis=1).sum()/df.shape[0],'%')

3530
5.042857142857143 %


To display the records with NAN values

In [7]:
df[df.isnull().any(axis=1)]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
8,13.0,17668.0,female,,71.0,110.0,,1.0,1.0,No,0.0,1.0,0.0
11,16.0,18815.0,male,173.0,,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0
14,23.0,14532.0,male,181.0,95.0,130.0,90.0,1.0,1.0,,1.0,1.0,0.0
21,31.0,21413.0,female,157.0,69.0,,80.0,1.0,1.0,No,0.0,1.0,0.0
22,32.0,23046.0,female,,90.0,145.0,85.0,2.0,2.0,No,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69919,99871.0,17312.0,female,159.0,45.0,110.0,70.0,,2.0,No,0.0,1.0,0.0
69928,99890.0,14420.0,female,,55.0,140.0,90.0,1.0,1.0,No,0.0,1.0,0.0
69962,99949.0,21151.0,female,178.0,69.0,130.0,90.0,1.0,1.0,,0.0,1.0,1.0
69974,99962.0,18226.0,female,,75.0,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0


# Data Cleaning

**Data Cleaning: drop all empty records**

The first step is usually to drop all empty records. I.e. records with all features are NaN.

In [8]:
df.dropna(how='all', inplace=True)
df.isnull().sum()

id                0
age               0
gender          989
height         1001
weight            4
ap_hi             5
ap_lo             6
cholesterol     599
gluc              2
smoke           994
alco              0
active            0
cardio            0
dtype: int64

By comparing the number of NaN features before and after the last step, we notice that there were 3 empty records in the dataset. We notice also that the number of missing values for the features 'weight', 'ap_hi', ap_lo', and 'gluc' is very low. So the best choice is to delete these patients' records from the dataset.

**Data Cleaning: 'weight' feature**

List the patients' records with 'weight' feature is NaN

In [9]:
df[df.weight.isnull()]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
11,16.0,18815.0,male,173.0,,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0
2160,3049.0,16160.0,male,196.0,,140.0,90.0,1.0,1.0,Yes,1.0,1.0,1.0
16105,22993.0,22468.0,male,176.0,,130.0,80.0,3.0,1.0,No,0.0,1.0,1.0
58630,83670.0,22551.0,female,,,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0


List the patients' records with 'weight' feature is not NaN

In [10]:
df[df.weight.notna()]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0.0,18393.0,male,168.0,62.0,110.0,80.0,1.0,1.0,No,0.0,1.0,0.0
1,1.0,20228.0,female,156.0,85.0,140.0,90.0,3.0,1.0,No,0.0,1.0,1.0
2,2.0,18857.0,female,165.0,64.0,130.0,70.0,3.0,1.0,No,0.0,0.0,1.0
3,3.0,17623.0,male,169.0,82.0,150.0,100.0,1.0,1.0,No,0.0,1.0,1.0
4,4.0,17474.0,female,156.0,56.0,100.0,60.0,1.0,1.0,No,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993.0,19240.0,male,168.0,76.0,120.0,80.0,,1.0,Yes,0.0,1.0,0.0
69996,99995.0,22601.0,female,158.0,126.0,140.0,90.0,2.0,2.0,No,0.0,1.0,1.0
69997,99996.0,19066.0,male,183.0,105.0,180.0,90.0,3.0,1.0,No,1.0,0.0,1.0
69998,99998.0,22431.0,female,163.0,72.0,135.0,80.0,1.0,2.0,No,0.0,0.0,1.0


Delete (drop) records with 'weight' feature is NaN be selecting only rows with weight does not equal to NaN.

In [11]:
print(df.shape)
df.dropna(subset=['weight'], inplace=True)
print(df.shape)

(69997, 13)
(69993, 13)


In [12]:
df.isnull().sum()

id                0
age               0
gender          989
height         1000
weight            0
ap_hi             5
ap_lo             6
cholesterol     599
gluc              2
smoke           994
alco              0
active            0
cardio            0
dtype: int64

As can be observed the number of records in the data frame was reduced by 4 (69996) and there is no NAN value in the 'weight' feature

**Data Cleaning: 'ap_hi', ap_lo', and 'gluc' features**

We will do the same for the 'ap_hi', ap_lo', and 'gluc' features.

In [13]:
print(df.shape)
df.dropna(subset=['ap_hi','ap_lo','gluc'], inplace=True)
print(df.shape)

(69993, 13)
(69981, 13)


In [14]:
df.isnull().sum()

id               0
age              0
gender         989
height         999
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke          994
alco             0
active           0
cardio           0
dtype: int64

The gender feature is a string 'male, female' and we have many missing values. One option is to drop all records with 'gender' feature equals to 'NaN'. However this means dropping ~1.4% of the records and this is to be decided by the domain experts.

In [15]:
dfgender = df.copy()
print(dfgender.isnull()['gender'].sum())
print(100*dfgender.isnull()['gender'].sum()/dfgender.shape[0],'%')
print(dfgender.shape)
dfgender.dropna(subset=['gender'], inplace=True)
print(dfgender.shape)

989
1.4132407367714095 %
(69981, 13)
(68992, 13)


Another option is to replace all missing values in the 'gender' feature with the majority kind (male or female).

In [16]:
df['gender'].value_counts()

female    44885
male      24107
Name: gender, dtype: int64

In [17]:
dfc = df.copy()
dfc['gender'].fillna(value='female', inplace=True)
dfc['gender'].value_counts()

female    45874
male      24107
Name: gender, dtype: int64

As can be observed the number of female records increased.

A third option is to try to set the missing 'gender' feature values based on other values in the record. For example, we can check the correlation between 'gender' and 'height' features. 

In [18]:
df[['gender','height']].apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,gender,height
gender,1.0,-0.085125
height,-0.085125,1.0


It seems that there is not much correlation. Let us try to check with other features.

In [19]:
df.apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,1.0,0.135931,-0.00175,0.004088,0.015556,0.005325,0.005312,0.002675,0.002483,-0.005228,0.001255,-0.003896,0.003769
age,0.135931,1.0,-0.008592,0.004392,0.001836,-0.004034,0.002397,-0.007037,-0.010129,0.006977,0.003834,0.005602,-0.024464
gender,-0.00175,-0.008592,1.0,-0.085125,-0.019587,-0.002249,0.0036,0.028888,0.021616,-0.278516,-0.1551,0.002539,-0.007021
height,0.004088,0.004392,-0.085125,1.0,0.085462,0.007998,0.02358,0.030008,0.006707,0.040973,0.030093,-0.01688,0.010824
weight,0.015556,0.001836,-0.019587,0.085462,1.0,0.02196,0.033945,0.055983,0.049485,0.029892,0.034908,0.002427,0.045406
ap_hi,0.005325,-0.004034,-0.002249,0.007998,0.02196,1.0,0.342503,0.018547,0.009641,0.008142,0.028904,0.000102,0.016902
ap_lo,0.005312,0.002397,0.0036,0.02358,0.033945,0.342503,1.0,0.031497,0.013947,0.006435,0.03037,0.002622,0.033604
cholesterol,0.002675,-0.007037,0.028888,0.030008,0.055983,0.018547,0.031497,1.0,0.264386,0.022127,0.045329,-0.00091,0.161295
gluc,0.002483,-0.010129,0.021616,0.006707,0.049485,0.009641,0.013947,0.264386,1.0,-0.001139,0.011291,0.006804,0.089344
smoke,-0.005228,0.006977,-0.278516,0.040973,0.029892,0.008142,0.006435,0.022127,-0.001139,1.0,0.307583,-0.022913,-0.013444


It seems that the 'gender' feature has the highest correlation with the 'smoke' feature.

In [20]:
df[['gender','smoke']].apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,gender,smoke
gender,1.0,-0.278516
smoke,-0.278516,1.0


Let us explore the correlation using crosstab

In [21]:
pd.crosstab(df['gender'],df['smoke'])

smoke,No,Yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,43453,795
male,18548,5214


This implies that most non-smokers are females and most smokers are males in the dataset. So let us make all 'gender' feature with 'NaN values for smokers to be 'male', and all 'gender' feature with 'NaN values for non-smokers to be 'female'. 

In [22]:
dfsmoke = df.copy()
dfsmoke.loc[(dfsmoke.gender.isnull()) & (dfsmoke['smoke'] == 'Yes'),'gender']='male'
dfsmoke.loc[(dfsmoke.gender.isnull()) & (dfsmoke['smoke'] == 'No'),'gender']='female'

Let us check the correlation using crosstab again.

In [23]:
pd.crosstab(dfsmoke['gender'],dfsmoke['smoke'])

smoke,No,Yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,44361,795
male,18548,5283


We observe that the number of female non-smokers increased and the male smokers increase also. We also need to check if there are still any 'NaN' values in the 'gender' feature. This could be because the 'smoke' feature has also NaN values.

In [24]:
dfsmoke.isnull().sum()

id               0
age              0
gender          12
height         999
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke          994
alco             0
active           0
cardio           0
dtype: int64

There are 12 NaN values in the 'gender' feature. We will drop them because they make only very small percentage of the population (records in the dataset).

In [25]:
print(dfsmoke.shape)
dfsmoke.dropna(subset=['gender'], inplace=True)
print(dfsmoke.shape)

(69981, 13)
(69969, 13)


In this notebook, we will consider the third option to deal with the 'NaN' values in the 'gender' feature.

In [26]:
df = dfsmoke.copy()
df.isnull().sum()

id               0
age              0
gender           0
height         999
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke          982
alco             0
active           0
cardio           0
dtype: int64

**Data Cleaning: 'smoke' feature**

Now, for the 'smoke' feature, is there any correlation with the other features?

In [27]:
df.apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,1.0,0.135961,-0.003195,0.004054,0.015487,0.005289,0.005301,0.002652,0.002478,-0.005679,0.001282,-0.003967,0.003778
age,0.135961,1.0,-0.010891,0.004453,0.001735,-0.004057,0.002381,-0.007055,-0.010224,0.006935,0.003721,0.005585,-0.024445
gender,-0.003195,-0.010891,1.0,-0.093059,-0.022726,0.000447,0.004397,0.026514,0.020176,-0.308838,-0.171638,0.006441,-0.0087
height,0.004054,0.004453,-0.093059,1.0,0.085497,0.00799,0.02357,0.030051,0.0067,0.041084,0.030083,-0.016897,0.010838
weight,0.015487,0.001735,-0.022726,0.085497,1.0,0.021934,0.033929,0.055939,0.049431,0.029553,0.034865,0.002443,0.045389
ap_hi,0.005289,-0.004057,0.000447,0.00799,0.021934,1.0,0.342503,0.018543,0.009625,0.008004,0.028897,0.000108,0.016897
ap_lo,0.005301,0.002381,0.004397,0.02357,0.033929,0.342503,1.0,0.0315,0.013946,0.006357,0.030377,0.002635,0.033607
cholesterol,0.002652,-0.007055,0.026514,0.030051,0.055939,0.018543,0.0315,1.0,0.264369,0.022048,0.045276,-0.000869,0.16126
gluc,0.002478,-0.010224,0.020176,0.0067,0.049431,0.009625,0.013946,0.264369,1.0,-0.001202,0.011092,0.006839,0.089309
smoke,-0.005679,0.006935,-0.308838,0.041084,0.029553,0.008004,0.006357,0.022048,-0.001202,1.0,0.308018,-0.022858,-0.013459


Yes, there is a high correlation between the 'smoke' feature and both the 'gender' and 'alco' features. But since we already used the 'smoke' feature to deal with the NaN values in the 'gender' feature and thus the correlation between them might be affected, we will use the 'alco' feature to deal with the NaN values in the 'smoke' feature.

In [28]:
pd.crosstab(df['smoke'],df['alco'])

alco,0.0,1.0
smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
No,61028,1881
Yes,4249,1829


We can observe from the crosstab results that most non-alcoholic persons in the dataset are non-smokers but alcoholic persons might or might not be smokers. So we will make all 'NaN' values in the 'smoke' feature for all records of non-alcoholic persons to be No. 

In [29]:
df.loc[(df.smoke.isnull()) & (df['alco'] == 0.0),'smoke']='No'

Let us check the correlation using crosstab again.

In [30]:
pd.crosstab(df['smoke'],df['alco'])

alco,0.0,1.0
smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
No,61958,1881
Yes,4249,1829


We observe that the number of non-alcoholic persons in the dataset is non-smokers increased. We will drop all other records with the 'smoke' feature equal to NaN.

In [31]:
print(df.shape)
df.dropna(subset=['smoke'], inplace=True)
print(df.shape)
df.isnull().sum()

(69969, 13)
(69917, 13)


id               0
age              0
gender           0
height         998
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke            0
alco             0
active           0
cardio           0
dtype: int64

**Data Cleaning: 'height' feature**

Now, for the 'height' feature, is there any correlation with the other features?

In [32]:
df.apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,1.0,0.13575,-0.003385,0.004219,0.015402,0.005457,0.005461,0.00281,0.002627,-0.003686,0.001432,-0.004101,0.003803
age,0.13575,1.0,-0.011127,0.004844,0.001505,-0.004124,0.002323,-0.007065,-0.01016,0.009356,0.003848,0.00551,-0.024644
gender,-0.003385,-0.011127,1.0,-0.099598,-0.022678,0.000403,0.004372,0.026536,0.020134,-0.340032,-0.170451,0.006235,-0.00862
height,0.004219,0.004844,-0.099598,1.0,0.089185,0.007583,0.022865,0.029728,0.005919,0.047471,0.03082,-0.016796,0.010837
weight,0.015402,0.001505,-0.022678,0.089185,1.0,0.021467,0.033512,0.055897,0.049445,0.030471,0.034921,0.002436,0.045327
ap_hi,0.005457,-0.004124,0.000403,0.007583,0.021467,1.0,0.341715,0.018361,0.009575,0.006676,0.028608,0.00021,0.016778
ap_lo,0.005461,0.002323,0.004372,0.022865,0.033512,0.341715,1.0,0.031473,0.013973,0.004213,0.029911,0.002734,0.033584
cholesterol,0.00281,-0.007065,0.026536,0.029728,0.055897,0.018361,0.031473,1.0,0.264303,0.022511,0.045554,-0.000909,0.161304
gluc,0.002627,-0.01016,0.020134,0.005919,0.049445,0.009575,0.013973,0.264303,1.0,-0.004817,0.01125,0.006876,0.089176
smoke,-0.003686,0.009356,-0.340032,0.047471,0.030471,0.006676,0.004213,0.022511,-0.004817,1.0,0.341181,-0.025663,-0.015437


Yes, there is a high correlation between the 'height' feature and both the 'gender' and 'weight' features. However, the 'height' feature has a continuous value and we can not deal with it similar to the 'gender' feature'. Instead, we should create a model that predicts the 'height' feature based on the 'gender' and 'weight' features which we will study in the next modules. So, for now, we have two options, either to drop all records where the 'height' feature is NaN or replace all these NaN values with some statistical measure (mean, median) of the 'height' feature. In this notebook, we will replace the NaN values with the median of the values in the 'height' feature.

In [33]:
print(df.height.median())
df['height'].fillna(df.height.median(), inplace=True)
print(df.height.median())
df.isnull().sum()


165.0
165.0


id               0
age              0
gender           0
height           0
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke            0
alco             0
active           0
cardio           0
dtype: int64

**Data Cleaning: 'cholesterol' feature**

To handle the NaN values in the 'cholesterol' feature, we will use the same method we used for the 'height' feature. However, because the values of the 'cholesterol' feature are between -1 and 1, we will use the mean instead of the median (the median will return an integer value).

In [34]:
print(df.cholesterol.mean())
df['cholesterol'].fillna(df.cholesterol.mean(), inplace=True)
print(df.cholesterol.mean())
df.isnull().sum()


1.3668888311838194
1.3668888311838194


id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

# Save Data

Now, we will save the clean dataset into a CSV file to be used in the next session.

In [35]:
df.to_csv("/content/AIData/cardio_train_cleaned.csv",index=False)

Check the '/content/AIData/' folder for the 'cardio_train_cleaned.csv' file and download it for future usage.