<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

EXPLORATORY DATA ANALYSIS – DATA CLEANING

https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data

https://towardsdatascience.com/exploratory-data-analysis-on-heart-disease-uci-data-set-ae129e47b323





The task is to predict the presence or absence of cardiovascular disease (CVD) using the patient examination results.

**Implementation**

Import some libraries 

In [58]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import os

# Notebook

In this notebook, we will demonstrate Data Cleaning as part of Exploratory Data Analysis (EDA). We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). The dataset consists of 70000 records of patient data in 12 features. The target class "cardio" equals 1, when a patient has cardiovascular disease, and it's 0 if a patient is healthy.

***Clone the dataset Repository***

The modified dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [59]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

Cloning into 'AIData'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 16 (delta 4), reused 12 (delta 3), pack-reused 0[K
Unpacking objects: 100% (16/16), done.


***Read the dataset***

The data is stored in the cardio_train.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [60]:
import pandas as pd
df = pd.read_csv("/content/AIData/cardio_train_modified.csv",sep=";")
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0.0,18393.0,male,168.0,62.0,110.0,80.0,1.0,1.0,No,0.0,1.0,0.0
1,1.0,20228.0,female,156.0,85.0,140.0,90.0,3.0,1.0,No,0.0,1.0,1.0
2,2.0,18857.0,female,165.0,64.0,130.0,70.0,3.0,1.0,No,0.0,0.0,1.0
3,3.0,17623.0,male,169.0,82.0,150.0,100.0,1.0,1.0,No,0.0,1.0,1.0
4,4.0,17474.0,female,156.0,56.0,100.0,60.0,1.0,1.0,No,0.0,0.0,0.0


***Display Data Info and Check NAN***

To display the content of the data and type of features use the info() method

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           69997 non-null  float64
 1   age          69997 non-null  float64
 2   gender       69008 non-null  object 
 3   height       68996 non-null  float64
 4   weight       69993 non-null  float64
 5   ap_hi        69992 non-null  float64
 6   ap_lo        69991 non-null  float64
 7   cholesterol  69398 non-null  float64
 8   gluc         69995 non-null  float64
 9   smoke        69003 non-null  object 
 10  alco         69997 non-null  float64
 11  active       69997 non-null  float64
 12  cardio       69997 non-null  float64
dtypes: float64(11), object(2)
memory usage: 6.9+ MB


Here the dataframe consists of 70000 rows with 12 variables (features). Ten features are numerical and two features are objects (gender, smoke). We notice that for some of the features the number of non-null values does not equal 70000 which means that some feature values in the data are missing.

We can get the exact number of missing values for each feature using the isnull() method as below

In [62]:
df.isnull().sum()

id                3
age               3
gender          992
height         1004
weight            7
ap_hi             8
ap_lo             9
cholesterol     602
gluc              5
smoke           997
alco              3
active            3
cardio            3
dtype: int64

We can also get the number and percentage of patients' records that has one or more missing values

In [63]:
print(df.isnull().any(axis=1).sum())
print(100*df.isnull().any(axis=1).sum()/df.shape[0],'%')

3530
5.042857142857143 %


To display the records with NAN values

In [64]:
df[df.isnull().any(axis=1)]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
8,13.0,17668.0,female,,71.0,110.0,,1.0,1.0,No,0.0,1.0,0.0
11,16.0,18815.0,male,173.0,,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0
14,23.0,14532.0,male,181.0,95.0,130.0,90.0,1.0,1.0,,1.0,1.0,0.0
21,31.0,21413.0,female,157.0,69.0,,80.0,1.0,1.0,No,0.0,1.0,0.0
22,32.0,23046.0,female,,90.0,145.0,85.0,2.0,2.0,No,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69919,99871.0,17312.0,female,159.0,45.0,110.0,70.0,,2.0,No,0.0,1.0,0.0
69928,99890.0,14420.0,female,,55.0,140.0,90.0,1.0,1.0,No,0.0,1.0,0.0
69962,99949.0,21151.0,female,178.0,69.0,130.0,90.0,1.0,1.0,,0.0,1.0,1.0
69974,99962.0,18226.0,female,,75.0,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0


**Data Cleaning**

The first step is usually to drop all empty records. I.e. records with all features are nan.

In [65]:
df.dropna(how='all', inplace=True)
df.isnull().sum()

id                0
age               0
gender          989
height         1001
weight            4
ap_hi             5
ap_lo             6
cholesterol     599
gluc              2
smoke           994
alco              0
active            0
cardio            0
dtype: int64

By comparing the number of NaN features before and after the last step, we notice that there were 3 empty records in the dataset. We notice also that the number of missing values for the features 'weight', 'ap_hi', ap_lo', and 'gluc' is very low. So the best choice is to delete these patients' records from the dataset.

List the patients' records with 'weight' feature is NaN

In [66]:
df[df.weight.isnull()]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
11,16.0,18815.0,male,173.0,,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0
2160,3049.0,16160.0,male,196.0,,140.0,90.0,1.0,1.0,Yes,1.0,1.0,1.0
16105,22993.0,22468.0,male,176.0,,130.0,80.0,3.0,1.0,No,0.0,1.0,1.0
58630,83670.0,22551.0,female,,,120.0,80.0,1.0,1.0,No,0.0,1.0,0.0


List the patients' records with 'weight' feature is not NaN

In [67]:
df[df.weight.notna()]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0.0,18393.0,male,168.0,62.0,110.0,80.0,1.0,1.0,No,0.0,1.0,0.0
1,1.0,20228.0,female,156.0,85.0,140.0,90.0,3.0,1.0,No,0.0,1.0,1.0
2,2.0,18857.0,female,165.0,64.0,130.0,70.0,3.0,1.0,No,0.0,0.0,1.0
3,3.0,17623.0,male,169.0,82.0,150.0,100.0,1.0,1.0,No,0.0,1.0,1.0
4,4.0,17474.0,female,156.0,56.0,100.0,60.0,1.0,1.0,No,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993.0,19240.0,male,168.0,76.0,120.0,80.0,,1.0,Yes,0.0,1.0,0.0
69996,99995.0,22601.0,female,158.0,126.0,140.0,90.0,2.0,2.0,No,0.0,1.0,1.0
69997,99996.0,19066.0,male,183.0,105.0,180.0,90.0,3.0,1.0,No,1.0,0.0,1.0
69998,99998.0,22431.0,female,163.0,72.0,135.0,80.0,1.0,2.0,No,0.0,0.0,1.0


Delete (drop) records with 'weight' feature is NaN be selecting only rows with weight does not equal to NaN.

In [68]:
print(df.shape)
df.dropna(subset=['weight'], inplace=True)
print(df.shape)

(69997, 13)
(69993, 13)


In [69]:
df.isnull().sum()

id                0
age               0
gender          989
height         1000
weight            0
ap_hi             5
ap_lo             6
cholesterol     599
gluc              2
smoke           994
alco              0
active            0
cardio            0
dtype: int64

As can be observed the number of records in the data frame was reduced by 4 (69996) and there is no NAN value in the 'weight' feature

We will do the same for the 'ap_hi', ap_lo', and 'gluc' features.

In [70]:
print(df.shape)
df.dropna(subset=['ap_hi','ap_lo','gluc'], inplace=True)
print(df.shape)

(69993, 13)
(69981, 13)


In [71]:
df.isnull().sum()

id               0
age              0
gender         989
height         999
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke          994
alco             0
active           0
cardio           0
dtype: int64

The gender feature is a string 'male, female' and we have many missing values. One option is to drop all records with 'gender' feature equals to 'NaN'. However this means dropping ~1.4% of the records and this is to be decided by the domain experts.

In [92]:
dfgender = df.copy()
print(dfgender.isnull()['gender'].sum())
print(100*dfgender.isnull()['gender'].sum()/dfgender.shape[0],'%')
print(dfgender.shape)
dfgender.dropna(subset=['gender'], inplace=True)
print(dfgender.shape)

989
1.4132407367714095 %
(69981, 13)
(68992, 13)


Another option is to replace all missing values in the 'gender' feature with the majority kind (male or female).

In [72]:
df['gender'].value_counts()

female    44885
male      24107
Name: gender, dtype: int64

In [73]:
dfc = df.copy()
dfc['gender'].fillna(value='female', inplace=True)
dfc['gender'].value_counts()

female    45874
male      24107
Name: gender, dtype: int64

As can be observed the number of female records increased.

A third option is to try to set the missing 'gender' feature values based on other values in the record. For example, we can check the correlation between 'gender' and 'height' features. 

In [74]:
df[['gender','height']].apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,gender,height
gender,1.0,-0.085125
height,-0.085125,1.0


It seems that there is not much correlation. Let us try to check with other features.

In [75]:
df.apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,1.0,0.135931,-0.00175,0.004088,0.015556,0.005325,0.005312,0.002675,0.002483,-0.005228,0.001255,-0.003896,0.003769
age,0.135931,1.0,-0.008592,0.004392,0.001836,-0.004034,0.002397,-0.007037,-0.010129,0.006977,0.003834,0.005602,-0.024464
gender,-0.00175,-0.008592,1.0,-0.085125,-0.019587,-0.002249,0.0036,0.028888,0.021616,-0.278516,-0.1551,0.002539,-0.007021
height,0.004088,0.004392,-0.085125,1.0,0.085462,0.007998,0.02358,0.030008,0.006707,0.040973,0.030093,-0.01688,0.010824
weight,0.015556,0.001836,-0.019587,0.085462,1.0,0.02196,0.033945,0.055983,0.049485,0.029892,0.034908,0.002427,0.045406
ap_hi,0.005325,-0.004034,-0.002249,0.007998,0.02196,1.0,0.342503,0.018547,0.009641,0.008142,0.028904,0.000102,0.016902
ap_lo,0.005312,0.002397,0.0036,0.02358,0.033945,0.342503,1.0,0.031497,0.013947,0.006435,0.03037,0.002622,0.033604
cholesterol,0.002675,-0.007037,0.028888,0.030008,0.055983,0.018547,0.031497,1.0,0.264386,0.022127,0.045329,-0.00091,0.161295
gluc,0.002483,-0.010129,0.021616,0.006707,0.049485,0.009641,0.013947,0.264386,1.0,-0.001139,0.011291,0.006804,0.089344
smoke,-0.005228,0.006977,-0.278516,0.040973,0.029892,0.008142,0.006435,0.022127,-0.001139,1.0,0.307583,-0.022913,-0.013444


It seems that the 'gender' feature has the highest correlation with the 'smoke' feature.

In [76]:
df[['gender','smoke']].apply(lambda x: x.factorize()[0]).corr()

Unnamed: 0,gender,smoke
gender,1.0,-0.278516
smoke,-0.278516,1.0


Let us explore the correlation using crosstab

In [77]:
pd.crosstab(df['gender'],df['smoke'])

smoke,No,Yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,43453,795
male,18548,5214


This implies that most non-smokers are females and most smokers are males in the dataset. So let us make all 'gender' feature with 'NaN values for smokers to be 'male', and all 'gender' feature with 'NaN values for non-smokers to be 'female'. 

In [78]:
dfsmoke = df.copy()
dfsmoke.loc[(dfsmoke.gender.isnull()) & (dfsmoke['smoke'] == 'Yes'),'gender']='male'
dfsmoke.loc[(dfsmoke.gender.isnull()) & (dfsmoke['smoke'] == 'No'),'gender']='female'

Let us check the correlation using crosstab again.

In [79]:
pd.crosstab(dfsmoke['gender'],dfsmoke['smoke'])

smoke,No,Yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,44361,795
male,18548,5283


We observe that the number of female non-smokers increased and the male smokers increase also. We also need to check if there are still any 'NaN' values in the 'gender' feature. This could be because the 'smoke' feature has also 'NaN' values.

In [80]:
dfsmoke.isnull().sum()

id               0
age              0
gender          12
height         999
weight           0
ap_hi            0
ap_lo            0
cholesterol    599
gluc             0
smoke          994
alco             0
active           0
cardio           0
dtype: int64

There are 12 'NaN' values in the 'gender' feature. We will drop them because they make only very small percentage of the population (records in the dataset).

In [81]:
print(dfsmoke.shape)
dfsmoke.dropna(subset=['gender'], inplace=True)
print(dfsmoke.shape)

(69981, 13)
(69969, 13)
