<a href="https://colab.research.google.com/github/mkjubran/ENCS5141-INTELLIGENT-SYSTEMS-LAB/blob/main/ENCS5141_Exp2_Handling_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Experiment #2: Data Visualization and Data Cleaning

This experiment focuses on discussing concepts and implementing code snippets to demonstrate various techniques used for data cleaning as part of an Exploratory Data Analysis (EDA). EDA plays a crucial role in comprehending and examining datasets. Throughout the experiment, you will also need to solve a few exercises to demonstrate your comprehension and acquire the necessary skills. The topics that will be discussed in the experiment are
##2.1 Data Visualization
2.1.1 Using Matplotlib \
2.1.2 Using Seaborn \
2.1.3 Using Pandas \
2.1.4 Boxplot \
##2.2 Descriptive statistics
2.2.1 Central Tendency \
2.2.2 Variation \
2.2.3 Shape of Distribution \
2.2.4 Quantiles \
##2.3 Handling Missing Data
2.3.1 Missing Numeric Data \
2.3.2 Missing Categorical Data \
##2.4 Handling Outliers
2.4.1 Statistical Outlier Detection Using Z-Score \
2.4.2 Using Interquartile Range and Boxplots \

---


#2.1. Data Visualization


##2.1.1 Using Matplotlib
Matplotlib package is an open source library that is used to create professional figures and plots. To make a plot, you need first to import the **pyplot** sub-module and then use the **plot** method with proper arguments.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# generating a sinwwave signal
t = np.arange(0, 1, 0.001)
sig = np.sin(2 * np.pi * 2 * t)

plt.plot(t,sig)
plt.xlabel('Time (sec)')
plt.ylabel('Amplitude')
plt.grid(True)

plt.show()

The **Matplotlib** package can also be used to plot multiple axes in the same figure or two curves on the same axis.

In [None]:
# generating two sinwwave signals
t = np.arange(0, 2, 0.001)
sig1 = np.sin(2 * np.pi * 2 * t)
sig2 = np.sin(2 * np.pi * 2 * t - np.pi/6)
sig3 = np.sin(2 * np.pi * 2 * t + np.pi/4)

fig, axs = plt.subplots(2, 1)
axs[0].plot(t, sig1)
axs[0].set_xlim(0, 2)
axs[0].set_xlabel('Time (sec)')
axs[0].set_ylabel('Amplitude')
axs[0].grid(True)

axs[1].plot(t, sig2, t, sig3)
axs[1].set_xlim(0, 2)
axs[1].set_xlabel('Time (sec)')
axs[1].set_ylabel('Amplitude')
axs[1].grid(True)

fig.tight_layout()
plt.show()

**Task 2.1**: Refer to the matplot documentation at https://matplotlib.org/stable/gallery/color/named_colors.html to plot two curves in the same figure, add markers of specific size, add a legend, and add a figure title.

In [None]:
#write you code here


**Task 2.2**: Create two axes next to each other, and plot a sinwave in one axis and a cosine wave on the other. Add markers, legend, and a title for the figure.

In [None]:
#write you code here



##2.1.2 Using Seaborn
Seaborn is a data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. To make a plot, you need first to import the **sns** sub-module and then use a specific method with proper arguments. For example you may use the **relplot** method to create relational plots.

In [None]:
import seaborn as sns
import pandas as pd

# generating a sinwwave signal
t = np.arange(0, 1, 0.001)
sin = np.sin(2 * np.pi * 2 * t)
cos = np.cos(2 * np.pi * 2 * t)

#Creating a dataframe from the time, sin, and cos curves
df = pd.DataFrame({'time':t, 'sin':sin, 'cos':cos})

# Create a visualization
sns.relplot(data=df,kind="line",x="time",y="sin").set(title='Sinwave')

You can also use the **relplot** function to create more advanced visualizations of your data. For instance, let's take the "tips" dataset as an example. This dataset contains information about tips received by a waiter over a few months in a restaurant. It has details like how much tip was given, the bill amount, whether the person paying the bill is male or female, if there were smokers in the group, the day of the week, the time of day, and the size of the group. To import the dataset use the following code

In [None]:
# Load an example dataset
tips = sns.load_dataset("tips")
tips.info()

Execute the following code to view the first 10 rows in the dataset

In [None]:
tips.head()

Using the **relplot()** method helps us understand patterns in the dataset how different factors might be connected.

In [None]:
# Create a visualization
sns.relplot(data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size",
)

From observing the visualization of the tips dataset, we can infer that as the total bill size grows, the tip value tends to increase proportionally. Additionally, it's apparent that both the total bill and tip value are higher when the group size is larger. **Can you observe any other patterns?**

**Task 2.3**: Load a dataset from the sns repository and then use the **relplot()** method to visualize and understand patterns in the dataset. You may list the datasets in the sns repository using the **sns.get_dataset_names()** method.

In [None]:
#write you code here


The **histplot()** is another method in the **sns** submodule that can be used to plot univariate or bivariate histograms to show distributions of datasets.

In [None]:
fig, axs = plt.subplots(figsize=(16, 4),ncols=3)

#Create histograms displaying the distribution of tip values
sns.histplot(data=tips, x="tip", ax=axs[0])

#Create histograms displaying the distribution of tip values based on the time of day
sns.histplot(data=tips, x="tip", hue="time",ax=axs[1])

#Create histograms displaying the distribution of tip values based on the time of day, and incorporate the actual distribution curve.
sns.histplot(data=tips, x="tip", hue="time",ax=axs[2], kde=True)

**Task 2.4**: Create a histogram plot for the dataset you loaded in task 2.3 and incorporate the actual distribution curve.

In [None]:
#write you code here


**Task 2.5**: Use the **scatterplot()** method within the **sns** submodule and the **subplots()** method within the **matplotlib** submodule to generate visual representations for the dataset you loaded in step 2.3.

In [None]:
#write you code here


##2.1.3 Using Pandas
Data visualization using pandas is a common task in data analysis and manipulation. As explained in experiment #1, pandas provides an easy-to-use DataFrame structure that allows you to store, manipulate, and analyze data efficiently. When combined with data visualization libraries like Matplotlib or Seaborn, pandas can generate a wide range of visualizations to explore and communicate insights from your data. In this section, we will present few types of visualizations that can be created using pandas.

In [None]:
import pandas as pd
import numpy as np


# Create a sample DataFrame
data = {
    'Category': ['Fruits','Fruits','Fruits','Fruits','Fruits', 'Vegetables','Vegetables','Vegetables','Vegetables','Vegetables','Grains','Grains','Grains'],
    'Item': ['Apple', 'Banana', 'Orange', 'Mango', 'Grapes', 'Spinach', 'Tomato', 'Cucumber','Cauliflower','Eggplant','Rice', 'Wheat', 'Corn'],
    'Weight': [200, 150, 250, 150, 200, 120, 200, 120, 120, 250, 120, 300, 300],
    'Cost': [0.5, 0.3, 0.2, 2.5, 1.0, 1.5, 0.3, 0.3, 0.5, 1.2, 1.6, 0.8, 0.5],
    'Calories': [95, 23, 205, 335, 120, 33, 50, 250, 350, 300, 420, 200, 250]
}


food = pd.DataFrame(data)
food.head()

**Line Plot**: To create a line plot, you can use the **plot()** method on the DataFrame.

In [None]:
import matplotlib.pyplot as plt
food.plot(x='Weight', y='Cost', kind='line', color='red')
plt.xlabel('Weight')
plt.ylabel('Cost')
plt.title('Line Plot')
plt.show()

**Scatter Plot**: To create a scatter plot use the **plot()** method with **kind='scatter'**.

In [None]:
food.plot(x='Weight', y='Cost', kind='scatter')
plt.xlabel('Weight')
plt.ylabel('Cost')
plt.title('Scatter Plot')
plt.show()

**Histogram**: To create a histogram use the **plot()** method with **kind='hist'**.

In [None]:
food['Cost'].plot(kind='hist', bins=10)
plt.xlabel('Cost')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

**Bar Plot**: To create a bar plot use the **plot()** method with **kind='bar'**.

In [None]:
food.plot(x='Weight', y='Cost', kind='bar')
plt.xlabel('Weight')
plt.ylabel('Cost')
plt.title('Bar Plot')
plt.show()

You can also utilize the **groupby()** function along with the **plot()** function with the **kind='bar'** option to aggregate column values and generate insightful bar visualizations.

In [None]:
# Grouping by 'Category' and summing 'Cost'
grouped_data = food.groupby('Category')['Cost'].sum()

# Creating a bar plot
grouped_data.plot(kind='bar')
plt.xlabel('Product Category')
plt.ylabel('Total Sales Amount')
plt.title('Total Sales Amount by Product Category')
plt.show()

**Task 2.7**: Create a histogram for the food weights in the Food DataFrame defined in this section.

In [None]:
#write you code here


**Task 2.8**: Generate informative bar charts illustrating the calorie distribution across food categories using the Food DataFrame introduced in this section.

In [None]:
#write you code here


###2.1.4 Boxplot

A boxplot is a graphical representation that provides insights into the distribution and variability of data. It help us visualize the spread and central tendency of the data. In a boxplot, a rectangular "box" is drawn to represent the interquartile range (IQR), which spans from the first quartile (Q1) to the third quartile (Q3) of the data. The first quartile (Q1) represents the point where a quarter (25%) of the data values fall below when arranged in increasing order. On the other hand, the third quartile (Q3), marks the threshold beneath which three-quarters (75%) of the data values are situated when organized in increasing order.

The boxplot can be created using the **boxplot()** method within the **panda** package (panada DataFrame).

In [None]:
tips.boxplot(by ='day', column =['total_bill'], grid = False)

The boxplot can also be generated using the **boxplot()** method within the **sns** submodule.

In [None]:
sns.boxplot(x = 'day', y = 'total_bill', data = tips)

**Task 2.9**: Use the **boxplot()** method within the **sns** submodule to generate boxplot for the dataset you loaded in step 2.3.

In [None]:
#write you code here


#2.2 Descriptive statistics
Descriptive statistics involves analyzing and summarizing data to gain insights into its central tendencies, variability, and overall distribution. It playes a crucial role in machine learning for many reasons including data understanding and exploration and data cleaning and preprocessing. In data understanding and exploration, descriptive statistics provide an initial overview of the dataset, helping you understand its distribution, central tendencies, and variability. This exploration phase is vital for identifying data patterns, anomalies, and potential issues that might impact the quality of your machine learning models. While in data cleaning and preprocessing, descriptive statistics help you identify missing values, outliers, and inconsistencies that need to be addressed. Cleaning and preprocessing ensure that your model receives accurate and reliable input data.

##2.2.1 Central Tendency
Mean: The average value of the data.
Median: The middle value when the data is sorted.
Mode: The value that appears most frequently in the data.

##2.2.2 Variation
Variance: A measure of how much the data points deviate from the mean.
Standard Deviation: The square root of the variance, indicating the spread of data.

##2.2.3 Shape of Distribution
Skewness: Measures the asymmetry of the data distribution.
Kurtosis: Measures the peakedness of the data distribution.

##2.2.4 Quantiles
Percentiles: Values below which a given percentage of data falls.
Interquartile Range (IQR): The range between the 25th and 75th percentiles.
Pandas provides functions like mean(), median(), mode(), std(), var(), min(), max(), quantile(), and more to calculate these descriptive statistics. These functions can be applied to entire DataFrames, specific columns, or Series.

#2.2. Handling Missing Data


##2.2.1 Missing numeric data

## EXPLORATORY DATA ANALYSIS – DATA CLEANING

In this notebook, we will demonstrate Data Cleaning as part of Exploratory Data Analysis (EDA). We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). The dataset consists of 70000 records of patient data in 12 features. The target class "cardio" equals 1, when a patient has cardiovascular disease, and it's 0 if a patient is healthy.

# Import Libraries

First, we need to import some libraries that will be used during data cleaning.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import rcParams
import matplotlib.pyplot as plt

# Data Preparation

***Clone the dataset Repository***

The modified dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

***Read the dataset***

The data is stored in the cardio_train.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_train_modified.csv",sep=";")
df.head()

***Display Data Info and Check NAN***

To display the content of the data and type of features use the info() method

In [None]:
df.info()

Here the dataframe consists of 70000 rows with 12 variables (features). Ten features are numerical and two features are objects (gender, smoke). We notice that for some of the features the number of non-null values does not equal 70000 which means that some feature values in the data are missing.

We can get the exact number of missing values for each feature using the isnull() method as below

In [None]:
df.isnull().sum()

We can also get the number and percentage of patients' records that has one or more missing values

In [None]:
print(df.isnull().any(axis=1).sum())
print(100*df.isnull().any(axis=1).sum()/df.shape[0],'%')

To display the records with NAN values

In [None]:
df[df.isnull().any(axis=1)]

# Data Cleaning

**Data Cleaning: drop all empty records**

The first step is usually to drop all empty records. I.e. records with all features are NaN.

In [None]:
df.dropna(how='all', inplace=True)
df.isnull().sum()

By comparing the number of NaN features before and after the last step, we notice that there were 3 empty records in the dataset. We notice also that the number of missing values for the features 'weight', 'ap_hi', ap_lo', and 'gluc' is very low. So the best choice is to delete these patients' records from the dataset.

**Data Cleaning: target feature (class, label)**

The target feature (class, label) "cardio" equals 1, when a patient has cardiovascular disease, and it's 0 if a patient is healthy. Notice that this feature 'cardio' does not have any missing data. Had there been any missing values in the target feature, then the corresponding patient records must be dropped.

In [None]:
print(df.shape)
df.dropna(subset=['cardio'], inplace=True)
print(df.shape)

As expected no record is dropped.

**Data Cleaning: 'weight' feature**

List the patients' records with 'weight' feature is NaN

In [None]:
df[df.weight.isnull()]

List the patients' records with 'weight' feature is not NaN

In [None]:
df[df.weight.notna()]

Delete (drop) records with 'weight' feature is NaN be selecting only rows with weight does not equal to NaN.

In [None]:
print(df.shape)
df.dropna(subset=['weight'], inplace=True)
print(df.shape)

In [None]:
df.isnull().sum()

As can be observed the number of records in the data frame was reduced by 4 (69996) and there is no NAN value in the 'weight' feature

**Data Cleaning: 'ap_hi', ap_lo', and 'gluc' features**

We will do the same for the 'ap_hi', ap_lo', and 'gluc' features.

In [None]:
print(df.shape)
df.dropna(subset=['ap_hi','ap_lo','gluc'], inplace=True)
print(df.shape)

In [None]:
df.isnull().sum()

**Data Cleaning: 'gender' feature**

The gender feature is a string 'male, female' and we have many missing values. One option is to drop all records with 'gender' feature equals to 'NaN'. However this means dropping ~1.4% of the records and this is to be decided by the domain experts.

In [None]:
dfgender = df.copy()
print(dfgender.isnull()['gender'].sum())
print(100*dfgender.isnull()['gender'].sum()/dfgender.shape[0],'%')
print(dfgender.shape)
dfgender.dropna(subset=['gender'], inplace=True)
print(dfgender.shape)

Another option is to replace all missing values in the 'gender' feature with the majority kind (male or female).

In [None]:
df['gender'].value_counts()

In [None]:
dfc = df.copy()
dfc['gender'].fillna(value='female', inplace=True)
dfc['gender'].value_counts()

As can be observed the number of female records increased.

A third option is to try to set the missing 'gender' feature values based on other values in the record. For example, we can check the correlation between 'gender' and 'height' features.

In [None]:
df[['gender','height']].apply(lambda x: x.factorize()[0]).corr()

It seems that there is not much correlation. Let us try to check with other features.

In [None]:
df.apply(lambda x: x.factorize()[0]).corr()

It seems that the 'gender' feature has the highest correlation with the 'smoke' feature.

In [None]:
df[['gender','smoke']].apply(lambda x: x.factorize()[0]).corr()

Let us explore the correlation using crosstab

In [None]:
pd.crosstab(df['gender'],df['smoke'])

This implies that most non-smokers are females and most smokers are males in the dataset. So let us make all 'gender' feature with 'NaN values for smokers to be 'male', and all 'gender' feature with 'NaN values for non-smokers to be 'female'.

In [None]:
dfsmoke = df.copy()
dfsmoke.loc[(dfsmoke.gender.isnull()) & (dfsmoke['smoke'] == 'Yes'),'gender']='male'
dfsmoke.loc[(dfsmoke.gender.isnull()) & (dfsmoke['smoke'] == 'No'),'gender']='female'

Let us check the correlation using crosstab again.

In [None]:
pd.crosstab(dfsmoke['gender'],dfsmoke['smoke'])

We observe that the number of female non-smokers increased and the male smokers increase also. We also need to check if there are still any 'NaN' values in the 'gender' feature. This could be because the 'smoke' feature has also NaN values.

In [None]:
dfsmoke.isnull().sum()

There are 12 NaN values in the 'gender' feature. We will drop them because they make only very small percentage of the population (records in the dataset).

In [None]:
print(dfsmoke.shape)
dfsmoke.dropna(subset=['gender'], inplace=True)
print(dfsmoke.shape)

In this notebook, we will consider the third option to deal with the 'NaN' values in the 'gender' feature.

In [None]:
df = dfsmoke.copy()
df.isnull().sum()

**Data Cleaning: 'smoke' feature**

Let us now handle the missing vlues of the 'smoke' feature. This feature takes only two values 'Yes' and 'No'. Is there any correlation with the other features?

In [None]:
df.apply(lambda x: x.factorize()[0]).corr()

Yes, there is a high correlation between the 'smoke' feature and both the 'gender' and 'alco' features. But since we already used the 'smoke' feature to deal with the NaN values in the 'gender' feature and thus the correlation between them might be affected, we will use the 'alco' feature to deal with the NaN values in the 'smoke' feature.

In [None]:
pd.crosstab(df['smoke'],df['alco'])

We can observe from the crosstab results that most non-alcoholic persons in the dataset are non-smokers but alcoholic persons might or might not be smokers. So we will make all 'NaN' values in the 'smoke' feature for all records of non-alcoholic persons to be No.

In [None]:
df.loc[(df.smoke.isnull()) & (df['alco'] == 0.0),'smoke']='No'

Let us check the correlation using crosstab again.

In [None]:
pd.crosstab(df['smoke'],df['alco'])

We observe that the number of non-alcoholic persons in the dataset is non-smokers increased. Let us know check the status of the missing values.

In [None]:
df.isnull().sum()

As the number of remaining missing values of the 'smoke' feature is small, we will drop all other records with the 'smoke' feature equal to NaN.

In [None]:
print(df.shape)
df.dropna(subset=['smoke'], inplace=True)
print(df.shape)
df.isnull().sum()

**Data Cleaning: 'cholesterol' feature**

Let us now handle the missing vlues of the 'cholesterol' feature. This feature takes three values.

In [None]:
df.cholesterol.unique()

 Is there any correlation with the other features?

In [None]:
df.apply(lambda x: x.factorize()[0]).corr()

Yes, there is a high correlation between the 'alco' feature and the 'gluc' feature. Let us explore the correlation using crosstab.

In [None]:
(pd.crosstab(df['cholesterol'],df['gluc'])/pd.crosstab(df['cholesterol'],df['gluc']).sum())*100

We observe that 81% of the persons with a 'gluc' value of 1.0 has also a 'cholesterol' value of 1.0. We also observe that 65% of the persons with a 'gluc' value of 3.0 has also a 'cholesterol' value of 3.0. And thus we will use these two notes to handle missing values of the 'cholesterol' feature. However, for the persons with a 'gluc' value of 2.0, 43% and 46% have 'cholesterol values of 1.0 and 2.0 which imply that we can not use the 'gluc' value for these persons to handle missing 'cholesterol' values.

In [None]:
df.loc[(df.cholesterol.isnull()) & (df['gluc'] == 1.0),'cholesterol']=1.0
df.loc[(df.cholesterol.isnull()) & (df['gluc'] == 3.0),'cholesterol']=3.0

Let us now check the status of missing values

In [None]:
df.isnull().sum()

As the number of missing values in the 'cholesterol' feature is reduced to 39. Then we will remove these records from the dataset.

In [None]:
print(df.shape)
df.dropna(subset=['cholesterol'], inplace=True)
print(df.shape)
df.isnull().sum()

**Data Cleaning: 'height' feature**

Now, for the 'height' feature, is there any correlation with the other features?

In [None]:
df.apply(lambda x: x.factorize()[0]).corr()

Yes, there is a high correlation between the 'height' feature and both the 'gender' and 'weight' features. However, the 'height' feature has a continuous value and we can not deal with it similar to the 'gender' feature'. Instead, we should create a model that predicts the 'height' feature based on the 'gender' and 'weight' features which we will study in the next modules. So, for now, we have two options, either to drop all records where the 'height' feature is NaN or replace all these NaN values with some statistical measure (mean, median) of the 'height' feature. In this notebook, we will replace the NaN values with the median of the values in the 'height' feature.

In [None]:
print(df.height.median())
df['height'].fillna(df.height.median(), inplace=True)
print(df.height.median())
df.isnull().sum()


# Remove Outliers

Let us have a close look at the statistical properties of the numaric features

In [None]:
df.describe()

Usually, the 'id' feature will not have outliers, so let us check the 'age' feature. According to the description of the dataset, the age is in days. Let us convert the Age into years so that it is easier to understand and interpret.

In [None]:
df['age_years'] = (df['age'] / 365).round().astype('int')
df.head()

Let us have a close look again at the statistical properties of the numaric features

In [None]:
df.describe()

The minimum age in the datset is about 30 years, the maximum is about 65 years, and the average is about 53.33 years.

**Remove Outliers: 'height' and 'weight' feature**

Next, let us examine the 'height' feature, the minimum height is 55cm which is too short for the records of persons with a minimum age of 30. Similarly, the maximum height is 250cms which is too rare value for a person's height. So there must be an error in the height feature. Let us also examine the 'weight' feature. The minimum weight is 10 kg which is too low for the records of persons with a minimum age of 30. So again, there must be an error in the 'weight' feature. Let us get the box plot of these two features.

In [None]:
rcParams['figure.figsize'] = 10, 6
df.boxplot(column=['height', 'weight'])

As can be observed there are outliers, so let us remove weights and heights, that fall below 5% or above 95% of a given range.

In [None]:
df.drop(df[(df['height'] > df['height'].quantile(0.95)) | (df['height'] < df['height'].quantile(0.05))].index,inplace=True)
df.drop(df[(df['weight'] > df['weight'].quantile(0.95)) | (df['weight'] < df['weight'].quantile(0.05))].index,inplace=True)

Let us get the box plot of these two features again.

In [None]:
rcParams['figure.figsize'] = 10, 6
df.boxplot(column=['height', 'weight'])

As can be observed, the outliers for the 'height' and 'weight' features are removed.

**Remove Outliers: 'ap_hi' and 'ap_lo' feature**

Similarly, we will do the same for the 'ap_hi' and 'ap_lo' features especially since the blood pressure can not be negative. Below is the box plot for the 'ap_hi' and 'ap_lo' features.

In [None]:
rcParams['figure.figsize'] = 10, 6
df.boxplot(column=['ap_hi', 'ap_lo'])

Here we will remove 'ap_hi' and 'ap_hi' features that fall below 5% or above 95% of a given range.

In [None]:
df.drop(df[(df['ap_hi'] > df['ap_hi'].quantile(0.95)) | (df['ap_hi'] < df['ap_hi'].quantile(0.05))].index,inplace=True)
df.drop(df[(df['ap_lo'] > df['ap_lo'].quantile(0.95)) | (df['ap_lo'] < df['ap_lo'].quantile(0.05))].index,inplace=True)

Then we plot again the box plot of the 'ap_hi' and 'ap_lo' features.

In [None]:
rcParams['figure.figsize'] = 10, 6
df.boxplot(column=['ap_hi', 'ap_lo'])

As can be observed, the outliers for the 'height' and 'weight' features are removed. Let us also make sure that the systolic pressure 'ap_hi' is always higher than the diastolic pressure 'ap_lo'.

In [None]:
print("Systolic pressure is higher than diastolic pressure in {0}% of the patient records".format(100*df[df['ap_hi']> df['ap_lo']].shape[0]/df.shape[0]))

**Remove Outliers: the other features**

The values of the other features are limited within a small range as can be observed from the min and max values in the statistical description table. Let us check if these features take only discrete values.

In [None]:
print('The discrete values of the \'cholesterol\' feature are {}'.format(set(df['cholesterol'].unique())))
print('The discrete values of the \'gluc\' feature are {}'.format(set(df['gluc'].unique())))
print('The discrete values of the \'active\' feature are {}'.format(set(df['active'].unique())))

Since the range of the other features is limited and the values are discrete so no need to apply outliers removal techniques for these features.

# Save Data

Now, we will save the clean dataset into a CSV file to be used in the next session.

In [None]:
df.to_csv("/content/AIData/cardio_train_cleaned.csv",sep=";",index=False)

Check the '/content/AIData/' folder for the 'cardio_train_cleaned.csv' file and download it for future usage.