# EDA - Database 2

We will conduct an exploratory analysis by doing the following:

- **Understanding the structure of the dataset**: We aim to comprehend how the data is organized and what information it contains. This involves examining the different variables in the dataset and their types.
<br>

- **Identifying missing, duplicated or erroneous data**: We will search for any data that is incomplete or contains mistakes.
<br>

## Import libraries

In [2]:
import pandas as pd
import os
from getpass import getpass
import pymysql
import sqlalchemy as alch

## Read the database

In [3]:
df = pd.read_csv('data/heart_disease.csv')

## Explore the database

In [4]:
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [5]:
df.BMI.max()

94.85

In [11]:
df.AgeCategory.unique()

array(['55-59', '80 or older', '65-69', '75-79', '40-44', '70-74',
       '60-64', '50-54', '45-49', '18-24', '35-39', '30-34', '25-29'],
      dtype=object)

In [13]:
df.Diabetic.unique()

array(['Yes', 'No', 'No, borderline diabetes', 'Yes (during pregnancy)'],
      dtype=object)

In [5]:
df.shape

(319795, 18)

In [6]:
df.columns

Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer'],
      dtype='object')

In [7]:
df.dtypes

HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCancer           object
dtype: object

### Column cleaning

In [8]:
# Count null values for each column
df.isnull().sum()

HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHealth        0
DiffWalking         0
Sex                 0
AgeCategory         0
Race                0
Diabetic            0
PhysicalActivity    0
GenHealth           0
SleepTime           0
Asthma              0
KidneyDisease       0
SkinCancer          0
dtype: int64

In [9]:
# Check if there are duplicated columns
df.columns[df.T.duplicated()]

Index([], dtype='object')

### Row cleaning

In [10]:
# Check if there are duplicated rows -> no duplicated rows
df.duplicated().sum()

18078

In [11]:
df.drop_duplicates(inplace=True)

### Save the data as csv file

In [12]:
# Save the dataframe as csv file
df.to_csv(r'data\data_clean\heart_disease_clean.csv', index=False)

### Import data into SQL

In [13]:
password = getpass("Please input your password: ")

dbName = "heart_disease"
connectionData=f"mysql+pymysql://root:{password}@localhost/{dbName}"
engine = alch.create_engine(connectionData)

df.to_sql('heart_disease_table', con=engine, if_exists='replace', index=False)

Please input your password: ········


301717