# EDA - Database 2

We will conduct an exploratory analysis by doing the following:

- **Understanding the structure of the dataset**: We aim to comprehend how the data is organized and what information it contains. This involves examining the different variables in the dataset and their types.
<br>

- **Identifying missing, duplicated or erroneous data**: We will search for any data that is incomplete or contains mistakes.
<br>

- **???????** Exploring the distribution of data: We will analyze how the data is spread out, including the range of values, central tendencies like the mean or median, and the presence of outliers or unusual patterns.

## Import libraries

In [1]:
import pandas as pd
import os
from getpass import getpass
import pymysql
import sqlalchemy as alch

## Read the database

In [2]:
df = pd.read_csv('data/body_performance.csv')

## Explore the database

In [3]:
df.head()

Unnamed: 0,age,gender,height_cm,weight_kg,body fat_%,diastolic,systolic,gripForce,sit and bend forward_cm,sit-ups counts,broad jump_cm,class
0,27.0,M,172.3,75.24,21.3,80.0,130.0,54.9,18.4,60.0,217.0,C
1,25.0,M,165.0,55.8,15.7,77.0,126.0,36.4,16.3,53.0,229.0,A
2,31.0,M,179.6,78.0,20.1,92.0,152.0,44.8,12.0,49.0,181.0,C
3,32.0,M,174.5,71.1,18.4,76.0,147.0,41.4,15.2,53.0,219.0,B
4,28.0,M,173.8,67.7,17.1,70.0,127.0,43.5,27.1,45.0,217.0,B


In [4]:
df.shape

(13393, 12)

In [5]:
df.columns

Index(['age', 'gender', 'height_cm', 'weight_kg', 'body fat_%', 'diastolic',
       'systolic', 'gripForce', 'sit and bend forward_cm', 'sit-ups counts',
       'broad jump_cm', 'class'],
      dtype='object')

In [6]:
df.dtypes

age                        float64
gender                      object
height_cm                  float64
weight_kg                  float64
body fat_%                 float64
diastolic                  float64
systolic                   float64
gripForce                  float64
sit and bend forward_cm    float64
sit-ups counts             float64
broad jump_cm              float64
class                       object
dtype: object

### Column cleaning

In [7]:
df.columns = df.columns.str.split().str.join('_')

In [8]:
# Count null values for each column -> no null values
df.isnull().sum()

age                        0
gender                     0
height_cm                  0
weight_kg                  0
body_fat_%                 0
diastolic                  0
systolic                   0
gripForce                  0
sit_and_bend_forward_cm    0
sit-ups_counts             0
broad_jump_cm              0
class                      0
dtype: int64

In [9]:
# Check if there are duplicated columns
df.columns[df.T.duplicated()]

Index([], dtype='object')

In [10]:
# Create a new column with grouped age
age_labels = {
    (20, 39): '20-39',
    (40, 59): '40-59',
    (60, 79): '60-79'
}

# Create a new column based on age ranges
df['age_group'] = df['age'].apply(lambda x: next((v for k, v in age_labels.items() if k[0] <= x <= k[1]), 'Other'))

### Row cleaning

In [11]:
# Check if there are duplicated rows -> no duplicated rows
df.duplicated().sum()

1

In [12]:
# Drop duplicated rows
df.drop_duplicates(inplace = True)

### Save the data as a CSV file

In [13]:
# Save the dataframes as csv
df.to_csv(r'data\data_clean\body_performance_clean.csv', index=False)

### Import data into SQL

In [14]:
password = getpass("Please input your password: ")

dbName = "body_performance"
connectionData=f"mysql+pymysql://root:{password}@localhost/{dbName}"
engine = alch.create_engine(connectionData)

df.to_sql('body_performance_table', con=engine, if_exists='replace', index=False)

Please input your password: ········


13392