# EDA - Database 3

We will conduct an exploratory analysis by doing the following:

- **Understanding the structure of the dataset**: We aim to comprehend how the data is organized and what information it contains. This involves examining the different variables in the dataset and their types.
<br>

- **Identifying missing, duplicated or erroneous data**: We will search for any data that is incomplete or contains mistakes.
<br>

- **???????** Exploring the distribution of data: We will analyze how the data is spread out, including the range of values, central tendencies like the mean or median, and the presence of outliers or unusual patterns.

## Import libraries

In [1]:
import pandas as pd
import os
from getpass import getpass
import pymysql
import sqlalchemy as alch

## Read the database

In [2]:
df = pd.read_csv('data/gym_exercise.csv')

## Explore the database

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
0,0,Partner plank band row,The partner plank band row is an abdominal exe...,Strength,Abdominals,Bands,Intermediate,0.0,
1,1,Banded crunch isometric hold,The banded crunch isometric hold is an exercis...,Strength,Abdominals,Bands,Intermediate,,
2,2,FYR Banded Plank Jack,The banded plank jack is a variation on the pl...,Strength,Abdominals,Bands,Intermediate,,
3,3,Banded crunch,The banded crunch is an exercise targeting the...,Strength,Abdominals,Bands,Intermediate,,
4,4,Crunch,The crunch is a popular core exercise targetin...,Strength,Abdominals,Bands,Intermediate,,


In [4]:
df.shape

(2918, 9)

In [5]:
df.columns

Index(['Unnamed: 0', 'Title', 'Desc', 'Type', 'BodyPart', 'Equipment', 'Level',
       'Rating', 'RatingDesc'],
      dtype='object')

In [6]:
df.dtypes

Unnamed: 0      int64
Title          object
Desc           object
Type           object
BodyPart       object
Equipment      object
Level          object
Rating        float64
RatingDesc     object
dtype: object

### Column cleaning

In [7]:
# Drop the column Unnamed:0
df = df.drop("Unnamed: 0", axis=1)

In [8]:
# Count null values for each column -> no null values
df.isnull().sum()

Title            0
Desc          1550
Type             0
BodyPart         0
Equipment       32
Level            0
Rating        1887
RatingDesc    2056
dtype: int64

In [9]:
# Check for the weight of null values as a percentage of the total values in each column.
# Description column has more than 50% of null values, but it is not determinant for our analysis since we have the exercise title
# Rating punctuation and description have more than 60% of null values, therefore these columns cannot be used for the analysis.
round((df.isnull().sum() / df.shape[0]) * 100, 2)

Title          0.00
Desc          53.12
Type           0.00
BodyPart       0.00
Equipment      1.10
Level          0.00
Rating        64.67
RatingDesc    70.46
dtype: float64

In [10]:
# Check if there are duplicated columns
df.columns[df.T.duplicated()]

Index([], dtype='object')

### Row cleaning

In [11]:
# Check if there are duplicated rows -> no duplicated rows
df.duplicated().sum()

9

In [12]:
# Drop duplicated rows
df.drop_duplicates(inplace = True)

In [13]:
df.reset_index(drop=True)

Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
0,Partner plank band row,The partner plank band row is an abdominal exe...,Strength,Abdominals,Bands,Intermediate,0.0,
1,Banded crunch isometric hold,The banded crunch isometric hold is an exercis...,Strength,Abdominals,Bands,Intermediate,,
2,FYR Banded Plank Jack,The banded plank jack is a variation on the pl...,Strength,Abdominals,Bands,Intermediate,,
3,Banded crunch,The banded crunch is an exercise targeting the...,Strength,Abdominals,Bands,Intermediate,,
4,Crunch,The crunch is a popular core exercise targetin...,Strength,Abdominals,Bands,Intermediate,,
...,...,...,...,...,...,...,...,...
2904,EZ-bar skullcrusher-,The EZ-bar skullcrusher is a popular exercise ...,Strength,Triceps,E-Z Curl Bar,Intermediate,8.1,Average
2905,Lying Close-Grip Barbell Triceps Press To Chin,,Strength,Triceps,E-Z Curl Bar,Beginner,8.1,Average
2906,EZ-Bar Skullcrusher - Gethin Variation,The EZ-bar skullcrusher is a popular exercise ...,Strength,Triceps,E-Z Curl Bar,Intermediate,,
2907,TBS Skullcrusher,The EZ-bar skullcrusher is a popular exercise ...,Strength,Triceps,E-Z Curl Bar,Intermediate,,


In [14]:
# Check if there rows with ony null values -> no rows with only null values
df.isnull().all(axis=1).sum()

0

### Save the data as a CSV file

In [15]:
# Save the dataframes as csv
df.to_csv(r'data\data_clean\gym_exercise_clean.csv', index=False)

### Import data into SQL

In [16]:
password = getpass("Please input your password: ")

dbName = "gym_exercise"
connectionData=f"mysql+pymysql://root:{password}@localhost/{dbName}"
engine = alch.create_engine(connectionData)

df.to_sql('gym_exercise_table', con=engine, if_exists='replace', index=False)

Please input your password: ········


2909