In this notebook, an extensive Exploratory Data Analysis (EDA) will be conducted on the processed dataset. The primary objective is to gain comprehensive insights into the data and its distribution characteristics. Furthermore, the study aims to elucidate the interrelationships between the predictor variables and the target variable. Additionally, correlations among the features will be thoroughly examined to uncover potential dependencies within the dataset.

# 1. Introduction

Data analysis is a crucial step in the machine learning pipeline. It involves the examination of the dataset to understand its characteristics and uncover patterns that can be leveraged to build predictive models. Exploratory Data Analysis (EDA) is a critical component of data analysis that involves the use of statistical and visualization techniques to gain insights into the data. In this notebook, we will conduct an extensive EDA on the processed dataset to understand its distribution characteristics, relationships between variables, and potential patterns that can be used to build predictive models.

# 2. Data Overview

Our datasets contain information time series data of heart rate, speed and cadence of individuals during cycling and running activities. 

In [130]:
import pandas as pd
import os

# Import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [131]:
data_running = pd.read_csv("./datasets/cleaned/running.csv", index_col=0)
data_biking = pd.read_csv("./datasets/cleaned/biking.csv", index_col=0)

In [132]:
# The index is the date and time of the activity, but it not in order, so we need to sort it
data_running.index = pd.to_datetime(data_running.index)
data_running = data_running.sort_index()
data_running.head()

Unnamed: 0_level_0,heart-rate,speed,cadence
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-03-24 21:40:55,99.5,1.623,81.0
2021-03-24 21:40:56,99.5,1.623,81.0
2021-03-24 21:40:57,99.5,1.623,81.0
2021-03-24 21:40:58,99.5,1.623,97.0
2021-03-24 21:40:59,99.5,1.922,97.0


In [133]:
data_biking.index = pd.to_datetime(data_biking.index)
data_biking = data_biking.sort_index()
data_biking.head()

Unnamed: 0_level_0,heart-rate,speed,cadence
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-08-18 14:43:19,102.0,4.325,64.0
2020-08-18 14:43:20,103.0,4.336,64.0
2020-08-18 14:43:21,105.0,4.409,66.0
2020-08-18 14:43:22,106.0,4.445,66.0
2020-08-18 14:43:23,106.0,4.441,67.0


## Data by Date

In [134]:
# Calculate the min, max and average heart rate for each activity by day
running_hr = data_running.groupby(data_running.index.date).agg(
    min_hr=pd.NamedAgg(column="heart-rate", aggfunc="min"),
    max_hr=pd.NamedAgg(column="heart-rate", aggfunc="max"),
    avg_hr=pd.NamedAgg(column="heart-rate", aggfunc="mean"),
)

biking_hr = data_biking.groupby(data_biking.index.date).agg(
    min_hr=pd.NamedAgg(column="heart-rate", aggfunc="min"),
    max_hr=pd.NamedAgg(column="heart-rate", aggfunc="max"),
    avg_hr=pd.NamedAgg(column="heart-rate", aggfunc="mean"),
)

# Calculate the min, max and average speed for each activity by day
running_speed = data_running.groupby(data_running.index.date).agg(
    min_speed=pd.NamedAgg(column="speed", aggfunc="min"),
    max_speed=pd.NamedAgg(column="speed", aggfunc="max"),
    avg_speed=pd.NamedAgg(column="speed", aggfunc="mean"),
)

biking_speed = data_biking.groupby(data_biking.index.date).agg(
    min_speed=pd.NamedAgg(column="speed", aggfunc="min"),
    max_speed=pd.NamedAgg(column="speed", aggfunc="max"),
    avg_speed=pd.NamedAgg(column="speed", aggfunc="mean"),
)

# Calculate the min, max and average cadence for each activity by day
running_cadence = data_running.groupby(data_running.index.date).agg(
    min_cadence=pd.NamedAgg(column="cadence", aggfunc="min"),
    max_cadence=pd.NamedAgg(column="cadence", aggfunc="max"),
    avg_cadence=pd.NamedAgg(column="cadence", aggfunc="mean"),
)

biking_cadence = data_biking.groupby(data_biking.index.date).agg(
    min_cadence=pd.NamedAgg(column="cadence", aggfunc="min"),
    max_cadence=pd.NamedAgg(column="cadence", aggfunc="max"),
    avg_cadence=pd.NamedAgg(column="cadence", aggfunc="mean"),
)

# Visualize the data
fig = make_subplots(
    rows=3,
    cols=2,
    shared_xaxes=False,
    subplot_titles=(
        "Running HR (bpm)",
        "Biking HR (bpm)",
        "Running Speed (m/s)",
        "Biking Speed (m/2)",
        "Running Cadence (rpm)",
        "Biking Cadence (rpm)",
    ),
    vertical_spacing=0.1,
)

fig.add_trace(
    go.Scatter(x=running_hr.index, y=running_hr["min_hr"], mode="lines", name="Min HR"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=running_hr.index, y=running_hr["max_hr"], mode="lines", name="Max HR"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=running_hr.index, y=running_hr["avg_hr"], mode="lines", name="Avg HR"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=biking_hr.index, y=biking_hr["min_hr"], mode="lines", name="Min HR"),
    row=1,
    col=2,
)
fig.add_trace(
    go.Scatter(x=biking_hr.index, y=biking_hr["max_hr"], mode="lines", name="Max HR"),
    row=1,
    col=2,
)
fig.add_trace(
    go.Scatter(x=biking_hr.index, y=biking_hr["avg_hr"], mode="lines", name="Avg HR"),
    row=1,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=running_speed.index,
        y=running_speed["min_speed"],
        mode="lines",
        name="Min Speed",
    ),
    row=2,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=running_speed.index,
        y=running_speed["max_speed"],
        mode="lines",
        name="Max Speed",
    ),
    row=2,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=running_speed.index,
        y=running_speed["avg_speed"],
        mode="lines",
        name="Avg Speed",
    ),
    row=2,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=biking_speed.index,
        y=biking_speed["min_speed"],
        mode="lines",
        name="Min Speed",
    ),
    row=2,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=biking_speed.index,
        y=biking_speed["max_speed"],
        mode="lines",
        name="Max Speed",
    ),
    row=2,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=biking_speed.index,
        y=biking_speed["avg_speed"],
        mode="lines",
        name="Avg Speed",
    ),
    row=2,
    col=2,
)




fig.add_trace(
    go.Scatter(
        x=running_cadence.index,
        y=running_cadence["min_cadence"],
        mode="lines",
        name="Min Cadence",
    ),
    row=3,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=running_cadence.index,
        y=running_cadence["max_cadence"],
        mode="lines",
        name="Max Cadence",
    ),
    row=3,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=running_cadence.index,
        y=running_cadence["avg_cadence"],
        mode="lines",
        name="Avg Cadence",
    ),
    row=3,
    col=1,
)

fig.add_trace(
    go.Scatter(
        x=biking_cadence.index,
        y=biking_cadence["min_cadence"],
        mode="lines",
        name="Min Cadence",
    ),
    row=3,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=biking_cadence.index,
        y=biking_cadence["max_cadence"],
        mode="lines",
        name="Max Cadence",
    ),
    row=3,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=biking_cadence.index,
        y=biking_cadence["avg_cadence"],
        mode="lines",
        name="Avg Cadence",
    ),
    row=3,
    col=2,
)


fig.update_layout(
    title_text="Activity Metrics",
    showlegend=False,
    height=1500,
    title_x=0.5,
    title_font_size=50,
)

fig.show()

## Distribution

In [135]:
# Draw distribution of heart rate

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Running HR (bpm)", "Biking HR (bpm)"),
    vertical_spacing=0.1,
)

fig.add_trace(
    go.Histogram(x=data_running["heart-rate"], nbinsx=50, name="Running HR"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Histogram(x=data_biking["heart-rate"], nbinsx=50, name="Biking HR"),
    row=1,
    col=2,
)

fig.update_layout(
    title_text="Heart Rate Distribution",
    showlegend=True,
    height=600,
    title_x=0.5,
    title_font_size=50,
    yaxis_title="Count",
)

In [136]:
# Draw distribution of speed

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Running Speed (m/s)", "Biking Speed (m/s)"),
    vertical_spacing=0.1,
)

fig.add_trace(
    go.Histogram(x=data_running["speed"], nbinsx=50, name="Running Speed"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Histogram(x=data_biking["speed"], nbinsx=50, name="Biking Speed"),
    row=1,
    col=2,
)

fig.update_layout(
    title_text="Speed Distribution",
    showlegend=True,
    height=600,
    title_x=0.5,
    title_font_size=50,
    yaxis_title_text="Count",
)

fig.show()

In [137]:
# Draw distribution of cadence

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Running Cadence (rpm)", "Biking Cadence (rpm)"),
    vertical_spacing=0.1,
)

fig.add_trace(
    go.Histogram(x=data_running["cadence"], nbinsx=50, name="Running Cadence"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Histogram(x=data_biking["cadence"], nbinsx=50, name="Biking Cadence"),
    row=1,
    col=2,
)

fig.update_layout(
    title_text="Cadence Distribution",
    showlegend=True,
    height=600,
    title_x=0.5,
    title_font_size=50,
    yaxis_title="Count",
)

fig.show()

## Correlation

In [138]:
# Running data correlation
fig = px.scatter_matrix(
    data_running,
    dimensions=["heart-rate", "speed", "cadence"],
    title="Running Data Correlation",
)
fig.update_layout(title_x=0.5, title_font_size=50)
fig.show()

In [139]:
# Draw the correlation matrix
fig = px.imshow(data_running.corr(), title="Running Data Correlation")
fig.update_layout(title_x=0.5, title_font_size=50)
fig.show()

As you can see, heart rate does not have a strong correlation with speed and cadence. This indicates that heart rate is not directly dependent on speed and cadence, and other factors may be influencing heart rate. Howerver, speed and cadence have a strong positive correlation, indicating that they are closely related and tend to increase or decrease together.

In [140]:
# Biking data correlation
fig = px.scatter_matrix(
    data_biking,
    dimensions=["heart-rate", "speed", "cadence"],
    title="Biking Data Correlation",
)
fig.update_layout(title_x=0.5, title_font_size=50)
fig.show()

In [141]:
# Draw the correlation matrix
fig = px.imshow(data_biking.corr(), title="Biking Data Correlation")
fig.update_layout(title_x=0.5, title_font_size=50)
fig.show()

## Feature Engineering

In [156]:
# Feature engineering
# Calculate the distance for each activity
# Distance is calculated as speed * time, which time is the difference between the current and previous timestamp of the same day


# Calculate the time for running for each day from the current and previous timestamp
data_running["timestamp"] = pd.to_datetime(data_running.index)
data_running["distance"] = (
    data_running.groupby(data_running.index.date)["timestamp"]
    .transform(lambda x: x - x.shift())
    .dt.total_seconds()
    * data_running["speed"]
)
# Fill nan values with 0
data_running["distance"] = data_running["distance"].fillna(0)

# Get total distance from the start of the day
data_running["distance"] = data_running.groupby(data_running.index.date)[
    "distance"
].cumsum()

# Drop the timestamp column
data_running = data_running.drop(columns=["timestamp"])

# Calculate the time for biking for each day from the current and previous timestamp
data_biking["timestamp"] = pd.to_datetime(data_biking.index)
data_biking["distance"] = (
    data_biking.groupby(data_biking.index.date)["timestamp"]
    .transform(lambda x: x - x.shift())
    .dt.total_seconds()
    * data_biking["speed"]
)
# Fill nan values with 0
data_biking["distance"] = data_biking["distance"].fillna(0)

# Get total distance from the start of the day
data_biking["distance"] = data_biking.groupby(data_biking.index.date)["distance"].cumsum()

# Drop the timestamp column
data_biking = data_biking.drop(columns=["timestamp"])

In [159]:
# Visualize the data
fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Running Distance (m)", "Biking Distance (m)"),
    vertical_spacing=0.1,
)

# Get the total distance for each day
running_distance = data_running.groupby(data_running.index.date).agg(
    distance=pd.NamedAgg(column="distance", aggfunc="max")
)

biking_distance = data_biking.groupby(data_biking.index.date).agg(
    distance=pd.NamedAgg(column="distance", aggfunc="max")
)

fig.add_trace(
    go.Scatter(x=running_distance.index, y=running_distance["distance"], mode="lines"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=biking_distance.index, y=biking_distance["distance"], mode="lines"),
    row=1,
    col=2,
)

fig.update_layout(
    title_text="Distance",
    showlegend=False,
    height=600,
    title_x=0.5,
    title_font_size=50,
    yaxis_title="Distance (m)",
)

fig.show()

In [160]:
# Save the data
data_running.to_csv("./datasets/processed/running.csv")
data_biking.to_csv("./datasets/processed/biking.csv")