# Workout Metrics Data Analysis
## Table of Contents
1. [Setup Packages and Config](#setup-packages-and-config)
2. [Import Data](#import-data)
3. [Clean the Data](#clean-the-data)
   - [Flatten Nested Columns](#flatten-nested-columns)
   - [Convert Dates to DateTime](#convert-dates-to-datetime)
4. [Validate and Save Cleaned Data](#validate-and-save-cleaned-data)
5. [Preparing for Data Analysis](#data-analysis)

# 1. Setup Packages and Config 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 2. Import Data

In [2]:
# Load the JSON data
data = pd.read_json('../original_data/data.json')["data"]

# Extract 'workouts' DataFrame from the JSON data
wdf = pd.DataFrame(data["workouts"])

# 3. Clean the Data
## 3.1 Flatten Nested Columns

In [3]:
# Flatten the nested columns in the 'workouts' DataFrame
def extract_qty_column(df, column_name):
    if column_name in df.columns:
        df[f'{column_name}_qty'] = df[column_name].apply(lambda x: x['qty'] if isinstance(x, dict) else x)
    else:
        df[f'{column_name}_qty'] = np.nan
    return df

# Extract the qty from all relevant columns
columns_to_extract = ['activeEnergyBurned', 'distance', 'lapLength', 'intensity', 'humidity', 'temperature']
for column_name in columns_to_extract: 
    wdf = extract_qty_column(wdf, column_name)

# Drop the original columns
wdf.drop(columns=columns_to_extract, axis=1, inplace=True)

## 3.2 Convert Dates to DateTime


In [None]:
# Convert Dates to DateTime objects
wdf['start'] = pd.to_datetime(wdf['start'], format='%Y-%m-%d %H:%M:%S %z')
wdf['end'] = pd.to_datetime(wdf['end'], format='%Y-%m-%d %H:%M:%S %z')
# Count the number of missing values in each column 
print(wdf.isnull().sum())

# 4. Validate and Save Cleaned Data

In [None]:
# Validate cleaned data
print(wdf.head()) 
# Save the cleaned data
wdf.to_csv('../cleaned_data/cl_workouts.csv', index=False)

# 5. Data Analysis
#### Find important statistics including count, mean, and standard deviation of each metric.

In [None]:
# Group by metric for visualization/analysis
grouped = wdf.groupby('name')
print(grouped['duration'].describe())
# print(grouped['activeEnergyBurned_qty'].describe())

#### Find and display the correlation matrix. This shows the correlation between any health metrics

In [None]:
# Pivot data for correlation analysis
# pivoted_df = wdf.pivot(index='date', columns='metric', values='value')
pivoted_df = wdf[["id", "duration", "activeEnergyBurned_qty", "intensity_qty"]].set_index("id")
# Calculate correlations
correlation_matrix = pivoted_df.corr()
print(correlation_matrix)

#### Find linear regression to help understand and predict the relationship between a dependent var (target) and one or more independent vars (predictors)

In [None]:
from functions import linear_regression
linear_regression(pivoted_df, 'duration', 'activeEnergyBurned_qty')

#### Analyze trends overtime to understand the shape of the data and predict near-future trends

In [None]:
from functions import plot_workout_trends

# Plot trends for each metric
metrics_to_plot = ["duration", "activeEnergyBurned_qty", "intensity_qty"]
for metric in metrics_to_plot:
    plot_workout_trends(wdf, metric, ylabel=metric)

#### Analyze the correlation between two specific metrics at a time

In [None]:
from functions import analyze_workout_correlation

# Flattened data as created earlier (workouts_df)
analyze_workout_correlation(wdf, "duration", "activeEnergyBurned_qty", show_plot=True)
analyze_workout_correlation(wdf, "activeEnergyBurned_qty", "intensity_qty", show_plot=True)
