In [1]:
import pandas as pd

DATA = '/kaggle/input/workout-and-fitness-tracker-data/workout_fitness_tracker_data.csv'
df = pd.read_csv(filepath_or_buffer=DATA, index_col=['User ID'])
df.head()

Unnamed: 0_level_0,Age,Gender,Height (cm),Weight (kg),Workout Type,Workout Duration (mins),Calories Burned,Heart Rate (bpm),Steps Taken,Distance (km),Workout Intensity,Sleep Hours,Water Intake (liters),Daily Calories Intake,Resting Heart Rate (bpm),VO2 Max,Body Fat (%),Mood Before Workout,Mood After Workout
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,39,Male,175,99,Cycling,79,384,112,8850,14.44,High,8.2,1.9,3195,61,38.4,28.5,Tired,Fatigued
2,36,Other,157,112,Cardio,73,612,168,2821,1.1,High,8.6,1.9,2541,73,38.4,28.5,Happy,Energized
3,25,Female,180,66,HIIT,27,540,133,18898,7.28,High,9.8,1.9,3362,80,38.4,28.5,Happy,Fatigued
4,56,Male,154,89,Cycling,39,672,118,14102,6.55,Medium,5.8,1.9,2071,65,38.4,28.5,Neutral,Neutral
5,53,Other,194,59,Strength,56,410,170,16518,3.17,Medium,7.3,1.9,3298,59,38.4,28.5,Stressed,Energized


In [2]:
df['Gender'].value_counts().to_dict()

{'Other': 3392, 'Male': 3370, 'Female': 3238}

The fact that we have exactly 10k records, and they are roughly evenly distributed across three genders is an early indication we have synthetic data.

Let's build a scatter plot using dimensionality reduction; this should tell us if our data has a signal in it.

In [3]:
COLUMNS = [key for key, value in df.dtypes.to_dict().items() if str(value) in {'float64', 'int64'}]
RANDOM_STATE = 2025
TARGET = 'Gender'

In [4]:
from sklearn.manifold import TSNE

reducer = TSNE(random_state=RANDOM_STATE)
plot_df = pd.DataFrame(columns=['x', 'y'], data=reducer.fit_transform(X=df[COLUMNS]))
plot_df[TARGET] = df[TARGET].tolist()

In [5]:
from plotly import express
from plotly import io

io.renderers.default = 'iframe'
express.scatter(data_frame=plot_df, x='x', y='y', color=TARGET)

Our data definitely looks synthetic. Let's make some more plots.

In [6]:
express.histogram(data_frame=df, x='Height (cm)', facet_col='Gender')

In [7]:
express.histogram(data_frame=df, x='Weight (kg)', facet_col='Gender')

Yes. This is synthetic data. Height and weight are not uniformly distributed, especially accounting for sex.

In [8]:
express.scatter(data_frame=df, x='Height (cm)', y='Weight (kg)', color='Gender', facet_col='Gender')

This is definitely synthetic data.