# General Exploration of the Data

This notebook is used for exploring and understanding the data for the WOD prediction project.


## Importing Required Libraries and Modules

In [1]:
# import sys

# PATH_TO_WORKSPACE = '/Users/hassan/Documents/wod-prediction/WOD-prediction'
# sys.path.append(PATH_TO_WORKSPACE)

from IPython.display import display

import pandas as pd
from pprint import pprint

from wod_predictor.data_loader import DataLoader
from wod_predictor.preprocessor import DataPreprocessor
from wod_predictor.modeling import RandomForestModel




## Loading the Data

We are using the [`DataLoader`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22path%22%3A%22%2FUsers%2FJawanHaider%2FDocuments%2FProjects%2FCollaborative%2FWOD-prediction%2Fpipeline_objects%2Fdata_loader.py%22%2C%22scheme%22%3A%22file%22%7D%2C%7B%22line%22%3A6%2C%22character%22%3A6%7D%5D "pipeline_objects/data_loader.py") class to load our data. The data is stored in the [`data`](command:_github.copilot.openRelativePath?%5B%7B%22scheme%22%3A%22file%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2FJawanHaider%2FDocuments%2FProjects%2FCollaborative%2FWOD-prediction%2Fdata%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22%22%7D%5D "/Users/JawanHaider/Documents/Projects/Collaborative/WOD-prediction/data") dictionary.



**Note about 'athlete_info' data (exploratory):**

Athlete info data is stored in `*_info.csv` files. We are loading the
data from these files and exploring it.

Athlete info data is currently only implemented/used in the `DataLoader`
class, but it's NOT fully used/implemented in the pipeline, especially
in the `DataPreprocessor` class and the `feature_engineering_parts/`
module.

The goal is to fully implement athlete_info data in the pipeline (from
loading to modelling), i.e. as proper, modular functions etc. in the
`DataLoader` and `DataPreprocessor` classes in `data_loader.py` and
`preprocessor.py` files as well as other relevant classes/files in
`feature_engineering_parts/`.

In [2]:
# loader = DataLoader(root_path = f'{PATH_TO_WORKSPACE}/Data',objects= ['open_results','descriptions','benchmark_stats'])
loader = DataLoader(
    root_path=f"../../Data",
    objects=["open_results", "descriptions", "benchmark_stats", "athlete_info"],
)
data = loader.load()

'''
NOTE: The 'description' object is actually stored with key
'workout_descriptions' in the data dictionary
'''


  df = pd.read_csv(os.path.join(self.root_path, file))
  df = pd.read_csv(os.path.join(self.root_path, file))


"\nNOTE: The 'description' object is actually stored with key\n'workout_descriptions' in the data dictionary\n"

In [3]:

# Display the info for the data objects, except for the
# 'workout_descriptions' object, which is actually a dictionary.
# The rest are pandas dataframes.
for key, value in data.items():
    if key != "workout_descriptions":
        print(f"\nData object: {key}:\n")
        display(value.info())
        print("\n")


Data object: open_results:

<class 'pandas.core.frame.DataFrame'>
Index: 570004 entries, 469656 to 1954019
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   23.1    272025 non-null  object
 1   23.2A   264235 non-null  object
 2   23.2B   264280 non-null  object
 3   23.3    241274 non-null  object
 4   19.1    178391 non-null  object
 5   19.2    164658 non-null  object
 6   19.3    158692 non-null  object
 7   19.4    153653 non-null  object
 8   19.5    137382 non-null  object
 9   22.1    138522 non-null  object
 10  22.2    133571 non-null  object
 11  22.3    119763 non-null  object
 12  20.1    120481 non-null  object
 13  20.2    114610 non-null  object
 14  20.3    107941 non-null  object
 15  20.4    105228 non-null  object
 16  20.5    100123 non-null  object
 17  21.1    117346 non-null  object
 18  21.2    111373 non-null  object
 19  21.3    101150 non-null  object
 20  21.4    101176 non-null  object
dtypes: 

None




Data object: athlete_info:

<class 'pandas.core.frame.DataFrame'>
Index: 570004 entries, 153604 to 2148950
Data columns (total 20 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   19.name    195562 non-null  object 
 1   19.age     195562 non-null  float64
 2   19.height  125933 non-null  object 
 3   19.weight  134870 non-null  object 
 4   20.name    133874 non-null  object 
 5   20.age     133874 non-null  float64
 6   20.height  86885 non-null   object 
 7   20.weight  91551 non-null   object 
 8   23.name    302228 non-null  object 
 9   23.age     302231 non-null  float64
 10  23.height  146886 non-null  object 
 11  23.weight  144511 non-null  object 
 12  21.name    137464 non-null  object 
 13  21.age     137464 non-null  float64
 14  21.height  82114 non-null   object 
 15  21.weight  86501 non-null   object 
 16  22.name    154815 non-null  object 
 17  22.age     154815 non-null  float64
 18  22.height  85131 non-null   obje

None




Data object: benchmark_stats:

<class 'pandas.core.frame.DataFrame'>
Index: 77227 entries, 469656 to 884354
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            77226 non-null  object 
 1   Back Squat      32782 non-null  float64
 2   Chad1000x       820 non-null    float64
 3   Clean and Jerk  32246 non-null  float64
 4   Deadlift        33220 non-null  float64
 5   Fight Gone Bad  50 non-null     float64
 6   Filthy 50       4354 non-null   float64
 7   Fran            18766 non-null  float64
 8   Grace           14287 non-null  float64
 9   Helen           10083 non-null  float64
 10  L1 Benchmark    248 non-null    float64
 11  Max Pull-ups    13972 non-null  float64
 12  Run 5k          14094 non-null  float64
 13  Snatch          31253 non-null  float64
 14  Sprint 400m     7675 non-null   float64
dtypes: float64(14), object(1)
memory usage: 9.4+ MB


None







## Data Exploration

Let's preview the keys in our data dictionary and the first few rows of each DataFrame.



In [4]:
# Extracting the keys of the data
keys = list(data.keys())
keys = [str(key) for key in keys]
print(f"`data` dict keys:\n{keys}")

# Preview the head(), shape, and size of each of the data objects except
# for the 'workout_descriptions' object.
for key in keys:
    if key != "workout_descriptions":
        print(f"\n{key} data object:")
        print(f"\nShape: {data[key].shape}")
        print(f"Size: {data[key].size}")
        display(data[key].head())

# Previewing the "workout_descriptions" dictionary
n = 2  # Number of elements you want to preview
preview = {k: data[keys[2]][k] for k in list(data[keys[2]])[:n]}
print(f"\n{keys[2]} data object:\n")
pprint(preview)

`data` dict keys:
['open_results', 'athlete_info', 'workout_descriptions', 'benchmark_stats']

open_results data object:

Shape: (570004, 21)
Size: 11970084


Unnamed: 0_level_0,23.1,23.2A,23.2B,23.3,19.1,19.2,19.3,19.4,19.5,22.1,...,22.3,20.1,20.2,20.3,20.4,20.5,21.1,21.2,21.3,21.4
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
469656,298 reps,181 reps,312 lbs,8:01,384 reps,17:48,9:54,9:09,8:14,367 reps,...,4:52,9:06,1016 reps,6:23,12:41,10:59,11:55,9:14,8:15,317 lbs
300638,277 reps,175 reps,327 lbs,7:44,352 reps,19:34,9:37,9:05,10:04,328 reps,...,5:26,9:41,884 reps,6:22,14:30,11:27,488 reps,12:07,9:44,330 lbs
676693,274 reps,180 reps,302 lbs,9:29,332 reps,19:45,177 reps,8:48,7:31,361 reps,...,4:24,8:31,884 reps,6:36,239 reps,12:39,13:47,9:04,8:06,296 lbs
663689,278 reps,172 reps,317 lbs,8:55,368 reps,17:27,8:53,9:07,8:36,350 reps,...,4:43,8:40,986 reps,7:08,17:02,11:10,14:33,9:55,8:39,312 lbs
1031875,273 reps,170 reps,317 lbs,9:10,335 reps,423 reps,9:37,9:42,10:08,329 reps,...,4:59,10:10,822 reps,7:19,239 reps,13:32,484 reps,11:00,8:39,305 lbs



athlete_info data object:

Shape: (570004, 20)
Size: 11400080


Unnamed: 0_level_0,19.name,19.age,19.height,19.weight,20.name,20.age,20.height,20.weight,23.name,23.age,23.height,23.weight,21.name,21.age,21.height,21.weight,22.name,22.age,22.height,22.weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
153604,Mathew Fraser,29.0,67 in,195 lb,Mathew Fraser,30.0,67 in,195 lb,,,,,,,,,,,,
81616,Björgvin Karl Guðmundsson,26.0,178 cm,190 lb,Björgvin Karl Guðmundsson,27.0,178 cm,190 lb,Björgvin Karl Guðmundsson,30.0,178 cm,190 lb,Björgvin Karl Guðmundsson,28.0,178 cm,185 lb,Björgvin Karl Guðmundsson,29.0,178 cm,185 lb
199938,Jacob Heppner,29.0,67 in,195 lb,Jacob Heppner,30.0,67 in,195 lb,,,,,,,,,,,,
514502,Lefteris Theofanidis,29.0,171 cm,81 kg,Lefteris Theofanidis,30.0,171 cm,81 kg,,,,,,,,,,,,
308712,Jean-Simon Roy-Lemaire,25.0,176 cm,195 lb,Jean-Simon Roy-Lemaire,26.0,176 cm,195 lb,Jean-Simon Roy-Lemaire,29.0,176 cm,195 lb,,,,,Jean-Simon Roy-Lemaire,28.0,176 cm,195 lb



benchmark_stats data object:

Shape: (77227, 15)
Size: 1158405


Unnamed: 0_level_0,name,Back Squat,Chad1000x,Clean and Jerk,Deadlift,Fight Gone Bad,Filthy 50,Fran,Grace,Helen,L1 Benchmark,Max Pull-ups,Run 5k,Snatch,Sprint 400m
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
469656,Jeffrey Adler,475.0,,377.0,567.0,,917.0,122.0,76.0,438.0,,54.0,1155.0,290.0,59.0
300638,Tola Morakinyo,500.0,,390.0,615.0,,,,,,,,,340.0,
676693,Colten Mertens,555.0,,335.0,545.0,,,,,,,,,275.0,
663689,Tyler Christophel,,,,,,,,,,,,,,
1031875,Roldan Goldbaum,445.0,,350.0,515.0,,,130.0,67.0,,,,1080.0,280.0,56.0



workout_descriptions data object:

{'17.1': {'description': '10 dumbbell snatches, 15 burpee box jump-overs, 20 '
                         'dumbbell snatches, 15 burpee box jump-overs, 30 '
                         'dumbbell snatches, 15 burpee box jump-overs, 40 '
                         'dumbbell snatches, 15 burpee box jump-overs, 50 '
                         'dumbbell snatches, 15 burpee box jump-overs',
          'goal': 'For Time',
          'time_cap': 20,
          'total_reps': 225},
 '17.2': {'description': 'Complete as many rounds and reps as possible in 12 '
                         'minutes of: 2 rounds of: 50-ft. weighted walking '
                         'lunge, 16 toes-to-bars, 8 power cleans. Then, 2 '
                         'rounds of: 50-ft. weighted walking lunge, 16 bar '
                         'muscle-ups, 8 power cleans. Etc., alternating '
                         'between toes-to-bars and bar muscle-ups every 2 '
                         'rounds. Men us

## Athlete Info Exploration

In [5]:
df_athinfo = data['athlete_info']
display(df_athinfo.head())
display(df_athinfo.info())


Unnamed: 0_level_0,19.name,19.age,19.height,19.weight,20.name,20.age,20.height,20.weight,23.name,23.age,23.height,23.weight,21.name,21.age,21.height,21.weight,22.name,22.age,22.height,22.weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
153604,Mathew Fraser,29.0,67 in,195 lb,Mathew Fraser,30.0,67 in,195 lb,,,,,,,,,,,,
81616,Björgvin Karl Guðmundsson,26.0,178 cm,190 lb,Björgvin Karl Guðmundsson,27.0,178 cm,190 lb,Björgvin Karl Guðmundsson,30.0,178 cm,190 lb,Björgvin Karl Guðmundsson,28.0,178 cm,185 lb,Björgvin Karl Guðmundsson,29.0,178 cm,185 lb
199938,Jacob Heppner,29.0,67 in,195 lb,Jacob Heppner,30.0,67 in,195 lb,,,,,,,,,,,,
514502,Lefteris Theofanidis,29.0,171 cm,81 kg,Lefteris Theofanidis,30.0,171 cm,81 kg,,,,,,,,,,,,
308712,Jean-Simon Roy-Lemaire,25.0,176 cm,195 lb,Jean-Simon Roy-Lemaire,26.0,176 cm,195 lb,Jean-Simon Roy-Lemaire,29.0,176 cm,195 lb,,,,,Jean-Simon Roy-Lemaire,28.0,176 cm,195 lb


<class 'pandas.core.frame.DataFrame'>
Index: 570004 entries, 153604 to 2148950
Data columns (total 20 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   19.name    195562 non-null  object 
 1   19.age     195562 non-null  float64
 2   19.height  125933 non-null  object 
 3   19.weight  134870 non-null  object 
 4   20.name    133874 non-null  object 
 5   20.age     133874 non-null  float64
 6   20.height  86885 non-null   object 
 7   20.weight  91551 non-null   object 
 8   23.name    302228 non-null  object 
 9   23.age     302231 non-null  float64
 10  23.height  146886 non-null  object 
 11  23.weight  144511 non-null  object 
 12  21.name    137464 non-null  object 
 13  21.age     137464 non-null  float64
 14  21.height  82114 non-null   object 
 15  21.weight  86501 non-null   object 
 16  22.name    154815 non-null  object 
 17  22.age     154815 non-null  float64
 18  22.height  85131 non-null   object 
 19  22.weight  89002 non-n

None

In [7]:
# Check to see if there are any missing values in the 'athlete_info'
# dataframe
missing_values = df_athinfo.isnull().sum()
print(f"\nMissing values in 'athlete_info' dataframe:\n{missing_values}")


Missing values in 'athlete_info' dataframe:
19.name      374442
19.age       374442
19.height    444071
19.weight    435134
20.name      436130
20.age       436130
20.height    483119
20.weight    478453
23.name      267776
23.age       267773
23.height    423118
23.weight    425493
21.name      432540
21.age       432540
21.height    487890
21.weight    483503
22.name      415189
22.age       415189
22.height    484873
22.weight    481002
dtype: int64


## Analysis

### Sampling

In [None]:
# to test the pipeline faster, we can sample the data, remove this line to use all the data
data['open_results'] = data['open_results'].sample(10000)
# Print the data to see what we have
data['open_results'].head()

Unnamed: 0_level_0,23.1,23.2A,23.2B,23.3,19.1,19.2,19.3,19.4,19.5,22.1,...,22.3,20.1,20.2,20.3,20.4,20.5,21.1,21.2,21.3,21.4
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
105191,180 reps - s,106 reps - s,200 lbs - s,128 reps - s,,101 reps,91 reps,66 reps,,,...,,,,,,,,,,
1176459,,,,,285 reps,167 reps,,,,,...,,,,,,,,,,
1577867,,,,,,75 reps,,72 reps - s,87 reps - s,,...,,,,,,,,,,
706816,,,,,229 reps,104 reps,98 reps,83 reps,128 reps,,...,,,,,,,,,,
1174117,,,,,261 reps,167 reps,96 reps,67 reps,138 reps,,...,,,,,,,,,,


### Preprocessing the Data

In [None]:
preprocessing_config = {
    "athlete_info": {},
    "benchmark_stats": {"remove_outliers": True, "missing_method": "zero"},
    "open_results": {
        "create_description_embeddings": False,
    },
    # 'workout_descriptions':{},
}
preprocessor = DataPreprocessor(config=preprocessing_config)
preprocessed_data = preprocessor.transform(data=data)

### Modelling

In [None]:
rf_modeler = RandomForestModel(n_estimators = 100)
rf_modeler.fit(**preprocessed_data)

rf_modeler.show_results()