# Raw Data Description

In [1]:
# import relevant libraries
import pandas as pd
import warnings

In [2]:
# set options
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

There are two datasets for the CrossFit Open 2019. Both datasets are provided by [kaggle](https://www.kaggle.com/jeanmidev/crossfit-games). The origin of the data is from the official Reebok CrossFit website.The dataset "2019_opens_athletes.csv" contains information of all the athletes participated in the Open 2019 and in the dataset "2019_opens_scores.csv" there are the results of the five competition workouts.

The third dataset is created via webscraping from the official [CrossFit](https://games.crossfit.com) website and contains the benchmark statistics of athletes if provided.

In [3]:
# read the two datasets of 2019 - Open Athletes & Open Scores
df_19_ath = pd.read_csv('./data/2019_opens_athletes.csv')
df_19_sco = pd.read_csv('./data/2019_opens_scores.csv')

# read the scraped datased - Athlete's Benchmark Statistics
df_19_bs = pd.read_csv('./data/2019_opens_bs.csv')

---

## Athlete Dataset

### Dataset Information

In [4]:
df_19_ath.shape

(572653, 19)

In [5]:
df_19_ath.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572653 entries, 0 to 572652
Data columns (total 19 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   competitorid         572653 non-null  int64  
 1   competitorname       572653 non-null  object 
 2   firstname            572653 non-null  object 
 3   lastname             572653 non-null  object 
 4   postcompstatus       302 non-null     object 
 5   gender               572653 non-null  object 
 6   profilepics3key      572653 non-null  object 
 7   countryoforigincode  572348 non-null  object 
 8   countryoforiginname  572653 non-null  object 
 9   divisionid           572653 non-null  int64  
 10  affiliateid          572653 non-null  int64  
 11  affiliatename        540821 non-null  object 
 12  age                  572653 non-null  int64  
 13  height               304823 non-null  float64
 14  weight               323754 non-null  float64
 15  overallrank      

* 572,653 entries in the dataset
* 19 features: 8 numeric and 11 categorical
* 10,3 % missing cells
* no duplicate rows
* memory size 380 MB

### Feature Explanation

In [6]:
df_19_ath.head()

Unnamed: 0,competitorid,competitorname,firstname,lastname,postcompstatus,gender,profilepics3key,countryoforigincode,countryoforiginname,divisionid,affiliateid,affiliatename,age,height,weight,overallrank,overallscore,is_scaled,division
0,2536,Samantha Briggs,Samantha,Briggs,accepted,F,0e63d-P2536_14-184.jpg,GB,United Kingdom,19,4098,CrossFit Black Five,37,1.7,61.23,1,33,0,Women (35-39)
1,485089,Renata Pimentel,Renata,Pimentel,accepted,F,04e97-P485089_5-184.jpg,BR,Brazil,19,15868,CrossFit Gurkha,36,1.74,73.0,2,66,0,Women (35-39)
2,16973,Carleen Mathews,Carleen,Mathews,,F,b663a-P16973_6-184.jpg,US,United States,19,10471,CrossFit Saint Helens,35,1.57,62.14,3,101,0,Women (35-39)
3,751083,Danila Capaccetti,Danila,Capaccetti,,F,pukie.png,IT,Italy,19,9329,CrossFit Black Shark,35,1.7,71.0,4,139,0,Women (35-39)
4,313257,Hope Cicero,Hope,Cicero,,F,f204b-P313257_1-184.jpg,US,United States,19,438,CrossFit Billings,36,1.55,61.23,5,176,0,Women (35-39)


* competitorid: each athlete is identified by the unique id (also available in the other datasets)
* competitorname: the full name of an athlete containing both first name and last name
* firstname: the first name of an athlete
* lastname: the last name of an athlete
* postcompstatus: (meaning not clarified)
* gender: gender of an athlete (male, female)
* profilepics3key: name of the picture in athlete profile
* countryoforigincode: the abbreviated code of an athlete's country of origin
* countryoforiginname: full name of an athlete's country of origin
* divisionid: redundand information of division
* affiliateid: the unique id of an athlete's affiliate
* affiliatename: the official name of an athlete's affiliate
* age: the age of an athlete
* height: the height of an athlete (in meters)
* weight: the weight of an athlete (in kg)
* overallrank: the rank of an athlete considering all five competition workouts
* overallscore: the score an athlete has reached after all five workouts
* is_scaled: displays if an athlete belongs to a scaling division (overall)
* division: the division (grouped by age) an athlete belongs to

### Neglected Features

* competitorname: contains the information of firstname and lastname
* postcompstatus: almost entire data is missing
* profilepics3key: just the name of a picture
* countryoforiginname: contains same information as countryoforigincode
* dividionid: information of age-grouped divisions already in division

---

## Score Dataset

### Dataset Information

In [7]:
df_19_sco.shape

(2863265, 13)

In [8]:
df_19_sco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2863265 entries, 0 to 2863264
Data columns (total 13 columns):
 #   Column           Dtype  
---  ------           -----  
 0   affiliate        object 
 1   breakdown        object 
 2   competitorid     int64  
 3   division         object 
 4   is_scaled        int64  
 5   judge            object 
 6   ordinal          int64  
 7   rank             int64  
 8   scaled           int64  
 9   score            int64  
 10  scoredisplay     object 
 11  scoreidentifier  object 
 12  time             float64
dtypes: float64(1), int64(6), object(6)
memory usage: 284.0+ MB


* 572,653 athlete entries from athlete dataset with 5 competition workouts per athlete results in 2,863,265 scores observations
* 13 features, 9 categorical & 4 numerical
* 14.7 % missing cells
* no duplicate rows
* memory size 1.2 GB

### Feature Explanation

In [9]:
df_19_sco.head()

Unnamed: 0,affiliate,breakdown,competitorid,division,is_scaled,judge,ordinal,rank,scaled,score,scoredisplay,scoreidentifier,time
0,CrossFit RDU,9 rounds +\n10 wall-ball shots\n,96511,Men (45-49),0,Erin Miller,1,1,0,13520000,352 reps,27f30f9a8c0a564ae799,
1,CrossFit RDU,Within 16 minutes:\n3 rounds +\n25 toes-to-bar...,96511,Men (45-49),0,Harper Thorsen,2,4,0,13420368,342 reps,0ed3d1264f25a8f1890d,
2,CrossFit RDU,200-ft. OH lunge\n50 box step-ups\n50 strict H...,96511,Men (45-49),0,Harper Thorsen,3,1,0,11800018,9:42,f2a143399a330c95321b,582.0
3,CrossFit RDU,132 reps\n6 rounds,96511,Men (45-49),0,Harper Thorsen,4,36,0,11320009,11:51,89101e401c6c85997363,711.0
4,CrossFit RDU,210 reps,96511,Men (45-49),0,Harper Thorsen,5,1,0,12100573,10:27,f7588c9174f1fe90f5c4,627.0


* affiliate: the official name of an athlete's affiliate
* breakdown: contains information about the completed workout, it shows the amount of rounds or reps of every exercise and in addition the tiebreak time
* competitorid: the unique id of an athlete (also available in the other datasets)
* division: the division (grouped by age) an athlete belongs to, same as in the other datasets
* is_scaled: displays if an athlete belongs to a scaling division (overall), same as in athlete dataset
* judge: the person who judged the performance of an athlete for each workout
* ordinal: describes the workout number (1-5)
* rank: shows the rank of an athlete regarding one specific workout
* scaled: checks if the athleted performed a scaled version of a workout
* score: the score an athlete reached regarding one workout
* scoredisplay: shows the amount of total reps of a workout or the time when finished earlier than timecap
* scoreidentifier: ???
* time: if a workout was completed before timecap, time feature shows the time in seconds

### Neglected Features

* judge: name of the judge is not interesting
* scoreidentifier: seems to contain useless information

---

## Benchmark Statistics Dataset

### Dataset Information

In [10]:
df_19_bs.shape

(338271, 21)

In [11]:
df_19_bs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 338271 entries, 0 to 338270
Data columns (total 21 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Unnamed: 0       338271 non-null  int64 
 1   fullname         338271 non-null  object
 2   countryoforigin  101784 non-null  object
 3   competitorid     338271 non-null  int64 
 4   division         338271 non-null  object
 5   age              338271 non-null  int64 
 6   height           338271 non-null  object
 7   weight           338271 non-null  object
 8   affiliate        338271 non-null  object
 9   bs_backsquat     338271 non-null  object
 10  bs_cleanandjerk  338271 non-null  object
 11  bs_snatch        338271 non-null  object
 12  bs_deadlift      338271 non-null  object
 13  bs_fightgonebad  338271 non-null  object
 14  bs_maxpull_ups   338271 non-null  object
 15  bs_fran          338271 non-null  object
 16  bs_grace         338271 non-null  object
 17  bs_helen  

* 338,271 entries in the dataset (for each given athlete in the database)
* 20 features: 2 numerical and 18 categorical
* most features given as strings
* almost no missing values (strings are not identified as missing data)

### Feature Explanation

In [12]:
df_19_bs.head()

Unnamed: 0.1,Unnamed: 0,fullname,countryoforigin,competitorid,division,age,height,weight,affiliate,bs_backsquat,bs_cleanandjerk,bs_snatch,bs_deadlift,bs_fightgonebad,bs_maxpull_ups,bs_fran,bs_grace,bs_helen,bs_filthy50,bs_sprint400m,bs_run5k
0,0,Justin Bergh,UnitedStates,86,Men(35-39),39,"6'5""",225lb,CrossFitSS(SouthSide),335lb,265lb,210lb,415lb,393,32,3:46,2:54,8:16,21:27,1:04,22:05
1,1,Cary Hair,UnitedStates,88,Men(35-39),36,"6'0""",191lb,CrossFitRoots,455lb,315lb,265lb,518lb,407,54,2:20,1:49,7:28,18:23,0:54,21:34
2,2,Tim Chan,UnitedStates,92,Men(50-54),51,"5'6""",161lb,CrossFitSoCal,225lb,205lb,155lb,305lb,--,35,4:22,4:41,10:45,27:41,--,--
3,3,Leif Edmundson,UnitedStates,93,Men(35-39),38,"6'0""",205lb,CrossFitHomeOfficeScottsValley,305lb,235lb,185lb,355lb,403,45,3:40,2:34,8:13,25:00,1:02,20:39
4,4,John Mclaughlin,UnitedStates,1617,Men(50-54),53,"5'10""",187lb,--,355lb,255lb,195lb,435lb,--,50,2:26,2:34,7:24,--,--,--


* fullname: first and surname of the athlete
* countryoforigin: is the country of origin
* competitorid: each athlete is identified by the unique id (also available in the other datasets)
* division: the division (grouped by age) an athlete belongs to, same as in the other datasets
* age: the age of an athlete
* height: the height of an athlete, given in feet and inches or in cm
* weight: the weight of an athlete, given in lb or kg
* affiliate: the official name of an athlete's affiliate
* benchmarks:
  - bs_backsquat: weight for 1-repetition-maximum (1RM) of Backsquat
  - bs_cleanandjerk: weight for 1RM of Clean&Jerk
  - bs_snatch: weight for 1RM of Snatch
  - bs_deadlift: weight for 1RM of Deadlift
  - bs_fightgonebad: number of total repetitions for workout "Fight Gone Bad", consisting of 3 rounds of: 1min Wallball Shots (20/14lb) - 1min Sume Deadlift Highpulls (75/55lb) - 1min Boxjumps (20in) - 1min Push Press (75/55lb) - 1min Rowing (calories) - 1min Rest 
  - bs_maxpull_ups: number of Pull-Ups in a row
  - bs_fran: time for completing "Fran"-workout, consisting of 21-15-9 repetitions of Thrusters (95/65lb) and Pull-Ups
  - bs_grace: time for completing "Grace"-workout, consisting of 30 Clean&Jerks (135/95lb)
  - bs_helen: time for completing "Helen"-workout, consisting of 3 rounds of: 800m Run - 42 Kettlebell-Swings (1.5/1pood) - 24 Pull-Ups
  - bs_filthy50: time for completing 50 Box Jumps (24/20in) - 50 Jumping Pull-Ups - 50 KB-Swings (1/0.75pood) - 50 Steps Walking Lunges - 50 Knees-to-Ellbow - 50 Push Press (45/35lb) - 50 Back Extensions - 50 Wallball Shots (20/14lb) - 50 Burpees - 50 Double Unders
  - bs_sprint400m: time for running 400m
  - bs_run5k: time for running 5km

### Neglected Features

Since the benchmark statistics are interested only, the other features can be dropped. The competitor id is taken into account to assign each athlete.