# Raw Data Description

In [1]:
# import relevant libraries

import pandas as pd
import seaborn as sns

import warnings

In [2]:
# set options

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

There are two datasets for the CrossFit Open 2019. Both datasets are provided by [kaggle](https://www.kaggle.com/jeanmidev/crossfit-games). The origin of the data is from the official Reebok CrossFit website.The dataset "2019_opens_athletes.csv" contains information of all the athletes participated in the Open 2019 and in the dataset "2019_opens_scores.csv" there are the results of the five competition workouts.

In [3]:
# read the two datasets of 2019 - Open Athletes & Open Scores

df_19_ath = pd.read_csv('./data/2019_opens_athletes.csv')
df_19_sco = pd.read_csv('./data/2019_opens_scores.csv')

---

## Athlete Dataset

### dataset information

In [4]:
df_19_ath.head()

Unnamed: 0,competitorid,competitorname,firstname,lastname,postcompstatus,gender,profilepics3key,countryoforigincode,countryoforiginname,divisionid,affiliateid,affiliatename,age,height,weight,overallrank,overallscore,is_scaled,division
0,2536,Samantha Briggs,Samantha,Briggs,accepted,F,0e63d-P2536_14-184.jpg,GB,United Kingdom,19,4098,CrossFit Black Five,37,1.7,61.23,1,33,0,Women (35-39)
1,485089,Renata Pimentel,Renata,Pimentel,accepted,F,04e97-P485089_5-184.jpg,BR,Brazil,19,15868,CrossFit Gurkha,36,1.74,73.0,2,66,0,Women (35-39)
2,16973,Carleen Mathews,Carleen,Mathews,,F,b663a-P16973_6-184.jpg,US,United States,19,10471,CrossFit Saint Helens,35,1.57,62.14,3,101,0,Women (35-39)
3,751083,Danila Capaccetti,Danila,Capaccetti,,F,pukie.png,IT,Italy,19,9329,CrossFit Black Shark,35,1.7,71.0,4,139,0,Women (35-39)
4,313257,Hope Cicero,Hope,Cicero,,F,f204b-P313257_1-184.jpg,US,United States,19,438,CrossFit Billings,36,1.55,61.23,5,176,0,Women (35-39)


In [5]:
df_19_ath.shape

(572653, 19)

In [6]:
df_19_ath.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572653 entries, 0 to 572652
Data columns (total 19 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   competitorid         572653 non-null  int64  
 1   competitorname       572653 non-null  object 
 2   firstname            572653 non-null  object 
 3   lastname             572653 non-null  object 
 4   postcompstatus       302 non-null     object 
 5   gender               572653 non-null  object 
 6   profilepics3key      572653 non-null  object 
 7   countryoforigincode  572348 non-null  object 
 8   countryoforiginname  572653 non-null  object 
 9   divisionid           572653 non-null  int64  
 10  affiliateid          572653 non-null  int64  
 11  affiliatename        540821 non-null  object 
 12  age                  572653 non-null  int64  
 13  height               304823 non-null  float64
 14  weight               323754 non-null  float64
 15  overallrank      

In [11]:
df_19_ath.describe()

Unnamed: 0,competitorid,divisionid,affiliateid,age,height,weight,overallrank,overallscore,is_scaled
count,572653.0,572653.0,572653.0,572653.0,304823.0,323754.0,572653.0,572653.0,572653.0
mean,1109414.0,8.567262,10433.175534,37.164978,1.743674,78.343412,56143.535611,269979.915326,0.115452
std,501751.3,7.082857,6943.016568,9.012462,1.439308,44.720039,55319.364617,245739.184643,0.319567
min,86.0,1.0,0.0,16.0,0.01,-9054.15,1.0,9.0,0.0
25%,738900.0,2.0,4079.0,31.0,1.67,67.0,9055.0,49581.0,0.0
50%,1235059.0,5.0,10484.0,37.0,1.75,78.93,31697.0,175493.0,0.0
75%,1588160.0,18.0,16626.0,43.0,1.8,88.45,99381.0,480742.0,0.0
max,1715660.0,19.0,23323.0,125.0,266.24,16960.0,185551.0,792781.0,1.0


* 572,653 entries in the dataset
* 19 features: 8 numeric and 11 categorical
* 10,3 % missing cells
* no duplicate rows
* memory size 380 MB

### feature explanation

* competitorid: each athlete is identified by the unique id (competitorid can be found in the score dataset too)
* competitorname: the full name of an athlete containing both first name and last name
* firstname: the first name of an athlete
* lastname: the last name of an athlete
* postcompstatus: ???
* gender: gender of an athlete (male, female)
* profilepics3key: name of the picture in athlete profile
* countryoforigincode: the abbreviated code of an athlete's country of origin
* countryoforiginname: full name of an athlete's country of origin
* divisionid: ???
* affiliateid: the unique id of an athlete's affiliate
* affiliatename: the official name of an athlete's affiliate
* age: the age of an athlete
* height: the height of an athlete (in meters)
* weight: the weight of an athlete (in kg)
* overallrank: the rank of an athlete considering all five competition workouts
* overallscore: the score an athlete has reached after all five workouts
* is_scaled: displays if an athlete belongs to a scaling division (overall)
* division: the division (grouped by age) an athlete belongs to

### neglected features

* competitorname: contains the information of firstname and lastname
* postcompstatus: almost entire data is missing
* profilepics3key: just the name of a picture
* countryoforiginname: contains same information as countryoforigincode
* dividionid: information of age-grouped divisions already in division

---

## Score Dataset

### dataset information

In [8]:
df_19_sco.head()

Unnamed: 0,affiliate,breakdown,competitorid,division,is_scaled,judge,ordinal,rank,scaled,score,scoredisplay,scoreidentifier,time
0,CrossFit RDU,9 rounds +\n10 wall-ball shots\n,96511,Men (45-49),0,Erin Miller,1,1,0,13520000,352 reps,27f30f9a8c0a564ae799,
1,CrossFit RDU,Within 16 minutes:\n3 rounds +\n25 toes-to-bar...,96511,Men (45-49),0,Harper Thorsen,2,4,0,13420368,342 reps,0ed3d1264f25a8f1890d,
2,CrossFit RDU,200-ft. OH lunge\n50 box step-ups\n50 strict H...,96511,Men (45-49),0,Harper Thorsen,3,1,0,11800018,9:42,f2a143399a330c95321b,582.0
3,CrossFit RDU,132 reps\n6 rounds,96511,Men (45-49),0,Harper Thorsen,4,36,0,11320009,11:51,89101e401c6c85997363,711.0
4,CrossFit RDU,210 reps,96511,Men (45-49),0,Harper Thorsen,5,1,0,12100573,10:27,f7588c9174f1fe90f5c4,627.0


In [9]:
df_19_sco.shape

(2863265, 13)

In [10]:
df_19_sco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2863265 entries, 0 to 2863264
Data columns (total 13 columns):
 #   Column           Dtype  
---  ------           -----  
 0   affiliate        object 
 1   breakdown        object 
 2   competitorid     int64  
 3   division         object 
 4   is_scaled        int64  
 5   judge            object 
 6   ordinal          int64  
 7   rank             int64  
 8   scaled           int64  
 9   score            int64  
 10  scoredisplay     object 
 11  scoreidentifier  object 
 12  time             float64
dtypes: float64(1), int64(6), object(6)
memory usage: 284.0+ MB


In [12]:
df_19_sco.describe()

Unnamed: 0,competitorid,is_scaled,ordinal,rank,scaled,score,time
count,2863265.0,2863265.0,2863265.0,2863265.0,2863265.0,2863265.0,194821.0
mean,1109414.0,0.1154521,3.0,53995.98,0.2754048,6642891.0,907.35063
std,501751.0,0.3195668,1.414214,52060.22,0.4467181,5278561.0,195.589391
min,86.0,0.0,1.0,1.0,0.0,0.0,63.0
25%,738900.0,0.0,2.0,8904.0,0.0,930220.0,719.0
50%,1235059.0,0.0,3.0,30652.0,0.0,10660270.0,931.0
75%,1588160.0,0.0,4.0,99204.0,1.0,11411110.0,1080.0
max,1715660.0,1.0,5.0,178392.0,1.0,14300340.0,1200.0


* 572,653 athlete entries from athlete dataset with 5 competition workouts per athlete results in 2,863,265 scores observations
* 13 features, 9 categorical & 4 numerical
* 14.7 % missing cells
* no duplicate rows
* memory size 1.2 GB

### feature explanation

In [13]:
df_19_sco.head()

Unnamed: 0,affiliate,breakdown,competitorid,division,is_scaled,judge,ordinal,rank,scaled,score,scoredisplay,scoreidentifier,time
0,CrossFit RDU,9 rounds +\n10 wall-ball shots\n,96511,Men (45-49),0,Erin Miller,1,1,0,13520000,352 reps,27f30f9a8c0a564ae799,
1,CrossFit RDU,Within 16 minutes:\n3 rounds +\n25 toes-to-bar...,96511,Men (45-49),0,Harper Thorsen,2,4,0,13420368,342 reps,0ed3d1264f25a8f1890d,
2,CrossFit RDU,200-ft. OH lunge\n50 box step-ups\n50 strict H...,96511,Men (45-49),0,Harper Thorsen,3,1,0,11800018,9:42,f2a143399a330c95321b,582.0
3,CrossFit RDU,132 reps\n6 rounds,96511,Men (45-49),0,Harper Thorsen,4,36,0,11320009,11:51,89101e401c6c85997363,711.0
4,CrossFit RDU,210 reps,96511,Men (45-49),0,Harper Thorsen,5,1,0,12100573,10:27,f7588c9174f1fe90f5c4,627.0


* affiliate: the official name of an athlete's affiliate
* breakdown: contains information about the completed workout, it shows the amount of rounds or reps of every exercise and in addition the tiebreak time
* competitorid: the unique id of an athlete (also available in ahtlete dataset)
* division: the division (grouped by age) an athlete belongs to, same as in athlete dataset
* is_scaled: displays if an athlete belongs to a scaling division (overall), same as in athlete dataset
* judge: the person who judged the performance of an athlete for each workout
* ordinal: describes the workout number (1-5)
* rank: shows the rank of an athlete regarding one specific workout
* scaled: checks if the athleted performed a scaled version of a workout
* score: the score an athlete reached regarding one workout
* scoredisplay: shows the amount of total reps of a workout or the time when finished earlier than timecap
* scoreidentifier: ???
* time: if a workout was completed before timecap, time feature shows the time in seconds

### neglected features

* judge: name of the judge is not interesting
* scoreidentifier: seems to contain useless information