# CrossFit Open Part 1: Scraping the Data

## Introduction

The CrossFit Games is a once year worldwide fitness competition that aims to find the fittest man and woman on Earth. Though the process to qualify for the CrossFit Games has changed over the years, the Crossfit Open is generally considered the first step to qualify. The CrossFit Open consists of five workouts over five weeks, with each workout released weekly on Thursday and scores due on Monday. After five weeks, the top competitors move on to secondary stages to qualify, or in some years can directly qualify for the CrossFit Games out of the Open. What makes Open unique is that anybody can sign up to participate. There are different versions of eah workout offered for different age groups and skill levels to allow the competition to allow more people the ability to participate. All workouts are either judged or videoed and scores are entered on a worldwide leaderboard for everyone to see their placing.

The Open began in 2011, and I personally have participated each year since 2016. Though I am not an athlete at a CrossFit Games level, I enjoy seeing how my fitness improves from year to year with my Open workout scores as concrete data points that show me what skills I need to improve on to be a more competitive and better athlete. As the sport of CrossFit grows and the overall abilities and fitness levels of athletes rise, I thought it would be an interesting project to pull old Open leaderboard data and create a tool that helps any Crossfit Open participant analyze what they need to work on to improve their scores and their fitness.

## Part 1: Scraping the Data

This notebook is the first step in creating a CrossFit Open performance analyzer. In this notebook, I scrape down all the Open leaderbaord data that is available.

First, I imported the necessary libraries:

In [2]:
import urllib
import json
import os
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Set local path variable:

In [2]:
path=os.path.abspath(os.getcwd())+'\\'

This is a helper function that flattens some of the data that is originally stored as a list of lists.

In [3]:
def flatten_list(x):
    return [i for c in x for i in c]

## 2020 and 2019 CrossFit Open Data

I started with the CrossFit Open 2020 and 2019 leaderboards. The data is loaded in JSON format using AJAX calls on games.crossfit.com/competitions. I opened the developer window in Chrome, went to the Network then XHR tab and found the url that calls the data in. In the cell below I pulled in the first page of the 2020 leaderboard for the men using that url that I found. A preview of the data is also show below:

In [4]:
url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2019/leaderboards?view=0&division=1&scaled=0&sort=0'
content=requests.get(url)
data=json.loads(content.content)
data['leaderboardRows'][0]

{'entrant': {'competitorId': '153604',
  'competitorName': 'Mathew Fraser',
  'firstName': 'Mathew',
  'lastName': 'Fraser',
  'status': 'ACT',
  'postCompStatus': 'accepted',
  'gender': 'M',
  'profilePicS3key': '9e218-P153604_4-184.jpg',
  'countryOfOriginCode': 'US',
  'countryOfOriginName': 'United States',
  'divisionId': '1',
  'affiliateId': '3220',
  'affiliateName': 'CrossFit Mayhem',
  'age': '29',
  'height': '67 in',
  'weight': '195 lb'},
 'ui': {'highlight': False, 'countryChampion': True},
 'scores': [{'ordinal': 1,
   'rank': '59',
   'score': '13870000',
   'scoreDisplay': '387 reps',
   'mobileScoreDisplay': '',
   'scoreIdentifier': '59bc1278f2d9cff4667a',
   'scaled': '0',
   'video': '0',
   'breakdown': '10 rounds +\n7 wall-ball shots\n',
   'judge': 'Shane Orr',
   'affiliate': 'CrossFit Mayhem',
   'heat': '',
   'lane': ''},
  {'ordinal': 2,
   'rank': '3',
   'score': '14300212',
   'scoreDisplay': '16:28',
   'mobileScoreDisplay': '',
   'scoreIdentifier': '

After examining the JSON data, I used the cell below to help pull out column names I wanted to include in my final dataset. There were 5 separately scored workouts in the 2020 Open, so I added an underscore and the week number to have unique column names for each week.

In [5]:
#Athlete info
cols_1=list(data['leaderboardRows'][0]['entrant'].keys())

#National Champ flag
cols_2=list(data['leaderboardRows'][0]['ui'].keys())

#Fields for each workout
cols_3=list(data['leaderboardRows'][0]['scores'][0].keys())
cols_3.insert(8,'time')

#Mutiply by 5 for the 5 workouts and add "_"+week # to each column to create unique column headers
cols_3=cols_3*5
n=1
count=1
for i in range(0,len(cols_3)):
    cols_3[i]=cols_3[i]+"_"+str(n)
    count+=1
    if count==15:
        n+=1
        count=1
cols_4=['overallRank','overallScore']

#Combine all columns
cols=[cols_1,cols_2,cols_3,cols_4]

#Create master column list
cols=flatten_list(cols)

#Columns that go with the scoring data to be used as keys to help pull out the JSON values
score_cols=['ordinal',
'rank',
'score',
'scoreDisplay',
'mobileScoreDisplay',
'scoreIdentifier',
'scaled',
'video',
'breakdown',
'time',
'judge',
'affiliate',
'heat',
'lane']

After organizing the columns, I created a function that can pull the CrossFit Open leaderboard data. It takes a year, a gender (men=1 and women=2), a start page, and an end page. I looked at the number of pages the leaderboard had on the CrossFit Games website for a given year and gender to find the end page number. It also takes the list of the score columns that I found from examining the JSON data and the column names I created above. This function returns all of the CrossFit Open leaderboard data for a given year and gender in a data frame.

In [6]:
def scrape_open_data(year,gender,start_page,end_page,score_cols,cols):
    master_list=[]
    for p in range(start_page,end_page+1):
        if p == 1:
            url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/'+str(year)+'/leaderboards?view=0&division='+str(gender)+'&scaled=0&sort=0'
        else:
            url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/'+str(year)+'/leaderboards?view=0&division='+str(gender)+'&scaled=0&sort=0&page='+str(p)
        try:
            content=requests.get(url)
            data=json.loads(content.content)
            for i in data['leaderboardRows']:
                info = []
                info.append(list(i['entrant'].values()))
                info.append(list(i['ui'].values())[0:2])
                scores=dict.fromkeys(score_cols)
                for j in i['scores']:
                    for k in j.keys():
                        scores[k]=j[k]
                    info.append(list(scores.values()))
                info = flatten_list(info)
                info.append(i['overallRank'])
                info.append(i['overallScore'])
                master_list.append(info)
            print("Done with page:",p)
        except ValueError:
            print("Error on page:",p)
    df = pd.DataFrame(master_list,columns=cols)
    return df  

The below cells use the above function to pull the 2020 and 2019 data for men and women and stores them to a local csv.

In [None]:
df_men_2020=scrape_open_data(2020,1,1,2678,score_cols,cols)
df_men_2020.to_csv(path+"open_2020_men.csv",encoding="utf-8-sig",index=False)

In [24]:
df_men_2020

Unnamed: 0,competitorId,competitorName,firstName,lastName,status,postCompStatus,gender,profilePicS3key,countryOfOriginCode,countryOfOriginName,...,scaled_5,video_5,breakdown_5,time_5,judge_5,affiliate_5,heat_5,lane_5,overallRank,overallScore
0,158264,Patrick Vellner,Patrick,Vellner,ACT,accepted,M,d471c-P158264_7-184.jpg,CA,Canada,...,0,0,240 reps,609.0,Matt O'Keefe,CrossFit New England,,,1,64
1,153604,Mathew Fraser,Mathew,Fraser,ACT,accepted,M,9e218-P153604_4-184.jpg,US,United States,...,0,0,240 reps,645.0,Kelley Jackson,CrossFit Mayhem,,,2,74
2,514502,Lefteris Theofanidis,Lefteris,Theofanidis,ACT,accepted,M,931eb-P514502_2-184.jpg,GR,Greece,...,0,1,240 reps,671.0,,,,,3,94
3,81616,Björgvin Karl Guðmundsson,Björgvin Karl,Guðmundsson,ACT,accepted,M,4c5dc-P81616_4-184.jpg,IS,Iceland,...,0,0,240 reps,611.0,Throstur Olason,Simmagym CrossFit,,,4,97
4,469656,Jeffrey Adler,Jeffrey,Adler,ACT,accepted,M,e480e-P469656_1-184.jpg,CA,Canada,...,0,0,240 reps,659.0,Caroline Lambray,CrossFit Wonderland,,,5,100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133869,701430,Anthony Gallego,Anthony,Gallego,ACT,,M,f06dc-P701430_3-184.jpg,GB,United Kingdom,...,0,0,,,,,,,126461,548388
133870,941007,Thomas Woodward,Thomas,Woodward,ACT,,M,a2006-P941007_1-184.jpg,GB,United Kingdom,...,0,0,,,,,,,126461,548388
133871,464456,Stephen Hipskind,Stephen,Hipskind,ACT,,M,7856b-P464456_3-184.jpg,US,United States,...,0,0,,,,,,,126461,548388
133872,81415,Shane Lemon,Shane,Lemon,ACT,,M,8be3e-P81415_13-184.jpg,CA,Canada,...,0,0,,,,,,,126461,548388


In [None]:
df_women_2020=scrape_open_data(2020,1,1884,score_cols,cols)
df_women_2020.to_csv(path+"open_2020_women.csv",encoding="utf-8-sig",index=False)

In [23]:
df_women_2020

Unnamed: 0,competitorId,competitorName,firstName,lastName,status,postCompStatus,gender,profilePicS3key,countryOfOriginCode,countryOfOriginName,...,scaled_5,video_5,breakdown_5,time_5,judge_5,affiliate_5,heat_5,lane_5,overallRank,overallScore
0,8859,Ragnheiður Sara Sigmundsdottir,Ragnheiður Sara,Sigmundsdottir,ACT,accepted,F,5bee0-P8859_7-184.jpg,IS,Iceland,...,0,0,240 reps,695.0,Andri Hreidarsson,Simmagym CrossFit,,,1,24
1,18588,Annie Thorisdottir,Annie,Thorisdottir,ACT,accepted,F,15f17-P18588_4-184.jpg,IS,Iceland,...,0,0,240 reps,767.0,Frederik Aegidius,CrossFit Reykjavík,,,2,39
2,120480,Kristin Holte,Kristin,Holte,ACT,accepted,F,df164-P120480_7-184.jpg,NO,Norway,...,0,0,240 reps,776.0,Joakim Rygh,CrossFit Oslo,,,3,55
3,163097,Tia-Clair Toomey,Tia-Clair,Toomey,ACT,accepted,F,b8a69-P163097_3-184.jpg,AU,Australia,...,0,1,240 reps,774.0,Shane Orr,CrossFit Torian,,,4,56
4,264512,Jamie Simmonds,Jamie,Simmonds,ACT,accepted,F,101af-P264512_2-184.jpg,NZ,New Zealand,...,0,0,240 reps,732.0,Elliot Simmonds,CrossFit Yas,,,5,65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94152,1598450,Nicole Ferrarella,Nicole,Ferrarella,ACT,,F,5bf13-P1598450_1-184.jpg,US,United States,...,0,0,,,,,,,90041,394073
94153,1552533,Martine Dun,Martine,Dun,ACT,,F,f1681-P1552533_1-184.jpg,NL,Netherlands,...,0,0,,,,,,,90041,394073
94154,1548768,Luna Brown,Luna,Brown,ACT,,F,pukie.png,US,United States,...,0,0,,,,,,,90041,394073
94155,1753916,Danielle Helfrick,Danielle,Helfrick,ACT,,F,0ee1c-P1753916_1-184.jpg,US,United States,...,0,0,,,,,,,90041,394073


In [None]:
df_men_2019=scrape_open_data(2019,1,1,3912,score_cols,cols)
df_men_2019.to_csv(path+"open_2019_men.csv",encoding="utf-8-sig",index=False)

In [21]:
df_men_2019

Unnamed: 0,competitorId,competitorName,firstName,lastName,status,postCompStatus,gender,profilePicS3key,countryOfOriginCode,countryOfOriginName,...,scaled_5,video_5,time_5,breakdown_5,judge_5,affiliate_5,heat_5,lane_5,overallRank,overallScore
0,153604,Mathew Fraser,Mathew,Fraser,ACT,accepted,M,9e218-P153604_4-184.jpg,US,United States,...,0,0,210 reps,413,Daniel Lopez,CrossFit HQ,,,1,66
1,81616,Björgvin Karl Guðmundsson,Björgvin Karl,Guðmundsson,ACT,accepted,M,4c5dc-P81616_4-184.jpg,IS,Iceland,...,0,0,210 reps,477,Hafsteinn Gunnlaugsson,CrossFit Reykjavík,,,2,93
2,199938,Jacob Heppner,Jacob,Heppner,ACT,accepted,M,d1ef5-P199938_3-184.jpg,US,United States,...,0,0,210 reps,441,Andrew Kuechler,Cobra Command CrossFit,,,3,168
3,514502,Lefteris Theofanidis,Lefteris,Theofanidis,ACT,accepted,M,931eb-P514502_2-184.jpg,GR,Greece,...,0,1,210 reps,440,,,,,4,183
4,308712,Jean-Simon Roy-Lemaire,Jean-Simon,Roy-Lemaire,ACT,accepted,M,47f1f-P308712_3-184.jpg,CA,Canada,...,0,0,210 reps,492,Mathieu Gravel,Tonic CrossFit,,,5,187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195557,1158517,Chad Davis,Chad,Davis,ACT,,M,f9b60-P1158517_14-184.jpg,US,United States,...,0,0,,,,,,,185551,792781
195558,1206149,Elliott Sidey,Elliott,Sidey,ACT,,M,pukie.png,US,United States,...,0,0,,,,,,,185551,792781
195559,1206003,Hieu Tran,Hieu,Tran,ACT,,M,pukie.png,US,United States,...,0,0,,,,,,,185551,792781
195560,1192449,Miles Alden,Miles,Alden,ACT,,M,bde29-P1192449_1-184.jpg,US,United States,...,0,0,,,,,,,185551,792781


In [None]:
df_women_2019=scrape_open_data(2019,1,2928,score_cols,cols)
df_women_2019.to_csv(path+"open_2019_women.csv",encoding="utf-8-sig",index=False)

In [4]:
df_women_2019

Unnamed: 0,competitorId,competitorName,firstName,lastName,status,postCompStatus,gender,profilePicS3key,countryOfOriginCode,countryOfOriginName,...,scaled_5,video_5,breakdown_5,time_5,judge_5,affiliate_5,heat_5,lane_5,overallRank,overallScore
0,8859,Ragnheiður Sara Sigmundsdottir,Ragnheiður Sara,Sigmundsdottir,ACT,accepted,F,5bee0-P8859_7-184.jpg,IS,Iceland,...,0,0,210 reps,459.0,Phil Mansfield,Tagoror CrossFit,,,1,40
1,18588,Annie Thorisdottir,Annie,Thorisdottir,ACT,accepted,F,15f17-P18588_4-184.jpg,IS,Iceland,...,0,0,210 reps,485.0,Frederik Aegidius,Reebok CrossFit Reykjavík,,,2,72
2,120480,Kristin Holte,Kristin,Holte,ACT,accepted,F,df164-P120480_7-184.jpg,NO,Norway,...,0,0,210 reps,450.0,Joakim Rygh,CrossFit Oslo,,,3,93
3,264512,Jamie Greene,Jamie,Greene,ACT,accepted,F,101af-P264512_2-184.jpg,NZ,New Zealand,...,0,0,210 reps,459.0,Elliot Simmonds,CrossFit Club La Santa,,,4,94
4,670000,Dani Speegle,Dani,Speegle,ACT,accepted,F,47a8d-P670000_4-184.jpg,US,United States,...,0,0,210 reps,528.0,Miguel Senior,SUBU CrossFit,,,5,97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146358,146572,Andréa Maria Cecil Topper,Andréa Maria,Cecil Topper,ACT,,F,8b50b-P146572_4-184.jpg,US,United States,...,0,0,,,,,,,140133,604776
146359,1147531,Bary Morton,Bary,Morton,ACT,,F,pukie.png,US,United States,...,0,0,,,,,,,140133,604776
146360,681265,Melissa Yinger,Melissa,Yinger,ACT,,F,5eb33-P681265_1-184.jpg,US,United States,...,0,0,,,,,,,140133,604776
146361,332049,Tamara Meyer,Tamara,Meyer,ACT,,F,4ed03-P332049_7-184.jpg,US,United States,...,0,0,,,,,,,140133,604776


## 2018 Data

In 2018, there was an additional scored workout meaning that we now need 6 score columns instead of 5. 2018 was also the last year that competitors were assigned to a region, and there were extra columns with that extra data. I previewed the 2018 data using the cell below:

In [15]:
url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2018/leaderboards?view=0&division=1&scaled=0&sort=0'
content=requests.get(url)
data=json.loads(content.content)
data['leaderboardRows'][0]

{'entrant': {'competitorId': '153604',
  'competitorName': 'Mathew Fraser',
  'firstName': 'Mathew',
  'lastName': 'Fraser',
  'status': 'ACT',
  'postCompStatus': 'accepted',
  'gender': 'M',
  'profilePicS3key': '9e218-P153604_4-184.jpg',
  'countryShortCode': '',
  'regionalCode': '3',
  'regionId': '6',
  'regionName': 'Central East',
  'divisionId': '1',
  'profession': '0',
  'affiliateId': '3220',
  'affiliateName': 'CrossFit Mayhem',
  'age': '28',
  'height': '67 in',
  'weight': '195 lb',
  'teamCaptain': '0'},
 'ui': {'highlight': False},
 'scores': [{'ordinal': 1,
   'rank': '4',
   'score': '14760000',
   'scoreDisplay': '476 reps',
   'mobileScoreDisplay': '',
   'scoreIdentifier': '7db296c93cdcd76984d0',
   'scaled': '0',
   'video': '0',
   'breakdown': '14 rounds +\n8 toes-to-bars\n10 clean & jerks\n10 calories',
   'judge': 'Anthony Jay Wilkerson',
   'affiliate': 'CrossFit Mayhem',
   'heat': '',
   'lane': ''},
  {'ordinal': 2,
   'rank': '49',
   'score': '11100478

I used the code below to update the column names that we would like to have added to the final dataframe:

In [17]:
#Athlete info
cols_1=list(data['leaderboardRows'][0]['entrant'].keys())

#National Champ flag
cols_2=list(data['leaderboardRows'][0]['ui'].keys())

#Fields for each workout
cols_3=list(data['leaderboardRows'][0]['scores'][0].keys())
cols_3.insert(8,'time')

#Mutiply by 5 for the 5 workouts and add "_"+week # to each column to create unique column headers
cols_3=cols_3*6
n=1
count=1
for i in range(0,len(cols_3)):
    cols_3[i]=cols_3[i]+"_"+str(n)
    count+=1
    if count==15:
        n+=1
        count=1
cols_4=['overallRank','overallScore']

#Combine all columns
cols=[cols_1,cols_2,cols_3,cols_4]

#Create master column list
cols=flatten_list(cols)

#Columns that go with the scoring data to be used as keys to help pull out the JSON values
score_cols=['ordinal',
'rank',
'score',
'scoreDisplay',
'mobileScoreDisplay',
'scoreIdentifier',
'scaled',
'video',
'breakdown',
'time',
'judge',
'affiliate',
'heat',
'lane']

There is a page missing that is blank

Once I updated the master column list, the I was able to use the scrape_open_data function again.

In [None]:
#Pulling the 2018 men data
df_men_2018 = scrape_open_data(2018,1,1,4552,score_cols,cols)
df_men_2018.to_csv(path+"open_2018_men.csv",encoding="utf-8-sig",index=False)

In [5]:
df_men_2018

Unnamed: 0,competitorId,competitorName,firstName,lastName,status,postCompStatus,gender,profilePicS3key,countryShortCode,regionalCode,...,scaled_6,video_6,time_6,breakdown_6,judge_6,affiliate_6,heat_6,lane_6,overallRank,overallScore
0,153604,Mathew Fraser,Mathew,Fraser,ACT,accepted,M,9e218-P153604_4-184.jpg,,3.0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,399.0,Lindy Barber,CrossFit Mayhem,,,1,97
1,180541,Alex Vigneault,Alex,Vigneault,ACT,accepted,M,ebe1c-P180541_4-184.jpg,,0.0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,408.0,Benjamin Hebert,CrossFit Quebec City,,,2,439
2,702092,Willy Georges,Willy,Georges,ACT,accepted,M,9ca60-P702092_3-184.jpg,,0.0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,501.0,Mylene Jankovits,CrossFit DBS 83,,,3,596
3,308712,Jean-Simon Roy-Lemaire,Jean-Simon,Roy-Lemaire,ACT,accepted,M,47f1f-P308712_3-184.jpg,,0.0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,463.0,Isabelle Chouinard,Tonic CrossFit,,,4,681
4,158264,Patrick Vellner,Patrick,Vellner,ACT,accepted,M,d471c-P158264_7-184.jpg,,0.0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,409.0,Jonathan Mulder,CrossFit Solid Ground,,,5,706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227507,5650,Brandon Petersen,Brandon,Petersen,ACT,,M,24bff-P5650_5-184.jpg,,4.0,...,0,0,,,,,,,216825,1149324
227508,668752,孙 庭旺,孙,庭旺,ACT,,M,pukie.png,,0.0,...,0,0,,,,,,,216825,1149324
227509,63540,Nicholas Thomlison,Nicholas,Thomlison,ACT,,M,pukie.png,,0.0,...,0,0,,,,,,,216825,1149324
227510,670063,Jonathan Gary,Jonathan,Gary,ACT,,M,fbcb0-P670063_1-184.jpg,,0.0,...,0,0,,,,,,,216825,1149324


Page 4,214 is missing from the CrossFit Games website, so I am missing 50 entries for the 2018 men.

In [None]:
#Pulling the 2018 women data
df_women_2018 = scrape_open_data(2018,2,1,3440,score_cols,cols)
df_women_2018.to_csv(path+"open_2018_women.csv",encoding="utf-8-sig",index=False)

In [6]:
df_women_2018

Unnamed: 0,competitorId,competitorName,firstName,lastName,status,postCompStatus,gender,profilePicS3key,countryShortCode,regionalCode,...,scaled_6,video_6,time_6,breakdown_6,judge_6,affiliate_6,heat_6,lane_6,overallRank,overallScore
0,123582,Cassidy Lance-Mcwherter,Cassidy,Lance-Mcwherter,ACT,accepted,F,415f2-P123582_15-184.jpg,,1,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,390.0,Kimberly Marczynski,CrossFit WaterSide,,,1,189
1,2942,Kara Saunders,Kara,Saunders,ACT,accepted,F,1fb24-P2942_13-184.jpg,,0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,411.0,Matthew Saunders,CrossFit Kova,,,2,193
2,239148,Carolyne Prevost,Carolyne,Prevost,ACT,accepted,F,513d9-P239148_11-184.jpg,,0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,409.0,Paul McIntyre,CrossFit Colosseum,,,3,260
3,8404,Camille Leblanc-Bazinet,Camille,Leblanc-Bazinet,ACT,accepted,F,cfdc3-P8404_13-184.jpg,,7,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,412.0,Kirsten Ahrendt,CrossFit Invictus,,,4,310
4,18588,Annie Thorisdottir,Annie,Thorisdottir,ACT,accepted,F,15f17-P18588_4-184.jpg,,0,...,0,0,3 thr. 3 PU \n6 thr. 6 PU \n9 thr. 9 PU \n12 t...,394.0,Karl Steadman,Reebok CrossFit Reykjavík,,,5,336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171971,1206428,Olivia LeFort,Olivia,LeFort,ACT,,F,pukie.png,,4,...,0,0,,,,,,,165829,888711
171972,993494,Kari Thacker,Kari,Thacker,ACT,,F,acac0-P993494_1-184.jpg,,0,...,0,0,,,,,,,165829,888711
171973,226604,Kyla Hayden,Kyla,Hayden,ACT,,F,ee1ae-P226604_2-184.jpg,,0,...,0,0,,,,,,,165829,888711
171974,110102,Heather Rosenberg,Heather,Rosenberg,ACT,,F,d7fad-P110102_5-184.jpg,,0,...,0,0,,,,,,,165829,888711


## 2017 Data

The 2017 data required some extra wrangling because the JSON data was organized differently than the previous years:

In [None]:
url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2017/leaderboards?view=0&division=1&scaled=0&sort=0'
content=requests.get(url)
data=json.loads(content.content)
data['athletes'][0].keys()

In order to try to keep column names relatively consistent with the column names from the previous years, I tried to match up the column names from the 2017 JSON data with a manual list of column names:

In [35]:
cols = ['competitorid','competitorname','regionid','affiliateid','divisionid','highlight','age','region','height',
       'weight','profilepic','overallrank','overallscore','affiliate','division']

score_cols=['workoutrank','workoutresult','scoreidentifier','scoredisplay','time','breakdown','judge',
            'affiliate','video']

#Mutiply by 5 for the 5 workouts and add "_"+week # to each column to create unique column headers
score_cols=score_cols*5
n=1
count=1
for i in range(0,len(score_cols)):
    score_cols[i]=score_cols[i]+"_"+str(n)
    count+=1
    if count==10:
        n+=1
        count=1

cols=cols+score_cols+['nextstage']

The score data was als set up differently than the previous years:

In [115]:
data['athletes'][0]['scores']

[{'workoutrank': '20',
  'workoutresult': '--',
  'scoreidentifier': '79865929d2890ec36985',
  'scoredisplay': '10:23',
  'scoredetails': {'time': 623,
   'breakdown': "225 reps\nJudged by Matt O'Keefe\nat Champlain Valley CrossFit"},
  'video': '0'},
 {'workoutrank': '15',
  'workoutresult': '--',
  'scoreidentifier': 'b41b7a390c967ab0e238',
  'scoredisplay': '222 reps',
  'scoredetails': {'time': 654,
   'breakdown': '6 Rounds\n50-ft lunges\n8 Bar MU\n',
   'judge': 'Margaux Alvarez',
   'affiliate': 'CrossFit Columbus'},
  'video': 0},
 {'workoutrank': '3',
  'workoutresult': '--',
  'scoreidentifier': '72b5ac6ceb59985fbc33',
  'scoredisplay': '17:47',
  'scoredetails': {'time': 1067,
   'breakdown': '216 reps',
   'judge': 'Todd Widman',
   'affiliate': 'Alamo City CrossFit'},
  'video': 0},
 {'workoutrank': '1',
  'workoutresult': '--',
  'scoreidentifier': '09108c63c459551104fe',
  'scoredisplay': '327 reps',
  'scoredetails': {'time': 664,
   'breakdown': '1 Round\n55 Deadlifts\

I wrote a separate function to process the 2017 data:

In [39]:
def scrape_open_data_2017(gender,start_page,end_page,cols):
    master_list=[]
    for p in range(start_page,end_page+1):
        if p == 1:
            url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2017/leaderboards?view=0&division='+str(gender)+'&scaled=0&sort=0'
        else:
            url = 'https://games.crossfit.com/competitions/api/v1/competitions/open/2017/leaderboards?view=0&division='+str(gender)+'&scaled=0&sort=0&page='+str(p)
        try:
            content=requests.get(url)
            data=json.loads(content.content)
            for i in data['athletes']:
                info=[]
                vals=list(i.values())
                info.append(vals[:15])
                for i in range(5):
                    info.append(list(vals[15][i].values())[:4])
                    if list(vals[15][i].values())[4]==None:
                        info.append(["","","",""])
                    elif len(list(vals[15][i].values())[4].values())==2:
                        info.append(list(list(vals[15][i].values())[4].values())+["",""])
                    else:
                        info.append(list(vals[15][i].values())[4].values())
                    info.append([list(vals[15][i].values())[5]])
                if len(vals) != 17:
                    info.append([""])
                else:
                    info.append([vals[16]])
                master_list.append(flatten_list(info))
            print("Done with page:",p)
        except ValueError:
            print("Error on page:",p)
    df=pd.DataFrame(master_list,columns=cols)       
    return df

In [None]:
df_men_2017=scrape_open_data_2017(1,1,4291,cols)

After creating the 2017 data frame, the data needed some additional cleaning and some solumns added. I had to created a flag for whether the athlete scaled (did the modified version of) the workout or not. The first week's score also did not have judge, affiliate or score breakdown broken out for every athlete for some reason. The cell below makes sure each athlete has a judge, affiliate and score breakdown for the first score for each athlete.

In [None]:
def clean_2017_data(df):
    #Create scaled flag
    df['scaled_1'] = np.where(df['scoredisplay_1'].str.contains("- s"),1,0)
    df['scaled_2'] = np.where(df['scoredisplay_2'].str.contains("- s"),1,0)
    df['scaled_3'] = np.where(df['scoredisplay_3'].str.contains("- s"),1,0)
    df['scaled_4'] = np.where(df['scoredisplay_4'].str.contains("- s"),1,0)
    df['scaled_5'] = np.where(df['scoredisplay_5'].str.contains("- s"),1,0)
    
    #Split breakdown_1 at new line character
    df['judge_1']=np.where(df['breakdown_1'].str.contains("\n"),
                                    df['breakdown_1'].str.split("\n").str.get(1).str.slice(start=10),
                                    df['judge_1'])
    df['affiliate_1']=np.where(df['breakdown_1'].str.contains("\n"),
                                    df['breakdown_1'].str.split("\n").str.get(2).str.slice(start=3),
                                    df['affiliate_1'])
    df['breakdown_1']=np.where(df['breakdown_1'].str.contains("\n"),
                                    df['breakdown_1'].str.split("\n").str.get(0),
                                    df['breakdown_1'])
    return df

In [None]:
df_men_2017=clean_2017_data(df_men_2017)

Save the data locally:

In [99]:
df_men_2017.to_csv(path+"open_2017_men_cleaned.csv",encoding="utf-8-sig",index=False)

In [25]:
df_men_2017

Unnamed: 0.1,Unnamed: 0,competitorid,competitorname,regionid,affiliateid,divisionid,highlight,age,region,height,...,breakdown_5,judge_5,affiliate_5,video_5,nextstage,scaled_1,scaled_2,scaled_3,scaled_4,scaled_5
0,0,153604,Mathew Fraser,11,2080,1,0,27,North East,"5'7""",...,440 reps,Matt O'Keefe,CrossFit Connex,0,accepted,0,0,0,0,0
1,1,2725,Noah Ohlsen,15,2509,1,0,26,South East,"5'7""",...,440 reps,Peter Kazanas,Peak 360 CrossFit,0,accepted,0,0,0,0,0
2,2,180541,Alex Vigneault,4,10990,1,0,25,Canada East,"5'11""",...,440 reps,Carol,CrossFit Quebec City,0,accepted,0,0,0,0,0
3,3,81616,Björgvin Karl Guðmundsson,7,4860,1,0,24,Europe,178 cm,...,440 reps,Evert Viglundsson,Reebok CrossFit Reykjavík,0,accepted,0,0,0,0,0
4,4,388740,Anthony Davis,10,1289,1,0,22,North Central,"5'9""",...,440 reps,Tony Koens,Timberwolf CrossFit,0,team,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214464,214464,1119772,Vinicius Peixoto Falcao,8,16213,1,0,27,Latin America,187 cm,...,,,,--,,0,0,0,0,0
214465,214465,116254,Brian Shook,16,4829,12,0,42,Southern California,"5'9""",...,,,,--,,0,0,0,0,0
214466,214466,1119905,Ruslan Nabinyuk,7,0,18,0,37,Europe,172 cm,...,,,,--,,0,0,0,0,0
214467,214467,1112526,Karim Alaoui,8,0,16,0,17,Latin America,183 cm,...,,,,--,,0,0,0,0,0


I then pulled, cleaned, and saved the 2017 women data:

In [None]:
df_women_2017=scrape_open_data_2017(2,1,3192,cols)

In [None]:
df_women_2017=clean_2017_data(df_women_2017)

In [None]:
df_women_2017.to_csv(path+"open_2017_women_cleaned.csv",encoding="utf-8-sig",index=False)

In [26]:
df_women_2017

Unnamed: 0,competitorid,competitorname,regionid,affiliateid,divisionid,highlight,age,region,height,weight,...,breakdown_5,judge_5,affiliate_5,video_5,nextstage,scaled_1,scaled_2,scaled_3,scaled_4,scaled_5
0,8859,Ragnheiður Sara Sigmundsdottir,6,0,2,0,24,Central East,173 cm,69 kg,...,440 reps,Lindy Barber,CrossFit Mayhem,0,accepted,0,0,0,0,0
1,305891,Kari Pearce,11,18553,2,0,28,North East,"5'3""",139 lb,...,440 reps,Michael Varrato III,Golden Phoenix CrossFit South,0,accepted,0,0,0,0,0
2,8404,Camille Leblanc-Bazinet,17,386,2,0,28,South West,"5'2""",130 lb,...,440 reps,Darren Hunsucker,CrossFit Mayhem,0,accepted,0,0,0,0,0
3,264512,Jamie Greene,1,10868,2,0,26,Africa,163 cm,135 lb,...,440 reps,Sabine Whitfield,CrossFit Yas,0,accepted,0,0,0,0,0
4,123582,Cassidy Lance-Mcwherter,15,16524,2,0,29,South East,"5'3""",140 lb,...,440 reps,Tim Ducat,CrossFit Westchase,0,accepted,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159558,1113771,Jeri Villarreal,10,0,13,0,40,North Central,"5'8""",166 lb,...,,,,--,,0,0,0,0,0
159559,691494,Aurora Cabello,7,8094,2,0,30,Europe,178 cm,73 kg,...,,,,--,,0,0,0,0,0
159560,1114887,Tara Routsis,17,0,4,0,47,South West,,,...,,,,--,,0,0,0,0,0
159561,1118645,Sydney Cole,1,0,2,0,23,Africa,,140 lb,...,,,,--,,0,0,0,0,0


## 2016 Data

After 2017, the rest of the CrossFit Open data is stored on the Legacy Leaderboard. The main data cannot be pulled into a nice JSON format directly from an AJAX call. I decided to use BeautifulSoup to help parse the direct HTML of the pages from the 2016 leaderboard.

In [116]:
url="https://games.crossfit.com/scores/leaderboard.php?stage=0&sort=0&page=1&division=1&region=0&numberperpage=60&competition=0&frontpage=0&expanded=1&year=16&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&=&is_mobile=&scaled=0&fittest=1&fitSelect=0&regional=5&occupation=0"
content=requests.get(url)
soup=BeautifulSoup(content.content)

I examined the HTML and saw that the data stored in the leaderboard is stored in an HTML table. I wrote a function that parses through the table and pulls out all of the relevant information:

In [4]:
#Columns included in the HTML table
cols=['overallrank','athletepage','competitorname','scoreidentifier_1','score_1','scoreidentifier_2','score_2',
     'scoreidentifier_3','score_3','scoreidentifier_4','score_4','scoreidentifier_5','score_5']

In [52]:
def scrape_open_data_2016(gender,start,end,cols):
    main_table = []
    for p in range(start,end+1):
        #Pull the HTML of the page and create a BeautifulSoup object
        url="https://games.crossfit.com/scores/leaderboard.php?stage=0&sort=0&page="+str(p)+"&division="+str(gender)+"&region=0&numberperpage=60&competition=0&frontpage=0&expanded=1&year=16&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&=&is_mobile=&scaled=0&fittest=1&fitSelect=0&regional=5&occupation=0"
        content=requests.get(url)
        soup=BeautifulSoup(content.content)
        
        #Find the first table
        table=soup.find_all("table")[0]
        #Start at the second set of tr tags, this is where all the table rows are
        rows=table.find_all('tr')[1]
        #Pull out all the data cells of the rows
        columns=rows.find_all('td')
        new_table=[]
        #Loops through all of the data cells
        for column in columns:
            
            #Pull out athelete href if data cell has it
            if column.find('a') !=None:
                if "athlete" in column.find('a')['href']:
                    new_table.append(column.find('a')['href'])
            #Pull out the score identifier if data cell has it
            if column.find("span",{"class":"display"}) != None:
                new_table.append(column.find("span",{"class":"display"})['data-scoreid'])
            #Get text of data cell
            new_table.append(column.get_text().strip())
            if len(new_table)==13:
                main_table.append(new_table)
                new_table=[]
        print("Done with page",p)
    df=pd.DataFrame(main_table,columns=cols,index=False)
    return df

In [None]:
df_women_2016=scrape_open_data_2016(2,1,2170,cols)

In [10]:
df_men_2016=scrape_open_data_2016(2,1,2976,cols)

After I pulled the 2016 data, I did some data cleaning and and added some columns:

In [8]:
def clean_2016_data(df):
    df['competitorid']=df['athletepage'].apply(lambda x: x[x.rfind("/")+1:])

    #Pull out and create a separate rank column from the score_ columns:
    df['overallrank'] = df['overallrank'].apply(lambda x: x[:x.find("(")].strip())
    df['rank_1']=df['score_1'].apply(lambda x: x[:x.find("(")].strip())
    df['rank_2']=df['score_2'].apply(lambda x: x[:x.find("(")].strip())
    df['rank_3']=df['score_3'].apply(lambda x: x[:x.find("(")].strip())
    df['rank_4']=df['score_4'].apply(lambda x: x[:x.find("(")].strip())
    df['rank_5']=df['score_5'].apply(lambda x: x[:x.find("(")].strip())
    

    #Pull out and create a separate scoredisplay column from the score_ columns:
    df['scoredisplay_1']=df['score_1'].apply(lambda x: "No score" if "No score" in x 
                                                               else x[x.find("(")+1:x.find(")")].strip())
    df['scoredisplay_2']=df['score_2'].apply(lambda x: "No score" if "No score" in x 
                                                               else x[x.find("(")+1:x.find(")")].strip())
    df['scoredisplay_3']=df['score_3'].apply(lambda x: "No score" if "No score" in x 
                                                               else x[x.find("(")+1:x.find(")")].strip())
    df['scoredisplay_4']=df['score_4'].apply(lambda x: "No score" if "No score" in x 
                                                               else x[x.find("(")+1:x.find(")")].strip())
    df['scoredisplay_5']=df['score_5'].apply(lambda x: "No score" if "No score" in x 
                                                               else x[x.find("(")+1:x.find(")")].strip())
    
    #Create scaled flag - A note that the score_2 does not have a way to determine if the workout was
    #scaled or not. I will have to find an alternate method to create a scaled flag at a later point.
    df['scaled_1']=df['score_1'].apply(lambda x: 1 if '- s' in x else 0)
    df['scaled_2']=df['score_2'].apply(lambda x: 1 if '- s' in x else 0)
    df['scaled_3']=df['score_3'].apply(lambda x: 1 if '- s' in x else 0)
    df['scaled_4']=df['score_4'].apply(lambda x: 1 if '- s' in x else 0)
    df['scaled_5']=df['score_5'].apply(lambda x: 1 if '- s' in x else 0)
    
    #Drop score columns bc they are now redundant
    df=df.drop(['score_1','score_2','score_3','score_4','score_5'],axis=1)
    
    return df

I then cleaned and saved the 2016 data:

In [None]:
df_women_2016=pd.read_csv(path+'open_2016_women.csv',encoding='utf-8-sig',low_memory=False)
df_women_2016=clean_2016_data(df_women_2016)

In [None]:
df_men_2016=pd.read_csv(path+'open_2016_men.csv',encoding='utf-8-sig',low_memory-False)
df_men_2016=clean_2016_data(df_men_2016)

I wanted to pull the judge, affiliate and breakdown information for each score. This information only shows up in the tooltip when you hover over a score for the 2016 data on the website. Each score has a unique identifier that I was able to pull from the original HTML. To pull the additional information that is included in the tool tip for each score, I found that the information for each score is rendered by a request to a specific endpoint identified by the unique score identifier. I wrote the below function to pull this data. I had to add an option to chunk the pulling of the score data. We are talking about doing almost 200,000 unique requests 5 separate times for each workout for the 2016 men or women, so trying to run this all at once was causing my computer to run out of memory. For each score week, I had to run this function for a smaller amount of data, store it in a list, append the lists together, create a data frame and then append that data frame tp the main 2016 data frame. This was a pretty time intensive process, but it was the only way my machine could handle pulling and processing all the data without memory errors and losing parts of the data.

In [2]:
def get_score_details_2016(df,num,start_range,end_range,run_num,gender,path):
    df_scores=pd.DataFrame(columns=['judge_'+str(num),'affiliate_'+str(num),'breakdown_'+str(num),'reps_'+str(num)])
    if start_range==0:
        df_scores.to_csv(path+'2016_scores_'+str(num)+'_'+gender+'.csv',index=False)
    if end_range=="end":
        identifiers=df['scoreidentifier_'+str(num)].values[start_range:]
    else:
        identifiers=df['scoreidentifier_'+str(num)].values[start_range:end_range]
    judges=[]
    affiliates=[]
    breakdowns=[]
    reps=[]
    n=0
    for i in identifiers:
        if not np.isnan(i):
            url="https://games.crossfit.com/scores/getTooltip.php?id="+str(i)+"&year=16"
            content=requests.get(url)
            data=json.loads(content.content)
            judges.append(data['judge_details'])
            affiliates.append(data['affiliate_name'])
            breakdowns.append(data['round_breakdown'])
            reps.append(data['reps'])
            n+=1
            print("Done with ",n,i)
        else:
            judges.append("")
            affiliates.append("")
            breakdowns.append("")
            reps.append("")
            n+=1
            print("Done with ",n,i)
    df_scores['judge_'+str(num)]=judges
    df_scores['affiliate_'+str(num)]=affiliates
    df_scores['breakdown_'+str(num)]=breakdowns
    df_scores['reps_'+str(num)]=reps
    with open(path+'\\2016_scores_'+str(num)+'_'+gender+'.csv','a',newline='',encoding='utf-8-sig') as f:# Open file as append mode
        df_scores.to_csv(f, header = False,index=False, encoding='utf-8-sig')
    return "Run "+str(run_num)+" is done!" 

Below I looped through the function I wrote above to pull and store all of the score data for the women:

In [13]:
vals=[0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,110000,120000,"end"]

In [None]:
#Score 1
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_women_2016,1,vals[i],vals[i+1],run_count,"women",path)
    run_count+=1

In [None]:
#Score 2
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_women_2016,2,vals[i],vals[i+1],run_count,"women",path)
    run_count+=1

In [None]:
#Score 3
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_women_2016,3,vals[i],vals[i+1],run_count,"women",path)
    run_count+=1

In [None]:
#Score 4
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_women_2016,4,vals[i],vals[i+1],run_count,"women",path)
    run_count+=1

In [None]:
#Score 5
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_women_2016,5,vals[i],vals[i+1],run_count,"women",path)
    run_count+=1

I repeated the same process for the men:

In [14]:
vals=[0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,110000,120000,130000,140000,
      150000,160000,170000,"end"]

In [None]:
#Score 1
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_men_2016,1,vals[i],vals[i+1],run_count,"men",path)
    run_count+=1

In [None]:
#Score 2
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_men_2016,2,vals[i],vals[i+1],run_count,"men",path)
    run_count+=1

In [None]:
#Score 3
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_men_2016,3,vals[i],vals[i+1],run_count,"men",path)
    run_count+=1

In [None]:
#Score 4
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_men_2016,4,vals[i],vals[i+1],run_count,"men",path)
    run_count+=1

In [None]:
#Score 5
run_count=1
for i in range(0,len(vals)-1):
    get_score_details_2016(df_men_2016,5,vals[i],vals[i+1],run_count,"men",path)
    run_count+=1

I then wrote a function to concatenate all of the score data I pulled with the main dataframes for men and women:

In [18]:
def add_final_score_data_2016(gender,df,path):
    for i in range(1,6):
        if gender == 2:
            df_scores=pd.read_csv(path+'2016_scores_'+str(i)+'_women.csv',encoding='latin-1')
        else:
            df_scores=pd.read_csv(path+'2016_scores_'+str(i)+'.csv',encoding='latin-1')
        df=pd.concat([df,df_scores],axis=1)
    return df

In [16]:
df_women_2016=add_final_score_data_2016(2,df_women_2016)
df_women_2016.to_csv(path+'open_2016_women_cleaned.csv',encoding='utf-8-sig',index=False)
df_women_2016

Unnamed: 0,overallrank,athletepage,competitorname,scoreidentifier_1,scoreidentifier_2,scoreidentifier_3,scoreidentifier_4,scoreidentifier_5,competitorid,rank_1,...,breakdown_3,reps_3,judge_4,affiliate_4,breakdown_4,reps_4,judge_5,affiliate_5,breakdown_5,reps_5
0,1,http://games.crossfit.com/athlete/264512,Jamie Greene,215780.0,490055.0,790289.0,1090808.0,1348661.0,264512,3,...,11 Full Rounds<br/>4 snatches<br/>(6:51)<br/>,147 Reps,Elliot Simmonds,CrossFit Yas,1 Full Rounds<br/>55 deadlifts<br/>30 wall-bal...,305 Reps,Elliot Simmonds,CrossFit Yas,,Time: 07:43 (Rx)
1,2,http://games.crossfit.com/athlete/2536,Samantha Briggs,225341.0,501497.0,801792.0,955366.0,1218954.0,2536,2,...,12 Full Rounds<br/>2 snatches<br/>(6:51)<br/>,158 Reps,Juris Vjacirs,CrossFit Black Five,1 Full Rounds<br/>55 deadlifts<br/>33 wall-bal...,308 Reps,Craig Massey,CrossFit Black Five,,Time: 07:36 (Rx)
2,3,http://games.crossfit.com/athlete/2942,Kara Webb,56243.0,363876.0,681328.0,990659.0,1255989.0,2942,20,...,10 Full Rounds<br/>9 snatches<br/>(6:39)<br/>,139 Reps,Tom Henderson,CrossFit Roar,1 Full Rounds<br/>55 deadlifts<br/>14 wall-bal...,289 Reps,Tom Henderson,CrossFit Roar,,Time: 08:05 (Rx)
3,4,http://games.crossfit.com/athlete/8859,Ragnheiður Sara Sigmundsdottir,299807.0,585234.0,879498.0,1164895.0,1502664.0,8859,14,...,10 Full Rounds<br/>(6:58)<br/>,130 Reps,Andri Gudjonsson,CrossFit Sudurnes,1 Full Rounds<br/>55 deadlifts<br/>41 wall-bal...,316 Reps,John Singleton,CrossFit Hengill,,Time: 08:26 (Rx)
4,5,http://games.crossfit.com/athlete/3407,Michele Letendre,265578.0,461762.0,741085.0,1114258.0,1305066.0,3407,22,...,11 Full Rounds<br/>4 snatches<br/>(6:48)<br/>,147 Reps,Maxime Dufault,Deka CrossFit,1 Full Rounds<br/>55 deadlifts<br/>4 wall-ball...,279 Reps,Patrick Thibeault,CrossFit Cloverdale,,Time: 08:02 (Rx)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130149,--,http://games.crossfit.com/athlete/693668,Valentina Zurro,,,,,,693668,--,...,,,,,,,,,,
130150,--,http://games.crossfit.com/athlete/754069,Amanda Zwernemann,,,,,,754069,--,...,,,,,,,,,,
130151,--,http://games.crossfit.com/athlete/562525,Lyndsay Zwirlein,,,,,,562525,--,...,,,,,,,,,,
130152,--,http://games.crossfit.com/athlete/134126,Katarzyna Zyra,,,,,,134126,--,...,,,,,,,,,,


In [19]:
df_men_2016=add_final_score_data_2016(1,df_men_2016)
df_men_2016.to_csv(path+'open_2016_men_cleaned.csv',encoding='utf-8-sig',index=False)
df_men_2016

Unnamed: 0,overallrank,athletepage,competitorname,scoreidentifier_1,scoreidentifier_2,scoreidentifier_3,scoreidentifier_4,scoreidentifier_5,competitorid,rank_1,...,breakdown_3,reps_3,judge_4,affiliate_4,breakdown_4,reps_4,judge_5,affiliate_5,breakdown_5,reps_5
0,1,http://games.crossfit.com/athlete/2725,Noah Ohlsen,28336.0,593643.0,824688.0,1127681.0,1520410.0,2725,2,...,10 Full Rounds<br/>10 snatches<br/>(6:37)<br/>,140 Reps,Zach Martin,Peak 360 CrossFit,1 Full Rounds<br/>55 deadlifts<br/>39 wall-bal...,314 Reps,Guido Trinidad,Peak 360 CrossFit,,Time: 07:38 (Rx)
1,2,http://games.crossfit.com/athlete/11435,Richard Froning Jr.,255602.0,345256.0,868013.0,1160992.0,1515045.0,11435,51,...,10 Full Rounds<br/>10 snatches<br/>(6:33)<br/>,140 Reps,Darren Hunsucker,CrossFit Mayhem,1 Full Rounds<br/>55 deadlifts<br/>40 wall-bal...,315 Reps,George Krauss,CrossFit Mayhem,,Time: 08:02 (Rx)
2,3,http://games.crossfit.com/athlete/1690,Travis Mayer,250219.0,539825.0,824816.0,1119789.0,1470878.0,1690,17,...,10 Full Rounds<br/>8 snatches<br/>(6:39)<br/>,138 Reps,Marjorie Greene,CrossFit Passion,1 Full Rounds<br/>55 deadlifts<br/>39 wall-bal...,314 Reps,Marjorie Greene,CrossFit Passion,,Time: 08:05 (Rx)
3,4,http://games.crossfit.com/athlete/34796,Scott Panchik,295507.0,610310.0,856108.0,1142206.0,1523761.0,34796,51,...,10 Full Rounds<br/>10 snatches<br/>1 bar-muscl...,141 Reps,Saxon Panchik,CrossFit Mentality,1 Full Rounds<br/>55 deadlifts<br/>24 wall-bal...,299 Reps,Christin Handley,CrossFit Mentality,,Time: 08:03 (Rx)
4,5,http://games.crossfit.com/athlete/18670,Kyle Frankenfeld,218759.0,473708.0,864228.0,1202462.0,1518123.0,18670,8,...,11 Full Rounds<br/>3 snatches<br/>(6:52)<br/>,146 Reps,Jasmin Wood,CrossFit Moorabbin,1 Full Rounds<br/>55 deadlifts<br/>19 wall-bal...,294 Reps,Hayden Miller,CrossFit Moorabbin,,Time: 08:23 (Rx)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178505,--,http://games.crossfit.com/athlete/680723,Marin Škara,,,,,,680723,--,...,,,,,,,,,,
178506,--,http://games.crossfit.com/athlete/456913,Marcin Żaworonek,,,,,,456913,--,...,,,,,,,,,,
178507,--,http://games.crossfit.com/athlete/713921,Adriano Čubrić,,,,,,713921,--,...,,,,,,,,,,
178508,--,http://games.crossfit.com/athlete/809754,Hüseyin İnceoglu,,,,,,809754,--,...,,,,,,,,,,


## 2015 Data

This notebook is a work in progress. I will be pulling the 2015 data next!

## Part 2

The next step will be uploading the data to a PostgreSQL database.