## 2021: Week 29 - PD x WOW - Tokyo 2020 Calendar

Challenge by Tom Prowse with collaboration with the Workout Wednesday team!

This week is time for our annual get together with Workout Wednesday for a joint challenge so that you can have a full data prep to visualisation solution. 

Unfortunately the Olympics was postponed in 2020, so for last year's collaboration we looked at historical winners through the history of the games. However, this year, Japan 2020 is going ahead so we thought it would be the perfect time to create an event calendar to help us keep track of the events that we don't want to miss. 

### Inputs
The data comes from the Olympics website. (Note; this was taken on Wednesday 14th July so the schedule for some events may have changed since!).

1. Event Schedule

A list of all the event dates, times and locations throughout the games

2. Venue Details

A list of all of the different venue locations

### Requirements
- Input the Data 
- Create a correctly formatted DateTime field 
- Parse the event list so each event is on a separate row 
- Group similar sports into a Sport Type field 
- Combine the Venue table 
- Calculate whether the event is a 'Victory Ceremony' or 'Gold Medal' event. (Note, this might not pick up all of the medal events.)
- Output the Data

In [872]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [873]:
data = pd.read_excel("./data/Olympic Events.xlsx", sheet_name=["Olympics Events", "Venues"])
olympics = data["Olympics Events"].copy()
venues = data["Venues"].copy()

  warn(msg)


In [874]:
olympics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 709 entries, 0 to 708
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    709 non-null    object
 1   Time    709 non-null    object
 2   Sport   709 non-null    object
 3   Venue   709 non-null    object
 4   Events  709 non-null    object
dtypes: object(5)
memory usage: 27.8+ KB


In [875]:
olympics.head()

Unnamed: 0,Date,Time,Sport,Venue,Events
0,21st_July_2021,1:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,"Australia vs Japan, Italy vs United States, Me..."
1,21st_July_2021,8:30,Football,Sapporo Dome,"Women's Group E: Great Britain vs Chile, Women..."
2,21st_July_2021,9:00,Football,Miyagi Stadium,"Women's Group F: China vs Brazil, Women's Grou..."
3,21st_July_2021,9:30,Football,Tokyo Stadium,"Women's Group G: Sweden vs United States, Wome..."
4,22nd_July_2021,1:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,"United States vs Canada, Mexico vs Japan, Ital..."


In [876]:
venues.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Venue     59 non-null     object
 1   Sport     59 non-null     object
 2   Location  59 non-null     object
dtypes: object(3)
memory usage: 1.5+ KB


In [877]:
venues.head()

Unnamed: 0,Venue,Sport,Location
0,Olympic Stadium,Opening Ceremony,"35.67786383266573, 139.71366292613558"
1,Olympic Stadium,Closing Ceremony,"35.67786383266573, 139.71366292613558"
2,Olympic Stadium,Athletics,"35.67786383266573, 139.71366292613558"
3,Olympic Stadium,Football,"35.67786383266573, 139.71366292613558"
4,Tokyo Metropolitan Gymnasium,Table Tennis,"35.679538129089025, 139.71224149090568"


In [878]:
# Create a correctly formatted DateTime field
import re
date_time = olympics["Date"].str.split("_").apply(pd.Series)

def find_numbers(x):
    pattern = re.compile("[0-9]")
    result = pattern.findall(x)
    result = "".join(result)
    return result

date_time[0] = date_time[0].map(lambda x: find_numbers(x))
date_time = date_time[0] + "/" + date[1] + "/" + date[2]
date_time

0       21/July/2021
1       21/July/2021
2       21/July/2021
3       21/July/2021
4       22/July/2021
           ...      
704    8/August/2021
705    8/August/2021
706    8/August/2021
707    8/August/2021
708    8/August/2021
Length: 709, dtype: object

In [879]:
olympics["Date"] = pd.to_datetime(date_time)
olympics

Unnamed: 0,Date,Time,Sport,Venue,Events
0,2021-07-21,1:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,"Australia vs Japan, Italy vs United States, Me..."
1,2021-07-21,8:30,Football,Sapporo Dome,"Women's Group E: Great Britain vs Chile, Women..."
2,2021-07-21,9:00,Football,Miyagi Stadium,"Women's Group F: China vs Brazil, Women's Grou..."
3,2021-07-21,9:30,Football,Tokyo Stadium,"Women's Group G: Sweden vs United States, Wome..."
4,2021-07-22,1:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,"United States vs Canada, Mexico vs Japan, Ital..."
...,...,...,...,...,...
704,2021-08-08,5:40,Water Polo,Tatsumi Water Polo Centre,Men's Bronze Medal Match
705,2021-08-08,6:00,Boxing.,Kokugikan Arena,"Women's Light (57-60kg) Final, Men's Light (57..."
706,2021-08-08,7:00,Handball,Yoyogi National Stadium,Women's Gold Medal Match
707,2021-08-08,8:30,Water Polo,Tatsumi Water Polo Centre,Men's Gold Medal Match


In [880]:
# Parse the event list so each event is on a separate row
events = olympics["Events"].str.split(",").apply(pd.Series)
events = pd.concat([olympics, events], axis=1).melt(id_vars=["Date", "Time", "Sport", "Venue", "Events"], value_name="Events Split").drop(["variable", "Events"], axis=1)
events = events.dropna()
events.shape

(1895, 5)

In [881]:
# Group similar sports into a Sport Type field
events["Sport"].value_counts(dropna=False).index.sort_values()

Index(['3x3 Basketball', 'Archery', 'Artistic Gymnastic',
       'Artistic Gymnastics', 'Artistic Swimming', 'Athletics', 'Badminton',
       'Baseball', 'Baseball/Softball', 'Basketball', 'Beach Volley',
       'Beach Volleybal', 'Beach Volleyball', 'Beach volleyball', 'Boxing',
       'Boxing.', 'Canoe Slalom', 'Canoe Sprint', 'Closing Ceremony',
       'Cycling BMX Freestyle', 'Cycling BMX Racing', 'Cycling Mountain Bike',
       'Cycling Road', 'Cycling Track', 'Diving', 'Equestrian', 'Fencing',
       'Football', 'Golf', 'Handball', 'Hockey', 'Judo', 'Karate',
       'Marathon Swimming', 'Modern Pentathlon', 'Opening Ceremony',
       'Rhythmic Gymnastics', 'Rowing', 'Rugby', 'Rugby.', 'Sailing',
       'Shooting', 'Skateboarding', 'Skateboarding.', 'Softball',
       'Softball/Baseball', 'Sport Climbing', 'Surfing', 'Swimming',
       'Table Tennis', 'Taekwondo', 'Tennis', 'Trampoline Gymnastics',
       'Triathlon', 'Volleyball', 'Water Polo', 'Weightlifting', 'Wrestling',
     

In [882]:
sport_groups_list = pd.read_excel("./output/Olympic Event Schedule.xlsx")["Sport Group"].drop_duplicates()
sport_groups_list = sport_groups_list.sort_values()
sport_groups_list.values

array(['Archery', 'Athletics', 'Badminton', 'Baseball', 'Basketball',
       'Boxing', 'Canoeing', 'Ceremony', 'Cycling', 'Diving',
       'Equestrian', 'Fencing', 'Football', 'Golf', 'Gymnastics',
       'Handball', 'Hockey', 'Martial Arts', 'Modern Pentathlon',
       'Rhythmic Gymnastics', 'Rowing', 'Rugby', 'Sailing', 'Shooting',
       'Skateboarding', 'Sport Climbing', 'Surfing', 'Swimming', 'Tennis',
       'Triathlon', 'Volleyball', 'Water Polo', 'Weightlifting',
       'Wrestling', 'diving', 'football'], dtype=object)

In [883]:
events["Sport Group"] = events["Sport"].map({"3x3 Basketball": "Basketball",
                                             "Artistic Gymnastic": "Gymnastics",
                                             "Artistic Gymnastics": "Gymnastics",
                                             "Baseball/Softball": "Baseball",
                                             "Beach Volley": "Volleyball",
                                             "Beach Volleybal": "Volleyball",
                                             "Boxing.": "Boxing",
                                             "Canoe Slalom": "Canoeing",
                                             "Canoe Sprint": "Canoeing",
                                             "Closing Ceremony": "Ceremony",
                                             "Cycling BMX Freestyle": "Cycling",
                                             "Cycling Mountain Bike": "Cycling",
                                             "Cycling Road": "Cycling",
                                             "Cycling Track": "Cycling",
                                             "Cycling BMX Racing": "Cycling",
                                             "Judo": "Martial Arts",
                                             "Karate": "Martial Arts",
                                             "Marathon Swimming": "Swimming",
                                             "Opening Ceremony": "Ceremony",
                                             "Rugby.": "Rugby",
                                             "Skateboarding.": "Skateboarding",
                                             "Softball": "Baseball",
                                             "Softball/Baseball": "Baseball",
                                             "Table Tennis": "Tennis",
                                             "Taekwondo": "Martial Arts",
                                             "Trampoline Gymnastics": "Gymnastics",
                                             "Wrestling.": "Wrestling",
                                             "boxing": "Boxing",
                                             "rugby": "Rugby",
                                             "volleyball": "Volleyball",
                                             "Artistic Swimming": "Swimming",
                                             "Beach Volleyball": "Volleyball",
                                             "Beach volleyball": "Volleyball"
                     })

In [884]:
sport_origin_idx = events[events["Sport Group"].isna()].index

In [885]:
sport_origin_series = events.loc[sport_origin_idx, "Sport"]

In [886]:
events.loc[sport_origin_idx, "Sport Group"] = sport_origin_series

In [887]:
events["Sport Group"].value_counts(dropna=False).index.sort_values() == sport_groups_list.values

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True])

In [888]:
# Combine the Venue table
events["Venue"] = events["Venue"].map(lambda x: x.strip())

venues = venues[["Venue", "Location"]]
venues = venues.drop_duplicates(subset="Venue").reset_index(drop=True)
venues["Venue"] = venues["Venue"].map(lambda x: x.strip())

events = events.merge(venues, how="left", on="Venue")

In [889]:
events[events["Location"].isna()]

Unnamed: 0,Date,Time,Sport,Venue,Events Split,Sport Group,Location
31,2021-07-24,3:00,Cycling Road,Fuji international Speedway,Men's Road Race,Cycling,
91,2021-07-25,5:00,Cycling Road,Fuji international Speedway,Women's Road Race,Cycling,
265,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Women's Individual Time Trial,Cycling,
900,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Women's Individual Time Trial Victory Ceremony,Cycling,
1213,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Men's Individual Time Trial,Cycling,
1396,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Men's Individual Time Trial Victory Ceremony,Cycling,


In [890]:
fuji_speedway = venues.loc[venues["Venue"] == "Fuji International Speedway", "Location"]
fuji_speedway.values[0]

'35.372702911045124, 138.92936859906928'

In [891]:
events.loc[events["Location"].isna(), "Location"] = fuji_speedway.values[0]
events[events["Venue"] == "Fuji international Speedway"]

Unnamed: 0,Date,Time,Sport,Venue,Events Split,Sport Group,Location
31,2021-07-24,3:00,Cycling Road,Fuji international Speedway,Men's Road Race,Cycling,"35.372702911045124, 138.92936859906928"
91,2021-07-25,5:00,Cycling Road,Fuji international Speedway,Women's Road Race,Cycling,"35.372702911045124, 138.92936859906928"
265,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Women's Individual Time Trial,Cycling,"35.372702911045124, 138.92936859906928"
900,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Women's Individual Time Trial Victory Ceremony,Cycling,"35.372702911045124, 138.92936859906928"
1213,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Men's Individual Time Trial,Cycling,"35.372702911045124, 138.92936859906928"
1396,2021-07-28,3:30,Cycling Road,Fuji international Speedway,Men's Individual Time Trial Victory Ceremony,Cycling,"35.372702911045124, 138.92936859906928"


In [892]:
# Calculate whether the event is a 'Victory Ceremony' or 'Gold Medal' event.
events.head()

Unnamed: 0,Date,Time,Sport,Venue,Events Split,Sport Group,Location
0,2021-07-21,1:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,Australia vs Japan,Baseball,"37.72216480340486, 140.3640114979229"
1,2021-07-21,8:30,Football,Sapporo Dome,Women's Group E: Great Britain vs Chile,Football,"43.01517544330762, 141.41041300340524"
2,2021-07-21,9:00,Football,Miyagi Stadium,Women's Group F: China vs Brazil,Football,"38.33557331725407, 140.95096377309127"
3,2021-07-21,9:30,Football,Tokyo Stadium,Women's Group G: Sweden vs United States,Football,"35.66446761779039, 139.52756286092847"
4,2021-07-22,1:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,United States vs Canada,Baseball,"37.72216480340486, 140.3640114979229"


In [893]:
check = pd.read_excel("./output/Olympic Event Schedule.xlsx")
check[check["Medal Ceremony?"] == True]["Events Split"]

19                       Men's 10000m Victory Ceremony
30                 Men's Discus Throw Victory Ceremony
33                       Women's 100m Victory Ceremony
38                   Women's Shot Put Victory Ceremony
42                    Men's High Jump Victory Ceremony
                             ...                      
1802                     Men's Keirin Victory Ceremony
1817    Women's Individual Time Trial Victory Ceremony
1819      Men's Individual Time Trial Victory Ceremony
1835                          Softball Gold Medal Game
1852                    Baseball Gold Medal Game (#10)
Name: Events Split, Length: 145, dtype: object

In [894]:
events[events["Events Split"] == "Men's 10000m Victory Ceremony"]

Unnamed: 0,Date,Time,Sport,Venue,Events Split,Sport Group,Location
442,2021-07-31,11:00,Athletics,Olympic Stadium,Men's 10000m Victory Ceremony,Athletics,"35.67786383266573, 139.71366292613558"


In [895]:
pattern = re.compile("Victory Ceremony")

pattern.search(events["Events Split"].iloc[442]).group()

'Victory Ceremony'

In [896]:
def check_event(x):
    vic = re.compile("Victory Ceremony")
    gold = re.compile("Gold Medal|Gold medal")
    if vic.search(x):
        return True
    elif gold.search(x):
        return True
    else:
        return False

events["Medal Ceremony?"] = events["Events Split"].map(lambda x: check_event(x))

In [897]:
events[events["Medal Ceremony?"] == True].shape[0]

145

In [898]:
check[check["Medal Ceremony?"] == True].shape[0]

145

In [745]:
# Output the Data
time = pd.to_datetime(events["Time"], format="%H:%M", errors="coerce")
events["Time"] = time.map(lambda x: x.time() if not pd.isnull(x) else "")
events.head()

Unnamed: 0,Date,Time,Sport,Venue,Events Split,Sport Group,Location,Medal Ceremony?
0,2021-07-21,01:00:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,Australia vs Japan,Baseball,"37.72216480340486, 140.3640114979229",False
1,2021-07-21,08:30:00,Football,Sapporo Dome,Women's Group E: Great Britain vs Chile,Football,"43.01517544330762, 141.41041300340524",False
2,2021-07-21,09:00:00,Football,Miyagi Stadium,Women's Group F: China vs Brazil,Football,"38.33557331725407, 140.95096377309127",False
3,2021-07-21,09:30:00,Football,Tokyo Stadium,Women's Group G: Sweden vs United States,Football,"35.66446761779039, 139.52756286092847",False
4,2021-07-22,01:00:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,United States vs Canada,Baseball,"37.72216480340486, 140.3640114979229",False


In [746]:
events["Latitude"] = events["Location"].map(lambda x: x.split(",")[0])
events["Longitude"] = events["Location"].map(lambda x: x.split(",")[1])
events.head()

Unnamed: 0,Date,Time,Sport,Venue,Events Split,Sport Group,Location,Medal Ceremony?,Latitude,Longitude
0,2021-07-21,01:00:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,Australia vs Japan,Baseball,"37.72216480340486, 140.3640114979229",False,37.72216480340486,140.3640114979229
1,2021-07-21,08:30:00,Football,Sapporo Dome,Women's Group E: Great Britain vs Chile,Football,"43.01517544330762, 141.41041300340524",False,43.01517544330762,141.41041300340524
2,2021-07-21,09:00:00,Football,Miyagi Stadium,Women's Group F: China vs Brazil,Football,"38.33557331725407, 140.95096377309127",False,38.33557331725407,140.95096377309127
3,2021-07-21,09:30:00,Football,Tokyo Stadium,Women's Group G: Sweden vs United States,Football,"35.66446761779039, 139.52756286092847",False,35.66446761779039,139.52756286092847
4,2021-07-22,01:00:00,Baseball/Softball,Fukushima Azuma Baseball Stadium,United States vs Canada,Baseball,"37.72216480340486, 140.3640114979229",False,37.72216480340486,140.3640114979229


In [747]:
events = events.loc[:, ["Latitude", "Longitude", "Medal Ceremony?", "Sport Group", 
                        "Events Split", "Time", "Date", "Sport", "Venue"]]
events.head()

Unnamed: 0,Latitude,Longitude,Medal Ceremony?,Sport Group,Events Split,Time,Date,Sport,Venue
0,37.72216480340486,140.3640114979229,False,Baseball,Australia vs Japan,01:00:00,2021-07-21,Baseball/Softball,Fukushima Azuma Baseball Stadium
1,43.01517544330762,141.41041300340524,False,Football,Women's Group E: Great Britain vs Chile,08:30:00,2021-07-21,Football,Sapporo Dome
2,38.33557331725407,140.95096377309127,False,Football,Women's Group F: China vs Brazil,09:00:00,2021-07-21,Football,Miyagi Stadium
3,35.66446761779039,139.52756286092847,False,Football,Women's Group G: Sweden vs United States,09:30:00,2021-07-21,Football,Tokyo Stadium
4,37.72216480340486,140.3640114979229,False,Baseball,United States vs Canada,01:00:00,2021-07-22,Baseball/Softball,Fukushima Azuma Baseball Stadium


In [800]:
events.to_csv(".")

TypeError: index is not a valid DatetimeIndex or PeriodIndex

In [782]:
import pytz
import datetime as dt

dt.datetime(year=events["Date"][0].year, month=events["Date"][0].month, day=events["Date"][0].day,
            hour=events["Time"][0].hour, minute=events["Time"][0].minute, tzinfo=datetime.tzinfo("GMT"))

events["Date"][0].year

TypeError: 'getset_descriptor' object is not callable

In [762]:
events["Time"][0].minute

0

In [772]:
for tz in pytz.all_timezones:
    print(tz)

Africa/Abidjan
Africa/Accra
Africa/Addis_Ababa
Africa/Algiers
Africa/Asmara
Africa/Asmera
Africa/Bamako
Africa/Bangui
Africa/Banjul
Africa/Bissau
Africa/Blantyre
Africa/Brazzaville
Africa/Bujumbura
Africa/Cairo
Africa/Casablanca
Africa/Ceuta
Africa/Conakry
Africa/Dakar
Africa/Dar_es_Salaam
Africa/Djibouti
Africa/Douala
Africa/El_Aaiun
Africa/Freetown
Africa/Gaborone
Africa/Harare
Africa/Johannesburg
Africa/Juba
Africa/Kampala
Africa/Khartoum
Africa/Kigali
Africa/Kinshasa
Africa/Lagos
Africa/Libreville
Africa/Lome
Africa/Luanda
Africa/Lubumbashi
Africa/Lusaka
Africa/Malabo
Africa/Maputo
Africa/Maseru
Africa/Mbabane
Africa/Mogadishu
Africa/Monrovia
Africa/Nairobi
Africa/Ndjamena
Africa/Niamey
Africa/Nouakchott
Africa/Ouagadougou
Africa/Porto-Novo
Africa/Sao_Tome
Africa/Timbuktu
Africa/Tripoli
Africa/Tunis
Africa/Windhoek
America/Adak
America/Anchorage
America/Anguilla
America/Antigua
America/Araguaina
America/Argentina/Buenos_Aires
America/Argentina/Catamarca
America/Argentina/ComodRivad