<a href="https://www.kaggle.com/code/kartikpradyumna92/data-story-telling-on-paris-2024-olympics?scriptVersionId=212174387" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <p style="text-align: center;">Data Story telling on Paris 2024 Olympics</p>
Objective: Understand the data around Paris 2024 Olympics and visualize interesting findings.

Author: Karteek Pradyumna Bulusu

In [1]:
import numpy as np
import pandas as pd
from datetime import date, datetime
import warnings
import statistics
warnings.filterwarnings('ignore')
import os

In [2]:
'''
Uncomment below to print all datasets part of Paris Olympics repo.
'''

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

'\nUncomment below to print all datasets part of Paris Olympics repo.\n'

In [3]:
'''Plotly importing and settings'''

# Initialize notebook mode
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# all values for below - "colab" #"notebook", "png"
pio.renderers.default = "colab"

In [4]:
!pip install plotly

Looking in links: /kaggle/input/pm-65775380-at-12-09-2024-21-22-23/


In [5]:
!pip install -U kaleido

Looking in links: /kaggle/input/pm-65775380-at-12-09-2024-21-22-23/


In [6]:
'''
Install dependency package - kaleido
'''
# pip install -U kaleido

'\nInstall dependency package - kaleido\n'

# SCHEDULE

In [7]:
schedules_df = pd.read_csv('/kaggle/input/paris-2024-olympic-summer-games/schedules.csv')
print(f"Shape of Schedule data - {schedules_df.shape}")

Shape of Schedule data - (3895, 16)


In [8]:
finished_per_day = schedules_df[['day', 'status']].loc[schedules_df['status'] == 'FINISHED'].groupby(by='day').count()
finished_per_day['day'] = finished_per_day.index
finished_per_day.columns = ['Total_Events', 'Event_Date']
finished_per_day["Total_Event_till_date"] = finished_per_day["Total_Events"].cumsum()

# Schedule for Basketball
basketball_events_per_day = schedules_df[['day', 'status']].loc[(schedules_df['status'] == 'FINISHED') & (schedules_df['discipline'] == 'Basketball')].groupby(by='day').count()
basketball_events_per_day['day'] = basketball_events_per_day.index
basketball_events_per_day.columns = ['Total_Events', 'Event_Date']
basketball_events_per_day["Total_Event_till_date"] = basketball_events_per_day["Total_Events"].cumsum()

In [9]:
# Subplot around Schedule
# Creating subplot of 2 rows and 1 column
schedule_fig = make_subplots(rows=2, cols=1, vertical_spacing=0.3)

# Time series line chart to show events tally each day and cumsum of it.
all_events_fig = px.line(finished_per_day, x=finished_per_day["Event_Date"], y=["Total_Events", "Total_Event_till_date" ], title='Total events Per date')

# Time series line chart to show Basketball events tally each day and cumsum of it.
basketball_events_fig = px.line(basketball_events_per_day, x=basketball_events_per_day["Event_Date"], y=["Total_Events", "Total_Event_till_date" ], title='Total events Per date')

for trace in all_events_fig.data:
    schedule_fig.add_trace(trace, row=1, col=1)

for trace in basketball_events_fig.data:
    schedule_fig.add_trace(trace.update(showlegend=False), row=2, col=1)
    

schedule_fig.update_layout(title="Data story on Paris Olympics 2024 schedule", showlegend=True)
# Add x-axis and y-axis titles
schedule_fig.update_xaxes(title_text="Event Date", row=1, col=1)
schedule_fig.update_yaxes(title_text="Count of events", row=1, col=1)
schedule_fig.update_xaxes(title_text="Event Date", row=2, col=1)
schedule_fig.update_yaxes(title_text="Count of Basketball events", row=2, col=1)

# schedule_fig.show()
pio.show(schedule_fig, renderer='iframe')

Events seems to be well spread out with more events in the begining and slowly reducing which is understandle since there are more rounds in the begining before quarter finals begin.<br>
Trend looks very similar when we compare overall events schedule and Basketball event schedule.

# EVENTS

In [10]:
events_df = pd.read_csv('/kaggle/input/paris-2024-olympic-summer-games/events.csv')
print(f"Shape of events data - {events_df.shape}")

Shape of events data - (329, 5)


In [11]:
events_per_sport = events_df[['sport','event']].groupby(by='sport').count()
events_per_sport = events_per_sport.sort_values(by='event', ascending=False)
events_per_sport['sport'] = events_per_sport.index
events_per_sport.columns = ['Event', 'Sport']

In [12]:
fig = px.histogram(events_per_sport, x="Sport", y="Event", title="Total events for each Sport/Discipline")
fig.update_layout(bargap=0.1)
fig.update_xaxes(title_text="Sport/Discipline")
fig.update_yaxes(title_text="Total Events")
# fig.show()
pio.show(fig, renderer='iframe')

Athletics event is combination of track and fields events, so its understandably higher compared to other events. There are also multiple swimming events like Freestyle for multiple lengths, backstroke etc resulting in higher total ecent count.

# ATHELETES

In [13]:
athletes_df = pd.read_csv('/kaggle/input/paris-2024-olympic-summer-games/athletes.csv')

In [14]:
'''
Filter on current = True
'''
athletes_df = athletes_df.loc[athletes_df['current'] == True]
print(f"Shape of Athletes data - {athletes_df.shape}")
print(f"Total athletes part of Paris 2024 Olympics - {len(athletes_df.loc[athletes_df['current'] == True].code.unique())}")

Shape of Athletes data - (11110, 36)
Total athletes part of Paris 2024 Olympics - 11110


#### Identifying age from birth date

In [15]:
def calculate_age(dob):
    today = date.today()
    return today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))

In [16]:
age_list = []
for index, row in athletes_df.iterrows():
    dob = datetime.strptime(row['birth_date'], '%Y-%m-%d').date()
    age = calculate_age(dob)
    age_list.append(age)
athletes_df['age'] = age_list

## Data Story around age of Athletes

In [17]:
fig = px.violin(athletes_df, y=athletes_df["age"], box=True, # draw box plot inside the violin
                points='all', # can be 'outliers', or False)
                title = 'Violin chart of Athletes age'
               )
# fig.show()
pio.show(fig, renderer='iframe')

* Through this violin plot we can see that minimum age of an athelete participating in Paris olympics 2024 is 12 and maximum age is 70 years. <br>
* Zheng Haohao is the youngest olympian this time and youngest ever from People’s Republic of China. <br>
* Mary Hanna is oldest olympian this time and she plays equestrian. <br>

Identified more details of these 2 players.

In [18]:
print(f"minimum age - {min(athletes_df['age'])}, maximum age - {max(athletes_df['age'])}")
print(f"Median age of players is {statistics.median(athletes_df['age'])} years")
print(f"Average age of players is {round(statistics.mean(athletes_df['age']))} years")
print(f"Most common age of players is {statistics.mode(athletes_df['age'])} years")

minimum age - 12, maximum age - 70
Median age of players is 26.0 years
Average age of players is 27 years
Most common age of players is 25 years


In [19]:
youngest_player = athletes_df.loc[athletes_df['age'] == 12]
print(f"Youngest player details\n")
youngest_player[['name', 'gender', 'birth_date', 'age', 'country_code', 'country', 'country_long', 'lang','disciplines', 'events', 'occupation', 'family', 'hobbies', 'reason']].iloc[0].to_dict()


Youngest player details



{'name': 'ZHENG Haohao',
 'gender': 'Female',
 'birth_date': '2012-08-11',
 'age': 12,
 'country_code': 'CHN',
 'country': 'China',
 'country_long': "People's Republic of China",
 'lang': 'Mandarin',
 'disciplines': "['Skateboarding']",
 'events': '["Women\'s Park"]',
 'occupation': 'Student, athlete',
 'family': 'Mother, Wang Zhe',
 'hobbies': 'Painting',
 'reason': '"Somebody told me skateboarding was fun and I bought one. It is fun indeed." (chinadaily.com.cn, 28 Jun 2024; newsgd.com, 28 Jun 2024)'}

In [20]:
oldest_player = athletes_df.loc[athletes_df['age'] == 70]
oldest_player[['name', 'gender', 'birth_date', 'age', 'country_code', 'country', 'country_long', 'lang','disciplines', 'events', 'occupation', 'family', 'hobbies', 'reason']].iloc[0].to_dict()

{'name': 'HANNA Mary',
 'gender': 'Female',
 'birth_date': '1954-12-01',
 'age': 70,
 'country_code': 'AUS',
 'country': 'Australia',
 'country_long': 'Australia',
 'lang': 'English',
 'disciplines': "['Equestrian']",
 'events': "['Dressage Team']",
 'occupation': 'Athlete, breeder, coach, horse trainer',
 'family': 'Husband, Rob. Four children',
 'hobbies': nan,
 'reason': 'Her family has always been involved with horses. Initially, she was a jumping and event rider before switching to dressage in her 20s. "Ours was a very horsey family, it was compulsory. If you didn\'t ride you didn\'t get fed, practically. We rode to do the stock work, it was just part of life. I didn\'t really get into dressage properly until I married my first husband, who was Danish. He brought me over to Europe and introduced me to true competitive dressage." (myInfo)'}

In [21]:
# fig = px.violin(athletes_df, y=athletes_df["age"], x=athletes_df["gender"])
fig = px.violin(athletes_df, y=athletes_df["age"], color=athletes_df["gender"],
                violinmode='overlay', # draw violins on top of each other
                # default violinmode is 'group' as in example above
                hover_data=athletes_df.columns,
                title = 'Violin chart of Athletes age per Gender')
# fig.show()
pio.show(fig, renderer='iframe')

#### You can see slightly more variance in female athletes as compared to male athletes, but otherwise very comparable.

In [22]:
print(f"Minimum age for Male is {min(athletes_df['age'].loc[athletes_df['gender'] == 'Male'])} and Female is {min(athletes_df['age'].loc[athletes_df['gender'] == 'Female'])}")
print(f"Maximum age for Male is {max(athletes_df['age'].loc[athletes_df['gender'] == 'Male'])} and Female is {max(athletes_df['age'].loc[athletes_df['gender'] == 'Female'])}")
print(f"Median age for Male is {statistics.median(athletes_df['age'].loc[athletes_df['gender'] == 'Male'])} and Female is {statistics.median(athletes_df['age'].loc[athletes_df['gender'] == 'Female'])}")
print(f"Average age for Male is {round(statistics.mean(athletes_df['age'].loc[athletes_df['gender'] == 'Male']))} and Female is {round(statistics.mean(athletes_df['age'].loc[athletes_df['gender'] == 'Female']))}")
print(f"Most common age for Male is {statistics.mode(athletes_df['age'].loc[athletes_df['gender'] == 'Male'])} and Female is {statistics.mode(athletes_df['age'].loc[athletes_df['gender'] == 'Female'])}")

Minimum age for Male is 14 and Female is 12
Maximum age for Male is 65 and Female is 70
Median age for Male is 27 and Female is 26
Average age for Male is 27 and Female is 27
Most common age for Male is 25 and Female is 25


In [23]:
hist_data = athletes_df['age']
arr_hist_data = [hist_data.to_numpy()]
group_labels = ['athletes']

In [24]:
group_labels = ['athletes']

fig = ff.create_distplot(arr_hist_data, group_labels)
fig.update_layout(
    title='Histogram distribution of Age',
    xaxis_title="Age",
    yaxis_title="Density"
)
# fig.show()
pio.show(fig, renderer='iframe')

Above correlates with most common (mode) age - 25 years.

In [25]:
athletes_country_df = athletes_df[['country_long', 'disciplines','code']].groupby(by=['country_long','disciplines']).count()

In [26]:
athletes_country_df = athletes_df[['country_long','code']].groupby(by=['country_long']).count()
athletes_country_df = athletes_country_df.sort_values(by='code', ascending=False)
athletes_country_df['country_long'] = athletes_country_df.index

In [27]:
top_ten_countries_by_playerCount_df = athletes_country_df.country_long.head(10).tolist()
top_ten_countries_by_playerCount_details = athletes_df.loc[athletes_df['country_long'].isin(top_ten_countries_by_playerCount_df)]

## Data Story around Country represented by athletes

In [28]:
# Subplot around Country
# Creating subplot of 1 row and 2 columns
country_fig = make_subplots(rows=1, cols=2)

# Histogram to show gender parity per Country
fig1 = px.histogram(top_ten_countries_by_playerCount_details, x="country", color="gender",nbins=5).update_xaxes(categoryorder='total descending')

# Violin Chart to show variance of age per Country
fig2 = px.violin(top_ten_countries_by_playerCount_details, y="age", x="country").update_xaxes(categoryorder='total descending')

for trace in fig1.data:
    country_fig.add_trace(trace, row=1, col=1)

for trace in fig2.data:
    country_fig.add_trace(trace, row=1, col=2)
    

country_fig.update_layout(title="Data story on countries with most athletes", showlegend=True)
# Add x-axis and y-axis titles
country_fig.update_xaxes(title_text="Gender per Country", row=1, col=1)
country_fig.update_yaxes(title_text="Count of athletes", row=1, col=1)
country_fig.update_xaxes(title_text="Country", row=1, col=2)
country_fig.update_yaxes(title_text="Violin Chart of Age", row=1, col=2)
    
# country_fig.show()
pio.show(country_fig, renderer='iframe')

#### Gender parity
* In the above you could see difference in gender count per top Country (Top interms of # of athletes representing the country). We notice more female athletes from Canada, USA, Australia, Great Britian with almost double female athletes as compared to male coming from China.
* Overall, we saw more Male athletes than Female. The numbers are displayed below.

#### Variance around Age
* Variance is low for China. Youngest is 12 whom we had discovered above and oldest is 37 year old only.
* Variance for Australia was also expected since we discovered 70 year old Athlete earlier.
* Interesting to see big variance for Spain. Below I identified more details of that 65 year old Athlete from Spain



In [29]:
print("Details of 65 year old Athlete from Spain")
top_ten_countries_by_playerCount_details.loc[(top_ten_countries_by_playerCount_details['age'] == 65) & (top_ten_countries_by_playerCount_details['country'] == "Spain")][['name', 'gender', 'birth_date', 'age', 'country_code', 'country', 'country_long', 'lang','disciplines', 'events', 'occupation', 'family', 'hobbies', 'reason']].iloc[0].to_dict()

Details of 65 year old Athlete from Spain


{'name': 'JIMENEZ COBO Juan Antonio',
 'gender': 'Male',
 'birth_date': '1959-05-11',
 'age': 65,
 'country_code': 'ESP',
 'country': 'Spain',
 'country_long': 'Spain',
 'lang': 'English, Spanish',
 'disciplines': "['Equestrian']",
 'events': "['Dressage Individual', 'Dressage Team']",
 'occupation': 'Athlete, business owner, coach, horse trainer',
 'family': nan,
 'hobbies': nan,
 'reason': 'His father was a professional rider and a great help in his career. "My whole life is linked to horses." (juanantoniojimenez.com)'}

In [30]:
print(f"Female atheletes count from Top ten countries - {len(top_ten_countries_by_playerCount_details.loc[top_ten_countries_by_playerCount_details['gender'] == 'Female'].code.unique())}")
print(f"Male atheletes count from Top ten countries - {len(top_ten_countries_by_playerCount_details.loc[top_ten_countries_by_playerCount_details['gender'] == 'Male'].code.unique())}")

print(f"Female atheletes count overall - {len(athletes_df.loc[athletes_df['gender'] == 'Female'].code.unique())}")
print(f"Male atheletes count overall - {len(athletes_df.loc[athletes_df['gender'] == 'Male'].code.unique())}")

Female atheletes count from Top ten countries - 2353
Male atheletes count from Top ten countries - 2099
Female atheletes count overall - 5455
Male atheletes count overall - 5655


#### There are more Female athletes coming from top 10 countries (top in terms of total count of athletes), but overall there are more males athletes.

## Data Story around Athletes participating in Discplines representing the Country

In [31]:
athletes_discipline_country_hist = px.histogram(top_ten_countries_by_playerCount_details, x="country", color="disciplines", title='Data Story around # of Athletes per Discpline per Country').update_xaxes(categoryorder='total descending')
athletes_discipline_country_hist.update_xaxes(title_text="Country")
athletes_discipline_country_hist.update_yaxes(title_text="Count of Athletes")
# athletes_discipline_country_hist.show()
pio.show(athletes_discipline_country_hist, renderer='iframe')

Based on it, it looks like most players are part of Athletics discipline, followed by Swimming, Rowing, football, Water Polo, Volleyball. <br> *Note - Athletics discipline covers a wide range of running, throwing and walking in track and field events*<br>
This is expected since these sports are team sports or have team sport competition like relay in Swimming and Running etc. So we would see more players in these discplines as compared to Fencing or Canoe Slalom

## Data Story around Discplines

In [32]:
athletes_disciplines_df = athletes_df[['disciplines','code']].groupby(by=['disciplines']).count()
athletes_disciplines_df = athletes_disciplines_df.sort_values(by='code', ascending=False)
athletes_disciplines_df['disciplines'] = athletes_disciplines_df.index

In [33]:
top_ten_disciplines_by_playerCount_df = athletes_disciplines_df.disciplines.head(10).tolist()
top_ten_disciplines_by_playerCount_details = athletes_df.loc[athletes_df['disciplines'].isin(top_ten_disciplines_by_playerCount_df)]
top_ten_disciplines_by_playerCount_details.disciplines.unique()

array(["['Athletics']", "['Judo']", "['Swimming']", "['Sailing']",
       "['Rowing']", "['Shooting']", "['Football']", "['Hockey']",
       "['Rugby Sevens']", "['Handball']"], dtype=object)

In [34]:
# Subplot around Discpline
# Creating subplot of 1 row and 2 columns
discipline_fig = make_subplots(rows=1, cols=2)

# Histogram to show gender parity per Discipline
fig1 = px.histogram(top_ten_disciplines_by_playerCount_details, x="disciplines", color="gender",nbins=5).update_xaxes(categoryorder='total descending')

# Violin Chart to show variance of age per Discipline
fig2 = px.violin(top_ten_disciplines_by_playerCount_details, y="age", x="disciplines").update_xaxes(categoryorder='total descending')

for trace in fig1.data:
    discipline_fig.add_trace(trace, row=1, col=1)

for trace in fig2.data:
    discipline_fig.add_trace(trace, row=1, col=2)
    

discipline_fig.update_layout(title="Data story on top Discpline with most athletes", showlegend=True)
# Add x-axis and y-axis titles
discipline_fig.update_xaxes(title_text="Gender per Discpline", row=1, col=1)
discipline_fig.update_yaxes(title_text="Count of athletes", row=1, col=1)
discipline_fig.update_xaxes(title_text="Discpline", row=1, col=2)
discipline_fig.update_yaxes(title_text="Violin Chart of Age", row=1, col=2)
    
# discipline_fig.show()
pio.show(discipline_fig, renderer='iframe')

#### Gender parity
* Gender and team specific events like Sailing and shooting have exact same number of players, so was expected from Rugby but there is one less female player.

#### Variance around Age
* Variance across all discplines is usual and expected except for shooting where we saw 16 year athlete as well as 61 year athlete which was interesting observation.
* I was curious about the 16 year old shooter and 47 year old athlete part of Athletics discpline.


In [35]:
print("Details of 47 year old Athlete competing in Women's marathon")
top_ten_disciplines_by_playerCount_details.loc[(top_ten_disciplines_by_playerCount_details['age'] == 47) & (top_ten_disciplines_by_playerCount_details['disciplines'] == "['Athletics']")][['name', 'gender', 'birth_date', 'age', 'country_code', 'country', 'country_long', 'lang','disciplines', 'events', 'occupation', 'family', 'hobbies', 'reason']].iloc[0].to_dict()

Details of 47 year old Athlete competing in Women's marathon


{'name': 'DIVER Sinead',
 'gender': 'Female',
 'birth_date': '1977-02-17',
 'age': 47,
 'country_code': 'AUS',
 'country': 'Australia',
 'country_long': 'Australia',
 'lang': 'English, French, Irish',
 'disciplines': "['Athletics']",
 'events': '["Women\'s Marathon"]',
 'occupation': 'Senior analyst engineer in information technology',
 'family': 'Husband, Colin. Sons, Eddie and Dara',
 'hobbies': nan,
 'reason': '"I was always better over the longer distances, so it made sense to try it out. After my first (marathon), I knew that was the distance for me." (smh.com.au, 8 Jan 2023)'}

In [36]:
print("Details of 16 year old athlete participating in Shooting")
top_ten_disciplines_by_playerCount_details.loc[(top_ten_disciplines_by_playerCount_details['age'] == 16) & (top_ten_disciplines_by_playerCount_details['disciplines'] == "['Shooting']")][['name', 'gender', 'birth_date', 'age', 'country_code', 'country', 'country_long', 'lang','disciplines', 'events', 'occupation', 'family', 'hobbies', 'reason']].iloc[0].to_dict()

Details of 16 year old athlete participating in Shooting


{'name': 'BEYRANVAND Mohammad',
 'gender': 'Male',
 'birth_date': '2008-08-25',
 'age': 16,
 'country_code': 'IRI',
 'country': 'IR Iran',
 'country_long': 'Islamic Republic of Iran',
 'lang': 'Persian',
 'disciplines': "['Shooting']",
 'events': "['Trap Men']",
 'occupation': 'Student',
 'family': nan,
 'hobbies': nan,
 'reason': nan}

# MEDIALISTS

In [37]:
medalists_df = pd.read_csv('/kaggle/input/paris-2024-olympic-summer-games/medallists.csv')
print(f"Shape of Medalists data - {medalists_df.shape}")

Shape of Medalists data - (2315, 21)


In [38]:
# Remove the ones who are not a medalists
medalists_df = medalists_df.loc[medalists_df['is_medallist'] == True]

'''
If we want to filter on a Country or Discpline, we can do that and see the medal tally for that. 
UnComment below to filter on either Country code or discipline to understand narration from that subset of group.
'''
# medalists_df = medalists_df.loc[medalists_df['country_code'] == 'USA']
# medalists_df = medalists_df.loc[medalists_df['discipline'] == 'Swimming']

'\nIf we want to filter on a Country or Discpline, we can do that and see the medal tally for that. \nUnComment below to filter on either Country code or discipline to understand narration from that subset of group.\n'

### Data preparation for Gold, Silver and Bronze medal charts

In [39]:
'''
Gold medalist dataframe
'''
gold_medals_df = medalists_df.loc[medalists_df['medal_type'] == 'Gold Medal']
# Per country
gold_country = gold_medals_df[['country', 'name']].groupby(by='country').count()
gold_country = gold_country.sort_values(by='name', ascending=False)
gold_country['country'] = gold_country.index
gold_country.columns = ['Medal_Count', 'Country']

gold_country_top10_list = gold_country.Country.head(10).tolist()
gold_country_top10_details = gold_country.loc[gold_country['Country'].isin(gold_country_top10_list)]

# Per discipline
gold_discipline = gold_medals_df[['discipline', 'name']].groupby(by='discipline').count()
gold_discipline = gold_discipline.sort_values(by='name', ascending=False)
gold_discipline['discipline'] = gold_discipline.index
gold_discipline.columns = ['Medal_Count', 'Discipline']

gold_discipline_top10_list = gold_discipline.Discipline.head(10).tolist()
gold_discipline_top10_details = gold_discipline.loc[gold_discipline['Discipline'].isin(gold_discipline_top10_list)]

# Per athlete
gold_athlete = gold_medals_df[['name', 'country_code']].groupby(by='name').count()
gold_athlete = gold_athlete.sort_values(by='country_code', ascending=False)
gold_athlete['name'] = gold_athlete.index
gold_athlete.columns = ['Medal_Count', 'Athlete_Name']

gold_athlete_top10_list = gold_athlete.Athlete_Name.head(10).tolist()
gold_athlete_top10_details = gold_athlete.loc[gold_athlete['Athlete_Name'].isin(gold_athlete_top10_list)]

In [40]:
'''
Silver medalist dataframe
'''
silver_medals_df = medalists_df.loc[medalists_df['medal_type'] == 'Silver Medal']
# Per country
silver_country = silver_medals_df[['country', 'name']].groupby(by='country').count()
silver_country = silver_country.sort_values(by='name', ascending=False)
silver_country['country'] = silver_country.index
silver_country.columns = ['Medal_Count', 'Country']

silver_country_top10_list = silver_country.Country.head(10).tolist()
silver_country_top10_details = silver_country.loc[silver_country['Country'].isin(silver_country_top10_list)]

# Per discipline
silver_discipline = silver_medals_df[['discipline', 'name']].groupby(by='discipline').count()
silver_discipline = silver_discipline.sort_values(by='name', ascending=False)
silver_discipline['discipline'] = silver_discipline.index
silver_discipline.columns = ['Medal_Count', 'Discipline']

silver_discipline_top10_list = silver_discipline.Discipline.head(10).tolist()
silver_discipline_top10_details = silver_discipline.loc[silver_discipline['Discipline'].isin(silver_discipline_top10_list)]

# Per athlete
silver_athlete = silver_medals_df[['name', 'country_code']].groupby(by='name').count()
silver_athlete = silver_athlete.sort_values(by='country_code', ascending=False)
silver_athlete['name'] = silver_athlete.index
silver_athlete.columns = ['Medal_Count', 'Athlete_Name']

silver_athlete_top10_list = silver_athlete.Athlete_Name.head(10).tolist()
silver_athlete_top10_details = silver_athlete.loc[silver_athlete['Athlete_Name'].isin(silver_athlete_top10_list)]

In [41]:
'''
Bronze medalist dataframe
'''
bronze_medals_df = medalists_df.loc[medalists_df['medal_type'] == 'Bronze Medal']
# Per country
bronze_country = bronze_medals_df[['country', 'name']].groupby(by='country').count()
bronze_country = bronze_country.sort_values(by='name', ascending=False)
bronze_country['country'] = bronze_country.index
bronze_country.columns = ['Medal_Count', 'Country']

bronze_country_top10_list = bronze_country.Country.head(10).tolist()
bronze_country_top10_details = bronze_country.loc[bronze_country['Country'].isin(bronze_country_top10_list)]

# Per discipline
bronze_discipline = bronze_medals_df[['discipline', 'name']].groupby(by='discipline').count()
bronze_discipline = bronze_discipline.sort_values(by='name', ascending=False)
bronze_discipline['discipline'] = bronze_discipline.index
bronze_discipline.columns = ['Medal_Count', 'Discipline']

bronze_discipline_top10_list = bronze_discipline.Discipline.head(10).tolist()
bronze_discipline_top10_details = bronze_discipline.loc[bronze_discipline['Discipline'].isin(bronze_discipline_top10_list)]

# Per athlete
bronze_athlete = bronze_medals_df[['name', 'country_code']].groupby(by='name').count()
bronze_athlete = bronze_athlete.sort_values(by='country_code', ascending=False)
bronze_athlete['name'] = bronze_athlete.index
bronze_athlete.columns = ['Medal_Count', 'Athlete_Name']

bronze_athlete_top10_list = bronze_athlete.Athlete_Name.head(10).tolist()
bronze_athlete_top10_details = bronze_athlete.loc[bronze_athlete['Athlete_Name'].isin(bronze_athlete_top10_list)]

## Data Story around Gold Medalists

In [42]:
# Subplot to show Countries, Disciplines and Players who won Gold ranked in descending order
# Creating subplot of 1 row and 3 columns to depict all three aspects.
gold_fig = make_subplots(rows=1, cols=3, vertical_spacing=0.3)

# Line chart to rank Countries winning Gold medal
gold_fig1 = px.line(gold_country_top10_details, x="Country", y="Medal_Count").update_xaxes(categoryorder='total descending')
gold_fig1.add_trace(
    go.Scatter(
        x=gold_country_top10_details['Country'], 
        y=gold_country_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#FFD700')
    )
)

# Line chart to rank Discplines in terms of Gold medal tally
gold_fig2 = px.line(gold_discipline_top10_details, x="Discipline", y="Medal_Count").update_xaxes(categoryorder='total descending')
gold_fig2.add_trace(
    go.Scatter(
        x=gold_discipline_top10_details['Discipline'], 
        y=gold_discipline_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#FFD700')
    )
)

# Line chart to rank Athletes in terms of their Gold medal tally
gold_fig3 = px.line(gold_athlete_top10_details, x="Athlete_Name", y="Medal_Count").update_xaxes(categoryorder='total descending')
gold_fig3.add_trace(
    go.Scatter(
        x=gold_athlete_top10_details['Athlete_Name'], 
        y=gold_athlete_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#FFD700')
    )
)

for trace in gold_fig1.data:
    gold_fig.add_trace(trace, row=1, col=1)

for trace in gold_fig2.data:
    gold_fig.add_trace(trace, row=1, col=2)

for trace in gold_fig3.data:
    gold_fig.add_trace(trace, row=1, col=3)

gold_fig.update_layout(title="Gold Medalists Summary", showlegend= False)

# Add x-axis and y-axis titles
gold_fig.update_xaxes(title_text="Country", row=1, col=1)
gold_fig.update_yaxes(title_text="Count of Gold medals", row=1, col=1)
gold_fig.update_xaxes(title_text="Discpline", row=1, col=2)
gold_fig.update_yaxes(title_text="Count of Gold medals", row=1, col=2)
gold_fig.update_xaxes(title_text="Athlete_Name", row=1, col=3)
gold_fig.update_yaxes(title_text="Count of Gold medals", row=1, col=3, tickmode='linear')

# gold_fig.show()
pio.show(gold_fig, renderer='iframe')

## Data Story around Silver Medalists

In [43]:
# Subplot to show Countries, Disciplines and Players who won Silver ranked in descending order
# Creating subplot of 1 row and 3 columns to depict all three aspects.
silver_fig = make_subplots(rows=1, cols=3, vertical_spacing=0.3)

# Line chart to rank Countries winning Silver medal
silver_fig1 = px.line(silver_country_top10_details, x="Country", y="Medal_Count").update_xaxes(categoryorder='total descending')
silver_fig1.add_trace(
    go.Scatter(
        x=silver_country_top10_details['Country'], 
        y=silver_country_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#C0C0C0')
    )
)

# Line chart to rank Discplines in terms of Silver medal tally
silver_fig2 = px.line(silver_discipline_top10_details, x="Discipline", y="Medal_Count").update_xaxes(categoryorder='total descending')
silver_fig2.add_trace(
    go.Scatter(
        x=silver_discipline_top10_details['Discipline'], 
        y=silver_discipline_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#C0C0C0')
    )
)

# Line chart to rank Athletes in terms of their Silver medal tally
silver_fig3 = px.line(silver_athlete_top10_details, x="Athlete_Name", y="Medal_Count").update_xaxes(categoryorder='total descending')
silver_fig3.add_trace(
    go.Scatter(
        x=silver_athlete_top10_details['Athlete_Name'], 
        y=silver_athlete_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#C0C0C0')
    )
)

for trace in silver_fig1.data:
    silver_fig.add_trace(trace, row=1, col=1)

for trace in silver_fig2.data:
    silver_fig.add_trace(trace, row=1, col=2)

for trace in silver_fig3.data:
    silver_fig.add_trace(trace, row=1, col=3)

silver_fig.update_layout(title="Silver Medalists Summary", showlegend = False)

# Add x-axis and y-axis titles
silver_fig.update_xaxes(title_text="Country", row=1, col=1)
silver_fig.update_yaxes(title_text="Count of Silver medals", row=1, col=1)
silver_fig.update_xaxes(title_text="Discpline", row=1, col=2)
silver_fig.update_yaxes(title_text="Count of Silver medals", row=1, col=2)
silver_fig.update_xaxes(title_text="Athlete_Name", row=1, col=3)
silver_fig.update_yaxes(title_text="Count of Silver medals", row=1, col=3, tickmode='linear')

# silver_fig.show()
pio.show(silver_fig, renderer='iframe')

## Data Story around Bronze Medalists

In [44]:
# Subplot to show Countries, Disciplines and Players who won Bronze ranked in descending order
# Creating subplot of 1 row and 3 columns to depict all three aspects.
bronze_fig = make_subplots(rows=1, cols=3, vertical_spacing=0.3)

# Line chart to rank Countries winning Bronze medal
bronze_fig1 = px.line(bronze_country_top10_details, x="Country", y="Medal_Count").update_xaxes(categoryorder='total descending')
bronze_fig1.add_trace(
    go.Scatter(
        x=bronze_country_top10_details['Country'], 
        y=bronze_country_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#CD7F32')
    )
)

# Line chart to rank Discplines in terms of Bronze medal tally
bronze_fig2 = px.line(bronze_discipline_top10_details, x="Discipline", y="Medal_Count").update_xaxes(categoryorder='total descending')
bronze_fig2.add_trace(
    go.Scatter(
        x=bronze_discipline_top10_details['Discipline'], 
        y=bronze_discipline_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#CD7F32')
    )
)

# Line chart to rank Athletes in terms of their Bronze medal tally
bronze_fig3 = px.line(bronze_athlete_top10_details, x="Athlete_Name", y="Medal_Count").update_xaxes(categoryorder='total descending')
bronze_fig3.add_trace(
    go.Scatter(
        x=bronze_athlete_top10_details['Athlete_Name'], 
        y=bronze_athlete_top10_details['Medal_Count'], 
        fill='tozeroy',
        line=dict(color='#CD7F32')
    )
)


for trace in bronze_fig1.data:
    bronze_fig.add_trace(trace, row=1, col=1)

for trace in bronze_fig2.data:
    bronze_fig.add_trace(trace, row=1, col=2)

for trace in bronze_fig3.data:
    bronze_fig.add_trace(trace, row=1, col=3)

bronze_fig.update_layout(title="Bronze Medalists Summary", showlegend = False)

# Add x-axis and y-axis titles
bronze_fig.update_xaxes(title_text="Country", row=1, col=1)
bronze_fig.update_yaxes(title_text="Count of Bronze medals", row=1, col=1)
bronze_fig.update_xaxes(title_text="Discpline", row=1, col=2)
bronze_fig.update_yaxes(title_text="Count of Bronze medals", row=1, col=2)
bronze_fig.update_xaxes(title_text="Athlete_Name", row=1, col=3)
bronze_fig.update_yaxes(title_text="Count of Bronze medals", row=1, col=3, tickmode='linear')

# bronze_fig.show()
pio.show(bronze_fig, renderer='iframe')

## Players Point of View
### Novak Djokovic's Journey to Paris Olympics Gold (2024)

In [45]:
tennis_df = pd.read_csv('/kaggle/input/paris-2024-olympic-summer-games/results/Tennis.csv')
tennis_men_single = tennis_df.loc[tennis_df['event_code'] == 'TENMSINGLES']

In [46]:
djoko_games = tennis_men_single.loc[tennis_men_single['participant_name'] == 'DJOKOVIC Novak']
djoko_games_date = djoko_games.date.tolist()
djoko_single_games = tennis_men_single.loc[(tennis_men_single['date'].isin(djoko_games_date)) & (tennis_men_single['participant_name'] != 'DJOKOVIC Novak')]
djoko_single_games['Opponent_player'] = 'DJOKOVIC Novak'
djoko_single_games["date"] = pd.to_datetime(djoko_single_games["date"])
djoko_single_games["result_WLT"] = 'W'
print(f"Total Single Tennis games Novak Djokovic played to win his first Olympics Gold medal - {len(djoko_single_games)}")

Total Single Tennis games Novak Djokovic played to win his first Olympics Gold medal - 6


In [47]:
# Scatter plot to show the journey of Novak Djokovic's singles game.
fig = px.scatter(djoko_single_games, 
                 x="date", 
                 y="stage", 
                 color="result_WLT", 
                 text="participant_name", 
                 title="Novak Djokovic's Journey to Paris Olympics Gold (2024)", 
                 labels={"date": "Match Date", "stage": "Tournament Round"},
                 hover_data=["stage"])

# Customize the layout for better visibility
fig.update_layout(
    xaxis_title="Match Date",
    yaxis_title="Tournament Round",
    height=500,
    width=1200,
    showlegend=False
)

# fig.show()
pio.show(fig, renderer='iframe')


Upon looking at this chart, we realize that the Match date for first two rounds is incorrect.<br>
But, overall this visualization helps me understand Novak's journey towards his Gold and players he faced in each round.<br> My favorite one was the finals (Gold Medal) match with Carlos Alcaraz.

# __*Conclusion notes*__
This data was around the evnts and athletes part of Paris 2024 Olympics. I looked at it from few point of views (POV) <br> 
1. from Events POV <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Ranked events based on the occurrence. Most events are Athletics events since it bins all track and field events, followed by swimming. <br>

2. from Schedule POV <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Time series plot to see total events everyday and cumulative event to understand percentage completion each day.
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Comapred with Basketball event schedule and trend is highly correlated

3. from Overall Athletes POV <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Identified that we had a 12 year and 70 year old athlete competiting, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Most athletes are part of track and field discipline, followed by swimming, football. <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Gender parity is not a lot in top ten Countries in terms of total athletes representing their country but overall there were more male athletes than female atheletes. <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Most common age of the player is 27 and variance of age across gender is almost similar with little difference. <br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Interesting find was that Variance of athletes coming from China is low, lowest age being 12 and highest age being 37. <br>
4. from Medals POV<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Interesting identification was seeing various athletes like Yufei Zhang winning multiple medals. <br>
5.  from 1 athletes POV- Novak Djokovic<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; This is how I as a viewer most of the times watch it. I am a fan of Novak Djokovic and seeing him win his 1st Gold medal in his possibly last Olympics was very satisfying. <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Scatter plot to visualize his journey towards that Gold medal and athletes he played along the way.