<a href="https://colab.research.google.com/github/mirandaelisa/gender_representation_olympics/blob/main/Gender_Representation_in_the_Olympics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Gender Representation in the Olympics with Python

This project explores how male and female participation in the Olympic Games has evolved over time using a historical dataset. It compares trends across both the Summer and Winter Olympics, identifies key moments of change, and contextualizes drops in participation with relevant historical events.

# Importing libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

#Dataset 1 (1896 - 2016)

## Importing and Reading the dataset (1896 - 2016)

In [2]:
df = pd.read_csv('athlete_events.csv')
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [3]:
df.shape

(271116, 15)

## Cleaning and managing the dataset (1896 - 2016)

### Removing duplicates

In [4]:
duplicated = df[df.duplicated()].sort_values(by='ID')
duplicated

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
1252,704,Dsir Antoine Acket,M,27.0,,,Belgium,BEL,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
4282,2449,William Truman Aldrich,M,48.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Painting, Drawings And ...",
4283,2449,William Truman Aldrich,M,48.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Painting, Drawings And ...",
4862,2777,Hermann Reinhard Alker,M,43.0,,,Germany,GER,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Architecture, Designs F...",
4864,2777,Hermann Reinhard Alker,M,43.0,,,Germany,GER,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Architecture, Architect...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
269994,135072,Anna Katrina Zinkeisen (-Heseltine),F,46.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Paintings",
269995,135072,Anna Katrina Zinkeisen (-Heseltine),F,46.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Paintings",
269997,135072,Anna Katrina Zinkeisen (-Heseltine),F,46.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
269999,135073,Doris Clare Zinkeisen (-Johnstone),F,49.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",


In [5]:
df = df.drop_duplicates().reset_index()

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269731 entries, 0 to 269730
Data columns (total 16 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   index   269731 non-null  int64  
 1   ID      269731 non-null  int64  
 2   Name    269731 non-null  object 
 3   Sex     269731 non-null  object 
 4   Age     260416 non-null  float64
 5   Height  210917 non-null  float64
 6   Weight  208204 non-null  float64
 7   Team    269731 non-null  object 
 8   NOC     269731 non-null  object 
 9   Games   269731 non-null  object 
 10  Year    269731 non-null  int64  
 11  Season  269731 non-null  object 
 12  City    269731 non-null  object 
 13  Sport   269731 non-null  object 
 14  Event   269731 non-null  object 
 15  Medal   39772 non-null   object 
dtypes: float64(3), int64(3), object(10)
memory usage: 32.9+ MB


### Selecting data to be used

In [7]:
df_grouped = df.groupby(['Games', 'Year', 'Season', 'Sex'])['ID'].nunique().reset_index()
df_grouped.head()

Unnamed: 0,Games,Year,Season,Sex,ID
0,1896 Summer,1896,Summer,M,176
1,1900 Summer,1900,Summer,F,23
2,1900 Summer,1900,Summer,M,1201
3,1904 Summer,1904,Summer,F,6
4,1904 Summer,1904,Summer,M,644


In [8]:
df_grouped.rename(columns={'ID':'Count'}, inplace=True)
df_grouped.head()

Unnamed: 0,Games,Year,Season,Sex,Count
0,1896 Summer,1896,Summer,M,176
1,1900 Summer,1900,Summer,F,23
2,1900 Summer,1900,Summer,M,1201
3,1904 Summer,1904,Summer,F,6
4,1904 Summer,1904,Summer,M,644


#Dataset 2 (Tokyo 2020)

## Importing and reading the dataset (2020)

In [9]:
df_2020 = pd.read_excel('/content/EntriesGender.xlsx')

In [10]:
df_2020.head()

Unnamed: 0,Discipline,Female,Male,Total
0,3x3 Basketball,32,32,64
1,Archery,64,64,128
2,Artistic Gymnastics,98,98,196
3,Artistic Swimming,105,0,105
4,Athletics,969,1072,2041


## Cleaning and managing the dataset (2020)

In [11]:
df_2020['Games'] = '2020 Summer'
df_2020['Year'] = '2020'
df_2020['Season'] = 'Summer'
df_2020.head()

Unnamed: 0,Discipline,Female,Male,Total,Games,Year,Season
0,3x3 Basketball,32,32,64,2020 Summer,2020,Summer
1,Archery,64,64,128,2020 Summer,2020,Summer
2,Artistic Gymnastics,98,98,196,2020 Summer,2020,Summer
3,Artistic Swimming,105,0,105,2020 Summer,2020,Summer
4,Athletics,969,1072,2041,2020 Summer,2020,Summer


In [12]:
df_2020_grouped = df_2020.groupby(['Games', 'Year', 'Season'])[['Female', 'Male']].sum().reset_index()
df_2020_grouped.head()

Unnamed: 0,Games,Year,Season,Female,Male
0,2020 Summer,2020,Summer,5432,5884


In [13]:
df_2020_grouped = df_2020_grouped.rename(columns={'Female':'F', 'Male':'M'})
df_2020_grouped

Unnamed: 0,Games,Year,Season,F,M
0,2020 Summer,2020,Summer,5432,5884


In [14]:
df_2020_grouped = pd.melt(df_2020_grouped,
                  id_vars=['Games', 'Year', 'Season'],
                  value_vars=['F', 'M'],
                  var_name='Sex',
                  value_name='Count')

In [15]:
df_2020_grouped

Unnamed: 0,Games,Year,Season,Sex,Count
0,2020 Summer,2020,Summer,F,5432
1,2020 Summer,2020,Summer,M,5884


# Dataset 3 (Paris 2024)

## Importing and Reading the dataset (2024)

In [16]:
df_2024 = pd.read_csv('/content/athletes_2024.csv')
df_2024.head()

Unnamed: 0,code,current,name,name_short,name_tv,gender,function,country_code,country,country_long,...,family,lang,coach,reason,hero,influence,philosophy,sporting_relatives,ritual,other_sports
0,1532872,True,ALEKSANYAN Artur,ALEKSANYAN A,Artur ALEKSANYAN,Male,Athlete,ARM,Armenia,Armenia,...,"Father, Gevorg Aleksanyan","Armenian, English, Russian","Gevorg Aleksanyan (ARM), father",He followed his father and his uncle into the ...,"Footballer Zinedine Zidane (FRA), World Cup wi...","His father, Gevorg Aleksanyan","""Wrestling is my life."" (mediamax.am. 18 May 2...",,,
1,1532873,True,AMOYAN Malkhas,AMOYAN M,Malkhas AMOYAN,Male,Athlete,ARM,Armenia,Armenia,...,,Armenian,,,,,"""To become a good athlete, you first have to b...","Uncle, Roman Amoyan (wrestling), 2008 Olympic ...",,
2,1532874,True,GALSTYAN Slavik,GALSTYAN S,Slavik GALSTYAN,Male,Athlete,ARM,Armenia,Armenia,...,,Armenian,Personal: Martin Alekhanyan (ARM).<br>National...,,,,,,,
3,1532944,True,HARUTYUNYAN Arsen,HARUTYUNYAN A,Arsen HARUTYUNYAN,Male,Athlete,ARM,Armenia,Armenia,...,"Wife, Diana (married October 2022). Daughter, ...",Armenian,National: Habetnak Kurghinyan,While doing karate he noticed wrestlers traini...,"Wrestler Armen Nazaryan (ARM, BUL), two-time O...",,"“Nothing is impossible, set goals in front of ...",,,
4,1532945,True,TEVANYAN Vazgen,TEVANYAN V,Vazgen TEVANYAN,Male,Athlete,ARM,Armenia,Armenia,...,"Wife, Sona (married November 2023)","Armenian, Russian",National: Habetnak Kurghinyan (ARM),“My family did not like wrestling very much. A...,,,,,,


In [17]:
df_2024.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11113 entries, 0 to 11112
Data columns (total 36 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   code                11113 non-null  int64  
 1   current             11113 non-null  bool   
 2   name                11113 non-null  object 
 3   name_short          11110 non-null  object 
 4   name_tv             11110 non-null  object 
 5   gender              11113 non-null  object 
 6   function            11113 non-null  object 
 7   country_code        11113 non-null  object 
 8   country             11113 non-null  object 
 9   country_long        11113 non-null  object 
 10  nationality         11110 non-null  object 
 11  nationality_long    11110 non-null  object 
 12  nationality_code    11110 non-null  object 
 13  height              11110 non-null  float64
 14  weight              11108 non-null  float64
 15  disciplines         11113 non-null  object 
 16  even

## Cleaning and managing the dataset (2024)

In [18]:
df_2024_grouped = df_2024['gender'].value_counts().reset_index()
df_2024_grouped

Unnamed: 0,gender,count
0,Male,5658
1,Female,5455


In [19]:
df_2024_grouped['Games'] = '2024 Summer'
df_2024_grouped['Year'] = '2024'
df_2024_grouped['Season'] = 'Summer'
df_2024_grouped.head()

Unnamed: 0,gender,count,Games,Year,Season
0,Male,5658,2024 Summer,2024,Summer
1,Female,5455,2024 Summer,2024,Summer


In [20]:
df_2024_grouped.replace({'Male': 'M', 'Female': 'F'}, inplace=True)

In [21]:
df_2024_grouped.rename(columns={'gender': 'Sex', 'count': 'Count'}, inplace=True)
df_2024_grouped

Unnamed: 0,Sex,Count,Games,Year,Season
0,M,5658,2024 Summer,2024,Summer
1,F,5455,2024 Summer,2024,Summer


In [22]:
df_2024_grouped = df_2024_grouped[['Games', 'Year', 'Season', 'Sex', 'Count']]
df_2024_grouped

Unnamed: 0,Games,Year,Season,Sex,Count
0,2024 Summer,2024,Summer,M,5658
1,2024 Summer,2024,Summer,F,5455


# Combining the datasets

In [40]:
df_merged = pd.concat([df_grouped, df_2020_grouped, df_2024_grouped]).reset_index(drop=True)
df_merged.head()

Unnamed: 0,Games,Year,Season,Sex,Count
0,1896 Summer,1896,Summer,M,176
1,1900 Summer,1900,Summer,F,23
2,1900 Summer,1900,Summer,M,1201
3,1904 Summer,1904,Summer,F,6
4,1904 Summer,1904,Summer,M,644


In [39]:
df_merged_summer = df_merged[df_merged['Season'] == 'Summer'].reset_index(drop=True)
df_merged_summer.head()

Unnamed: 0,Games,Year,Season,Sex,Count
0,1896 Summer,1896,Summer,M,176
1,1900 Summer,1900,Summer,F,23
2,1900 Summer,1900,Summer,M,1201
3,1904 Summer,1904,Summer,F,6
4,1904 Summer,1904,Summer,M,644


In [25]:
df_merged_summer['Year'] = df_merged_summer['Year'].astype(int)

In [26]:
df_merged_summer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Games   61 non-null     object
 1   Year    61 non-null     int64 
 2   Season  61 non-null     object
 3   Sex     61 non-null     object
 4   Count   61 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.5+ KB


In [38]:
df_merged_winter = df_merged[df_merged['Season'] == 'Winter'].reset_index(drop=True)
df_merged_winter.head()

Unnamed: 0,Games,Year,Season,Sex,Count
0,1924 Winter,1924,Winter,F,13
1,1924 Winter,1924,Winter,M,300
2,1928 Winter,1928,Winter,F,28
3,1928 Winter,1928,Winter,M,433
4,1932 Winter,1932,Winter,F,21


In [28]:
df_merged_winter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Games   44 non-null     object
 1   Year    44 non-null     object
 2   Season  44 non-null     object
 3   Sex     44 non-null     object
 4   Count   44 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 1.8+ KB


In [29]:
df_merged_winter['Year'] = df_merged_winter['Year'].astype(int)

#Analysing gender participation over time

##Summer Olympic Games

In [35]:
gender_color_map = {'M': 'blue', 'F': 'red'}
gender_order = ['M', 'F']

fig = px.line(
    df_merged_summer,
    x='Year',
    y='Count',
    color='Sex',
    markers=True,
    title="Summer Olympic Games Participation by Gender Over Time",
    labels={'Count': 'Number of Athletes'},
    color_discrete_map=gender_color_map,
    category_orders={'Sex': gender_order}
)

fig.show()

###Visual analysis
###Female trend:
As seen in the graph above, female participation in the Summer Olympic Games has steadily increased over time. Starting with no female athletes in the first edition of the Games (1896), reaching parity with male athletes in the most recent edition (2024)


###Drops in male participation:
We can observe three major declines in male participation, corresponding to  global economic and political events:
- Los Angeles 1932\
  Held during the Great Depression, many countries were unable to afford to send full delegations.
- Melbourne 1956\
  Several countries boycotted these Games in protest of the Soviet invasion of Hungary and the Suez Canal conflict.
- Moscow 1980\
  Over 60 countries boycotted the Games in response to the Soviet invasion of Afghanistan.

##Winter Olympic Games

In [36]:
fig = px.line(
    df_merged_winter,
    x='Year',
    y='Count',
    color='Sex',
    markers=True,
    title="Winter Olympic Games Participation by Gender Over Time",
    labels={'Count': 'Number of Athletes'},
    color_discrete_map=gender_color_map,
    category_orders={'Sex': gender_order}
)

fig.show()

###Visual analysis
###Female trend:
As observed in the Summer Games, female participation in the Winter Olympic Games has also increased over time. However it has remained lower than male participation, not being able to reach parity.


###Drops in male participation:
We can also observe three declines in male participation in the Winter Games:
- Lake Placid 1932\
  Just like the Summer edition, it was held during the Great Depression, resulting in many countries not being able to afford to send full delegations.
- Squaw Valley 1960\
  This edition of the Winter Games featured fewer events and lacked an Olympic Village, which complicated logistics.
- Lillehammer 1994\
  The 1994 Games were held only two years after the previous edition. This was the first one after the IOC decision of alternating Summer and Winter Games. This may have caused a slight decrease in male participation, although this behaviour is not observed among the female athletes.