# **Track 3**

*Researched and performed by GentleFatCat*

The challenge stated to analyze 2021 Tokyo Olympics dataset to predict who wins 2024 Paris Olympics. The Tokyo 2020 Summer Olympics, which took place in 2021 due to the COVID-19 pandemic, left a lasting impact.

1. Tokyo 2020 introduced new sports like skateboarding, surfing, sport climbing, and breaking (a dance-based sport). These additions brought fresh energy to the Olympic stage during a period of uncertainty.

2. The Tokyo 2020 Games included athletes from countries like Turkmenistan and San Marino who won their first-ever Olympic medals.

The Paris 2024 Summer Olympics, which will be hosting the world's top athletes from July 26 to August 11, have faced definitely more interesting coverage with regard to the location that they are taking place in. Besides that, notable changes include:

1. Paris 2024 will feature 28 returning sports, along with four exciting additions: sport climbing, skateboarding, surfing, and breaking. Breaking, in particular, will make its Olympic debut.
2. In a historic decision, the International Olympic Committee awarded both Paris (2024) and Los Angeles (2028) the hosting rights simultaneously1.

Our initial thought process was to find a suitable dataset with infomatics from Paris 2024 and Tokyo 2020 Olympics. Following this, our ML skills were put to an interesting test!

**Datasets used:**

1. https://www.kaggle.com/datasets/piterfm/paris-2024-olympic-summer-games

   events.csv for Events taking place in Paris 2024.
   It covers various aspects of the event, including participating countries, athletes, sports disciplines, medal standings, and key event details.

2. https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo/data
All datasets were imported from here.

**Importing necessary libraries for data manipulation and visualization**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Dataset about the events in Paris 2024 is read.

In [None]:
# df here is the dataframe into which the events.csv is being read into

df = pd.read_csv("/content/events.csv")

In [None]:
df

Unnamed: 0,event,tag,sport,sport_code,sport_url
0,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery
1,Women's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery
2,Men's Team,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery
3,Women's Team,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery
4,Mixed Team,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery
...,...,...,...,...,...
324,Men's Freestyle 65kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...
325,Men's Freestyle 74kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...
326,Men's Freestyle 86kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...
327,Men's Freestyle 97kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...


**Initially, we tried to consider a correlation between sports that are common for Paris 2024 Olympics and Tokyo 2020 Olympics, based on the available data, to predict which sport is likelier to cause a victory in 2024.**

Preprocessing and understanding the sport column

In [None]:
df['sport']

0        Archery
1        Archery
2        Archery
3        Archery
4        Archery
         ...    
324    Wrestling
325    Wrestling
326    Wrestling
327    Wrestling
328    Wrestling
Name: sport, Length: 329, dtype: object

In [None]:
# Get all unique values from event's sport column
df['sport'].unique()

array(['Archery', 'Artistic Gymnastics', 'Artistic Swimming', 'Athletics',
       'Badminton', 'Basketball', 'Basketball 3x3', 'Beach Volleyball',
       'Boxing', 'Breaking', 'Canoe Slalom', 'Canoe Sprint',
       'Cycling BMX Freestyle', 'Cycling BMX Racing',
       'Cycling Mountain Bike', 'Cycling Road', 'Cycling Track ',
       'Diving', 'Equestrian', 'Fencing', 'Football', 'Golf', 'Handball',
       'Hockey', 'Judo', 'Marathon Swimming', 'Modern Pentathlon',
       'Rhythmic Gymnastics', 'Rowing', 'Rugby Sevens', 'Sailing',
       'Shooting', 'Skateboarding', 'Sport Climbing', 'Surfing',
       'Swimming', 'Table Tennis', 'Taekwondo', 'Tennis', 'Trampoline',
       'Triathlon', 'Volleyball', 'Water Polo', 'Weightlifting',
       'Wrestling'], dtype=object)

In [None]:
df['sport'].value_counts()

sport
Athletics                48
Swimming                 35
Wrestling                18
Judo                     15
Shooting                 15
Rowing                   14
Artistic Gymnastics      14
Boxing                   13
Fencing                  12
Cycling Track            12
Canoe Sprint             10
Sailing                  10
Weightlifting            10
Taekwondo                 8
Diving                    8
Canoe Slalom              6
Equestrian                6
Table Tennis              5
Tennis                    5
Archery                   5
Badminton                 5
Skateboarding             4
Sport Climbing            4
Cycling Road              4
Triathlon                 3
Artistic Swimming         2
Water Polo                2
Volleyball                2
Trampoline                2
Basketball                2
Basketball 3x3            2
Beach Volleyball          2
Surfing                   2
Football                  2
Golf                      2
Breaking      

Dataset about athletes from Tokyo 2020 Olympics is read.

In [None]:
dz = pd.read_excel("/content/Athletes.xlsx")

In [None]:
dz

Unnamed: 0,Name,NOC,Discipline
0,AALERUD Katrine,Norway,Cycling Road
1,ABAD Nestor,Spain,Artistic Gymnastics
2,ABAGNALE Giovanni,Italy,Rowing
3,ABALDE Alberto,Spain,Basketball
4,ABALDE Tamara,Spain,Basketball
...,...,...,...
11080,ZWICKER Martin Detlef,Germany,Hockey
11081,ZWOLINSKA Klaudia,Poland,Canoe Slalom
11082,ZYKOVA Yulia,ROC,Shooting
11083,ZYUZINA Ekaterina,ROC,Sailing


Dataset about Medals won by countries in Tokyo 2020 Olympics is read.

In [None]:
dm = pd.read_excel("/content/Medals.xlsx")

  warn("Workbook contains no default style, apply openpyxl's default")


In [None]:
dm

Unnamed: 0,Rank,Team/NOC,Gold,Silver,Bronze,Total,Rank by Total
0,1,United States of America,39,41,33,113,1
1,2,People's Republic of China,38,32,18,88,2
2,3,Japan,27,14,17,58,5
3,4,Great Britain,22,21,22,65,4
4,5,ROC,20,28,23,71,3
...,...,...,...,...,...,...,...
88,86,Ghana,0,0,1,1,77
89,86,Grenada,0,0,1,1,77
90,86,Kuwait,0,0,1,1,77
91,86,Republic of Moldova,0,0,1,1,77


Merging events dataset with athletes dataset to form an educated dataset to see commonalities between events and Athlete dataset's Discipline column

In [None]:
merged_data = pd.merge(df, dz, left_on='sport', right_on='Discipline', how='left')

In [None]:
# Merging with medal count data to form an appropriate training data
final_merged_data = pd.merge(merged_data, dm , left_on='sport', right_on='Team/NOC', how='left')

In [None]:
#merging events from Paris 2024 w/ athlete count of tokyo 2020

In [None]:
mdf = pd.merge(df, dz, left_on='sport', right_on='Discipline', how='left')

In [None]:
mdf

Unnamed: 0,event,tag,sport,sport_code,sport_url,Name,NOC,Discipline
0,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ABDULLIN Ilfat,Kazakhstan,Archery
1,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ACOSTA GIRALDO Valentina,Colombia,Archery
2,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ADAM Amal,Egypt,Archery
3,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,AGUILAR Andres,Chile,Archery
4,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ALVAREZ Luis,Mexico,Archery
...,...,...,...,...,...,...,...,...
179515,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHADRAYEV Demeu,Kazakhstan,Wrestling
179516,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHOU Feng,People's Republic of China,Wrestling
179517,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHOU Qian,People's Republic of China,Wrestling
179518,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHUMANAZAROVA Meerim,Kyrgyzstan,Wrestling


In [None]:
# Grouping the data by 'NOC' and 'Discipline', counting the number of athletes in each group, and resetting the index

athletes_paris_2024_agg = mdf.groupby(['NOC', 'Discipline']).size().reset_index(name='Athlete_Count')

Performing a join on athletes_paris_2024_agg dataframe and Medals's dm dataframe, to combine information based on 'NOC' Column from 'athletics_paris_2024_agg' and 'Team/NOC' column from dm.

In [None]:
comparison_df = pd.merge(athletes_paris_2024_agg, dm, left_on='NOC', right_on='Team/NOC', how='left')

In [None]:
comparison_df

Unnamed: 0,NOC,Discipline,Athlete_Count,Rank,Team/NOC,Gold,Silver,Bronze,Total,Rank by Total
0,Afghanistan,Athletics,96,,,,,,,
1,Afghanistan,Shooting,15,,,,,,,
2,Afghanistan,Swimming,35,,,,,,,
3,Afghanistan,Taekwondo,8,,,,,,,
4,Albania,Artistic Gymnastics,14,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
1999,Zambia,Swimming,70,,,,,,,
2000,Zimbabwe,Athletics,48,,,,,,,
2001,Zimbabwe,Golf,2,,,,,,,
2002,Zimbabwe,Rowing,14,,,,,,,


In [None]:
# Confirming that all values aren't 0

comparison_df['Bronze'].unique()

array([nan,  2., 22.,  5.,  4.,  0.,  3.,  1.,  8., 11.,  6., 16.,  7.,
       20., 17., 14., 18., 23., 10.,  9., 12., 33.])

In [None]:
# Filling all NaN values with zero to further train over a Supervisedd Model

comparison_df.fillna(0, inplace=True)

In [None]:
comparison_df

Unnamed: 0,NOC,Discipline,Athlete_Count,Rank,Team/NOC,Gold,Silver,Bronze,Total,Rank by Total
0,Afghanistan,Athletics,96,0.0,0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,Shooting,15,0.0,0,0.0,0.0,0.0,0.0,0.0
2,Afghanistan,Swimming,35,0.0,0,0.0,0.0,0.0,0.0,0.0
3,Afghanistan,Taekwondo,8,0.0,0,0.0,0.0,0.0,0.0,0.0
4,Albania,Artistic Gymnastics,14,0.0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
1999,Zambia,Swimming,70,0.0,0,0.0,0.0,0.0,0.0,0.0
2000,Zimbabwe,Athletics,48,0.0,0,0.0,0.0,0.0,0.0,0.0
2001,Zimbabwe,Golf,2,0.0,0,0.0,0.0,0.0,0.0,0.0
2002,Zimbabwe,Rowing,14,0.0,0,0.0,0.0,0.0,0.0,0.0


Encoded 'NOC' and 'Discipline' as categorical variables, because they are stron deciding factors on Medal decision and Rank

# **Proving High Correlation between Athlete Count from dz and df dataset and Medals from dm dataset**

In [None]:
analysis = comparison_df.groupby('NOC').agg({
    'Athlete_Count': 'sum',
    'Gold': 'sum',
    'Silver': 'sum',
    'Bronze': 'sum',
    'Total': 'sum'
}).reset_index()

In [None]:
analysis

Unnamed: 0,NOC,Athlete_Count,Gold,Silver,Bronze,Total
0,Afghanistan,154,0.0,0.0,0.0,0.0
1,Albania,182,0.0,0.0,0.0,0.0
2,Algeria,848,0.0,0.0,0.0,0.0
3,American Samoa,113,0.0,0.0,0.0,0.0
4,Andorra,54,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...
201,"Virgin Islands, British",131,0.0,0.0,0.0,0.0
202,"Virgin Islands, US",123,0.0,0.0,0.0,0.0
203,Yemen,98,0.0,0.0,0.0,0.0
204,Zambia,262,0.0,0.0,0.0,0.0


In [None]:
correlation_df = analysis.drop(columns=['NOC'])

In [None]:
# Calculate the correlation between athlete count and medals
correlation = correlation_df.corr()

In [None]:
print("Correlation between Athlete Count and Medal Counts:\n", correlation)

Correlation between Athlete Count and Medal Counts:
                Athlete_Count      Gold    Silver    Bronze     Total
Athlete_Count       1.000000  0.822767  0.821662  0.909125  0.871268
Gold                0.822767  1.000000  0.953511  0.911418  0.981271
Silver              0.821662  0.953511  1.000000  0.910428  0.979838
Bronze              0.909125  0.911418  0.910428  1.000000  0.962944
Total               0.871268  0.981271  0.979838  0.962944  1.000000


The correlation matrix clearly proves that athletes playing for a particular country are a heavy determining factor for Medal won: Gold, Silver, Bronze and Total

# **Applying Models:**

In [None]:
# Converting categorical variable(s) into dummy/indicator variables

categorical_features = ['NOC', 'Discipline']
comparison_df_encoded = pd.get_dummies(comparison_df, columns=categorical_features, drop_first=True)

Deciding X and y to split and train our model with.

In [None]:
X = comparison_df_encoded.drop(columns=['Gold', 'Silver', 'Bronze', 'Total', 'Rank by Total', 'Team/NOC'])
y = comparison_df_encoded['Gold']

In [None]:
# X: features, y: target variable
# test_size=0.2: 20% of the data will be used for testing
# random_state=42: ensures reproducibility of the split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**The models we chose to train to predict champions of Paris Olympics 2024 are:**

1. **Random Forest**: It is an ensemble learning method that builds multiple decision trees and merges them to get a more accurate and stable prediction.
2.  **Logistic Regression**: It is a linear model used for binary classification problems. We considered it here because a country winning Gold is a 1 or 0 possibility during prediction.
3. **Support Vector Classifier (SVC)**: It is a powerful classification method that finds the hyperplane that best separates the data into different classes. 'NOC' and 'Discipline' influence the Gold Medal based on past data.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
final_merged_data = final_merged_data.drop(columns=['sport','sport_code','sport_url','Name'], inplace=True)

In [None]:
print("\nLinear Regression Testing Performance:")
print("Mean Squared Error:", mean_squared_error(y_test, y_test_preds))
print("R-squared:", r2_score(y_test, y_test_preds))


Linear Regression Testing Performance:
Mean Squared Error: 6.719477564120869e-05
R-squared: 0.9999992264466419


In [None]:
print("Linear Regression Training Performance:")
print("Mean Squared Error:", mean_squared_error(y_train, y_train_preds))
print("R-squared:", r2_score(y_train, y_train_preds))

Linear Regression Training Performance:
Mean Squared Error: 8.910562431979137e-24
R-squared: 1.0


In [None]:
y_train_preds = lr_model.predict(X_train)
y_test_preds = lr_model.predict(X_test)

In an attempt to salvage our preprocessing and possible idea, we attempted Linear Regression.

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

**Result:**

Logistic Regression and SVC could only achieve an accuracy of about 36%, while Random Forest's results were heavily overfit.

In [None]:
print("Random Forest Classification Report:\n", classification_report(y_test, rf_preds))
print("Logistic Regression Classification Report:\n", classification_report(y_test, lr_preds))
print("SVC Classification Report:\n", classification_report(y_test, svc_preds))

Random Forest Classification Report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       145
         1.0       1.00      1.00      1.00        72
         2.0       1.00      1.00      1.00        27
         3.0       1.00      1.00      1.00        39
         4.0       1.00      1.00      1.00        14
         6.0       1.00      1.00      1.00         7
         7.0       1.00      1.00      1.00        28
        10.0       1.00      1.00      1.00        26
        17.0       1.00      1.00      1.00         5
        20.0       1.00      1.00      1.00         6
        22.0       1.00      1.00      1.00         5
        27.0       1.00      1.00      1.00         8
        38.0       1.00      1.00      1.00         7
        39.0       1.00      1.00      1.00        12

    accuracy                           1.00       401
   macro avg       1.00      1.00      1.00       401
weighted avg       1.00      1.00      1.0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# **Results:**

In [None]:
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_preds))
print("SVC Accuracy:", accuracy_score(y_test, svc_preds))

Random Forest Accuracy: 1.0
Logistic Regression Accuracy: 0.3541147132169576
SVC Accuracy: 0.36159600997506236


In [None]:
rf_preds = rf_model.predict(X_test)
lr_preds = lr_model.predict(X_test)
svc_preds = svc_model.predict(X_test)

In [None]:
rf_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)
svc_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression(random_state=42)
svc_model = SVC(random_state=42)

# **Creation of an informational training dataset**

We successfully created an informational training dataset to predict the champions of Paris 2024 Olympics.

In [None]:
final_merged_data

Unnamed: 0,event,tag,sport,sport_code,sport_url,Name,NOC,Discipline,Rank,Team/NOC,Gold,Silver,Bronze,Total,Rank by Total
0,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ABDULLIN Ilfat,Kazakhstan,Archery,,,,,,,
1,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ACOSTA GIRALDO Valentina,Colombia,Archery,,,,,,,
2,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ADAM Amal,Egypt,Archery,,,,,,,
3,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,AGUILAR Andres,Chile,Archery,,,,,,,
4,Men's Individual,archery,Archery,ARC,https://olympics.com/en/paris-2024/sports/archery,ALVAREZ Luis,Mexico,Archery,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179515,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHADRAYEV Demeu,Kazakhstan,Wrestling,,,,,,,
179516,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHOU Feng,People's Republic of China,Wrestling,,,,,,,
179517,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHOU Qian,People's Republic of China,Wrestling,,,,,,,
179518,Men's Freestyle 125kg,wrestling,Wrestling,WRE,https://olympics.com/en/paris-2024/sports/wres...,ZHUMANAZAROVA Meerim,Kyrgyzstan,Wrestling,,,,,,,
