# In this notebook, we explore the relationships between age, height, and weight with the likelihood that an athlete will win an Olympic competition.  

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



We first extract data from the CSV file. We first delete rows which have null entries for the age, weight, or height columns. We then compile a set of sports that are usable.

In [9]:
data = pd.read_csv('Olympics2021.csv')

df = pd.DataFrame(data)


df = df[df['Height'].notnull()]
df = df[df['Weight'].notnull()]
df = df[df['Age'].notnull()]

sports = set(df['Sport'])

set()


We filter the data by competition and sex. After formatting the x and y entries, the data is passed into a logistic regression.

In [11]:
def formatter(x):
  try:
    return int(x)
  except:
    return int(x[:x.find("-")])

for sport in sports:
  for sex in ['M', 'F']:
    filtered_df = df.query("Sport == '" + sport + "' & Sex == '"+ sex +"'")

    filtered_df.loc[:, 'Weight'] = filtered_df['Weight'].apply(formatter)


    x = filtered_df[['Age', 'Height', 'Weight']]
    y_one = pd.DataFrame()

    y_one["label"] = filtered_df['Medal'].apply(lambda x: 1 if x == 'Bronze' or x == 'Silver' or x == 'Gold' else 0)

    y = y_one.to_numpy()
    y = y.ravel()

    if len(y) > 50 and np.any(y == 1) and np.any(y == 0):
      X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)


      model = LogisticRegression()


      model.fit(X_train, y_train)

      y_pred = model.predict(X_test)

      accuracy = accuracy_score(y_test, y_pred)


      print(f"{sport} {sex}, Coefficients: {model.coef_}, Model Accuracy: {accuracy}")









Football M, Coefficients: [[-0.13287837  0.14999418 -0.10523698]], Model Accuracy: 0.5789473684210527
Football F, Coefficients: [[ 0.08499181 -0.15926932  0.2502858 ]], Model Accuracy: 0.6388888888888888
Table Tennis M, Coefficients: [[-0.08290395 -0.08769887  0.02895855]], Model Accuracy: 0.84
Table Tennis F, Coefficients: [[-0.21692871 -0.19477148 -0.12784757]], Model Accuracy: 0.9166666666666666
Rowing M, Coefficients: [[-0.05220198 -0.09355564  0.01928158]], Model Accuracy: 0.5428571428571428
Rowing F, Coefficients: [[-0.07612992 -0.02730287  0.08352551]], Model Accuracy: 0.6551724137931034
Fencing M, Coefficients: [[ 0.06091175  0.02102989 -0.0858476 ]], Model Accuracy: 0.7586206896551724
Fencing F, Coefficients: [[ 0.19090455  0.11227426 -0.10219386]], Model Accuracy: 0.6666666666666666
Sailing M, Coefficients: [[-0.04519418  0.04173965 -0.04158381]], Model Accuracy: 0.9666666666666667
Sailing F, Coefficients: [[ 0.02653654 -0.02110006  0.01660687]], Model Accuracy: 0.78260869565

Since the 2024 Olympics participant stats haven't been released, we decided to analyze the models that were trained on each set of data.

Findings:
- Height, weight, and age are significant in sports like men's road cycling, women's tennis, and men's sailing. The model is confident in predicting whether someone won based on their age, weight, and height.
- Some magnitudes for coefficients are irregularly high, likely indicating which attributes are important to your chances of winning the particular sport.
- Being older is quite important to your chances in winning for men's handball, women's hockey, and women's artistic gymnastics. Being young is quite important to your chances of winning men's beach volleyball.
- It is particularly important to be tall in women's swimming.
- It is particularly important to be heavy in women's football.  