Access here: 👉 https://marathonpredictingfinishtime.streamlit.app/
- Name: Marvin Adorian Zanchi dos Santos
- Student Number: C00288302
- Course: BSc Software Development
- Module: Data Science & Machine Learning 1
- Lecturer: Ben OShaughnessy
- Submission Date: 4 November 2025
This project applies a Linear Regression model to predict marathon finish times using runners’ demographic and event-related data.
The dataset, sourced from Kaggle (2023 Marathon Results), contains over 420,000 entries from 600+ marathon events across the United States.
Each record includes:
- Age
- Gender
- Race name
- Finish time (seconds)
The aim is to analyze how these variables influence marathon performance and develop a model that estimates a runner’s expected finish time.
- Clean and prepare a large, real-world marathon dataset for machine learning.
- Explore relationships between age, gender, race, and finish time.
- Train and evaluate a Linear Regression model.
- Interpret the model coefficients to understand what drives performance.
- Summarize insights and propose future directions for improvement.
Source: Kaggle – 2023 Marathon Results
Dataset Summary
- ~429,000 marathon results
- Columns: Name, Race, Year, Gender, Age, Finish, Age Bracket
- Age = -1 indicates unknown (removed during cleaning)
- All events occurred in 2023
Note:
Due to its size, the dataset is not included in this repository.
Download it from Kaggle and place it inside a folder named /data.
- Language: Python
- Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn
- Environment: Jupyter Notebook / Visual Studio Code
- Version Control: GitHub
- Removed invalid ages (<16 or >90, or age = -1)
- Kept only Male/Female categories
- Encoded Gender numerically (Male = 1, Female = 0)
- Removed columns not required for modelling: Name, Age Bracket, Year
- Removed extreme finish times (> 20,000 seconds) to avoid ultra-marathon outliers
- Visualized distribution of finish times (in hours)
- Analysed age distribution
- Explored how Age, Gender, and Race relate to finish time
- Created a correlation heatmap
- Selected features: Age, Gender, Race
- Applied one-hot encoding to Race
- Split data into training and testing (80/20)
- Trained a Linear Regression model using scikit-learn
- Predicted
Finish(seconds)
- R²: 0.186
- MAE: 32.29 minutes
- RMSE: 39.16 minutes
- Age effect: +34.7 seconds/year (~0.58 minutes per year)
- Gender effect: Male runners finish ~19 minutes faster on average
- Race: Significant impact due to course difficulty, terrain, and climate
- Distribution of finish times
- Distribution of runner ages
- Age vs Finish scatterplot
- Gender vs Finish time boxplot
- Actual vs Predicted scatterplot with perfect-fit line
- Feature coefficient ranking
The linear regression model explains around 19% of the variance in marathon finish times. This is expected given that marathon performance depends on many additional factors:
- training load
- pacing strategy
- athlete experience
- physiological variables (VO₂ max)
- course elevation
- weather conditions
Even so, the model provides clear and interpretable insights into how age, gender, and race influence performance.
-
Install dependencies:
pip install -r requirements.txt
-
Place the dataset file here:
/data/Results.csv -
Open and run the notebook:
notebook/marathon_predicting_finish_time.ipynb
- Dataset: Kaggle – 2023 Marathon Results
- Python Libraries: pandas, numpy, scikit-learn, matplotlib, seaborn
- Course Content: Data Science & Machine Learning 1
- YouTube Tutorials: Exploratory data analysis & regression modelling
- ChatGPT (OpenAI, 2025): Assisted with documentation, structure, and formatting
Developed by Marvin Adorian Zanchi Santos
BSc in Software Development, 4th-year student at South East Technological University (SETU), Carlow Campus