This repo is a small full-stack project that predicts Formula 1 finishing positions from historical race data. I built it as something I could walk through in an interview: where the data comes from, how features are built without peeking at the future, what the model is doing, and where it falls short.
The upstream data is the Jolpica F1 API, which serves Ergast-compatible JSON at https://api.jolpi.ca/ergast/f1/. The old Ergast host is gone; Jolpica is the practical replacement. Use of their API is subject to their terms.
Repository: github.com/malik-builds/ApexPredict
- A Python pipeline that downloads seasons 2018–2024, merges results, qualifying, and pit stops, and writes
backend/data/processed_races.csv. - A regression model (RandomForest + XGBoost averaged in a
VotingRegressor) with time-series cross-validation so we do not train on future races. - A FastAPI service with
/predict,/health, and/feature-importance/{circuit_id}. - A Next.js 14 dashboard (dark UI, Recharts) that talks to the API via
NEXT_PUBLIC_API_URLonly.
| Piece | Why it is there |
|---|---|
| pandas | Joining multi-source race tables |
| scikit-learn + XGBoost | Tabular data, nonlinear effects, tree importances |
| FastAPI | Quick JSON API and CORS for the frontend |
| Next.js 14 + Tailwind + Recharts | App Router, styling, horizontal bar chart for importances |
| Docker (optional) | Railway-style deploy with PORT |
Work from the repo root (the directory that contains backend/, not inside backend/). On macOS you usually want python3, not python.
cd ApexPredict
python3 -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txtIf XGBoost fails to load on a Mac with a missing libomp.dylib, install OpenMP (brew install libomp) and try again.
export PYTHONPATH="$(pwd)"
python3 -m backend.data.fetch
python3 -m backend.model.train
python3 -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000The fetch step hits the network for a long time and sleeps between requests on purpose. Training writes:
backend/models/apexpredict_model.pklbackend/models/encoders.pklbackend/models/feature_cols.pklbackend/models/training_metrics.json
If No module named 'backend' appears, PYTHONPATH was probably set from the wrong directory (for example after cd backend). If pandas is missing, you are likely running a global uvicorn instead of the one in .venv; use python3 -m uvicorn after activating the venv.
cd frontend
cp .env.local.example .env.localSet NEXT_PUBLIC_API_URL=http://localhost:8000 in .env.local, then:
npm install
npm run devAfter you regenerate processed_races.csv, refresh the dropdown lists:
python3 scripts/refresh_meta.py| Method | Path | Notes |
|---|---|---|
| GET | /health |
Status, whether the model loaded, and CV metrics when training_metrics.json exists |
| POST | /predict |
JSON body with driver id, circuit id, grid, pit stats, points, wet flag. Missing historical circuit averages are filled from the CSV. |
| GET | /feature-importance/{circuit_id} |
Top five features from a RandomForest fit on that circuit’s rows (falls back to full data if the circuit is too small) |
CORS is open for local dev and typical preview deployments.
Ensemble
RandomForest gives stable global feature importances; gradient boosting often squeezes out a bit of error. Averaging the two with VotingRegressor is simple to explain and deploy.
TimeSeriesSplit
Race rows are ordered in time. Shuffled k-fold would leak information from later races into training when scoring earlier periods. Time-series splits keep training strictly in the past relative to each validation fold.
MAE
The target is a finishing position, so mean absolute error is reported in “positions wrong,” which is easy to reason about. R² is secondary.
Circuit-specific importances
One global model blurs what matters where. Monaco is famously grid-heavy; high-speed tracks lean more on strategy and car traits. The circuit endpoint retrains a small forest on filtered rows so you can show something local for a talk or a chart.
What I would add next
Real weather, tyre compounds, sector times, and a calibration layer for podium probability instead of the simple heuristic mapping from predicted position.
From the repo root:
docker build -f backend/Dockerfile -t apexpredict-api .
docker run -e PORT=8000 -p 8000:8000 apexpredict-apiThe image expects backend/data and backend/models to be present in the copied tree (run fetch and train before building, or bake that into CI).
- API (e.g. Railway): set
PORT; run the sameuvicorncommand as in the Dockerfile. - Frontend (e.g. Vercel): set
NEXT_PUBLIC_API_URLto the public API URL. No other API host is hardcoded in the app.
MIT. See LICENSE. F1 data remains subject to Jolpica’s terms.