ApexPredict

This repo is a small full-stack project that predicts Formula 1 finishing positions from historical race data. I built it as something I could walk through in an interview: where the data comes from, how features are built without peeking at the future, what the model is doing, and where it falls short.

The upstream data is the Jolpica F1 API, which serves Ergast-compatible JSON at https://api.jolpi.ca/ergast/f1/. The old Ergast host is gone; Jolpica is the practical replacement. Use of their API is subject to their terms.

Repository: github.com/malik-builds/ApexPredict

What you get

A Python pipeline that downloads seasons 2018–2024, merges results, qualifying, and pit stops, and writes backend/data/processed_races.csv.
A regression model (RandomForest + XGBoost averaged in a VotingRegressor) with time-series cross-validation so we do not train on future races.
A FastAPI service with /predict, /health, and /feature-importance/{circuit_id}.
A Next.js 14 dashboard (dark UI, Recharts) that talks to the API via NEXT_PUBLIC_API_URL only.

Stack (short version)

Piece	Why it is there
pandas	Joining multi-source race tables
scikit-learn + XGBoost	Tabular data, nonlinear effects, tree importances
FastAPI	Quick JSON API and CORS for the frontend
Next.js 14 + Tailwind + Recharts	App Router, styling, horizontal bar chart for importances
Docker (optional)	Railway-style deploy with `PORT`

Run the backend

Work from the repo root (the directory that contains backend/, not inside backend/). On macOS you usually want python3, not python.

cd ApexPredict
python3 -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt

If XGBoost fails to load on a Mac with a missing libomp.dylib, install OpenMP (brew install libomp) and try again.

export PYTHONPATH="$(pwd)"
python3 -m backend.data.fetch
python3 -m backend.model.train
python3 -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

The fetch step hits the network for a long time and sleeps between requests on purpose. Training writes:

backend/models/apexpredict_model.pkl
backend/models/encoders.pkl
backend/models/feature_cols.pkl
backend/models/training_metrics.json

If No module named 'backend' appears, PYTHONPATH was probably set from the wrong directory (for example after cd backend). If pandas is missing, you are likely running a global uvicorn instead of the one in .venv; use python3 -m uvicorn after activating the venv.

Run the frontend

cd frontend
cp .env.local.example .env.local

Set NEXT_PUBLIC_API_URL=http://localhost:8000 in .env.local, then:

npm install
npm run dev

After you regenerate processed_races.csv, refresh the dropdown lists:

python3 scripts/refresh_meta.py

API

Method	Path	Notes
GET	`/health`	Status, whether the model loaded, and CV metrics when `training_metrics.json` exists
POST	`/predict`	JSON body with driver id, circuit id, grid, pit stats, points, wet flag. Missing historical circuit averages are filled from the CSV.
GET	`/feature-importance/{circuit_id}`	Top five features from a RandomForest fit on that circuit’s rows (falls back to full data if the circuit is too small)

CORS is open for local dev and typical preview deployments.

Modeling notes

Ensemble
RandomForest gives stable global feature importances; gradient boosting often squeezes out a bit of error. Averaging the two with VotingRegressor is simple to explain and deploy.

TimeSeriesSplit
Race rows are ordered in time. Shuffled k-fold would leak information from later races into training when scoring earlier periods. Time-series splits keep training strictly in the past relative to each validation fold.

MAE
The target is a finishing position, so mean absolute error is reported in “positions wrong,” which is easy to reason about. R² is secondary.

Circuit-specific importances
One global model blurs what matters where. Monaco is famously grid-heavy; high-speed tracks lean more on strategy and car traits. The circuit endpoint retrains a small forest on filtered rows so you can show something local for a talk or a chart.

What I would add next
Real weather, tyre compounds, sector times, and a calibration layer for podium probability instead of the simple heuristic mapping from predicted position.

Docker

From the repo root:

docker build -f backend/Dockerfile -t apexpredict-api .
docker run -e PORT=8000 -p 8000:8000 apexpredict-api

The image expects backend/data and backend/models to be present in the copied tree (run fetch and train before building, or bake that into CI).

Deploy hints

API (e.g. Railway): set PORT; run the same uvicorn command as in the Dockerfile.
Frontend (e.g. Vercel): set NEXT_PUBLIC_API_URL to the public API URL. No other API host is hardcoded in the app.

License

MIT. See LICENSE. F1 data remains subject to Jolpica’s terms.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
backend		backend
frontend		frontend
scripts		scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
.nvmrc		.nvmrc
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ApexPredict

What you get

Stack (short version)

Run the backend

Run the frontend

API

Modeling notes

Docker

Deploy hints

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ApexPredict

What you get

Stack (short version)

Run the backend

Run the frontend

API

Modeling notes

Docker

Deploy hints

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages