Skip to content

malik-builds/ApexPredict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ApexPredict

This repo is a small full-stack project that predicts Formula 1 finishing positions from historical race data. I built it as something I could walk through in an interview: where the data comes from, how features are built without peeking at the future, what the model is doing, and where it falls short.

The upstream data is the Jolpica F1 API, which serves Ergast-compatible JSON at https://api.jolpi.ca/ergast/f1/. The old Ergast host is gone; Jolpica is the practical replacement. Use of their API is subject to their terms.

Repository: github.com/malik-builds/ApexPredict

What you get

  • A Python pipeline that downloads seasons 2018–2024, merges results, qualifying, and pit stops, and writes backend/data/processed_races.csv.
  • A regression model (RandomForest + XGBoost averaged in a VotingRegressor) with time-series cross-validation so we do not train on future races.
  • A FastAPI service with /predict, /health, and /feature-importance/{circuit_id}.
  • A Next.js 14 dashboard (dark UI, Recharts) that talks to the API via NEXT_PUBLIC_API_URL only.

Stack (short version)

Piece Why it is there
pandas Joining multi-source race tables
scikit-learn + XGBoost Tabular data, nonlinear effects, tree importances
FastAPI Quick JSON API and CORS for the frontend
Next.js 14 + Tailwind + Recharts App Router, styling, horizontal bar chart for importances
Docker (optional) Railway-style deploy with PORT

Run the backend

Work from the repo root (the directory that contains backend/, not inside backend/). On macOS you usually want python3, not python.

cd ApexPredict
python3 -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt

If XGBoost fails to load on a Mac with a missing libomp.dylib, install OpenMP (brew install libomp) and try again.

export PYTHONPATH="$(pwd)"
python3 -m backend.data.fetch
python3 -m backend.model.train
python3 -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

The fetch step hits the network for a long time and sleeps between requests on purpose. Training writes:

  • backend/models/apexpredict_model.pkl
  • backend/models/encoders.pkl
  • backend/models/feature_cols.pkl
  • backend/models/training_metrics.json

If No module named 'backend' appears, PYTHONPATH was probably set from the wrong directory (for example after cd backend). If pandas is missing, you are likely running a global uvicorn instead of the one in .venv; use python3 -m uvicorn after activating the venv.

Run the frontend

cd frontend
cp .env.local.example .env.local

Set NEXT_PUBLIC_API_URL=http://localhost:8000 in .env.local, then:

npm install
npm run dev

After you regenerate processed_races.csv, refresh the dropdown lists:

python3 scripts/refresh_meta.py

API

Method Path Notes
GET /health Status, whether the model loaded, and CV metrics when training_metrics.json exists
POST /predict JSON body with driver id, circuit id, grid, pit stats, points, wet flag. Missing historical circuit averages are filled from the CSV.
GET /feature-importance/{circuit_id} Top five features from a RandomForest fit on that circuit’s rows (falls back to full data if the circuit is too small)

CORS is open for local dev and typical preview deployments.

Modeling notes

Ensemble
RandomForest gives stable global feature importances; gradient boosting often squeezes out a bit of error. Averaging the two with VotingRegressor is simple to explain and deploy.

TimeSeriesSplit
Race rows are ordered in time. Shuffled k-fold would leak information from later races into training when scoring earlier periods. Time-series splits keep training strictly in the past relative to each validation fold.

MAE
The target is a finishing position, so mean absolute error is reported in “positions wrong,” which is easy to reason about. R² is secondary.

Circuit-specific importances
One global model blurs what matters where. Monaco is famously grid-heavy; high-speed tracks lean more on strategy and car traits. The circuit endpoint retrains a small forest on filtered rows so you can show something local for a talk or a chart.

What I would add next
Real weather, tyre compounds, sector times, and a calibration layer for podium probability instead of the simple heuristic mapping from predicted position.

Docker

From the repo root:

docker build -f backend/Dockerfile -t apexpredict-api .
docker run -e PORT=8000 -p 8000:8000 apexpredict-api

The image expects backend/data and backend/models to be present in the copied tree (run fetch and train before building, or bake that into CI).

Deploy hints

  • API (e.g. Railway): set PORT; run the same uvicorn command as in the Dockerfile.
  • Frontend (e.g. Vercel): set NEXT_PUBLIC_API_URL to the public API URL. No other API host is hardcoded in the app.

License

MIT. See LICENSE. F1 data remains subject to Jolpica’s terms.

About

F1 Race Outcome Predictor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors