## Powerlifting EDA and Linear Regression
This notebook performs exploratory data analysis (EDA) and a linear regression model on the [OpenPowerlifting](https://www.openpowerlifting.org/) dataset. It aims to highlight the features most strongly associated with squat, bench press and deadlift results over time.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

### Download dataset
The following cell downloads the latest OpenPowerlifting archive and reads it into a DataFrame. If the archive is already present it will be reused.

In [None]:
import zipfile
from pathlib import Path

import requests

DATA_ZIP = Path("openpowerlifting-latest.zip")
CSV_NAME = Path("openpowerlifting-latest.csv")

if not CSV_NAME.exists():
    if not DATA_ZIP.exists():
        r = requests.get("https://openpowerlifting.gitlab.io/opl-csv/files/openpowerlifting-latest.zip")
        r.raise_for_status()
        DATA_ZIP.write_bytes(r.content)
    with zipfile.ZipFile(DATA_ZIP) as z:
        z.extract(CSV_NAME.name)

df = pd.read_csv(CSV_NAME)
df.head()

### Basic cleaning
For this example only a subset of columns is used and rows with missing data are dropped.

In [None]:
cols = ["Date", "Name", "Sex", "BodyweightKg", "Best3SquatKg", "Best3BenchKg", "Best3DeadliftKg", "TotalKg", "Age"]
df = df[cols].dropna()
df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year
df.head()

### EDA
We inspect summary statistics and correlations.

In [None]:
df.describe()

In [None]:
corr = df[["BodyweightKg", "Age", "Year", "Best3SquatKg", "Best3BenchKg", "Best3DeadliftKg", "TotalKg"]].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap="coolwarm");

### Linear Regression
A simple multiple linear regression is fitted separately for squat, bench and deadlift using bodyweight, age and year as predictors.

In [None]:
from IPython.display import display

features = ["BodyweightKg", "Age", "Year"]
results = {}
for target in ["Best3SquatKg", "Best3BenchKg", "Best3DeadliftKg"]:
    X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], random_state=42)
    model = LinearRegression().fit(X_train, y_train)
    preds = model.predict(X_test)
    r2 = r2_score(y_test, preds)
    results[target] = {"model": model, "r2": r2}
    display({"target": target, "r2": r2})
    coefs = pd.Series(model.coef_, index=features)
    display(coefs.sort_values(key=abs, ascending=False))

### Feature importance
The coefficients indicate which variables contribute most strongly to each lift prediction. Larger absolute values correspond to stronger relationships.