
# 🏥 Project 1 — Healthcare: Predicting 30-Day Readmission

**Goal:** Predict whether a patient will be readmitted within 30 days of discharge using synthetic (safe) data.  
**Pipeline:** Collect → Clean → Explore → Visualize → Model → Evaluate → Interpret.

> Run cells top-to-bottom. Works in Google Colab or locally after `pip install pandas scikit-learn matplotlib`.


In [None]:

# 1) Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

np.random.seed(42)


In [None]:

# 2) Create a synthetic dataset (N=500) — no real patient data
N = 500
rng = np.random.default_rng(42)

df = pd.DataFrame({
    "patient_id": np.arange(1, N+1),
    "age": rng.integers(18, 90, N),
    "prior_admissions": rng.poisson(1, N),
    "days_in_hospital": rng.integers(1, 15, N),
    "has_diabetes": rng.integers(0, 2, N),
    "has_heart_disease": rng.integers(0, 2, N),
    "has_copd": rng.integers(0, 2, N),
})

# Create a realistic probability of readmission (logistic formula)
logit = (
    -2
    + 0.02*(df["age"]-50)
    + 0.3*df["prior_admissions"]
    + 0.2*(df["days_in_hospital"]-5)
    + 0.5*df["has_diabetes"]
    + 0.4*df["has_heart_disease"]
    + 0.3*df["has_copd"]
)
prob = 1 / (1 + np.exp(-logit))
df["readmitted_30d"] = (rng.random(N) < prob).astype(int)

# Save a copy (optional)
df.to_csv("patients_readmission_synthetic.csv", index=False)

# Show first 20 rows
df.head(20)


In [None]:

# 3) Quick EDA (explore)
print("Rows, Columns:", df.shape)
print("Columns:", list(df.columns))
print("\nReadmission rate:", round(df["readmitted_30d"].mean(), 3))
print("\nNumeric summary:")
display(df.describe())


In [None]:

# 4) Visualizations
# (A) Age distribution
df["age"].hist(bins=20)
plt.title("Age Distribution")
plt.xlabel("Age"); plt.ylabel("Count"); plt.show()

# (B) Prior admissions vs readmission (boxplot)
df.boxplot(column="prior_admissions", by="readmitted_30d")
plt.title("Prior Admissions by Readmission")
plt.suptitle("")
plt.xlabel("readmitted_30d"); plt.ylabel("prior_admissions"); plt.show()


In [None]:

# 5) Train a simple model (Logistic Regression)
X = df[["age","prior_admissions","days_in_hospital","has_diabetes","has_heart_disease","has_copd"]]
y = df["readmitted_30d"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", round(acc, 3))

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=["Not readmitted","Readmitted"]).plot(values_format='d')
plt.title("Confusion Matrix (Logistic Regression)")
plt.show()



## 📝 Insights (ready to copy into GitHub README)
- Patients with **more prior admissions** are more likely to be readmitted.
- Chronic conditions (**diabetes, heart disease, COPD**) increase risk.
- **Longer stays** correlate with higher readmission likelihood.
- Logistic Regression is a good baseline; try Random Forest or threshold tuning next.
