# Disease Prediction from Medical Data

**Goal:** Predict disease risk (example: Heart Disease or Diabetes). This notebook uses the Kaggle API to download a dataset — set `DATASET_SLUG` below.


## 1. Kaggle API setup (same as credit notebook)

Replace `DATASET_SLUG` with your dataset slug and run the download cell.

In [None]:
DATASET_SLUG = 'DATASET_OWNER/DATASET_NAME'  # <-- REPLACE
import os
if DATASET_SLUG == 'DATASET_OWNER/DATASET_NAME':
    print('Please set DATASET_SLUG to the dataset you want to download from Kaggle.')
else:
    os.system(f'kaggle datasets download -d {DATASET_SLUG} -p ./data --unzip')
    print('Download attempted. Check ./data for files.')

## 2. Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report, confusion_matrix

print('Libraries imported')

## 3. Load dataset

Common filenames: `heart.csv`, `diabetes.csv`. Update the PATH variable to match your dataset.

In [None]:
POSSIBLE_PATHS = ['./data/heart.csv','./data/diabetes.csv','./data/data.csv']
found = None
for p in POSSIBLE_PATHS:
    if os.path.exists(p):
        found = p
        break

if not found:
    print('No common file found in ./data. Please place dataset in ./data and update file name.')
else:
    df = pd.read_csv(found)
    print('Loaded', found, 'shape:', df.shape)
    df.head()

## 4. EDA & preprocessing (example)

Adjust the TARGET variable to the dataset's target column (e.g., `target`, `Outcome`, `heart_disease`).

In [None]:
# Quick EDA
print(df.info())
print('\nMissing values:\n', df.isnull().sum())
print('\nDescribe:\n', df.describe())

In [None]:
TARGET = 'target'  # <-- change as needed
if TARGET not in df.columns:
    print('Target column not found. Please update TARGET variable to your dataset target column name.')
else:
    X = df.drop(columns=[TARGET])
    y = df[TARGET]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y if y.nunique()>1 else None)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train.select_dtypes(include=[np.number]))
    X_test_scaled = scaler.transform(X_test.select_dtypes(include=[np.number]))
    print('Prepared numeric features for modeling.')

## 5. Train a Random Forest (example)

In [None]:
if TARGET in df.columns:
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train.select_dtypes(include=[np.number]), y_train)
    y_pred = rf.predict(X_test.select_dtypes(include=[np.number]))
    print('Accuracy:', accuracy_score(y_test, y_pred))
    print('F1:', f1_score(y_test, y_pred, zero_division=0))
    if hasattr(rf,'predict_proba'):
        try:
            y_proba = rf.predict_proba(X_test.select_dtypes(include=[np.number]))[:,1]
            print('ROC-AUC:', roc_auc_score(y_test, y_proba))
        except Exception as e:
            print('ROC-AUC error:', e)
    print('\nClassification report:\n', classification_report(y_test, y_pred, zero_division=0))
    print('\nConfusion matrix:\n', confusion_matrix(y_test, y_pred))

## 6. Feature importance & interpretation

For tree-based models you can inspect feature_importances_. Consider SHAP for richer explanations.

In [None]:
if TARGET in df.columns:
    importances = rf.feature_importances_
    feat_names = X.select_dtypes(include=[np.number]).columns
    fi = pd.DataFrame({'feature': feat_names, 'importance': importances}).sort_values('importance', ascending=False)
    display(fi.head(20))

## 7. Save and push to GitHub

See the final section for GitHub instructions. Create a `requirements.txt` using `pip freeze > requirements.txt` and include your notebooks and dataset information in the repo README.