# Group 1 Final Project Work 

#### Specific data on our dataset

### Maternal Health Risk Dataset Summary

**Shape:** 808 records × 7 columns  

**Columns:**
- `Age`
- `SystolicBP` (Systolic Blood Pressure)
- `DiastolicBP` (Diastolic Blood Pressure)
- `BS` (Blood Sugar level)
- `BodyTemp` (Body Temperature, °F)
- `HeartRate` (Heart Rate, bpm)
- `RiskLevel` (Target: maternal health risk category)

---

#### First 5 Records
| Age | SystolicBP | DiastolicBP | BS   | BodyTemp | HeartRate | RiskLevel  |
|-----|------------|--------------|------|----------|-----------|------------|
| 25  | 130        | 80           | 15.0 | 98.0     | 86        | high risk  |
| 35  | 140        | 90           | 13.0 | 98.0     | 70        | high risk  |
| 29  | 90         | 70           | 8.0  | 100.0    | 80        | high risk  |
| 30  | 140        | 85           | 7.0  | 98.0     | 70        | high risk  |
| 35  | 120        | 60           | 6.1  | 98.0     | 76        | low risk   |

---

#### Summary Statistics
- **Age:** 10–70 years (mean = 30.6, std = 13.9)  
- **SystolicBP:** 70–160 mmHg (mean = 113, std = 19.9)  
- **DiastolicBP:** 49–100 mmHg (mean = 77.5, std = 14.8)  
- **BS:** 6–19 mmol/L (mean = 9.26, std = 3.62)  
- **BodyTemp:** 98–103 °F (mean = 98.6, std = 1.39)  
- **HeartRate:** 7–90 bpm (mean = 74.3, std = 8.82)  

---

#### Target Variable: RiskLevel
- **Low risk:** 478 records (~59.2%)  
- **High risk:** 330 records (~40.8%)  
- **Medium risk:** Not present in this dataset version  

 Note: The dataset is binary-labeled (low vs. high risk), so if a 3-class model (low/mid/high) is needed, additional data preprocessing or augmentation may be required.

### Week 3 - Training and Feature Engineering

#### Environment (auto role + auto bucket) and constants

In [2]:
# This notebook:
#   • does EDA and writes plots/summary
#   • engineers features
#   • creates stratified splits: train(40%), val(10%), test(10%), production(40%)
#   • uploads artifacts to S3 (auto default bucket)
#   • creates OFFLINE Feature Store groups (auto execution role)
#   • writes a tracker update (JSON + Markdown)
#
# NO MANUAL SETTINGS: bucket/role are auto-detected from your Studio kernel.

import os, json, time
from pathlib import Path

import boto3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

plt.rcParams["figure.dpi"] = 120

# Paths
LOCAL_DATA_PATH = Path("Maternal_Risk.csv")          # change only if your CSV is elsewhere
ARTIFACTS_DIR   = Path("week3_outputs")
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

# AWS context (auto)
boto_sess  = boto3.session.Session()
region     = boto_sess.region_name
sm_session = Session(boto_sess)
role       = get_execution_role()                     # auto from Studio kernel
bucket     = sm_session.default_bucket()              # auto default bucket

# Unique ids (prevent FG name collisions + make runs auditable)
RUN_ID    = time.strftime("%Y%m%d-%H%M%S")
S3_PREFIX = f"aai540/maternal-risk/week3/{RUN_ID}"

print("Region:", region)
print("Role:  ", role)
print("S3:    ", f"s3://{bucket}/{S3_PREFIX}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Region: us-east-1
Role:   arn:aws:iam::533267301342:role/LabRole
S3:     s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647


#### Load data + lightweight EDA (plots + JSON summary)

In [3]:
assert LOCAL_DATA_PATH.exists(), f"Missing: {LOCAL_DATA_PATH.resolve()}"
df = pd.read_csv(LOCAL_DATA_PATH)

assert "RiskLevel" in df.columns, "Expected target column 'RiskLevel'."
print("Shape:", df.shape)
print("Columns:", list(df.columns))

# EDA summary (for tracker)
eda_summary = {
    "rows": int(df.shape[0]),
    "cols": int(df.shape[1]),
    "columns": df.columns.tolist(),
    "dtypes": {c: str(t) for c, t in df.dtypes.items()},
    "missing_counts": df.isna().sum().to_dict(),
    "class_counts": df["RiskLevel"].value_counts().to_dict(),
}
with open(ARTIFACTS_DIR / "eda_summary.json", "w") as f:
    json.dump(eda_summary, f, indent=2)

# Simple plots (defaults only; no custom colors)
df["RiskLevel"].value_counts().plot(kind="bar"); plt.title("Class Distribution")
plt.tight_layout(); plt.savefig(ARTIFACTS_DIR / "chart_class_distribution.png"); plt.clf()

df["Age"].plot(kind="hist", bins=20); plt.title("Age Distribution")
plt.tight_layout(); plt.savefig(ARTIFACTS_DIR / "chart_age_hist.png"); plt.clf()

plt.boxplot([df["SystolicBP"], df["DiastolicBP"]], labels=["SystolicBP","DiastolicBP"])
plt.title("Blood Pressure Boxplots"); plt.tight_layout()
plt.savefig(ARTIFACTS_DIR / "chart_bp_box.png"); plt.clf()

num_cols = df.select_dtypes(include=[np.number]).columns
corr = df[num_cols].corr()
plt.imshow(corr, interpolation="nearest")
plt.xticks(range(len(num_cols)), num_cols, rotation=45, ha="right")
plt.yticks(range(len(num_cols)), num_cols); plt.colorbar(); plt.title("Correlation Heatmap")
plt.tight_layout(); plt.savefig(ARTIFACTS_DIR / "chart_corr_heatmap.png"); plt.clf()

print("EDA done →", ARTIFACTS_DIR)

Shape: (808, 7)
Columns: ['Age', 'SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate', 'RiskLevel']


  plt.boxplot([df["SystolicBP"], df["DiastolicBP"]], labels=["SystolicBP","DiastolicBP"])


EDA done → week3_outputs


<Figure size 768x576 with 0 Axes>

#### Feature engineering (clinically-motivated features + z-scaling)

In [4]:
# We derive simple vitals-based features and also add z-scaled versions for linear models.

X = df.copy()

# Clinically motivated derived features
X["PulsePressure"]     = X["SystolicBP"] - X["DiastolicBP"]
X["SBP_to_DBP"]        = X["SystolicBP"] / (X["DiastolicBP"].replace(0, np.nan))
X["Fever"]             = (X["BodyTemp"] > 99.5).astype(int)
X["Tachycardia"]       = (X["HeartRate"] >= 100).astype(int)
X["HypertensionFlag"]  = ((X["SystolicBP"] >= 140) | (X["DiastolicBP"] >= 90)).astype(int)

# Optional standardization for linear models
cont = ["Age","SystolicBP","DiastolicBP","BS","BodyTemp","HeartRate","PulsePressure","SBP_to_DBP"]
X[[f"z_{c}" for c in cont]] = StandardScaler().fit_transform(X[cont])

# Label encoding (binary in this dataset)
label_map = {"low risk": 0, "high risk": 1}
if set(df["RiskLevel"].unique()) == set(label_map):
    y = df["RiskLevel"].map(label_map)
else:
    cats = sorted(df["RiskLevel"].unique())
    label_map = {v:i for i,v in enumerate(cats)}
    y = df["RiskLevel"].map(label_map)

with open(ARTIFACTS_DIR / "label_map.json", "w") as f:
    json.dump(label_map, f, indent=2)

X_no_target = X.drop(columns=["RiskLevel"])
engineered_full = pd.concat([X_no_target, y.rename("label")], axis=1)
engineered_full.to_csv(ARTIFACTS_DIR / "maternal_features_full.csv", index=False)

print("Feature engineering done.")

Feature engineering done.


#### Stratified splits: 40% prod, 40% train, 10% val, 10% test

In [5]:
# We first carve out 40% as "production" holdout for future batch inference/monitoring.
# The remaining 60% --> train (40%), val (10%), test (10%) of the original dataset.

from sklearn.model_selection import train_test_split

# 40% set aside for future batch inference/monitoring
X_tmp, X_prod, y_tmp, y_prod = train_test_split(
    X_no_target, y, test_size=0.40, random_state=42, stratify=y
)
# remaining 60% -> 40/10/10
X_train, X_rem, y_train, y_rem = train_test_split(
    X_tmp, y_tmp, test_size=(1/3), random_state=42, stratify=y_tmp
)
X_val, X_test, y_val, y_test = train_test_split(
    X_rem, y_rem, test_size=0.5, random_state=42, stratify=y_rem
)

def _save(name, Xd, yd):
    out = Xd.copy(); out["label"] = yd.values
    out.to_csv(ARTIFACTS_DIR / f"{name}.csv", index=False)
    return out

train_df = _save("train",      X_train, y_train)
val_df   = _save("val",        X_val,   y_val)
test_df  = _save("test",       X_test,  y_test)
prod_df  = _save("production", X_prod,  y_prod)

print({"train":len(train_df), "val":len(val_df), "test":len(test_df), "production":len(prod_df)})

{'train': 322, 'val': 81, 'test': 81, 'production': 324}


#### Upload artifacts to S3 (no manual bucket)

In [6]:
# Upload the CSVs, label map, EDA summary, and figures to your default bucket/prefix.

s3 = boto3.client("s3")

def s3_upload(local: Path, key: str):
    s3.upload_file(str(local), bucket, f"{S3_PREFIX}/{key}")
    print("Uploaded", f"s3://{bucket}/{S3_PREFIX}/{key}")

# CSVs + summaries
for fname in ["train.csv","val.csv","test.csv","production.csv",
              "maternal_features_full.csv","label_map.json","eda_summary.json"]:
    s3_upload(ARTIFACTS_DIR / fname, fname)

# Plots
for fname in ["chart_class_distribution.png","chart_age_hist.png","chart_bp_box.png","chart_corr_heatmap.png"]:
    s3_upload(ARTIFACTS_DIR / fname, f"figures/{fname}")

Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/train.csv
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/val.csv
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/test.csv
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/production.csv
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/maternal_features_full.csv
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/label_map.json
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/eda_summary.json
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/figures/chart_class_distribution.png
Uploaded s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647/figures/chart_age_hist.png
Uploaded s3://sagemaker-u

#### Sanitize column names (Feature Store regex) & write sanitized splits

In [7]:
# FS rules: names must be letters/numbers/hyphens only; must start with alnum; <=64 chars.

def sanitize_col(name: str) -> str:
    if name == "SBP_to_DBP": name = "SBPtoDBP"   # preserve meaning
    if name.startswith("z_"): name = "z" + name[2:]
    name = name.replace("_", "")
    name = "".join(ch for ch in name if ch.isalnum() or ch == "-")
    if not name or not name[0].isalnum(): name = "f" + name
    return name[:64]

def sanitize_df_cols(df: pd.DataFrame) -> pd.DataFrame:
    newcols, seen = [], set()
    for c in df.columns:
        s = sanitize_col(c)
        if s in seen:
            i, base = 2, s
            while f"{base}{i}" in seen: i += 1
            s = f"{base}{i}"
        newcols.append(s); seen.add(s)
    out = df.copy(); out.columns = newcols
    return out

label_col = "label"
def sanitize_split(df):
    feats = df.drop(columns=[label_col])
    feats = sanitize_df_cols(feats)
    feats[label_col] = df[label_col].values
    return feats

train_s = sanitize_split(train_df); train_s.to_csv(ARTIFACTS_DIR/"train_sanitized.csv", index=False)
val_s   = sanitize_split(val_df);   val_s.to_csv(ARTIFACTS_DIR/"val_sanitized.csv",   index=False)
test_s  = sanitize_split(test_df);  test_s.to_csv(ARTIFACTS_DIR/"test_sanitized.csv", index=False)
prod_s  = sanitize_split(prod_df);  prod_s.to_csv(ARTIFACTS_DIR/"production_sanitized.csv", index=False)

print("Sanitized splits saved.")

Sanitized splits saved.


#### Create & ingest Feature Store (OFFLINE, unique names per run)

In [8]:
import time
import boto3
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import FeatureDefinition, FeatureTypeEnum

sm      = boto3.client("sagemaker")
session = Session(boto3.session.Session(region_name=region))

def ensure_id_time(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    if "recordid" not in df.columns:
        df["recordid"] = range(1, len(df)+1)
    if "eventtime" not in df.columns:
        df["eventtime"] = pd.Timestamp.utcnow().isoformat()
    return df

def to_boto_feature_defs(df: pd.DataFrame):
    out = []
    for c, d in df.dtypes.items():
        if c == "eventtime":
            t = "String"
        elif pd.api.types.is_integer_dtype(d):
            t = "Integral"
        elif pd.api.types.is_float_dtype(d):
            t = "Fractional"
        else:
            t = "String"
        out.append({"FeatureName": c, "FeatureType": t})
    return out

def create_fg_boto3(name: str, df_local: pd.DataFrame, s3_uri: str):
    fdefs = to_boto_feature_defs(df_local)
    try:
        resp = sm.create_feature_group(
            FeatureGroupName=name,
            RecordIdentifierFeatureName="recordid",
            EventTimeFeatureName="eventtime",
            FeatureDefinitions=fdefs,
            OfflineStoreConfig={"S3StorageConfig": {"S3Uri": s3_uri}},
            OnlineStoreConfig={"EnableOnlineStore": False},
            RoleArn=role,
            Description=f"Maternal Health Risk – {name}",
        )
        return resp
    except sm.exceptions.ResourceInUse:
        # Already exists --> safe to reuse after we confirm it's Created
        return {"FeatureGroupArn": f"arn:aws:sagemaker:{region}:{boto3.client('sts').get_caller_identity()['Account']}:feature-group/{name}"}

def wait_fg_created(name: str, timeout_s: int = 900, poll_s: int = 10):
    start = time.time()
    last = ""
    while True:
        desc = sm.describe_feature_group(FeatureGroupName=name)
        status = desc.get("FeatureGroupStatus", "")
        if status == "Created":
            print(f"[READY] {name}")
            return desc
        if status == "CreateFailed":
            raise RuntimeError(f"{name} failed: {desc.get('FailureReason')}")
        if time.time() - start > timeout_s:
            raise TimeoutError(f"Timeout waiting for {name} (last status={status})")
        if status != last:
            print(f"Status {name}: {status}")
            last = status
        time.sleep(poll_s)

def create_and_ingest(name_base: str, df_local: pd.DataFrame):
    # unique FG names per run to avoid collisions
    name = f"{name_base}-{RUN_ID}"             # e.g., mhr-train-fg-20250920-154301
    assert "_" not in name, "FG name must not contain underscores."
    df_local = ensure_id_time(df_local)
    s3_uri   = f"s3://{bucket}/{S3_PREFIX}/feature-store/{name}"

    create_fg_boto3(name, df_local, s3_uri)
    wait_fg_created(name)

    fg = FeatureGroup(name=name, sagemaker_session=session)
    fg.load_feature_definitions(data_frame=df_local)    # make sure SDK knows schema
    fg.ingest(data_frame=df_local, max_workers=4, wait=True)
    print(f"[OK] Ingested {len(df_local)} rows → {name}")
    return name

# Load sanitized splits
train_s = pd.read_csv(ARTIFACTS_DIR/"train_sanitized.csv")
val_s   = pd.read_csv(ARTIFACTS_DIR/"val_sanitized.csv")
prod_s  = pd.read_csv(ARTIFACTS_DIR/"production_sanitized.csv")

# Create OFFLINE FGs with unique names
FG_TRAIN = create_and_ingest("mhr-train-fg", train_s)
FG_VAL   = create_and_ingest("mhr-val-fg",   val_s)
FG_BATCH = create_and_ingest("mhr-batch-fg", prod_s)

print("Feature Store complete:", FG_TRAIN, FG_VAL, FG_BATCH)

Status mhr-train-fg-20250925-102647: Creating
[READY] mhr-train-fg-20250925-102647
[OK] Ingested 322 rows → mhr-train-fg-20250925-102647
Status mhr-val-fg-20250925-102647: Creating
[READY] mhr-val-fg-20250925-102647
[OK] Ingested 81 rows → mhr-val-fg-20250925-102647
Status mhr-batch-fg-20250925-102647: Creating
[READY] mhr-batch-fg-20250925-102647
[OK] Ingested 324 rows → mhr-batch-fg-20250925-102647
Feature Store complete: mhr-train-fg-20250925-102647 mhr-val-fg-20250925-102647 mhr-batch-fg-20250925-102647


#### Tracker update (JSON + Markdown) and upload

In [9]:
tracker = {
    "run_id": RUN_ID,
    "s3_prefix": f"s3://{bucket}/{S3_PREFIX}",
    "dataset": {
        "rows": eda_summary["rows"], "cols": eda_summary["cols"],
        "class_counts": eda_summary["class_counts"],
        "dtypes": eda_summary["dtypes"],
        "missing": eda_summary["missing_counts"],
    },
    "splits": {
        "train_rows": len(train_df), "val_rows": len(val_df),
        "test_rows": len(test_df), "prod_rows": len(prod_df),
    },
    "feature_store_groups": [FG_TRAIN, FG_VAL, FG_BATCH],
}
with open(ARTIFACTS_DIR / "team_tracker_update_week3.json", "w") as f:
    json.dump(tracker, f, indent=2)

md = f"""# Week 3 Tracker — Maternal Health Risk (RUN: {RUN_ID})

**S3 prefix:** s3://{bucket}/{S3_PREFIX}

## Dataset
- Rows: {eda_summary['rows']} | Cols: {eda_summary['cols']}
- Classes: {eda_summary['class_counts']}

## Splits
- Train: {len(train_df)} (~40%)
- Val:   {len(val_df)} (~10%)
- Test:  {len(test_df)} (~10%)
- Prod:  {len(prod_df)} (~40%)

## Feature Store (offline)
- {FG_TRAIN}
- {FG_VAL}
- {FG_BATCH}
"""
with open(ARTIFACTS_DIR / "team_tracker_update_week3.md", "w") as f:
    f.write(md)

# Upload tracker docs
s3 = boto3.client("s3")
s3.upload_file(str(ARTIFACTS_DIR/"team_tracker_update_week3.json"), bucket, f"{S3_PREFIX}/team_tracker_update_week3.json")
s3.upload_file(str(ARTIFACTS_DIR/"team_tracker_update_week3.md"),   bucket, f"{S3_PREFIX}/team_tracker_update_week3.md")

print("Tracker written & uploaded.")

Tracker written & uploaded.


In [10]:
# View

import boto3, json
s3 = boto3.client("s3")
obj = s3.get_object(Bucket=bucket, Key=f"{S3_PREFIX}/team_tracker_update_week3.json")
tracker = json.load(obj["Body"])
tracker

{'run_id': '20250925-102647',
 's3_prefix': 's3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647',
 'dataset': {'rows': 808,
  'cols': 7,
  'class_counts': {'low risk': 478, 'high risk': 330},
  'dtypes': {'Age': 'int64',
   'SystolicBP': 'int64',
   'DiastolicBP': 'int64',
   'BS': 'float64',
   'BodyTemp': 'float64',
   'HeartRate': 'int64',
   'RiskLevel': 'object'},
  'missing': {'Age': 0,
   'SystolicBP': 0,
   'DiastolicBP': 0,
   'BS': 0,
   'BodyTemp': 0,
   'HeartRate': 0,
   'RiskLevel': 0}},
 'splits': {'train_rows': 322,
  'val_rows': 81,
  'test_rows': 81,
  'prod_rows': 324},
 'feature_store_groups': ['mhr-train-fg-20250925-102647',
  'mhr-val-fg-20250925-102647',
  'mhr-batch-fg-20250925-102647']}

## Week 4, Model Development and Deployment

#### Auto settings; continues from Week 3

In [11]:
import os, io, json, time, tarfile
from pathlib import Path
import boto3, sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
import pandas as pd
import numpy as np

# Reuse Week-3 objects if they exist; otherwise, auto-init (no manual config)
try:
    bucket
    sm_session
    role
except NameError:
    boto_sess  = boto3.session.Session()
    sm_session = Session(boto_sess)
    role       = get_execution_role()
    bucket     = sm_session.default_bucket()

# Use the Week-3 S3 prefix if it’s still in memory; otherwise pick the latest run
s3 = boto3.client("s3")
try:
    WEEK3_PREFIX = S3_PREFIX  # from Week 3 cells
except NameError:
    base = "aai540/maternal-risk/week3/"
    resp = s3.list_objects_v2(Bucket=bucket, Prefix=base, Delimiter="/")
    runs = [cp["Prefix"].rstrip("/") for cp in resp.get("CommonPrefixes", [])]
    assert runs, f"No Week-3 artifacts found under s3://{bucket}/{base}"
    WEEK3_PREFIX = sorted(runs)[-1]

# Create a unique Week-4 prefix
RUN_ID = time.strftime("%Y%m%d-%H%M%S")
W4_PREFIX = f"aai540/maternal-risk/week4/{RUN_ID}"

print("Using Week-3:", f"s3://{bucket}/{WEEK3_PREFIX}")
print("Writing Week-4:", f"s3://{bucket}/{W4_PREFIX}")

Using Week-3: s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week3/20250925-102647
Writing Week-4: s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week4/20250925-102857


#### Load Week-3 splits (train/val/test) from S3

In [12]:
# LOAD SPLITS

def read_csv_from_s3(key: str) -> pd.DataFrame:
    obj = s3.get_object(Bucket=bucket, Key=key)
    return pd.read_csv(io.BytesIO(obj["Body"].read()))

train = read_csv_from_s3(f"{WEEK3_PREFIX}/train.csv")
val   = read_csv_from_s3(f"{WEEK3_PREFIX}/val.csv")
test  = read_csv_from_s3(f"{WEEK3_PREFIX}/test.csv")

label_col = "label"
X_train, y_train = train.drop(columns=[label_col]), train[label_col]
X_val,   y_val   = val.drop(columns=[label_col]),   val[label_col]
X_test,  y_test  = test.drop(columns=[label_col]),  test[label_col]

print("Loaded:", train.shape, val.shape, test.shape)
print("Train label balance:", y_train.value_counts().to_dict())

Loaded: (322, 20) (81, 20) (81, 20)
Train label balance: {0: 190, 1: 132}


#### Benchmark model in SageMaker (very simple: Logistic Regression on 2 features)

In [18]:
# Why: have a simple, interpretable baseline for comparison (MVP).
from sagemaker.sklearn import SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score

bm_feats = ["Age", "SystolicBP"]
assert all(f in X_train.columns for f in bm_feats), "Expected baseline features missing."

# Stage small CSVs (filenames don't matter inside channels)
w4_local = Path("w4_benchmark"); w4_local.mkdir(exist_ok=True)
pd.concat([X_train[bm_feats], y_train], axis=1).to_csv(w4_local/"train_benchmark.csv", index=False)
pd.concat([X_val[bm_feats],   y_val],   axis=1).to_csv(w4_local/"val_benchmark.csv",   index=False)

bm_train_s3 = sm_session.upload_data(str(w4_local/"train_benchmark.csv"), key_prefix=f"{W4_PREFIX}/benchmark")
bm_val_s3   = sm_session.upload_data(str(w4_local/"val_benchmark.csv"),   key_prefix=f"{W4_PREFIX}/benchmark")

# Entry script: read first *.csv in each channel, fit LR, write metrics + model
with open("baseline_train.py","w") as f:
    f.write("""
import os, glob, json, pathlib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score

def first_csv_in(d):
    files = sorted(glob.glob(os.path.join(d, '*.csv')))
    assert files, f'No CSV found in {d}'
    return files[0]

if __name__ == '__main__':
    train_dir = os.environ.get('SM_CHANNEL_TRAIN', '/opt/ml/input/data/train')
    val_dir   = os.environ.get('SM_CHANNEL_VAL',   '/opt/ml/input/data/val')
    model_dir = os.environ.get('SM_MODEL_DIR',     '/opt/ml/model')

    df_tr = pd.read_csv(first_csv_in(train_dir))
    df_va = pd.read_csv(first_csv_in(val_dir))

    Xtr, ytr = df_tr[['Age','SystolicBP']], df_tr['label']
    Xva, yva = df_va[['Age','SystolicBP']], df_va['label']

    clf = LogisticRegression(max_iter=1000).fit(Xtr, ytr)

    pred  = clf.predict(Xva)
    proba = clf.predict_proba(Xva)[:,1]
    acc = accuracy_score(yva, pred)
    p,r,f1,_ = precision_recall_fscore_support(yva, pred, average='binary', zero_division=0)
    try: auc = roc_auc_score(yva, proba)
    except: auc = float('nan')

    pathlib.Path(model_dir).mkdir(parents=True, exist_ok=True)
    import joblib
    joblib.dump(clf, os.path.join(model_dir, 'model.joblib'))
    with open(os.path.join(model_dir, 'metrics.json'), 'w') as f:
        json.dump({'accuracy':acc,'precision':p,'recall':r,'f1':f1,'roc_auc':auc}, f)
""")

bm_est = SKLearn(
    entry_point="baseline_train.py",
    framework_version="1.2-1",     # use a tag compatible with your Studio image
    role=role,
    instance_type="ml.m5.large",
    instance_count=1,
    sagemaker_session=sm_session,
)
bm_est.fit({"train": bm_train_s3, "val": bm_val_s3})

# Also compute baseline metrics on our held-out TEST for a clean comparison
bm_clf   = LogisticRegression(max_iter=1000).fit(train[bm_feats], y_train)
bm_proba = bm_clf.predict_proba(test[bm_feats])[:,1]
bm_pred  = (bm_proba >= 0.5).astype(int)

bm_acc = accuracy_score(y_test, bm_pred)
bm_p, bm_r, bm_f1, _ = precision_recall_fscore_support(y_test, bm_pred, average='binary', zero_division=0)
try:    bm_auc = roc_auc_score(y_test, bm_proba)
except: bm_auc = float("nan")

baseline_metrics = {"accuracy":bm_acc,"precision":bm_p,"recall":bm_r,"f1":bm_f1,"roc_auc":bm_auc}
baseline_metrics

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2025-09-25-11-29-32-936


2025-09-25 11:29:33 Starting - Starting the training job...
2025-09-25 11:30:00 Starting - Preparing the instances for training...
2025-09-25 11:30:21 Downloading - Downloading input data...
2025-09-25 11:30:46 Downloading - Downloading the training image......
2025-09-25 11:31:58 Training - Training image download completed. Training in progress.
  import pkg_resources[0m
[34m2025-09-25 11:31:51,816 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2025-09-25 11:31:51,820 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2025-09-25 11:31:51,823 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-09-25 11:31:51,839 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2025-09-25 11:31:52,170 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2025-09-25 11:31:52,174 sagemaker-t

{'accuracy': 0.7654320987654321,
 'precision': 0.71875,
 'recall': 0.696969696969697,
 'f1': 0.7076923076923077,
 'roc_auc': 0.790719696969697}

#### MAIN MODEL in SageMaker (Built-in XGBoost, CSV mode)

In [14]:
# Why built-in? No entry_point needed; just CSV with label first (no header).
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
import sagemaker

# Helper to reorder columns to [label, features...] for CSV training
def reorder_for_xgb(df):
    cols = [label_col] + [c for c in df.columns if c != label_col]
    return df[cols]

xgb_train_local = reorder_for_xgb(train)
xgb_val_local   = reorder_for_xgb(val)

xgb_train_path = Path("w4_xgb_train.csv"); xgb_val_path = Path("w4_xgb_val.csv")
xgb_train_local.to_csv(xgb_train_path, index=False, header=False)
xgb_val_local.to_csv(xgb_val_path,   index=False, header=False)

s3_xgb_train = sm_session.upload_data(str(xgb_train_path), key_prefix=f"{W4_PREFIX}/xgb")
s3_xgb_val   = sm_session.upload_data(str(xgb_val_path),   key_prefix=f"{W4_PREFIX}/xgb")

def get_xgb_image():
    for ver in ["1.7-1", "1.5-1", "1.3-1"]:
        try:
            return sagemaker.image_uris.retrieve("xgboost", sm_session.boto_region_name, version=ver)
        except Exception as e:
            print(f"xgboost {ver} not available → trying next … ({e})")
    raise RuntimeError("No compatible built-in XGBoost image found.")

xgb_image_uri = get_xgb_image()

xgb_est = Estimator(
    image_uri=xgb_image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    sagemaker_session=sm_session,
    hyperparameters={
        "objective":"binary:logistic",
        "eval_metric":"auc",
        "max_depth":5,
        "eta":0.2,
        "min_child_weight":1,
        "subsample":0.8,
        "colsample_bytree":0.8,
        "num_round":200,
        "verbosity":1,
    },
)

# tell container the training data is CSV (otherwise it expects libsvm)
train_input = TrainingInput(s3_data=s3_xgb_train, content_type="text/csv")
val_input   = TrainingInput(s3_data=s3_xgb_val,   content_type="text/csv")

xgb_est.fit({"train": train_input, "validation": val_input}, wait=True)
xgb_model_artifact = xgb_est.model_data
print("XGBoost model artifact:", xgb_model_artifact)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-09-25-10-31-56-289


2025-09-25 10:32:01 Starting - Starting the training job...
2025-09-25 10:32:16 Starting - Preparing the instances for training...
2025-09-25 10:32:36 Downloading - Downloading input data...
2025-09-25 10:33:16 Downloading - Downloading the training image......
  import pkg_resources[0m
[34m[2025-09-25 10:34:31.168 ip-10-0-120-196.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-09-25 10:34:31.238 ip-10-0-120-196.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-09-25:10:34:31:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-09-25:10:34:31:INFO] Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34m[2025-09-25:10:34:31:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2025-09-25:10:34:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025

#### EVALUATE (compare main vs baseline on TEST)

In [19]:
import xgboost as xgb

def parse_s3_uri(uri: str):
    assert uri.startswith("s3://")
    p = uri[5:]; b, k = p.split("/", 1)
    return b, k

tmp_dir = Path("w4_tmp"); tmp_dir.mkdir(exist_ok=True)
bkt, key = parse_s3_uri(xgb_model_artifact)
boto3.client("s3").download_file(bkt, key, str(tmp_dir/"model.tar.gz"))
with tarfile.open(tmp_dir/"model.tar.gz") as t:
    t.extractall(tmp_dir)

# Score test set with the trained booster
dtest   = xgb.DMatrix(test.drop(columns=[label_col]), label=test[label_col])
booster = xgb.Booster(); booster.load_model(str(tmp_dir/"xgboost-model"))
xgb_proba = booster.predict(dtest)
xgb_pred  = (xgb_proba >= 0.5).astype(int)

from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
xgb_acc = accuracy_score(y_test, xgb_pred)
xgb_p, xgb_r, xgb_f1, _ = precision_recall_fscore_support(y_test, xgb_pred, average='binary', zero_division=0)
xgb_auc = roc_auc_score(y_test, xgb_proba)

metrics_compare = {
    "baseline": baseline_metrics,
    "xgboost":  {"accuracy":xgb_acc,"precision":xgb_p,"recall":xgb_r,"f1":xgb_f1,"roc_auc":xgb_auc},
}
metrics_compare

  t.extractall(tmp_dir)


{'baseline': {'accuracy': 0.7654320987654321,
  'precision': 0.71875,
  'recall': 0.696969696969697,
  'f1': 0.7076923076923077,
  'roc_auc': 0.790719696969697},
 'xgboost': {'accuracy': 0.9876543209876543,
  'precision': 1.0,
  'recall': 0.9696969696969697,
  'f1': 0.9846153846153847,
  'roc_auc': 0.999368686868687}}

#### Deploy via Batch Transform (score Week-3 production.csv)

In [21]:
# Inference expects FEATURES ONLY (no label) in the SAME order as training.

from sagemaker.inputs import TransformInput

prod_df = read_csv_from_s3(f"{WEEK3_PREFIX}/production.csv")

# Same feature order as used to create training CSVs
FEATURE_COLS = [c for c in train.columns if c != label_col]
print("Feature cols count:", len(FEATURE_COLS))

prod_features = prod_df[FEATURE_COLS].copy()
bt_local = Path("w4_production_features_only.csv")
prod_features.to_csv(bt_local, index=False, header=False)

s3_bt_input = sm_session.upload_data(str(bt_local), key_prefix=f"{W4_PREFIX}/batch")

transformer = xgb_est.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{W4_PREFIX}/batch/outputs",
    accept="text/csv",
    assemble_with="Line",
)

transformer.transform(data=s3_bt_input, content_type="text/csv", split_type="Line")
transformer.wait()

batch_output_s3 = transformer.output_path
print("Batch output:", batch_output_s3)

Feature cols count: 19


INFO:sagemaker:Creating model with name: sagemaker-xgboost-2025-09-25-11-35-20-162
INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2025-09-25-11-35-20-993


  import pkg_resources[0m
[34m[2025-09-25:11:40:43:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-09-25:11:40:43:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-09-25:11:40:43:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location /

#### ARTIFACTS + DESIGN-DOC SNIPPET + TRACKER (upload to S3)

In [22]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

def plot_cm(cm, title, path):
    plt.figure()
    plt.imshow(cm, interpolation='nearest'); plt.title(title); plt.colorbar()
    plt.xlabel("Predicted"); plt.ylabel("Actual"); plt.tight_layout(); plt.savefig(path); plt.close()

# Confusion matrices (test set)
bm_pred = (LogisticRegression(max_iter=1000).fit(train[bm_feats], y_train)
           .predict_proba(test[bm_feats])[:,1] >= 0.5).astype(int)
bm_cm   = confusion_matrix(y_test, bm_pred)
xgb_cm  = confusion_matrix(y_test, xgb_pred)

art_dir = Path("w4_artifacts"); art_dir.mkdir(exist_ok=True)
plot_cm(bm_cm,  "Baseline CM", art_dir/"baseline_cm.png")
plot_cm(xgb_cm, "XGBoost CM",  art_dir/"xgb_cm.png")
with open(art_dir/"metrics_compare.json","w") as f:
    json.dump(metrics_compare, f, indent=2)

def up(local, key):
    boto3.client("s3").upload_file(str(local), bucket, f"{W4_PREFIX}/{key}")
    return f"s3://{bucket}/{W4_PREFIX}/{key}"

metrics_s3 = up(art_dir/"metrics_compare.json", "metrics_compare.json")
bm_cm_s3   = up(art_dir/"baseline_cm.png",      "baseline_cm.png")
xgb_cm_s3  = up(art_dir/"xgb_cm.png",           "xgb_cm.png")

design_doc_snippet = f"""
### Week 4 Findings — Model Development & Deployment

**Benchmark (LogReg on Age + SystolicBP)**  
Acc: {baseline_metrics['accuracy']:.3f} | Prec: {baseline_metrics['precision']:.3f} | Rec: {baseline_metrics['recall']:.3f} | F1: {baseline_metrics['f1']:.3f} | AUC: {baseline_metrics['roc_auc']:.3f}

**XGBoost (full features)**  
Acc: {metrics_compare['xgboost']['accuracy']:.3f} | Prec: {metrics_compare['xgboost']['precision']:.3f} | Rec: {metrics_compare['xgboost']['recall']:.3f} | F1: {metrics_compare['xgboost']['f1']:.3f} | AUC: {metrics_compare['xgboost']['roc_auc']:.3f}

**Artifacts**  
- Metrics JSON: {metrics_s3}  
- Baseline CM:  {bm_cm_s3}  
- XGBoost CM:   {xgb_cm_s3}  
- XGBoost Model Artifact: {xgb_model_artifact}  
- Batch Transform Output: {batch_output_s3}
"""
print(design_doc_snippet)

# Tracker (JSON + Markdown)
w4_tracker_dir = Path("w4_tracker"); w4_tracker_dir.mkdir(exist_ok=True)
tracker_w4 = {
    "week": "4",
    "run_id": RUN_ID,
    "week3_prefix": f"s3://{bucket}/{WEEK3_PREFIX}",
    "week4_prefix": f"s3://{bucket}/{W4_PREFIX}",
    "benchmark": baseline_metrics,
    "xgboost": metrics_compare["xgboost"],
    "artifacts": {
        "metrics_json": metrics_s3,
        "baseline_cm": bm_cm_s3,
        "xgb_cm": xgb_cm_s3,
        "model_artifact": xgb_model_artifact,
        "batch_output": batch_output_s3
    }
}
with open(w4_tracker_dir/"team_tracker_update_week4.json","w") as f:
    json.dump(tracker_w4, f, indent=2)

md = f"""# Week 4 Tracker – Maternal Health Risk (RUN: {RUN_ID})

**Week-3 prefix:** s3://{bucket}/{WEEK3_PREFIX}  
**Week-4 prefix:** s3://{bucket}/{W4_PREFIX}

## Benchmark (LogReg on Age + SystolicBP)
Acc: {baseline_metrics['accuracy']:.3f} | Prec: {baseline_metrics['precision']:.3f} | Rec: {baseline_metrics['recall']:.3f} | F1: {baseline_metrics['f1']:.3f} | AUC: {baseline_metrics['roc_auc']:.3f}

# XGBoost (full features)
Acc: {metrics_compare['xgboost']['accuracy']:.3f} | Prec: {metrics_compare['xgboost']['precision']:.3f} | Rec: {metrics_compare['xgboost']['recall']:.3f} | F1: {metrics_compare['xgboost']['f1']:.3f} | AUC: {metrics_compare['xgboost']['roc_auc']:.3f}

# Artifacts
- Metrics JSON: {metrics_s3}
- Baseline CM:  {bm_cm_s3}
- XGBoost CM:   {xgb_cm_s3}
- Model:        {xgb_model_artifact}
- Batch Output: {batch_output_s3}
"""
with open(w4_tracker_dir/"team_tracker_update_week4.md","w") as f:
    f.write(md)

boto3.client("s3").upload_file(str(w4_tracker_dir/"team_tracker_update_week4.json"), bucket, f"{W4_PREFIX}/team_tracker_update_week4.json")
boto3.client("s3").upload_file(str(w4_tracker_dir/"team_tracker_update_week4.md"),   bucket, f"{W4_PREFIX}/team_tracker_update_week4.md")

print("Week-4 tracker written & uploaded →", f"s3://{bucket}/{W4_PREFIX}/team_tracker_update_week4.*")


### Week 4 Findings — Model Development & Deployment

**Benchmark (LogReg on Age + SystolicBP)**  
Acc: 0.765 | Prec: 0.719 | Rec: 0.697 | F1: 0.708 | AUC: 0.791

**XGBoost (full features)**  
Acc: 0.988 | Prec: 1.000 | Rec: 0.970 | F1: 0.985 | AUC: 0.999

**Artifacts**  
- Metrics JSON: s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week4/20250925-102857/metrics_compare.json  
- Baseline CM:  s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week4/20250925-102857/baseline_cm.png  
- XGBoost CM:   s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week4/20250925-102857/xgb_cm.png  
- XGBoost Model Artifact: s3://sagemaker-us-east-1-533267301342/sagemaker-xgboost-2025-09-25-10-31-56-289/output/model.tar.gz  
- Batch Transform Output: s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week4/20250925-102857/batch/outputs

Week-4 tracker written & uploaded → s3://sagemaker-us-east-1-533267301342/aai540/maternal-risk/week4/20250925-102857/team_tracker_u