# Group 1 Final Project Work 

#### Specific data on our dataset

### Maternal Health Risk Dataset Summary

**Shape:** 808 records × 7 columns  

**Columns:**
- `Age`
- `SystolicBP` (Systolic Blood Pressure)
- `DiastolicBP` (Diastolic Blood Pressure)
- `BS` (Blood Sugar level)
- `BodyTemp` (Body Temperature, °F)
- `HeartRate` (Heart Rate, bpm)
- `RiskLevel` (Target: maternal health risk category)

---

#### First 5 Records
| Age | SystolicBP | DiastolicBP | BS   | BodyTemp | HeartRate | RiskLevel  |
|-----|------------|--------------|------|----------|-----------|------------|
| 25  | 130        | 80           | 15.0 | 98.0     | 86        | high risk  |
| 35  | 140        | 90           | 13.0 | 98.0     | 70        | high risk  |
| 29  | 90         | 70           | 8.0  | 100.0    | 80        | high risk  |
| 30  | 140        | 85           | 7.0  | 98.0     | 70        | high risk  |
| 35  | 120        | 60           | 6.1  | 98.0     | 76        | low risk   |

---

#### Summary Statistics
- **Age:** 10–70 years (mean = 30.6, std = 13.9)  
- **SystolicBP:** 70–160 mmHg (mean = 113, std = 19.9)  
- **DiastolicBP:** 49–100 mmHg (mean = 77.5, std = 14.8)  
- **BS:** 6–19 mmol/L (mean = 9.26, std = 3.62)  
- **BodyTemp:** 98–103 °F (mean = 98.6, std = 1.39)  
- **HeartRate:** 7–90 bpm (mean = 74.3, std = 8.82)  

---

#### Target Variable: RiskLevel
- **Low risk:** 478 records (~59.2%)  
- **High risk:** 330 records (~40.8%)  
- **Medium risk:** Not present in this dataset version  

 Note: The dataset is binary-labeled (low vs. high risk), so if a 3-class model (low/mid/high) is needed, additional data preprocessing or augmentation may be required.

### Week 3 - Training and Feature Engineering

#### Environment (auto role + auto bucket) and constants

In [19]:
# This notebook:
#   • does EDA and writes plots/summary
#   • engineers features
#   • creates stratified splits: train(40%), val(10%), test(10%), production(40%)
#   • uploads artifacts to S3 (auto default bucket)
#   • creates OFFLINE Feature Store groups (auto execution role)
#   • writes a tracker update (JSON + Markdown)
#
# NO MANUAL SETTINGS: bucket/role are auto-detected from your Studio kernel.

import os, json, time
from pathlib import Path

import boto3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

plt.rcParams["figure.dpi"] = 120

# Paths
LOCAL_DATA_PATH = Path("Maternal_Risk.csv")          # change only if your CSV is elsewhere
ARTIFACTS_DIR   = Path("week3_outputs")
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

# AWS context (auto)
boto_sess  = boto3.session.Session()
region     = boto_sess.region_name
sm_session = Session(boto_sess)
role       = get_execution_role()                     # auto from Studio kernel
bucket     = sm_session.default_bucket()              # auto default bucket

# Unique ids (prevent FG name collisions + make runs auditable)
RUN_ID    = time.strftime("%Y%m%d-%H%M%S")
S3_PREFIX = f"aai540/maternal-risk/week3/{RUN_ID}"

print("Region:", region)
print("Role:  ", role)
print("S3:    ", f"s3://{bucket}/{S3_PREFIX}")

Region: us-east-1
Role:   arn:aws:iam::590183777926:role/LabRole
S3:     s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407


#### Load data + lightweight EDA (plots + JSON summary)

In [20]:
assert LOCAL_DATA_PATH.exists(), f"Missing: {LOCAL_DATA_PATH.resolve()}"
df = pd.read_csv(LOCAL_DATA_PATH)

assert "RiskLevel" in df.columns, "Expected target column 'RiskLevel'."
print("Shape:", df.shape)
print("Columns:", list(df.columns))

# EDA summary (for tracker)
eda_summary = {
    "rows": int(df.shape[0]),
    "cols": int(df.shape[1]),
    "columns": df.columns.tolist(),
    "dtypes": {c: str(t) for c, t in df.dtypes.items()},
    "missing_counts": df.isna().sum().to_dict(),
    "class_counts": df["RiskLevel"].value_counts().to_dict(),
}
with open(ARTIFACTS_DIR / "eda_summary.json", "w") as f:
    json.dump(eda_summary, f, indent=2)

# Simple plots (defaults only; no custom colors)
df["RiskLevel"].value_counts().plot(kind="bar"); plt.title("Class Distribution")
plt.tight_layout(); plt.savefig(ARTIFACTS_DIR / "chart_class_distribution.png"); plt.clf()

df["Age"].plot(kind="hist", bins=20); plt.title("Age Distribution")
plt.tight_layout(); plt.savefig(ARTIFACTS_DIR / "chart_age_hist.png"); plt.clf()

plt.boxplot([df["SystolicBP"], df["DiastolicBP"]], labels=["SystolicBP","DiastolicBP"])
plt.title("Blood Pressure Boxplots"); plt.tight_layout()
plt.savefig(ARTIFACTS_DIR / "chart_bp_box.png"); plt.clf()

num_cols = df.select_dtypes(include=[np.number]).columns
corr = df[num_cols].corr()
plt.imshow(corr, interpolation="nearest")
plt.xticks(range(len(num_cols)), num_cols, rotation=45, ha="right")
plt.yticks(range(len(num_cols)), num_cols); plt.colorbar(); plt.title("Correlation Heatmap")
plt.tight_layout(); plt.savefig(ARTIFACTS_DIR / "chart_corr_heatmap.png"); plt.clf()

print("EDA done →", ARTIFACTS_DIR)

Shape: (808, 7)
Columns: ['Age', 'SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate', 'RiskLevel']


  plt.boxplot([df["SystolicBP"], df["DiastolicBP"]], labels=["SystolicBP","DiastolicBP"])


EDA done → week3_outputs


<Figure size 768x576 with 0 Axes>

#### Feature engineering (clinically-motivated features + z-scaling)

In [21]:
# We derive simple vitals-based features and also add z-scaled versions for linear models.

X = df.copy()

# Clinically motivated derived features
X["PulsePressure"]     = X["SystolicBP"] - X["DiastolicBP"]
X["SBP_to_DBP"]        = X["SystolicBP"] / (X["DiastolicBP"].replace(0, np.nan))
X["Fever"]             = (X["BodyTemp"] > 99.5).astype(int)
X["Tachycardia"]       = (X["HeartRate"] >= 100).astype(int)
X["HypertensionFlag"]  = ((X["SystolicBP"] >= 140) | (X["DiastolicBP"] >= 90)).astype(int)

# Optional standardization for linear models
cont = ["Age","SystolicBP","DiastolicBP","BS","BodyTemp","HeartRate","PulsePressure","SBP_to_DBP"]
X[[f"z_{c}" for c in cont]] = StandardScaler().fit_transform(X[cont])

# Label encoding (binary in this dataset)
label_map = {"low risk": 0, "high risk": 1}
if set(df["RiskLevel"].unique()) == set(label_map):
    y = df["RiskLevel"].map(label_map)
else:
    cats = sorted(df["RiskLevel"].unique())
    label_map = {v:i for i,v in enumerate(cats)}
    y = df["RiskLevel"].map(label_map)

with open(ARTIFACTS_DIR / "label_map.json", "w") as f:
    json.dump(label_map, f, indent=2)

X_no_target = X.drop(columns=["RiskLevel"])
engineered_full = pd.concat([X_no_target, y.rename("label")], axis=1)
engineered_full.to_csv(ARTIFACTS_DIR / "maternal_features_full.csv", index=False)

print("Feature engineering done.")

Feature engineering done.


#### Stratified splits: 40% prod, 40% train, 10% val, 10% test

In [22]:
# We first carve out 40% as "production" holdout for future batch inference/monitoring.
# The remaining 60% --> train (40%), val (10%), test (10%) of the original dataset.

from sklearn.model_selection import train_test_split

# 40% set aside for future batch inference/monitoring
X_tmp, X_prod, y_tmp, y_prod = train_test_split(
    X_no_target, y, test_size=0.40, random_state=42, stratify=y
)
# remaining 60% -> 40/10/10
X_train, X_rem, y_train, y_rem = train_test_split(
    X_tmp, y_tmp, test_size=(1/3), random_state=42, stratify=y_tmp
)
X_val, X_test, y_val, y_test = train_test_split(
    X_rem, y_rem, test_size=0.5, random_state=42, stratify=y_rem
)

def _save(name, Xd, yd):
    out = Xd.copy(); out["label"] = yd.values
    out.to_csv(ARTIFACTS_DIR / f"{name}.csv", index=False)
    return out

train_df = _save("train",      X_train, y_train)
val_df   = _save("val",        X_val,   y_val)
test_df  = _save("test",       X_test,  y_test)
prod_df  = _save("production", X_prod,  y_prod)

print({"train":len(train_df), "val":len(val_df), "test":len(test_df), "production":len(prod_df)})

{'train': 322, 'val': 81, 'test': 81, 'production': 324}


#### Upload artifacts to S3 (no manual bucket)

In [23]:
# Upload the CSVs, label map, EDA summary, and figures to your default bucket/prefix.

s3 = boto3.client("s3")

def s3_upload(local: Path, key: str):
    s3.upload_file(str(local), bucket, f"{S3_PREFIX}/{key}")
    print("Uploaded", f"s3://{bucket}/{S3_PREFIX}/{key}")

# CSVs + summaries
for fname in ["train.csv","val.csv","test.csv","production.csv",
              "maternal_features_full.csv","label_map.json","eda_summary.json"]:
    s3_upload(ARTIFACTS_DIR / fname, fname)

# Plots
for fname in ["chart_class_distribution.png","chart_age_hist.png","chart_bp_box.png","chart_corr_heatmap.png"]:
    s3_upload(ARTIFACTS_DIR / fname, f"figures/{fname}")

Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/train.csv
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/val.csv
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/test.csv
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/production.csv
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/maternal_features_full.csv
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/label_map.json
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/eda_summary.json
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/figures/chart_class_distribution.png
Uploaded s3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407/figures/chart_age_hist.png
Uploaded s3://sagemaker-u

#### Sanitize column names (Feature Store regex) & write sanitized splits

In [24]:
# FS rules: names must be letters/numbers/hyphens only; must start with alnum; <=64 chars.

def sanitize_col(name: str) -> str:
    if name == "SBP_to_DBP": name = "SBPtoDBP"   # preserve meaning
    if name.startswith("z_"): name = "z" + name[2:]
    name = name.replace("_", "")
    name = "".join(ch for ch in name if ch.isalnum() or ch == "-")
    if not name or not name[0].isalnum(): name = "f" + name
    return name[:64]

def sanitize_df_cols(df: pd.DataFrame) -> pd.DataFrame:
    newcols, seen = [], set()
    for c in df.columns:
        s = sanitize_col(c)
        if s in seen:
            i, base = 2, s
            while f"{base}{i}" in seen: i += 1
            s = f"{base}{i}"
        newcols.append(s); seen.add(s)
    out = df.copy(); out.columns = newcols
    return out

label_col = "label"
def sanitize_split(df):
    feats = df.drop(columns=[label_col])
    feats = sanitize_df_cols(feats)
    feats[label_col] = df[label_col].values
    return feats

train_s = sanitize_split(train_df); train_s.to_csv(ARTIFACTS_DIR/"train_sanitized.csv", index=False)
val_s   = sanitize_split(val_df);   val_s.to_csv(ARTIFACTS_DIR/"val_sanitized.csv",   index=False)
test_s  = sanitize_split(test_df);  test_s.to_csv(ARTIFACTS_DIR/"test_sanitized.csv", index=False)
prod_s  = sanitize_split(prod_df);  prod_s.to_csv(ARTIFACTS_DIR/"production_sanitized.csv", index=False)

print("Sanitized splits saved.")

Sanitized splits saved.


#### Create & ingest Feature Store (OFFLINE, unique names per run)

In [25]:
import time
import boto3
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import FeatureDefinition, FeatureTypeEnum

sm      = boto3.client("sagemaker")
session = Session(boto3.session.Session(region_name=region))

def ensure_id_time(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    if "recordid" not in df.columns:
        df["recordid"] = range(1, len(df)+1)
    if "eventtime" not in df.columns:
        df["eventtime"] = pd.Timestamp.utcnow().isoformat()
    return df

def to_boto_feature_defs(df: pd.DataFrame):
    out = []
    for c, d in df.dtypes.items():
        if c == "eventtime":
            t = "String"
        elif pd.api.types.is_integer_dtype(d):
            t = "Integral"
        elif pd.api.types.is_float_dtype(d):
            t = "Fractional"
        else:
            t = "String"
        out.append({"FeatureName": c, "FeatureType": t})
    return out

def create_fg_boto3(name: str, df_local: pd.DataFrame, s3_uri: str):
    fdefs = to_boto_feature_defs(df_local)
    try:
        resp = sm.create_feature_group(
            FeatureGroupName=name,
            RecordIdentifierFeatureName="recordid",
            EventTimeFeatureName="eventtime",
            FeatureDefinitions=fdefs,
            OfflineStoreConfig={"S3StorageConfig": {"S3Uri": s3_uri}},
            OnlineStoreConfig={"EnableOnlineStore": False},
            RoleArn=role,
            Description=f"Maternal Health Risk – {name}",
        )
        return resp
    except sm.exceptions.ResourceInUse:
        # Already exists --> safe to reuse after we confirm it's Created
        return {"FeatureGroupArn": f"arn:aws:sagemaker:{region}:{boto3.client('sts').get_caller_identity()['Account']}:feature-group/{name}"}

def wait_fg_created(name: str, timeout_s: int = 900, poll_s: int = 10):
    start = time.time()
    last = ""
    while True:
        desc = sm.describe_feature_group(FeatureGroupName=name)
        status = desc.get("FeatureGroupStatus", "")
        if status == "Created":
            print(f"[READY] {name}")
            return desc
        if status == "CreateFailed":
            raise RuntimeError(f"{name} failed: {desc.get('FailureReason')}")
        if time.time() - start > timeout_s:
            raise TimeoutError(f"Timeout waiting for {name} (last status={status})")
        if status != last:
            print(f"Status {name}: {status}")
            last = status
        time.sleep(poll_s)

def create_and_ingest(name_base: str, df_local: pd.DataFrame):
    # unique FG names per run to avoid collisions
    name = f"{name_base}-{RUN_ID}"             # e.g., mhr-train-fg-20250920-154301
    assert "_" not in name, "FG name must not contain underscores."
    df_local = ensure_id_time(df_local)
    s3_uri   = f"s3://{bucket}/{S3_PREFIX}/feature-store/{name}"

    create_fg_boto3(name, df_local, s3_uri)
    wait_fg_created(name)

    fg = FeatureGroup(name=name, sagemaker_session=session)
    fg.load_feature_definitions(data_frame=df_local)    # make sure SDK knows schema
    fg.ingest(data_frame=df_local, max_workers=4, wait=True)
    print(f"[OK] Ingested {len(df_local)} rows → {name}")
    return name

# Load sanitized splits
train_s = pd.read_csv(ARTIFACTS_DIR/"train_sanitized.csv")
val_s   = pd.read_csv(ARTIFACTS_DIR/"val_sanitized.csv")
prod_s  = pd.read_csv(ARTIFACTS_DIR/"production_sanitized.csv")

# Create OFFLINE FGs with unique names
FG_TRAIN = create_and_ingest("mhr-train-fg", train_s)
FG_VAL   = create_and_ingest("mhr-val-fg",   val_s)
FG_BATCH = create_and_ingest("mhr-batch-fg", prod_s)

print("Feature Store complete:", FG_TRAIN, FG_VAL, FG_BATCH)

Status mhr-train-fg-20250920-160407: Creating
[READY] mhr-train-fg-20250920-160407
[OK] Ingested 322 rows → mhr-train-fg-20250920-160407
Status mhr-val-fg-20250920-160407: Creating
[READY] mhr-val-fg-20250920-160407
[OK] Ingested 81 rows → mhr-val-fg-20250920-160407
Status mhr-batch-fg-20250920-160407: Creating
[READY] mhr-batch-fg-20250920-160407
[OK] Ingested 324 rows → mhr-batch-fg-20250920-160407
Feature Store complete: mhr-train-fg-20250920-160407 mhr-val-fg-20250920-160407 mhr-batch-fg-20250920-160407


#### Tracker update (JSON + Markdown) and upload

In [26]:
tracker = {
    "run_id": RUN_ID,
    "s3_prefix": f"s3://{bucket}/{S3_PREFIX}",
    "dataset": {
        "rows": eda_summary["rows"], "cols": eda_summary["cols"],
        "class_counts": eda_summary["class_counts"],
        "dtypes": eda_summary["dtypes"],
        "missing": eda_summary["missing_counts"],
    },
    "splits": {
        "train_rows": len(train_df), "val_rows": len(val_df),
        "test_rows": len(test_df), "prod_rows": len(prod_df),
    },
    "feature_store_groups": [FG_TRAIN, FG_VAL, FG_BATCH],
}
with open(ARTIFACTS_DIR / "team_tracker_update_week3.json", "w") as f:
    json.dump(tracker, f, indent=2)

md = f"""# Week 3 Tracker — Maternal Health Risk (RUN: {RUN_ID})

**S3 prefix:** s3://{bucket}/{S3_PREFIX}

## Dataset
- Rows: {eda_summary['rows']} | Cols: {eda_summary['cols']}
- Classes: {eda_summary['class_counts']}

## Splits
- Train: {len(train_df)} (~40%)
- Val:   {len(val_df)} (~10%)
- Test:  {len(test_df)} (~10%)
- Prod:  {len(prod_df)} (~40%)

## Feature Store (offline)
- {FG_TRAIN}
- {FG_VAL}
- {FG_BATCH}
"""
with open(ARTIFACTS_DIR / "team_tracker_update_week3.md", "w") as f:
    f.write(md)

# Upload tracker docs
s3 = boto3.client("s3")
s3.upload_file(str(ARTIFACTS_DIR/"team_tracker_update_week3.json"), bucket, f"{S3_PREFIX}/team_tracker_update_week3.json")
s3.upload_file(str(ARTIFACTS_DIR/"team_tracker_update_week3.md"),   bucket, f"{S3_PREFIX}/team_tracker_update_week3.md")

print("Tracker written & uploaded.")

Tracker written & uploaded.


In [28]:
# View

import boto3, json
s3 = boto3.client("s3")
obj = s3.get_object(Bucket=bucket, Key=f"{S3_PREFIX}/team_tracker_update_week3.json")
tracker = json.load(obj["Body"])
tracker

{'run_id': '20250920-160407',
 's3_prefix': 's3://sagemaker-us-east-1-590183777926/aai540/maternal-risk/week3/20250920-160407',
 'dataset': {'rows': 808,
  'cols': 7,
  'class_counts': {'low risk': 478, 'high risk': 330},
  'dtypes': {'Age': 'int64',
   'SystolicBP': 'int64',
   'DiastolicBP': 'int64',
   'BS': 'float64',
   'BodyTemp': 'float64',
   'HeartRate': 'int64',
   'RiskLevel': 'object'},
  'missing': {'Age': 0,
   'SystolicBP': 0,
   'DiastolicBP': 0,
   'BS': 0,
   'BodyTemp': 0,
   'HeartRate': 0,
   'RiskLevel': 0}},
 'splits': {'train_rows': 322,
  'val_rows': 81,
  'test_rows': 81,
  'prod_rows': 324},
 'feature_store_groups': ['mhr-train-fg-20250920-160407',
  'mhr-val-fg-20250920-160407',
  'mhr-batch-fg-20250920-160407']}