# Group 1 Final Project Work 

#### Specific data on our dataset

### Maternal Health Risk Dataset Summary

**Shape:** 808 records × 7 columns  

**Columns:**
- `Age`
- `SystolicBP` (Systolic Blood Pressure)
- `DiastolicBP` (Diastolic Blood Pressure)
- `BS` (Blood Sugar level)
- `BodyTemp` (Body Temperature, °F)
- `HeartRate` (Heart Rate, bpm)
- `RiskLevel` (Target: maternal health risk category)

---

#### First 5 Records
| Age | SystolicBP | DiastolicBP | BS   | BodyTemp | HeartRate | RiskLevel  |
|-----|------------|--------------|------|----------|-----------|------------|
| 25  | 130        | 80           | 15.0 | 98.0     | 86        | high risk  |
| 35  | 140        | 90           | 13.0 | 98.0     | 70        | high risk  |
| 29  | 90         | 70           | 8.0  | 100.0    | 80        | high risk  |
| 30  | 140        | 85           | 7.0  | 98.0     | 70        | high risk  |
| 35  | 120        | 60           | 6.1  | 98.0     | 76        | low risk   |

---

#### Summary Statistics
- **Age:** 10–70 years (mean = 30.6, std = 13.9)  
- **SystolicBP:** 70–160 mmHg (mean = 113, std = 19.9)  
- **DiastolicBP:** 49–100 mmHg (mean = 77.5, std = 14.8)  
- **BS:** 6–19 mmol/L (mean = 9.26, std = 3.62)  
- **BodyTemp:** 98–103 °F (mean = 98.6, std = 1.39)  
- **HeartRate:** 7–90 bpm (mean = 74.3, std = 8.82)  

---

#### Target Variable: RiskLevel
- **Low risk:** 478 records (~59.2%)  
- **High risk:** 330 records (~40.8%)  
- **Medium risk:** Not present in this dataset version  

 Note: The dataset is binary-labeled (low vs. high risk), so if a 3-class model (low/mid/high) is needed, additional data preprocessing or augmentation may be required.

### Week 3 - Training and Feature Engineering

#### EDA --> feature engineering --> stratified splits (40/10/10/40)

In [3]:
# Imports
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

DATA_PATH = Path("Maternal_Risk.csv") 
OUT_DIR = Path("./week3_outputs"); OUT_DIR.mkdir(parents=True, exist_ok=True)

# Load
df = pd.read_csv(DATA_PATH)
assert 'RiskLevel' in df.columns, "Expected target column 'RiskLevel'"

# Basic EDA (shape, dtypes, nulls, class balance)
eda = {
    "rows": int(df.shape[0]),
    "cols": int(df.shape[1]),
    "dtypes": {c: str(t) for c, t in df.dtypes.items()},
    "missing": df.isna().sum().to_dict(),
    "class_counts": df['RiskLevel'].value_counts().to_dict(),
}
pd.Series(eda["class_counts"], name="class_counts")

# Quick EDA charts (matplotlib defaults; no custom colors)
df['RiskLevel'].value_counts().plot(kind='bar'); plt.title("Class Distribution"); plt.tight_layout()
plt.savefig(OUT_DIR/"chart_class_distribution.png", dpi=150); plt.clf()

df['Age'].plot(kind='hist', bins=20); plt.title("Age Distribution"); plt.tight_layout()
plt.savefig(OUT_DIR/"chart_age_hist.png", dpi=150); plt.clf()

plt.boxplot([df['SystolicBP'], df['DiastolicBP']], labels=['SystolicBP','DiastolicBP'])
plt.title("Blood Pressure Boxplots"); plt.tight_layout()
plt.savefig(OUT_DIR/"chart_bp_box.png", dpi=150); plt.clf()

corr = df.select_dtypes(include=np.number).corr()
plt.imshow(corr, interpolation='nearest'); plt.xticks(range(len(corr)), corr.columns, rotation=45, ha='right')
plt.yticks(range(len(corr)), corr.columns); plt.colorbar(); plt.title("Correlation Heatmap"); plt.tight_layout()
plt.savefig(OUT_DIR/"chart_corr_heatmap.png", dpi=150); plt.clf()

# Feature engineering (keep raw + engineered)
X = df.copy()
X['PulsePressure']  = X['SystolicBP'] - X['DiastolicBP']
X['SBP_to_DBP']     = X['SystolicBP'] / (X['DiastolicBP'].replace(0, np.nan))
X['Fever']          = (X['BodyTemp'] > 99.5).astype(int)
X['Tachycardia']    = (X['HeartRate'] >= 100).astype(int)
X['HypertensionFlag']= ((X['SystolicBP'] >= 140) | (X['DiastolicBP'] >= 90)).astype(int)

# optional z-scaling for continuous features
cont = ['Age','SystolicBP','DiastolicBP','BS','BodyTemp','HeartRate','PulsePressure','SBP_to_DBP']
X[[f"z_{c}" for c in cont]] = StandardScaler().fit_transform(X[cont])

# Encode label (binary in your file)
label_map = {'low risk':0, 'high risk':1}
y = X['RiskLevel'].map(label_map) if set(df['RiskLevel'].unique())==set(label_map) \
    else X['RiskLevel'].astype('category').cat.codes
X = X.drop(columns=['RiskLevel'])
pd.Series(label_map).to_json(OUT_DIR/"label_map.json")

# Stratified splits: prod 40%, then remaining 60% -> train 40% (of full),
#    val 10%, test 10% (of full)
X_temp, X_prod, y_temp, y_prod = train_test_split(X, y, test_size=0.40, random_state=42, stratify=y)
X_train, X_rem, y_train, y_rem  = train_test_split(X_temp, y_temp, test_size=1/3, random_state=42, stratify=y_temp)
X_val, X_test, y_val, y_test    = train_test_split(X_rem, y_rem, test_size=0.5, random_state=42, stratify=y_rem)

def _save(name, Xd, yd):
    out = Xd.copy()
    out['label'] = yd.values
    out.to_csv(OUT_DIR/f"{name}.csv", index=False)
    return out

train = _save("train", X_train, y_train)
val   = _save("val",   X_val,   y_val)
test  = _save("test",  X_test,  y_test)
prod  = _save("production", X_prod, y_prod)

# 7) Save engineered full for reference + tracker
df_engineered = pd.concat([X, y.rename('label')], axis=1)
df_engineered.to_csv(OUT_DIR/"maternal_features_full.csv", index=False)

with open(OUT_DIR/"eda_summary.json","w") as f:
    import json; json.dump(eda, f, indent=2)

print("Done. Files written to:", OUT_DIR)

  plt.boxplot([df['SystolicBP'], df['DiastolicBP']], labels=['SystolicBP','DiastolicBP'])


Done. Files written to: week3_outputs


<Figure size 640x480 with 0 Axes>

#### Upload splits to S3 (so Feature Store can ingest)

In [5]:
# In out Studio: to set bucket/prefix and run this
import boto3, os
bucket = "<OUR_BUCKET>"  # To be set later
prefix = "aai540/maternal_risk/week3"
s3 = boto3.client("s3")
for fname in ["train.csv","val.csv","test.csv","production.csv","maternal_features_full.csv","label_map.json","eda_summary.json"]:
    s3.upload_file(f"./week3_outputs/{fname}", bucket, f"{prefix}/{fname}")
    print(f"Uploaded s3://{bucket}/{prefix}/{fname}")

#### Initialize Feature Store & design feature groups (train/val/batch)

In [None]:
# Minimal excerpt (full file is provided above)
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup

# Create three offline-only feature groups and ingest your CSVs
fg_train = FeatureGroup(name="mhr_train_fg", sagemaker_session=Session())
# fg_train.load_feature_definitions(data_frame=train_df); fg_train.create(...); fg_train.ingest(...)

# Repeat for mhr_val_fg and mhr_batch_fg (production)