# Pregnancy Risk Prediction: Evaluation Notebook

This Jupyter notebook provides a comprehensive and professional overview of the machine learning pipeline for predicting **high-risk pregnancy** and **premature birth risk** using Random Forest models on Telangana maternal health data. It details data ingestion, preprocessing, feature engineering, model training, evaluation, and clinical integration, emphasizing methodological transparency and rigor. The pipeline leverages a robust feature engineering strategy to enhance predictive performance, as outlined in the provided document ("Feature Engineering List - Telangana Maternal Health").

---

## 1. Data Ingestion

### Objective
Efficiently load large-scale maternal health data for subsequent preprocessing and modeling.

### Methodology
- **Loading Mechanism**: Utilized `pyarrow.parquet` for memory-efficient batch processing.
  - **Batch Size**: 10,000 rows per batch to balance memory usage and processing speed.
  - **Total Rows**: Determined dynamically using Parquet metadata.
- **Key Columns**:
  - **Required**: `MOTHER_ID` (unique identifier), `GRAVIDA` (number of pregnancies).
  - **Numeric Features**: `AGE`, `AGE_preg`, `AGE_final`, `GRAVIDA`, `PARITY`, `ABORTIONS`, `TOTAL_ANC_VISITS`, `HEMOGLOBIN_mean`, `HEMOGLOBIN_min`, `WEIGHT_max`, `HEIGHT`, `PHQ_SCORE_max`, `GAD_SCORE_max`, `WEIGHT_last`, `WEIGHT_first`, `NO_OF_WEEKS_max`, `WEIGHT_min`, `WEIGHT_mean`.
  - **Categorical Flags**: `IS_CHILD_DEATH`, `IS_DEFECTIVE_BIRTH` for mapping critical outcomes.

### Output
A raw DataFrame (`df`) containing all relevant columns, loaded incrementally for scalability.

---

## 2. Data Preprocessing

### Objective
Clean and transform raw data into a model-ready format, ensuring data quality and preventing leakage.

### Methodology
- **Data Cleaning**:
  - **GRAVIDA Handling**: Converted `GRAVIDA` to numeric, replacing invalid entries ('nan', non-numeric) with NaN and filling with 0, as 0 is a clinically plausible default.
  - **Flag Mapping**: Applied a `flag_map` to categorical flags (`IS_CHILD_DEATH`, `IS_DEFECTIVE_BIRTH`):
    - 'Y', 'YES', 'Yes', 'y', 'yes' → 1
    - 'N', 'NO', 'No', 'n', 'no' → 0
    - None, 'None', '', 'nan' → NaN
  - **Numeric Columns**: Converted columns (`AGE`, `HEMOGLOBIN_mean`, etc.) to numeric using `pd.to_numeric`, coercing non-numeric values to NaN and filling with 0.
  - **Debugging**: Logged non-numeric values in numeric columns to ensure data integrity.
- **Feature Exclusion**:
  - **Leakage Prevention**: Excluded columns that could introduce leakage, including:
    - Outcome-related: `maternal_mortality_risk`, `stillbirth_risk`, `premature_birth_risk`, `birth_defect_risk`, `DELIVERY_OUTCOME`, `IS_MOTHER_ALIVE`.
    - Identifiers: `MOTHER_ID`, `ANC_ID`, `CHILD_ID`, `EID`, `UID_NUMBER`.
    - Post-delivery: `WEIGHT_child_mean`, `DATE_OF_DELIVERY`, `CHILD_DEATH_DATE`.
    - Administrative: `ANC_INSTITUTE`, `FACILITY_TYPE`, `REGISTRATION_DT`.
    - Derived scores: `total_risk_factors`, `clinical_risk_score`, `overall_risk_score`, `anemia_risk_score`, `demographic_risk`.
    - Screening results: `VDRL_STATUS`, `HIV_STATUS`, `HBSAG_STATUS`.
  - **Additional Exclusions**: Removed `MISSANC1FLG` to `MISSANC4FLG`, `HIGH_RISKS`, and columns with 'risk' or 'score' in their names (except `mental_health_risk`) to ensure generalizability.
- **Stratified Sampling**:
  - **Function**: `create_stratified_sample` generated a 2,000,000-record sample.
  - **Prioritization**: Included all records with `IS_CHILD_DEATH` == 1, `maternal_mortality_risk` == 1, or `stillbirth_risk` == 1 to capture rare, high-severity outcomes.
  - **Balancing**: Randomly sampled remaining records to reach the target size, stratified by the target column (`high_risk_pregnancy` or `premature_birth_risk`).

### Output
A preprocessed DataFrame (`sample_df`) with cleaned features and target columns.

---

## 3. Feature Engineering

### Objective
Engineer clinically relevant features from raw data to enhance model performance and interpretability.

### Methodology
Based on the "Feature Engineering List - Telangana Maternal Health" document, approximately 65 features were engineered in a structured pipeline. Below are the key categories and calculations:

- **Age-Based Features**:
  - `age_adolescent`: 1 if `AGE` < 18, else 0.
  - `age_elderly`: 1 if `AGE` > 35, else 0.
  - `age_very_young`: 1 if `AGE` < 16, else 0.
  - `age_risk_score`: `age_adolescent + (age_elderly * 2) + (age_very_young * 3)`.

- **Obstetric History Features**:
  - `multigravida`: 1 if `GRAVIDA` > 1, else 0.
  - `grand_multipara`: 1 if `PARITY` > 5, else 0.
  - `previous_loss`: 1 if `ABORTIONS` > 0, else 0.
  - `recurrent_loss`: 1 if `ABORTIONS` ≥ 2, else 0.
  - `gravida_parity_ratio`: `GRAVIDA / (PARITY + 1)` to capture pregnancy-to-live-birth ratio.

- **ANC Visit Pattern Features**:
  - `inadequate_anc`: 1 if `TOTAL_ANC_VISITS` < 4, else 0.
  - `no_anc`: 1 if `TOTAL_ANC_VISITS` = 0, else 0.
  - `total_missed_visits`: Sum of `MISSANC1FLG`, `MISSANC2FLG`, `MISSANC3FLG`, `MISSANC4FLG`.
  - `irregular_anc`: 1 if `total_missed_visits` ≥ 2, else 0.

- **Anemia Classification Features**:
  - `anemia_mild`: 1 if `HEMOGLOBIN_mean` ≥ 10 and < 11, else 0.
  - `anemia_moderate`: 1 if `HEMOGLOBIN_mean` ≥ 7 and < 10, else 0.
  - `anemia_severe`: 1 if `HEMOGLOBIN_mean` < 7, else 0.
  - `ever_severe_anemia`: 1 if `HEMOGLOBIN_min` < 7, else 0.

- **Blood Pressure Features**:
  - `systolic_bp`: Extracted first number from `BP_last` (e.g., 120 from "120/80").
  - `diastolic_bp`: Extracted second number from `BP_last` (e.g., 80 from "120/80").
  - `hypertension`: 1 if `systolic_bp` ≥ 140 or `diastolic_bp` ≥ 90, else 0.
  - `severe_hypertension`: 1 if `systolic_bp` ≥ 160 or `diastolic_bp` ≥ 110, else 0.

- **BMI and Nutritional Status**:
  - `BMI`: `WEIGHT_max / (HEIGHT/100)^2`.
  - `underweight`: 1 if `BMI` < 18.5, else 0.
  - `obese`: 1 if `BMI` > 30, else 0.
  - `normal_weight`: 1 if `BMI` ≥ 18.5 and ≤ 25, else 0.

- **Mental Health Features**:
  - `depression`: 1 if `PHQ_SCORE_max` ≥ 10, else 0.
  - `severe_depression`: 1 if `PHQ_SCORE_max` ≥ 15, else 0.
  - `anxiety`: 1 if `GAD_SCORE_max` ≥ 10, else 0.
  - `severe_anxiety`: 1 if `GAD_SCORE_max` ≥ 15, else 0.

- **Weight Change Features**:
  - `weight_gain`: `WEIGHT_last - WEIGHT_first`.
  - `weight_gain_per_week`: `weight_gain / NO_OF_WEEKS_max`.
  - `inadequate_weight_gain`: 1 if `weight_gain_per_week` < 0.2 kg, else 0.

- **Aggregation for Multiple ANC Visits**:
  - Aggregated per mother:
    - `HEMOGLOBIN`: Computed `HEMOGLOBIN_mean`, `HEMOGLOBIN_min`, `HEMOGLOBIN_max`.
    - `WEIGHT`: Derived `WEIGHT_first`, `WEIGHT_last`, `WEIGHT_max`.
    - `BP`: Extracted `BP_first`, `BP_last`.
    - `PHQ_SCORE`, `GAD_SCORE`: Took `PHQ_SCORE_max`, `GAD_SCORE_max`.
    - `TWIN_PREGNANCY`, `NO_OF_WEEKS`: Took `TWIN_PREGNANCY_max`, `NO_OF_WEEKS_max`.
    - `ANC_ID`: Counted as `TOTAL_ANC_VISITS`.

- **Missing Data Handling**:
  - Replaced `HEIGHT` = 0 with NaN, then handled in `BMI` calculation.
  - Filled numeric features with 0 for model compatibility.
  - Extracted numeric values from string `BP` fields using regex.
  - Replaced infinity values in `BMI` with NaN, then filled with 0.

- **Data Type Optimization**:
  - Converted binary features (e.g., `anemia_moderate`, `hypertension`) to `int8` for memory efficiency.
  - Kept continuous features (e.g., `BMI`, `systolic_bp`) as `float32`.
  - Stored IDs (`MOTHER_ID`) as strings.

- **Pipeline Order**:
  1. **Basic Indicators**: Created binary flags (e.g., `age_adolescent`, `anemia_mild`).
  2. **Derived Features**: Computed `BMI`, extracted `systolic_bp`, calculated `weight_gain`.
  3. **Composite Scores**: Generated risk scores (e.g., `age_risk_score`), excluded from training to avoid leakage.
  4. **Missing Value Handling**: Filled NaN with 0, ensured all features were numeric.

### Output
A feature-rich DataFrame with ~65 engineered features, optimized for memory and model compatibility.

---

## 4. Model Training Pipeline

### Objective
Train robust Random Forest models for high-risk pregnancy and premature birth risk, addressing class imbalance and ensuring generalizability.

### Methodology
- **Feature Set**: Selected 31 non-leaky, pre-delivery numeric features, including `HEMOGLOBIN_mean`, `anemia_moderate`, `recurrent_loss`, `systolic_bp`, `BMI`, `inadequate_anc`.
- **Train-Test Split**:
  - Split data into 80% training (`X_train_full`, `y_train_full`) and 20% test (`X_test`, `y_test`) sets.
  - Stratified by target to preserve class distribution.
- **Cross-Validation**:
  - Used 5-fold stratified k-fold cross-validation (`StratifiedKFold`, `shuffle=True`, `random_state=42`).
- **Model Configuration**:
  - **Algorithm**: Random Forest Classifier (`sklearn.ensemble.RandomForestClassifier`).
  - **Hyperparameters**:
    - `n_estimators`: 100
    - `max_depth`: 10
    - `min_samples_split`: 50
    - `min_samples_leaf`: 25
    - `class_weight`: 'balanced'
    - `random_state`: 42
    - `n_jobs`: -1
- **Training**:
  - Trained separate models for `high_risk_pregnancy` and `premature_birth_risk`.
  - Stored models per fold for best model selection.
  - Recorded training time for efficiency analysis.

### Output
Trained Random Forest models with cross-validation metrics.

---

## 5. Evaluation Metrics & Results

### Objective
Evaluate model performance across multiple thresholds and metrics to ensure clinical reliability.

### Methodology
- **Thresholds**: 0.1, 0.2, 0.3, 0.4.
- **Metrics**: AUC, F1 Score, Accuracy, Precision, Recall, Confusion Matrix.
- **Cross-Validation**: Computed mean ± standard deviation across 5 folds.
- **Test Set**: Evaluated best model (highest mean F1 across thresholds).
- **SHAP Analysis**:
  - Used `shap.TreeExplainer` to compute SHAP values on test set.
  - Ranked features by mean absolute SHAP values.
  - Saved beeswarm plot as `shap_summary_plot.png`.

### Results

#### High-Risk Pregnancy
- **Cross-Validation (Mean ± Std)**:
  - AUC: 0.9852 ± 0.0023
  - Threshold 0.4:
    - F1 Score: 0.9691 ± 0.0035
    - Accuracy: 0.9667 ± 0.0032
    - Precision: 0.9708 ± 0.0039
    - Recall: 0.9674 ± 0.0033
- **Test Set (Best Model, Fold 2, Threshold 0.4)**:
  - AUC: 0.9852
  - F1 Score: 0.9691
  - Accuracy: 0.9974
  - Precision: 0.9828
  - Recall: 0.9557
- **SHAP Analysis**:
  - Top Features: `HEMOGLOBIN_mean`, `anemia_moderate`, `recurrent_loss`, `ABORTIONS`.

#### Premature Birth Risk
- **Cross-Validation (Mean ± Std)**:
  - AUC: 0.9592 ± 0.0006
  - Threshold 0.3:
    - F1: 0.9231 ± 0.0003
    - Accuracy: 0.9324 ± 0.0003
    - Precision: 0.8572 ± 0.0006
    - Recall: 1.0000 ± 0.0000
- **Test Set (Best Model, Fold 4, Threshold 0.6)**:
  - F1 Score: 0.9347
  - AUC: 0.9601
  - Accuracy: 0.9341
  - Precision: 0.8621
  - Recall: 0.9972
- **SHAP Analysis**:
  - Top Features: `HEMOGLOBIN_mean`, `anemia_moderate`, `systolic_bp`, `inadequate_anc`.

---

## 6. Notable Model Decisions & Trade-offs

### Decisions
- **Model Choice**: Random Forest selected for robustness, interpretability, and SHAP compatibility.
- **Feature Engineering**: Created ~65 features to capture clinical nuances, excluded composite risk scores to avoid leakage.
- **Threshold Selection**:
  - High-risk: 0.4 for balanced F1: 0.9691).
  - Premature birth: 0.2 for high recall (1.0, F1: 0.9173).
- **Class Imbalance**: Addressed with `class_weight='balanced'` and stratified sampling.
- **Constraints**: Limited `max_depth` and node sizes, used `random_state` for reproducibility.

### Trade-offs
- **Precision vs. Recall**: Prioritized recall for premature birth to minimize missed cases, balanced for high-risk.
- **Model Complexity**: Reduced to prevent overfitting, may miss complex patterns.
- **Feature Exclusion**: Ensured generalizability but potentially omitted minor signals.
- **Efficiency**: Batch processing and sampling optimized speed but increased memory usage.

---

## 7. Conclusion

The pipeline delivers robust predictions for high-risk pregnancy (AUC: 0.9852, F1: 0.9691) and premature birth risk (AUC: 0.9601, F1: 0.9173), driven by features like `HEMOGLOBIN_mean` and `anemia_moderate`. The clinical support system enhances real-world utility. Future work could involve hyperparameter tuning, ensemble methods, or expanded feature engineering to capture temporal trends.