# PCOS Diagnosis Prediction: A Logistic Regression Analysis

## Research Question

**"Which clinical and lifestyle factors are most predictive of PCOS diagnosis, and do these effects vary across demographic groups (e.g., by ethnicity or socioeconomic status)?"**

---

## Dataset Overview

- **Source:** Global PCOS diagnostic data
- **Size:** 120,000 observations × 17 variables
- **Target Variable:** `Diagnosis` (Yes/No) — 10.5% positive cases, class imbalance ratio ~8.5:1

### Variable Categories

| Category | Variables |
|----------|-----------|
| **Target** | `Diagnosis` (Binary: Yes/No) |
| **Clinical Symptoms** | `Menstrual Regularity`, `Hirsutism`, `Acne Severity`, `BMI`, `Insulin Resistance` |
| **Medical History** | `Family History of PCOS` |
| **Lifestyle Factors** | `Lifestyle Score`, `Stress Levels` |
| **Demographics** | `Age`, `Ethnicity`, `Country` |
| **Socioeconomic** | `Socioeconomic Status`, `Urban/Rural` |
| **Awareness/Concerns** | `Awareness of PCOS`, `Fertility Concerns` |
| **Other** | `Undiagnosed PCOS Likelihood` (continuous probability score) |

---

## Analysis Pipeline

### 1. Data Preparation & Exploration
- Load and inspect data
- Handle missing values
- Encode categorical variables
- **Calculate and plot correlation matrix** to explore underlying relationships

### 2. Homogeneous Model (Baseline)
- Fit logistic regression using core clinical predictors only
- Predictors: `BMI`, `Hirsutism`, `Acne Severity`, `Menstrual Regularity`, `Insulin Resistance`, `Family History of PCOS`
- Check for multicollinearity via coefficient inspection
- Ensure no errors or warnings

### 3. Heterogeneous Model (Extended)
- Add demographic and socioeconomic predictors: `Age`, `Ethnicity`, `Socioeconomic Status`, `Urban/Rural`
- Include interaction terms (e.g., `Ethnicity × Insulin Resistance`)
- Compare complexity and fit against homogeneous model

### 4. Coefficient Significance Assessment
- Use simulations to assess coefficient significance
- Plot simulation results
- Interpret coefficients in terms of **odds ratios** (not log-odds)
- Discuss magnitude, direction, and statistical significance

### 5. Model Comparison
- Compare homogeneous vs. heterogeneous models using **AIC criterion**
- Interpret which model provides better balance of fit and parsimony

### 6. Model Selection & Validation
- Perform model selection via **cross-validation** or **train-validation-test split**
- Use complexity-adjusted metrics (AIC/BIC) to guide selection

### 7. Predicted Probabilities & Uncertainty
- Generate predicted probabilities for observations
- Plot **posterior predictive distributions**
- Visualise uncertainty using histograms or credible intervals
- Compare predicted probabilities against true class for specific observations

### 8. Generalisation Error
- Compute generalisation error using appropriate metric (e.g., log-loss, Brier score, or AUC-ROC)
- Report and interpret model performance on held-out data

---

## Expected Outputs

1. Correlation matrix heatmap
2. Fitted homogeneous and heterogeneous logistic regression models
3. Coefficient significance plots (simulation-based)
4. Odds ratio interpretations with confidence intervals
5. AIC comparison table
6. Predicted probability distributions with uncertainty visualisation
7. Individual observation predictions vs. true class
8. Generalisation error metrics

---

## Key Assessment Criteria Addressed

| Criterion | Section |
|-----------|---------|
| Homogeneous & heterogeneous models fitted | Sections 2, 3 |
| Multicollinearity check | Section 2 |
| Simulation-based significance testing | Section 4 |
| Predicted probabilities & posterior predictive plots | Section 7 |
| Coefficient interpretation (odds, not log-odds) | Section 4 |
| AIC model comparison | Section 5 |
| Correlation matrix | Section 1 |
| Model selection (CV/train-test/AIC) | Section 6 |
| Predicted probabilities for observations | Section 7 |
| Comparison against true class | Section 7 |
| Generalisation error computation | Section 8 |