# 03b – Merge COPD cohort with RAAS exposures
## ACEi / ARB / RAAS inhibitor Exposure Definition and Validation

---
## 0. Overview
This notebook merges the baseline COPD ICU cohort with detailed pre-ICU RAAS inhibitor exposure information (ACE inhibitors and ARBs).  
The resulting dataset enables subclass-specific and combined RAAS analyses used in downstream
survival models ([04a](04a_outcomes_and_modeling.ipynb), [04b](04b_outcomes_and_modeling_raas_subgroups.ipynb), and [04c](04c_extended_covariate_cox_model.ipynb)).

---

## 1. Purpose
The objectives of this notebook are to:

- Integrate **pre-ICU ACEi and ARB exposure flags** into the baseline COPD ICU cohort
- Define mutually exclusive **RAAS inhibitor exposure groups**
- Validate exposure completeness and internal consistency
- Produce a standardized analysis-ready table for downstream modeling

---

## 2. Data Sources
- **MIMIC-IV v3.1** (BigQuery public dataset)
- Project: `mimic-iv-portfolio`
- Working datasets:
  - `copd_raas.cohort_copd`<br>
    (created in 02 using [02_build_cohort_copd.sql](../sql/02_build_cohort_copd.sql))
  - `copd_raas.cohort_copd_raas_detailed`<br>
    (created in 02 using [02_exposure_raas_detailed.sql](../sql/02_exposure_raas_detailed.sql))

---

## 3. Exposure Definitions
RAAS exposure is defined strictly as **medication use before or at ICU admission**, preserving temporal ordering and avoiding reverse causation.

Derived variables include:
- `acei_pre_icu` (0/1)
- `arb_pre_icu` (0/1)
- `raas_any_pre_icu` (ACEi or ARB)

A four-category exposure variable is constructed:

| Category    | Definition                |
|-------------|---------------------------|
| acei_only   | ACEi = 1, ARB = 0         |
| arb_only    | ACEi = 0, ARB = 1         |
| both        | ACEi = 1, ARB = 1         |
| neither     | ACEi = 0, ARB = 0         |

---

## 4. Cohort Merge (BigQuery SQL)
The cohort merge is performed entirely in BigQuery to ensure:
- Reproducibility
- Consistency across notebooks
- Minimal reliance on in-memory post-processing

The merge operation LEFT JOINs the baseline COPD ICU cohort with detailed RAAS exposure flags
using `(subject_id, hadm_id, stay_id)` as composite keys.

The resulting table is saved as:
- **Table:** `copd_raas.cohort_copd_with_raas`

---

## 5. Data Preparation and Sanity Checks
### 5.1 Connect to BigQuery


In [8]:
# Use Application Default Credentials (my user account)
# This account already has PhysioNet BigQuery access.

from google.cloud import bigquery
from google.auth import default
from pathlib import Path

# Define project ID
PROJECT_ID = "mimic-iv-portfolio"

# Get ADC credentials
creds, adc_project = default()

# Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID, credentials=creds)

print("Connected to BigQuery project:", PROJECT_ID)
print("ADC default project:", adc_project)

# Helper to run a SQL script file (DDL, CREATE TABLE, etc.)
def run_sql_script(path) :
    """
    Read a .sql file from disk, execute it in BigQuery,
    and wait until the job finishes.
    Use this for CREATE TABLE / INSERT INTO scripts.
    """
    sql_path = Path(path)
    with sql_path.open("r") as f:
        query = f.read()
    job = client.query(query)
    job.result()
    print(f"Executed SQL script: {sql_path.name}")

# Helper for SELECT queries → DataFrame
def query_to_df(query) :
    """
    Run a SELECT query in BigQuery and return a pandas DataFrame.
    """
    job = client.query(query)
    return job.to_dataframe(create_bqstorage_client=False)

Connected to BigQuery project: mimic-iv-portfolio
ADC default project: mimic-iv-portfolio


### 5.2 Execute Merge SQL

This step constructs the standardized COPD ICU cohort table with detailed pre-ICU RAAS inhibitor exposure information, which serves as the common input for downstream survival and subgroup analyses ([04a](04a_outcomes_and_modeling.ipynb), [04b](04b_outcomes_and_modeling_raas_subgroups.ipynb), and [04c](04c_extended_covariate_cox_model.ipynb)).

- SQL script: [03_merge_exposures.sql](../sql/03_merge_exposures.sql)
- Output table: `mimic-iv-portfolio.copd_raas.cohort_copd_with_raas`

The SQL script starts from the predefined COPD ICU cohort table `copd_raas.cohort_copd` and enriches each ICU stay with detailed RAAS inhibitor exposure variables by left-joining the exposure table `copd_raas.cohort_copd_raas_detailed` using standardized ICU stay identifiers (`subject_id`, `hadm_id`, `stay_id`).

The merged exposure variables include binary indicators for angiotensin-converting enzyme inhibitor (ACEi) use and angiotensin receptor blocker (ARB) use prior to or at ICU admission, as well as composite RAAS exposure definitions. These include any RAAS inhibitor exposure, dual ACEi/ARB exposure, mutually exclusive exposure
categories, and monotherapy indicators to support subgroup and sensitivity analyses.

All exposure variables are defined strictly before or at ICU admission to preserve temporal ordering and minimize the risk of reverse causation.

The resulting table is created using CREATE OR REPLACE TABLE, ensuring a clean and fully reproducible merge of exposure information in SQL. This table provides a consistent exposure framework for subsequent Cox proportional hazards modeling and stratified analyses performed in Python.

In [9]:
# Run merge SQL to create cohort_copd_with_raas in BigQuery
run_sql_script("../sql/03_merge_exposures.sql")

Executed SQL script: 03_merge_exposures.sql


### 5.3 Basic sanity checks on Row Count Validation
Baseline and post-merge row counts were compared to ensure no row loss or inflation.

In [10]:
# Sanity check: row counts before and after merge

sql_row_counts = """
SELECT
  'cohort_copd' AS table_name,
  COUNT(*) AS n
FROM `mimic-iv-portfolio.copd_raas.cohort_copd`
UNION ALL
SELECT
  'cohort_copd_with_raas' AS table_name,
  COUNT(*) AS n
FROM `mimic-iv-portfolio.copd_raas.cohort_copd_with_raas`
"""
row_counts = query_to_df(sql_row_counts)
row_counts

Unnamed: 0,table_name,n
0,cohort_copd,11964
1,cohort_copd_with_raas,11964


- Baseline COPD ICU cohort: **11,964**
- Post-merge cohort with RAAS exposures: **11,964**

No discrepancies were observed.

### 5.4 Basic sanity checks on Counts of Missing Exposure

In [11]:
# Sanity check: count NULLs in key exposure variables
sql_nulls = """
SELECT
  SUM(CASE WHEN acei_pre_icu IS NULL THEN 1 ELSE 0 END) AS null_acei_pre_icu,
  SUM(CASE WHEN arb_pre_icu IS NULL THEN 1 ELSE 0 END) AS null_arb_pre_icu,
  SUM(CASE WHEN raas_any_pre_icu IS NULL THEN 1 ELSE 0 END) AS null_any_pre_icu,
  SUM(CASE WHEN exposure_group_4cat IS NULL THEN 1 ELSE 0 END) AS null_group
FROM `mimic-iv-portfolio.copd_raas.cohort_copd_with_raas`
"""
query_to_df(sql_nulls)

Unnamed: 0,null_acei_pre_icu,null_arb_pre_icu,null_any_pre_icu,null_group
0,0,0,0,0


All derived exposure variables (`acei_pre_icu`, `arb_pre_icu`, `raas_any_pre_icu`, and the four-category exposure variable) were verified to contain no missing values.

### 5.5 Sanity checks on Exposure Distribution

The distribution of the four-category RAAS exposure variable was examined.

In [12]:
# Sanity check: distribution of exposure groups
sql_grp = """
SELECT exposure_group_4cat, COUNT(*) AS n
FROM `mimic-iv-portfolio.copd_raas.cohort_copd_with_raas`
GROUP BY exposure_group_4cat
ORDER BY exposure_group_4cat
"""
query_to_df(sql_grp)

Unnamed: 0,exposure_group_4cat,n
0,acei_only,643
1,arb_only,290
2,both,13
3,neither,11018


### 5.6 Cross-check of RAAS inhibitor exposure counts

In [13]:
# Sanity check: counts of users in each RAAS inhibitor category
sql_cross = """
SELECT
  SUM(CASE WHEN acei_pre_icu = 1 THEN 1 ELSE 0 END) AS acei_users,
  SUM(CASE WHEN arb_pre_icu = 1 THEN 1 ELSE 0 END) AS arb_users,
  SUM(CASE WHEN raas_any_pre_icu = 1 THEN 1 ELSE 0 END) AS any_users
FROM `mimic-iv-portfolio.copd_raas.cohort_copd_with_raas`
"""

query_to_df(sql_cross)

Unnamed: 0,acei_users,arb_users,any_users
0,656,303,946


Aggregate counts were internally consistent:
- ACEi users: **656**
- ARB users: **303**
- Any RAAS users: **946**

---

## 6. Output and Downstream Use
This step produces the standardized analysis-ready cohort table `copd_raas.cohort_copd_with_raas`, which serves as the primary input for all
downstream survival analyses.

This table serves as the standardized input for:
- **[04a](04a_outcomes_and_modeling.ipynb):** Combined RAAS exposure analysis
- **[04b](04b_outcomes_and_modeling_raas_subgroups.ipynb):** ACEi vs ARB subgroup analysis
- **[04c](04c_extended_covariate_cox_model.ipynb):** Extended covariate Cox models