# 02. Cohort and Exposure Construction
## COPD ICU Cohort and Pre-ICU RAAS Inhibitor Exposure Flags (BigQuery)

## 0. Overview
This notebook constructs a COPD ICU cohort and derives pre-ICU RAAS inhibitor exposure variables entirely using BigQuery.
It outputs standardized tables that are used downstream in:
- **03a:** baseline cohort table construction
- **03b:** exposure integration into an analysis-ready dataset
- **04a–04c:** survival analyses (Kaplan–Meier, Cox models, and diagnostics)

---

## 1. Purpose
1) Build the **COPD ICU cohort** in BigQuery
2) Create **pre-ICU RAAS inhibitor exposure flags** (simple + detailed) in BigQuery  
3) Run quick **sanity checks** to confirm expected cohort/exposure distributions

---

## 2. Data Sources
- **MIMIC-IV v3.1** (BigQuery public dataset via PhysioNet access)
- Google Cloud Platform (GCP): `mimic-iv-portfolio`
- Working dataset: `copd_raas`

---

## 3. Common BigQuery Utilities
See **01** for shared BigQuery utility functions.

In [None]:
# Use Application Default Credentials (my user account)
# This account already has PhysioNet BigQuery access.

from google.cloud import bigquery
from google.auth import default
from pathlib import Path

# Define project ID
PROJECT_ID = "mimic-iv-portfolio"

# Get ADC credentials
creds, adc_project = default()

# Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID, credentials=creds)

print("Connected to BigQuery project:", PROJECT_ID)
print("ADC default project:", adc_project)

# Helper to run a SQL script file (DDL, CREATE TABLE, etc.)
def run_sql_script(path) :
    """
    Read a .sql file from disk, execute it in BigQuery,
    and wait until the job finishes.
    Use this for CREATE TABLE / INSERT INTO scripts.
    """
    sql_path = Path(path)
    with sql_path.open("r") as f:
        query = f.read()
    job = client.query(query)
    job.result()
    print(f"Executed SQL script: {sql_path.name}")

# Helper for SELECT queries → DataFrame
def query_to_df(query) :
    """
    Run a SELECT query in BigQuery and return a pandas DataFrame.
    """
    job = client.query(query)
    return job.to_dataframe(create_bqstorage_client=False)

Connected to BigQuery project: mimic-iv-portfolio
ADC default project: mimic-iv-portfolio


## 4. Build Cohort and Exposure Tables in BigQuery

### 4.1 Build COPD ICU Cohort

This step constructs the base COPD ICU cohort entirely in BigQuery.

- **SQL script:** `02_build_cohort_copd.sql`
- **Output table:** `copd_raas.cohort_copd`

The SQL script identifies ICU admissions with COPD diagnoses using standard ICD-9
(491*, 492*, 496) and ICD-10 (J41–J44) codes from
`mimiciv_3_1_hosp.diagnoses_icd`, and restricts ICU stays to those admissions
by joining with the ICU cohort table (`copd_raas.cohort_icu`) defined in Notebook 01.

In [2]:
# Build the COPD ICU cohort in BigQuery using ICD-9/10 diagnosis codes
# Output: copd_raas.cohort_copd
run_sql_script("../sql/02_build_cohort_copd.sql")

Executed SQL script: 02_build_cohort_copd.sql


### 4.2 Define Binary Pre-ICU RAAS Inhibitor Exposure

This step defines a simple binary indicator of RAAS inhibitor exposure initiated
prior to or at ICU admission for the COPD ICU cohort.

- **SQL script:** `02_exposure_raas.sql`
- **Output table:** `copd_raas.cohort_copd_raas`

The SQL script identifies prescriptions for RAAS inhibitors from the MIMIC-IV
`physionet-data.mimiciv_3_1_hosp.prescriptions` table and derives a binary exposure flag (`raas_pre_icu`) at the ICU-stay level based on whether any RAAS inhibitor was initiated on or before ICU admission. This simplified exposure definition is primarily used for sanity checks and baseline comparisons, and serves as an intermediate input for downstream cohort
construction.

In [3]:
# Create binary pre-ICU RAAS inhibitor exposure flag (raas_pre_icu)
# Output: copd_raas.cohort_copd_raas
run_sql_script("../sql/02_exposure_raas.sql")

Executed SQL script: 02_exposure_raas.sql


### 4.3 Create Detailed RAAS Inhibitor Exposure Flags

This step defines medication-based pre-ICU RAAS exposure variables for the COPD ICU cohort.

- **SQL script:** `02_exposure_raas_detailed.sql`
- **Output table:** `copd_raas.cohort_copd_raas_detailed`

The SQL script identifies ACE inhibitor (ACEi) and angiotensin receptor blocker (ARB)
prescriptions from the MIMIC-IV `mimiciv_3_1_hosp.prescriptions` table using drug-name
pattern matching, and links them to ICU stays via patient and admission identifiers.

Only prescriptions initiated on or before ICU admission are considered to preserve
temporal ordering and avoid reverse causation.

At the ICU-stay level, the script derives binary indicators for ACEi and ARB use prior
to or at ICU admission, along with composite exposure definitions including any RAAS
inhibitor use, dual ACEi+ARB exposure, and mutually exclusive exposure groups (ACEi only,
ARB only, both, or neither).

In [4]:
# Execute SQL script to create detailed pre-ICU RAAS inhibitor exposure flags (ACEi / ARB / both / neither / monotherapy)
# Output: copd_raas.cohort_copd_raas_detailed
run_sql_script("../sql/02_exposure_raas_detailed.sql")

Executed SQL script: 02_exposure_raas_detailed.sql


## 5. Sanity Checks
### 5.1 Binary pre-ICU RAAS Inhibitor Exposure

A simple check is performed to confirm the distribution of the binary pre-ICU RAAS inhibitor exposure flag (raas_pre_icu) derived from medication records.

Expected (and observed) distribution:
- raas_pre_icu = 0: 11,018
- raas_pre_icu = 1: 946

In [5]:
# Simple pre-ICU RAAS inhibitor exposure Flag Sanity Check
# From: copd_raas.cohort_copd_raas (02_exposure_raas.sql)

sql_simple = """
SELECT 
  raas_pre_icu,
  COUNT(*) AS n
FROM `mimic-iv-portfolio.copd_raas.cohort_copd_raas`
GROUP BY raas_pre_icu
ORDER BY raas_pre_icu
"""

simple_counts = query_to_df(sql_simple)
df = simple_counts.copy()

# Replace 0/1 directly in the column
df["raas_pre_icu"] = df["raas_pre_icu"].map({
    0: "No RAAS inhibitor use before or at ICU admission",
    1: "RAAS inhibitor use before or at ICU admission"
})

display(df)

Unnamed: 0,raas_pre_icu,n
0,No RAAS inhibitor use before or at ICU admission,11018
1,RAAS inhibitor use before or at ICU admission,946


### 5.2 Cohort Integrity Check: One Row per ICU Stay

In [6]:
# Sanity check: one row per ICU stay
query_to_df("""
SELECT COUNT(*) AS n_rows, COUNT(DISTINCT stay_id) AS n_unique_stays
FROM `mimic-iv-portfolio.copd_raas.cohort_copd`
""")



Unnamed: 0,n_rows,n_unique_stays
0,11964,11964


### 5.3 Detailed Exposure Group Counts
This check validates the mutually exclusive 4-category exposure grouping.

Expected counts:
- ACEi only: 643
- ARB only: 290
- Both ACEi and ARB: 13
- Neither: 11,018

In [7]:
# Quick sanity check: number of patients per exposure group
sql_check = """
SELECT exposure_group_4cat, COUNT(*) AS n
FROM `mimic-iv-portfolio.copd_raas.cohort_copd_raas_detailed`
GROUP BY exposure_group_4cat
ORDER BY exposure_group_4cat
"""
exposure_counts = query_to_df(sql_check)
exposure_counts

Unnamed: 0,exposure_group_4cat,n
0,acei_only,643
1,arb_only,290
2,both,13
3,neither,11018


---

## 6. Outputs and Downstream Use

The following tables are created in BigQuery as part of the cohort and exposure
construction steps in this notebook:

- **`copd_raas.cohort_copd`**  
  Standardized COPD ICU cohort defining the base analysis population
  (created in `02_build_cohort_copd.sql`).

- **`copd_raas.cohort_copd_raas`**  
  Binary indicator of pre-ICU RAAS inhibitor exposure (`raas_pre_icu`), defined
  prior to or at ICU admission
  (created in `02_exposure_raas.sql`).

- **`copd_raas.cohort_copd_raas_detailed`**  
  Detailed pre-ICU RAAS inhibitor exposure definitions, including:
  - ACE inhibitor (ACEi) use
  - Angiotensin receptor blocker (ARB) use
  - Any RAAS inhibitor use
  - Dual exposure
  - Mutually exclusive categorical exposure groupings  
  (created in `02_exposure_raas_detailed.sql`).

These tables are designed to serve as standardized and reproducible inputs for
downstream cohort assembly and survival analyses performed in subsequent notebooks
(03a–04c).