# 03a - Baseline Cohort Construction

## COPD ICU Cohort Definition and Pre-ICU RAAS Exposure Ascertainment

---

## 0. Overview
This notebook constructs the baseline ICU cohort of patients with COPD and defines pre-ICU exposure to RAAS inhibitors. The resulting dataset serves as the common input for all downstream survival analyses ([04a](04a_outcomes_and_modeling.ipynb), [04b](04b_outcomes_and_modeling_raas_subgroups.ipynb), and [04c](04c_extended_covariate_cox_model.ipynb)).

---

## 1. Introduction
The objective of this notebook is to establish a reproducible baseline cohort for COPD patients admitted to the ICU, with a clear temporal definition of RAAS inhibitor exposure before or at ICU admission. This design ensures appropriate temporal ordering between exposure and outcome for subsequent survival analyses.

---

## 2. Data Sources
- **MIMIC-IV v3.1** (BigQuery public dataset)
- Project: `mimic-iv-portfolio`
- Working datasets:
  - `copd_raas.cohort_copd`<br>
    (created in [02](02_cohort_and_exposures.ipynb) using [02_build_cohort_copd.sql](../sql/02_build_cohort_copd.sql))
  - `copd_raas.cohort_copd_raas`<br>
    (created in [02](02_cohort_and_exposures.ipynb) using [02_exposure_raas.sql](../sql/02_exposure_raas.sql))

---

## 3. Cohort Definition
The baseline cohort includes adult ICU admissions meeting the following criteria:
- Diagnosis consistent with COPD
- First ICU stay per hospitalization
- Available follow-up for in-hospital mortality

These inclusion criteria are enforced in the upstream cohort construction SQL
executed in steps 01–02; this notebook focuses on assembling the baseline table
and integrating covariates and exposure indicators.

---

## 4. Exposure Definition: Pre-ICU RAAS Inhibitor Use
RAAS inhibitor exposure is defined as **any documented use prior to or at ICU admission**.
This definition was chosen to:
- Preserve temporal ordering between exposure and outcome
- Avoid reverse causation due to treatment escalation after ICU admission

The exposure variable is binary:
- `raas_pre_icu = 1`: RAAS inhibitor use before or at ICU
- `raas_pre_icu = 0`: No RAAS inhibitor use before or at ICU

---

## 5. Data Preparation and Sanity Checks
### 5.1 Common BigQuery Utilities
See **01** for shared BigQuery utility functions.

In [1]:
# Use Application Default Credentials (my user account)
# This account already has PhysioNet BigQuery access.

from google.cloud import bigquery
from google.auth import default
from pathlib import Path

# Define project ID
PROJECT_ID = "mimic-iv-portfolio"

# Get ADC credentials
creds, adc_project = default()

# Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID, credentials=creds)

print("Connected to BigQuery project:", PROJECT_ID)
print("ADC default project:", adc_project)

# Helper to run a SQL script file (DDL, CREATE TABLE, etc.)
def run_sql_script(path) :
    """
    Read a .sql file from disk, execute it in BigQuery,
    and wait until the job finishes.
    Use this for CREATE TABLE / INSERT INTO scripts.
    """
    sql_path = Path(path)
    with sql_path.open("r") as f:
        query = f.read()
    job = client.query(query)
    job.result()
    print(f"Executed SQL script: {sql_path.name}")

# Helper for SELECT queries → DataFrame
def query_to_df(query) :
    """
    Run a SELECT query in BigQuery and return a pandas DataFrame.
    """
    job = client.query(query)
    return job.to_dataframe(create_bqstorage_client=False)

Connected to BigQuery project: mimic-iv-portfolio
ADC default project: mimic-iv-portfolio


### 5.2 Execute SQL to construct the baseline COPD ICU cohort

This step constructs the standardized baseline COPD ICU cohort table, which serves as the common analytical input for all downstream ICU-based survival
analyses ([04a](04a_outcomes_and_modeling.ipynb), [04b](04b_outcomes_and_modeling_raas_subgroups.ipynb), and [04c](04c_extended_covariate_cox_model.ipynb)).

- SQL script: [03_build_baseline.sql](../sql/03_build_baseline.sql)
- Output table: `mimic-iv-portfolio.copd_raas.cohort_copd_baseline`

The SQL script starts from the predefined COPD ICU cohort table `copd_raas.cohort_copd` and enriches each ICU stay with patient-level demographic and mortality-related information from the `physionet-data.mimiciv_3_1_hosp.patients` table using `subject_id`.

Patient-level variables include age (taken directly from anchor_age), sex (gender), de-identified anchor variables (`anchor_year`, `anchor_year_group`), and date of death (`dod`) from the patients table.

Hospital length of stay (`hosp_los`, in days) is computed by linking ICU stays to
the `physionet-data.mimiciv_3_1_hosp.admissions` table using `hadm_id` and calculating the time difference between hospital discharge and admission timestamps (dischtime minus admittime, expressed in hours divided by 24).

Pre-ICU exposure to renin–angiotensin–aldosterone system (RAAS) inhibitors is incorporated by left-joining the RAAS exposure table `copd_raas.cohort_copd_raas` on (`subject_id`, `hadm_id`, `stay_id`). ICU stays
without a matching exposure record are explicitly classified as non-exposed (raas_pre_icu = 0) using COALESCE.

This table serves as the fixed input for subsequent feature engineering and survival modeling steps performed in Python.

In [2]:
# Execute SQL script (03_build_baseline.sql) to create the baseline table (copd_raas.cohort_copd_baseline) for downstream modeling
run_sql_script("../sql/03_build_baseline.sql")

Executed SQL script: 03_build_baseline.sql


### 5.3 Perform basic sanity checks on cohort size and exposure distribution

In [3]:
# Verify updated baseline table
df_baseline = query_to_df("""
    SELECT *
    FROM `mimic-iv-portfolio.copd_raas.cohort_copd_baseline`
""")

df_baseline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11964 entries, 0 to 11963
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   subject_id         11964 non-null  Int64         
 1   hadm_id            11964 non-null  Int64         
 2   stay_id            11964 non-null  Int64         
 3   intime             11964 non-null  datetime64[us]
 4   outtime            11962 non-null  datetime64[us]
 5   icu_los            11962 non-null  float64       
 6   age                11964 non-null  Int64         
 7   gender             11964 non-null  object        
 8   anchor_year        11964 non-null  Int64         
 9   anchor_year_group  11964 non-null  object        
 10  hosp_los           11964 non-null  float64       
 11  raas_pre_icu       11964 non-null  Int64         
 12  dod                6413 non-null   dbdate        
dtypes: Int64(6), datetime64[us](2), dbdate(1), float64(2), object

In [4]:
# Check distribution of raas_pre_icu
df_baseline.groupby('raas_pre_icu').size().reset_index(name='count')

Unnamed: 0,raas_pre_icu,count
0,0,11018
1,1,946


These counts were reviewed to ensure consistency with expectations based on prior
literature and dataset characteristics.

---


## 6. Outputs and Downstream Use

This step produces the standardized baseline cohort table `copd_raas.cohort_copd_baseline`, which serves as the primary analysis-ready input for all downstream survival analyses.

The baseline table is used in:

- **[04a](04a_outcomes_and_modeling.ipynb):** Kaplan–Meier and Cox models with combined RAAS inhibitor exposure
- **[04b](04b_outcomes_and_modeling_raas_subgroups.ipynb):** Subgroup analyses comparing ACE inhibitors and ARBs
- **[04c](04c_extended_covariate_cox_model.ipynb):** Extended Cox models with additional covariates and sensitivity analyses

By consolidating cohort definition, covariates, and pre-ICU exposure indicators into a single SQL-generated table, this step ensures consistent data inputs, reproducibility, and comparability across all downstream modeling stages.