# 03b Baseline Characteristics Stratified by Early RAAS Inhibitor Exposure (Non-ICU)
## 0. Overview

This notebook summarizes baseline characteristics of non-ICU hospital admissions stratified by early RAAS inhibitor exposure using a pre-constructed, admission-level analysis dataset.

The goal is to describe cohort composition and baseline differences between exposure groups by summarizing demographic characteristics, comorbidity burden, and admission-related variables prior to outcome modeling.

This notebook is purely descriptive and does not fit outcome models, estimate effect measures, or perform causal inference.

## 1. Introduction

Patients receiving early RAAS inhibitors may differ systematically from unexposed patients with respect to demographic characteristics, comorbidity burden, and admission context.

Before estimating adjusted outcomes, it is therefore important to characterize baseline differences between exposure groups to understand potential sources of confounding and to contextualize subsequent multivariable analyses.

This notebook provides descriptive summaries only and does not attempt to estimate associations or causal effects.

## 2. Data Sources

- **MIMIC-IV v3.1** (BigQuery public dataset)
- Project: `mimic-iv-portfolio`

**Source Tables:**
  - `mimic-iv-portfolio.nonicu_raas.analysis_dataset`<br>
    (created in 02 using `03_build_analysis_dataset.sql`)

These intermediate tables were generated using SQL-based preprocessing pipelines to ensure reproducibility and separation between data extraction and analysis.

## 3. Cohort Definition

### 3.1 Inclusion Criteria
- Adult admissions (age ≥ 18 years)
- Non-ICU hospital admissions
- Exposure indicators available at the admission level (with unexposed admissions explicitly coded)

Each row corresponds to a unique hospital admission.

## 4. Exposure Definition: Pre-ICU RAAS Inhibitor Use

### 4.1 Exposure Construction
Early RAAS inhibitor exposure was defined upstream using pre-specified, time-restricted criteria and incorporated into the analysis dataset prior to this notebook.

For descriptive purposes, exposure status is represented using a binary admission-level indicator (raas_any_early), which classifies hospital admissions as either exposed or unexposed to early RAAS inhibitor therapy.

Exposure definitions are not modified or re-derived in this notebook.

### 4.2 Exposure Group Labels
For descriptive clarity, exposure groups were labeled as:
- RAAS early
- No RAAS early

## 5. Data Preparation and Sanity Checks

### 5.1 Common BigQuery Utilities

In [1]:
from google.cloud import bigquery
from google.auth import default
import pandas as pd
import numpy as np
from pathlib import Path

# 1. Define project ID, dataset, and table references
PROJECT_ID = "mimic-iv-portfolio"
DATASET = "nonicu_raas"

TABLE_ANALYSIS = f"{PROJECT_ID}.{DATASET}.analysis_dataset"  # Created in 03_build_analysis_dataset.sql

# 2. Get ADC credentials

creds, adc_project = default()
client = bigquery.Client(project=PROJECT_ID, credentials=creds)

print("Connected to:", PROJECT_ID, "| ADC default:", adc_project)

Connected to: mimic-iv-portfolio | ADC default: mimic-iv-portfolio


### 5.2 Dataset Loading and Initial Sanity Checks

In [2]:
# 5. Helper for read-only SELECT queries → DataFrame
def query_to_df(query) :
    """
    Run a SELECT query in BigQuery and return a pandas DataFrame.
    """
    job = client.query(query)
    return job.to_dataframe(create_bqstorage_client=False)

In [3]:
q = f"""
SELECT
  *
FROM `{TABLE_ANALYSIS}`
"""
df = query_to_df(q)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460786 entries, 0 to 460785
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   subject_id            460786 non-null  Int64         
 1   hadm_id               460786 non-null  Int64         
 2   admittime             460786 non-null  datetime64[us]
 3   dischtime             460786 non-null  datetime64[us]
 4   deathtime             2324 non-null    datetime64[us]
 5   hospital_expire_flag  460786 non-null  Int64         
 6   admission_type        460786 non-null  object        
 7   admission_location    460785 non-null  object        
 8   discharge_location    311810 non-null  object        
 9   insurance             452862 non-null  object        
 10  language              460377 non-null  object        
 11  marital_status        454118 non-null  object        
 12  race                  460786 non-null  object        
 13 

### 5.3 Exposure Definition Validation and Consistency Checks

In [4]:
# ================================
# Basic dataset integrity checks
# ================================

# Number of rows (each row represents one hospital admission)
print("Rows:", len(df))

# Number of unique hospital admissions (should match row count)
print("Unique hadm_id:", df["hadm_id"].nunique())


# ================================
# Exposure prevalence
# ================================

# Proportion of admissions with early RAAS inhibitor exposure
# Since raas_any_early is coded as 0/1, the mean corresponds to prevalence
prev = df["raas_any_early"].mean()
print(f"Early RAAS exposure prevalence: {prev:.4f} ({prev*100:.2f}%)")


# ================================
# Internal consistency checks for exposure definitions
# ================================

# Check that raas_any_early is correctly defined as:
# 1 if either ACE inhibitor OR ARB was administered early
raas_any_definition_check = (
    ((df["acei_early"] == 1) | (df["arb_early"] == 1))
    == (df["raas_any_early"] == 1)
)

# Check that raas_both_early is correctly defined as:
# 1 if BOTH ACE inhibitor AND ARB were administered early
raas_both_definition_check = (
    ((df["acei_early"] == 1) & (df["arb_early"] == 1))
    == (df["raas_both_early"] == 1)
)

# Proportion of rows where definitions are internally consistent
# A value of 1.00 indicates perfect agreement across all admissions
print(
    "RAAS-any indicator matches ACEi or ARB exposure:",
    f"{raas_any_definition_check.mean():.2f}"
)

print(
    "RAAS-both indicator matches simultaneous ACEi and ARB exposure:",
    f"{raas_both_definition_check.mean():.2f}"
)

Rows: 460786
Unique hadm_id: 460786
Early RAAS exposure prevalence: 0.1233 (12.33%)
RAAS-any indicator matches ACEi or ARB exposure: 1.00
RAAS-both indicator matches simultaneous ACEi and ARB exposure: 1.00


**Descriptive Summary**

- The analytic dataset consisted of 460,786 hospital admissions, with a one-to-one correspondence between rows and unique hospital admission identifiers, confirming the absence of duplicate admissions.
	
- Early RAAS inhibitor exposure was observed in 12.33% of admissions, indicating that early use of ACE inhibitors or ARBs was relatively uncommon in this non-ICU cohort.

- Internal consistency checks demonstrated perfect agreement (100%) between the composite exposure indicators and their component definitions:
  - The RAAS-any indicator correctly reflected early exposure to either an ACE inhibitor or an ARB.
  - The RAAS-both indicator correctly reflected simultaneous early exposure to both drug classes.

- These results confirm that exposure variables were internally consistent and reliably constructed, supporting their use in subsequent descriptive and outcome analyses.

### 5.4 Exposure Group Label Construction for Descriptive Analysis

In [5]:
df["expo_group"] = np.where(df["raas_any_early"] == 1, "RAAS early", "No RAAS early")

In [6]:
# Exposure group counts with percentages
expo_summary = (
    df["expo_group"]
    .value_counts(dropna=False)
    .rename("n")
    .to_frame()
)

expo_summary["pct"] = (expo_summary["n"] / expo_summary["n"].sum() * 100).round(2)

expo_summary.reset_index().rename(
    columns={
        "index": "Exposure group",
        "n": "Number of admissions",
        "pct": "Percentage (%)"
    }
)

Unnamed: 0,expo_group,Number of admissions,Percentage (%)
0,No RAAS early,403961,87.67
1,RAAS early,56825,12.33


**Descriptive Summary**

Early RAAS inhibitor exposure was observed in 56,825 admissions (12.33%), whereas 403,961 admissions (87.67%) had no early RAAS exposure, indicating that early RAAS use was relatively uncommon in the study population.

### 5.5 Baseline Continuous Variables by Exposure Group

In [7]:
# ------------------------------------------------------------
# Helper functions for formatted descriptive statistics
# ------------------------------------------------------------

def format_mean_sd(x):
    """
    Return mean and standard deviation formatted as:
    mean (sd)
    """
    x = x.dropna()
    if len(x) == 0:
        return "NA"
    return f"{x.mean():.2f} ({x.std():.2f})"


def format_median_iqr(x):
    """
    Return median and interquartile range formatted as:
    median [Q1, Q3]
    """
    x = x.dropna()
    if len(x) == 0:
        return "NA"
    q1 = x.quantile(0.25)
    q3 = x.quantile(0.75)
    return f"{x.median():.2f} [{q1:.2f}, {q3:.2f}]"


# ------------------------------------------------------------
# Continuous variables to summarize
# ------------------------------------------------------------
cont_vars = ["age", "hosp_los"]


# ------------------------------------------------------------
# Build Table 1: Continuous baseline characteristics
# Stratified by early RAAS exposure
# ------------------------------------------------------------
# ------------------------------------------------------------
# Helper functions for formatted descriptive statistics
# ------------------------------------------------------------
def format_mean_sd(x):
    x = x.dropna()
    if len(x) == 0:
        return "NA"
    return f"{x.mean():.2f} ({x.std():.2f})"

def format_median_iqr(x):
    x = x.dropna()
    if len(x) == 0:
        return "NA"
    q1 = x.quantile(0.25)
    q3 = x.quantile(0.75)
    return f"{x.median():.2f} [{q1:.2f}, {q3:.2f}]"

# ------------------------------------------------------------
# Continuous variables to summarize
# ------------------------------------------------------------
cont_vars = ["age", "hosp_los"]

# ------------------------------------------------------------
# Build Table 1: Continuous baseline characteristics
# Stratified by early RAAS exposure (raas_any_early)
# ------------------------------------------------------------
rows = []

for v in cont_vars:
    g0 = df.loc[df["raas_any_early"] == 0, v]
    g1 = df.loc[df["raas_any_early"] == 1, v]

    rows.append({
        "variable": v,
        "No RAAS early (mean (sd))": format_mean_sd(g0),
        "RAAS early (mean (sd))": format_mean_sd(g1),
        "No RAAS early (median [IQR])": format_median_iqr(g0),
        "RAAS early (median [IQR])": format_median_iqr(g1),
    })

table1_cont = pd.DataFrame(rows)
table1_cont

Unnamed: 0,variable,No RAAS early (mean (sd)),RAAS early (mean (sd)),No RAAS early (median [IQR]),RAAS early (median [IQR])
0,age,56.65 (19.55),68.63 (14.01),"58.00 [41.00, 72.00]","69.00 [59.00, 79.00]"
1,hosp_los,3.75 (5.45),3.98 (5.13),"2.33 [0.92, 4.58]","2.75 [1.50, 4.79]"


**Descriptive Summary**

- Admissions with early RAAS inhibitor exposure were markedly older than those without early exposure.
The mean age was 68.6 years in the RAAS early group compared with 56.7 years in the non-RAAS group, and this difference was also evident in the median age (69 vs. 58 years).

- Hospital length of stay was slightly longer among admissions with early RAAS exposure.
The median hospital stay was 2.75 days in the RAAS early group versus 2.33 days in the non-RAAS group, with similar patterns observed for mean length of stay.

- These summaries represent unadjusted baseline characteristics stratified by exposure group.
The substantial age difference between groups suggests that crude comparisons of outcomes may be confounded by baseline risk, underscoring the need for multivariable adjustment in subsequent analyses

### 5.6 Baseline Categorical Variables by Exposure Group

In [8]:
def cat_table_wide(df: pd.DataFrame, var: str, group_col: str = "expo_group") -> pd.DataFrame:
    # Count
    ct = pd.crosstab(df[var], df[group_col], dropna=False)

    # Percent within each exposure group (column-wise)
    pct = ct.div(ct.sum(axis=0), axis=1) * 100

    # Combine as "n (pct%)"
    out = pd.DataFrame(index=ct.index)
    for g in ct.columns:
        out[g] = ct[g].astype(int).astype(str) + " (" + pct[g].round(2).astype(str) + "%)"

    # Optional: add percentage-point difference (RAAS - No RAAS) if both exist
    if ("RAAS early" in ct.columns) and ("No RAAS early" in ct.columns):
        out["pp_diff (RAAS - No RAAS)"] = (pct["RAAS early"] - pct["No RAAS early"]).round(2)

    out = out.reset_index().rename(columns={var: "category"})
    return out

In [9]:
display(cat_table_wide(df, "gender"))
display(cat_table_wide(df, "admission_type"))
display(cat_table_wide(df, "anchor_year_group"))

Unnamed: 0,category,No RAAS early,RAAS early,pp_diff (RAAS - No RAAS)
0,F,218834 (54.17%),27549 (48.48%),-5.69
1,M,185127 (45.83%),29276 (51.52%),5.69


Unnamed: 0,category,No RAAS early,RAAS early,pp_diff (RAAS - No RAAS)
0,AMBULATORY OBSERVATION,5939 (1.47%),1231 (2.17%),0.7
1,DIRECT EMER.,17046 (4.22%),2007 (3.53%),-0.69
2,DIRECT OBSERVATION,20122 (4.98%),4192 (7.38%),2.4
3,ELECTIVE,9132 (2.26%),1192 (2.1%),-0.16
4,EU OBSERVATION,111478 (27.6%),7439 (13.09%),-14.51
5,EW EMER.,112138 (27.76%),21606 (38.02%),10.26
6,OBSERVATION ADMIT,60166 (14.89%),11704 (20.6%),5.7
7,SURGICAL SAME DAY ADMISSION,29595 (7.33%),4362 (7.68%),0.35
8,URGENT,38345 (9.49%),3092 (5.44%),-4.05


Unnamed: 0,category,No RAAS early,RAAS early,pp_diff (RAAS - No RAAS)
0,2008 - 2010,173705 (43.0%),26819 (47.2%),4.2
1,2011 - 2013,85239 (21.1%),11968 (21.06%),-0.04
2,2014 - 2016,65477 (16.21%),9263 (16.3%),0.09
3,2017 - 2019,49710 (12.31%),5706 (10.04%),-2.26
4,2020 - 2022,29830 (7.38%),3069 (5.4%),-1.98


Baseline categorical characteristics by early RAAS inhibitor exposure

Values are shown as *n (% within exposure group)*.  
**pp_diff** denotes the absolute difference in percentage points between the RAAS early group and the No RAAS early group (RAAS − No RAAS).

**Descriptive Summary**

- Sex distribution differed modestly between exposure groups. Admissions with early RAAS inhibitor exposure were more frequently male (51.5%) compared with the non-exposed group (45.8%), corresponding to a difference of +5.7 percentage points.

- Admission type showed pronounced differences between groups. Early RAAS exposure was substantially more common among emergency-related admissions, particularly EW emergency admissions (38.0% vs. 27.8%, +10.3 percentage points), whereas EU observation admissions were less frequent in the RAAS-exposed group (13.1% vs. 27.6%, −14.5 percentage points). Observation admissions were also more frequent among RAAS-exposed admissions (20.6% vs. 14.9%, +5.7 percentage points).

- Calendar period of admission was broadly similar between groups, although early RAAS exposure was slightly more common in earlier periods (2008–2010: 47.2% vs. 43.0%, +4.2 percentage points) and less frequent in more recent years (2020–2022: 5.4% vs. 7.4%, −2.0 percentage points).

- Taken together, these baseline differences indicate that early RAAS inhibitor exposure is associated with systematic variation in patient composition and admission context, particularly with respect to sex and admission pathway. These baseline differences motivate the use of multivariable adjustment in subsequent outcome analyses.

### 5.7 Unadjusted Baseline Summary by Early RAAS Inhibitor Exposure

In [10]:
summary = df.groupby("expo_group").agg(
    n=("hadm_id", "count"),
    age_mean=("age", "mean"),
    age_sd=("age", "std"),
    los_median=("hosp_los", "median"),
    raas_any=("raas_any_early", "mean"),
    acei=("acei_early", "mean"),
    arb=("arb_early", "mean"),
    both=("raas_both_early", "mean"),
)
summary

Unnamed: 0_level_0,n,age_mean,age_sd,los_median,raas_any,acei,arb,both
expo_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No RAAS early,403961,56.648924,19.554315,2.333333,0.0,0.0,0.0,0.0
RAAS early,56825,68.631236,14.007736,2.75,1.0,0.682288,0.325966,0.008253


Baseline summary statistics by early RAAS inhibitor exposure

Values are reported by exposure group. Continuous variables are shown as means (SD) or medians, as appropriate. Binary variables represent proportions within each group.

　**Descriptive Summary**

This summary table presents key baseline characteristics stratified by early RAAS inhibitor exposure status.

- The cohort consisted of 460,786 non-ICU hospital admissions, of which 12.3% (n = 56,825) received RAAS inhibitors early after admission.

- Patients in the RAAS early group were substantially older on average (mean age 68.6 vs. 56.6 years), suggesting a higher burden of chronic comorbidity.

- Median hospital length of stay was slightly longer among RAAS-exposed patients (2.75 vs. 2.33 days), although the difference was modest.

- By construction, RAAS exposure indicators behaved as expected:

  - All patients in the RAAS early group had raas_any = 1, while none in the non-exposed group did.
  - Among RAAS-exposed patients, 68.2% received ACE inhibitors, 32.6% received ARBs, and <1% received both, indicating minimal overlap between drug classes.

Overall, these baseline differences highlight substantial confounding by age and clinical context, reinforcing the need for multivariable adjustment in downstream outcome analyses.

## 6. Outputs and Downstream Use

This notebook generates baseline descriptive summaries stratified by early RAAS inhibitor exposure and saves them as reusable interim artifacts for downstream analyses.　These outputs are used as baseline descriptive inputs (Table 1–style summaries) for adjusted outcome modeling and marginal effect estimation in subsequent notebooks (04a, 04b).

The following outputs are produced:

- **Baseline continuous-variable summary table (table1_cont.csv)**
	- Includes age and hospital length of stay summarized as mean (SD) and median [IQR]
	- Stratified by early RAAS exposure group
	- Used for baseline characterization and reporting

- **Group-level baseline summary table (baseline_summary.csv)**
	- Includes sample size, age distribution, length of stay, and exposure composition
	- Serves as a compact reference for interpreting adjusted analyses

- **On-screen categorical distribution tables**
	- Sex, admission type, and calendar period distributions by exposure group
	- Used for exploratory checks and cohort understanding (not exported)

## 7. Export of Baseline Summary Tables for Downstream Analysis

In [11]:
from pathlib import Path

out_path = Path("../data/interim")
out_path.mkdir(parents=True, exist_ok=True)

table1_cont.to_csv(out_path / "table1_cont.csv", index=False)
summary.to_csv(out_path / "baseline_summary.csv")

print("Saved to:", out_path)

Saved to: ../data/interim
