# 03a - Validation and Descriptive Summary of the Analysis Dataset

## 0. Overview

This notebook loads and validates the component tables that together form the admission-level analysis dataset used in downstream descriptive and outcome analyses.

Specifically, it inspects:
- A non-ICU adult hospital admission cohort table defined at the hospital admission (HADM) level
- A corresponding early RAAS inhibitor exposure table, also defined at the HADM level and constructed upstream using a fixed early-in-admission exposure window

The primary objective of this notebook is to verify structural integrity and internal consistency prior to analysis. Validation checks include confirmation of a one-to-one correspondence between rows and hospital admissions, alignment of exposure indicators with admission-level identifiers, logical consistency among exposure variables, and the absence of unexpected missingness in key fields.

No tables are created or modified in this notebook; all data are read-only and assumed to have been materialized upstream in BigQuery. The validated tables serve as stable inputs for subsequent baseline characterization and outcome modeling notebooks.

## 1. Introduction

In observational clinical analyses, careful validation of analytic inputs is essential before performing descriptive summaries or outcome modeling. Errors such as duplicate admissions, unintended row expansion during joins, inconsistent exposure definitions, or silent missingness can introduce bias or invalidate downstream results.

This notebook focuses on validating the admission-level component tables used to define the analytic population and early RAAS inhibitor exposure in a non-ICU cohort derived from the MIMIC-IV database. The cohort table captures demographic, administrative, and hospitalization characteristics at the level of individual hospital admissions, while the exposure table encodes early use of angiotensin-converting enzyme inhibitors and angiotensin receptor blockers based on pre-specified, time-restricted definitions applied upstream.

By systematically inspecting row counts, key identifiers, data types, and logical relationships among exposure indicators, this notebook establishes confidence that the analytic inputs preserve the intended admission-level structure and are suitable for downstream descriptive and outcome analyses.

## 2. Data Sources

- **MIMIC-IV v3.1** (BigQuery public dataset)
- Project: `mimic-iv-portfolio`

**Source Tables:**
  - `nonicu_raas.nonicu_admissions`<br>
    (created in [02_exposure.ipynb](02_exposure.ipynb) using [02_exclude_icu_admissions.sql](../sql/02_exclude_icu_admissions.sql))
    
  - `nonicu_raas.exposure_raas_early`<br>
    (created in [02_exposure.ipynb](02_exposure.ipynb) using [03_define_exposure_raas_early.sql](../sql/03_define_exposure_raas_early.sql))

These intermediate tables were generated using SQL-based preprocessing pipelines in BigQuery to ensure reproducibility and a clear separation between data extraction, variable construction, and downstream analysis.

## 3. Cohort Definition

The analytic cohort consists of adult, non-ICU hospital admissions derived from the MIMIC-IV database, with inclusion and exclusion criteria applied upstream using SQL-based data extraction procedures.

Specifically, hospital admissions with any recorded ICU stay were excluded by linking admissions to the MIMIC-IV ICU module and removing all admissions with at least one ICU encounter. As a result, the cohort represents adult non-ICU hospital admissions defined at the hospital admission (HADM) level.

In this notebook, the cohort is represented by a pre-materialized BigQuery table containing one row per hospital admission, uniquely identified by `subject_id` and `hadm_id`. The table is loaded in read-only mode for inspection and validation only; no cohort construction or modification is performed here.

## 4. Exposure Definition

Exposure information was merged into the analytic cohort at the hospital admission level using a left join on `hadm_id` (and `subject_id` as a secondary identifier). Admissions without recorded early RAAS inhibitor exposure were retained and explicitly coded as unexposed, rather than treated as missing.

Exposure status is encoded as admission-level binary indicators for early ACE inhibitor use, early ARB use, combined exposure to both drug classes, and a composite indicator reflecting exposure to either class. These variables were constructed upstream using time-restricted prescription records and are loaded here without modification.

In this notebook, exposure variables are evaluated solely for row alignment, logical consistency (e.g., agreement between component and composite indicators), and completeness, without redefining exposure criteria.


## 5. Data Preparation and Sanity Checks
### 5.1 Dataset Loading and Initial Inspection

In [1]:
from google.cloud import bigquery
from google.auth import default
import pandas as pd
import numpy as np
from pathlib import Path

# 1. Define project ID, dataset, and table references
PROJECT_ID = "mimic-iv-portfolio"
DATASET = "nonicu_raas"

TABLE_NONICU = f"{PROJECT_ID}.{DATASET}.nonicu_admissions"   # Created in 02_exclude_icu_admissions.sql
TABLE_EXPO = f"{PROJECT_ID}.{DATASET}.exposure_raas_early"   # Created in 03_define_exposure_raas_early.sql

# 2. Get ADC credentials

creds, adc_project = default()
client = bigquery.Client(project=PROJECT_ID, credentials=creds)

print("Connected to:", PROJECT_ID, "| ADC default:", adc_project)


Connected to: mimic-iv-portfolio | ADC default: mimic-iv-portfolio


### 5.2 Dataset Overview and Descriptive Checks

In [2]:
# Helper for read-only SELECT queries â†’ DataFrame
def query_to_df(query) :
    """
    Run a SELECT query in BigQuery and return a pandas DataFrame.
    """
    job = client.query(query)
    return job.to_dataframe(create_bqstorage_client=False)

### 5.3 Loading Intermediate BigQuery Tables into pandas DataFrames

In [None]:
q_nonicu = f"SELECT * FROM {TABLE_NONICU}"
q_expo = f"SELECT * FROM {TABLE_EXPO}"

df_nonicu = query_to_df(q_nonicu)
df_expo = query_to_df(q_expo)

### 5.4 Schema and Data Type Inspection of the Non-ICU Admissions Table

In [None]:
df_nonicu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460786 entries, 0 to 460785
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   subject_id            460786 non-null  Int64         
 1   hadm_id               460786 non-null  Int64         
 2   admittime             460786 non-null  datetime64[us]
 3   dischtime             460786 non-null  datetime64[us]
 4   deathtime             2324 non-null    datetime64[us]
 5   hospital_expire_flag  460786 non-null  Int64         
 6   admission_type        460786 non-null  object        
 7   admission_location    460785 non-null  object        
 8   discharge_location    311810 non-null  object        
 9   insurance             452862 non-null  object        
 10  language              460377 non-null  object        
 11  marital_status        454118 non-null  object        
 12  race                  460786 non-null  object        
 13 

### 5.5 Schema and Data Type Inspection of the Early RAAS Exposure Table

In [None]:
df_expo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460786 entries, 0 to 460785
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype
---  ------           --------------   -----
 0   subject_id       460786 non-null  Int64
 1   hadm_id          460786 non-null  Int64
 2   acei_early       460786 non-null  Int64
 3   arb_early        460786 non-null  Int64
 4   raas_both_early  460786 non-null  Int64
 5   raas_any_early   460786 non-null  Int64
dtypes: Int64(6)
memory usage: 23.7 MB
