Here is the **full set of inclusion and exclusion criteria**:

---

### **ICD Code Lists for ALLHAT Trial Selection Criteria**  

#### **1. Inclusion Criteria (Patients to Include)**  

##### **(1) Age Requirement**  
- **Patients must be ≥ 55 years old**  
  - This **must be filtered using patient database**.  

##### **(2) Hypertension (Essential for Inclusion)**  

###### **ICD-9 Codes for Hypertension**  
- `401`, `4010`, `4011`, `4019` - Essential hypertension  
- `402`, `4020`, `4021`, `4029` - Hypertensive heart disease  
- `403`, `4030`, `4031`, `4039` - Hypertensive kidney disease  
- `404`, `4040`, `4041`, `4049` - Hypertensive heart and kidney disease  
- `405` - Secondary hypertension  

###### **ICD-10 Codes for Hypertension**  
- `I10` - Essential (primary) hypertension  
- `I11` - Hypertensive heart disease  
- `I12` - Hypertensive kidney disease  
- `I13` - Hypertensive heart and kidney disease  
- `I15` - Secondary hypertension  

##### **(3) At Least One Additional CHD (Coronary Heart Disease) Risk Factor**  

###### **ICD-9 Codes for CHD Risk Factors**  
- **Diabetes Mellitus (Type 2)**: `250`, `25000`, `25002`, `25010`, `25012`, `25020`, `25022`, `25030`, `25032`, `25040`, `25042`, `25050`, `25052`, `25060`, `25062`, `25070`, `25072`, `25080`, `25082`, `25090`, `25092`  
- **History of Myocardial Infarction (Heart Attack)**: `410`, `4100`, `4101`, `4102`, `4103`, `4104`, `4105`, `4106`, `4107`, `4108`, `4109`, `412`  
- **Left Ventricular Hypertrophy (LVH)**: `4293`  
- **Low HDL Cholesterol (Hypercholesterolemia)**: `2720`  
- **Other Evidence of Atherosclerosis**: `440`, `4400`, `4401`, `4402`, `4409`  
- **Current Cigarette Smoking**: `3051`  

###### **ICD-10 Codes for CHD Risk Factors**  
- **Diabetes Mellitus (Type 2)**: `E11`  
- **History of Myocardial Infarction (Heart Attack)**: `I21`, `I22`, `I25.2`  
- **Left Ventricular Hypertrophy (LVH)**: `I51.7`  
- **Low HDL Cholesterol (Hypercholesterolemia)**: `E78.0`  
- **Other Evidence of Atherosclerosis**: `I70`  
- **Current Cigarette Smoking**: `F17.2`, `Z72.0`  

---

#### **2. Exclusion Criteria (Patients to Exclude)**  

##### **(1) Severe Cardiovascular Disease**  

###### **ICD-9 Codes for Exclusion**  
- **Congestive Heart Failure Requiring Hospitalization**: `428`, `4280`, `4281`, `4282`, `4283`, `4284`, `4289`  
- **Recent Myocardial Infarction (Past 6 Months)**: `410`, `4100`, `4101`, `4102`, `4103`, `4104`, `4105`, `4106`, `4107`, `4108`, `4109`  
- **Stroke Within the Last 6 Months**: `434`, `436`  
- **Severe Kidney Disease (End-Stage Renal Disease, eGFR < 30)**: `5854`, `5855`, `5856`  

###### **ICD-10 Codes for Exclusion**  
- **Congestive Heart Failure Requiring Hospitalization**: `I50`  
- **Recent Myocardial Infarction (Past 6 Months)**: `I21`  
- **Stroke Within the Last 6 Months**: `I63`, `I64`  
- **Severe Kidney Disease (End-Stage Renal Disease, eGFR < 30)**: `N18.4`, `N18.5`, `N18.6`  

##### **(2) Contraindications to Study Drugs**  
- **These typically do not have ICD codes. Patients must be checked for allergies or intolerance to:**  
  - **Chlorthalidone**  
  - **Amlodipine**  
  - **Lisinopril**  

##### **(3) Severe Kidney Disease (End-Stage Renal Disease, eGFR < 30)**  
- **ICD-10:** `N18.4`, `N18.5`, `N18.6`  
- **ICD-9:** `585.4`, `585.5`, `585.6`  
---

### **Summary of Filtering Steps**
1. **Patients must be ≥ 55 years old.**  
2. **Patients must have hypertension (ICD-9/10).**  
3. **Patients must have at least one CHD risk factor (ICD-9/10).**  
4. **Patients with exclusion conditions (ICD-9/10) must be removed.**  
5. **Patients with contraindications to study drugs (non-ICD) must be evaluated manually.**  



In [1]:
import pandas as pd

In [2]:
df_schema = pd.read_csv('schema.csv')
df_schema

Unnamed: 0,table,schema
0,diagnoses_icd,"['subject_id', 'hadm_id', 'seq_num', 'icd_code..."
1,discharge,"['subject_id', 'hadm_id', 'charttime', 'text']"
2,drgcodes,"['subject_id', 'hadm_id', 'description']"
3,d_icd_diagnoses,"['icd_code', 'icd_version', 'long_title']"
4,d_icd_procedures,"['icd_code', 'icd_version', 'long_title']"
5,emar,"['subject_id', 'hadm_id', 'charttime', 'medica..."
6,hcpcsevents,"['subject_id', 'hadm_id', 'chartdate', 'short_..."
7,patients,"['subject_id', 'gender', 'anchor_age', 'anchor..."
8,pharmacy,"['subject_id', 'hadm_id', 'starttime', 'medica..."
9,prescriptions,"['subject_id', 'hadm_id', 'starttime', 'drug']"


In [3]:
# Get the headers for the services table from df_schema
headers = df_schema[df_schema['table'] == 'patients']['schema'].values[0]
headers = headers.strip("[]").replace("'", "").split(", ")
headers
# Read the CSV file for services
df_patients = pd.read_csv('../CSV/patients.csv', sep="|", header=None, names=headers, dtype=str)
df_patients

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,insurance,language,marital_status,race,blood_pressure_systolic,blood_pressure_diastolic,bmi,height,weight,egfr
0,10000117,F,48,2174,Medicaid,English,DIVORCED,WHITE,108,74,18.90,64.00,110.00,\N
1,10000161,M,60,2163,Medicaid,English,SINGLE,WHITE,106,92,\N,\N,\N,\N
2,10000248,M,34,2192,Private,English,MARRIED,WHITE,\N,\N,25.50,68.00,168.00,\N
3,10000280,M,20,2151,Private,English,,OTHER,125,77,\N,\N,170.50,\N
4,10000560,F,53,2189,Private,English,MARRIED,WHITE,124,78,\N,\N,128.00,\N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128162,19985757,F,73,2166,Medicare,English,MARRIED,WHITE,137,82,24.50,61.50,132.00,\N
128163,19986183,F,64,2189,Medicare,English,WIDOWED,WHITE,\N,\N,23.40,68.00,153.88,\N
128164,19987983,M,20,2123,Private,English,SINGLE,OTHER,130,88,33.60,65.00,202.00,\N
128165,19990563,M,71,2180,Medicare,English,SINGLE,ASIAN,109,65,22.40,69.00,152.00,\N


In [4]:
df_patient_basic_info = pd.read_csv('../CSV/patients.csv', sep="|", header=None, dtype=str)
df_patient_basic_info = df_patient_basic_info.rename(columns={0: "subject_id", 1: "gender", 2:"anchor_age"})

In [5]:
df_patient_age = df_patient_basic_info[['subject_id', 'anchor_age']]
df_patient_age = df_patient_age.drop_duplicates(subset=['subject_id'])
df_patient_age

Unnamed: 0,subject_id,anchor_age
0,10000117,48
1,10000161,60
2,10000248,34
3,10000280,20
4,10000560,53
...,...,...
128162,19985757,73
128163,19986183,64
128164,19987983,20
128165,19990563,71


In [6]:
unique_subject_count = df_patient_age['subject_id'].nunique()
# Print the count
print("Number of unique subject_id:", unique_subject_count)

Number of unique subject_id: 128167


In [7]:
df_dia_icd = pd.read_csv('../CSV/diagnoses_icd.csv', sep="|", header=None, dtype=str)
df_dia_icd = df_dia_icd.rename(columns={0: "subject_id", 1: "hadm_id", 2:"seq_num", 3: "icd_code", 4:"icd_version"})
df_dia_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version
0,10000117,22927623,1,R1310,10
1,10000117,22927623,2,R0989,10
2,10000117,22927623,3,K31819,10
3,10000117,22927623,4,K219,10
4,10000117,22927623,5,K449,10
...,...,...,...,...,...
3409918,19999828,29734428,18,Z9049,10
3409919,19999828,29734428,19,Z87891,10
3409920,19999828,29734428,20,B9620,10
3409921,19999828,29734428,21,Z1611,10


In [8]:
df_dia_icd['icd_code'] = df_dia_icd.apply(lambda row: f"{row['icd_code']}_10" if row['icd_version'] == '10' else f"{row['icd_code']}_9", axis=1)

# Group by subject_id and aggregate icd_code as a list
df_dia_icd = df_dia_icd.groupby('subject_id').agg({'icd_code': list}).reset_index()
df_dia_icd

Unnamed: 0,subject_id,icd_code
0,10000117,"[R1310_10, R0989_10, K31819_10, K219_10, K449_..."
1,10000161,"[D693_10, R519_10]"
2,10000248,"[9222_9, 920_9, E8854_9, E8495_9, 2860_9, 2859_9]"
3,10000280,[6820_9]
4,10000560,"[1890_9, V1582_9, V1201_9]"
...,...,...
128066,19999464,"[K51011_10, Q453_10, R51_10, R197_10, E559_10,..."
128067,19999625,"[486_9, 5849_9, 2760_9, 5070_9, 33182_9, 29410..."
128068,19999733,"[9953_9, E9479_9, E8490_9]"
128069,19999784,"[Z5111_10, B181_10, C8599_10, D72819_10, E876_..."


In [9]:
unique_subject_count = df_dia_icd['subject_id'].nunique()

# Print the count
print("Number of unique subject_id:", unique_subject_count)

Number of unique subject_id: 128071


In [10]:
# Define ICD codes for filtering

# ICD codes about hypertension
hypertension_icd_9 = {'401', '4010', '4011', '4019', '402', '4020', '4021', '4029', '403', '4030', '4031', '4039', '404', '4040', '4041', '4049', '405'}
hypertension_icd_10 = {'I10', 'I11', 'I12', 'I13', 'I15'}


chd_risk_icd_9 = {'250', '25000', '25002', '25010', '25012', '25020', '25022', '25030', '25032', '25040', '25042', '25050', '25052', '25060', '25062',
                  '25070', '25072', '25080', '25082', '25090', '25092', '410', '4100', '4101', '4102', '4103', '4104', '4105', '4106', '4107', '4108', '4109',
                  '412', '4293', '2720', '440', '4400', '4401', '4402', '4409', '3051'}

# ICD codes about CHD risk
chd_risk_icd_10 = {'E11', 'I21', 'I22', 'I25.2', 'I51.7', 'F17.2', 'Z72.0', 'E78.0', 'I70'}

# ICD codes about exclusion disease
exclude_icd_9 = {'428', '4280', '4281', '4282', '4283', '4284', '4289', '410', '4100', '4101', '4102', '4103', '4104', '4105', '4106', '4107', '4108', '4109',
                 '434', '436', '5854', '5855', '5856'}
exclude_icd_10 = {'I50', 'I21', 'I63', 'I64', 'N18.4', 'N18.5', 'N18.6'}

# Function to check if a subject has at least one hypertension ICD
def has_hypertension(icd_list):
    return any(code.split('_')[0] in hypertension_icd_9 or code.split('_')[0] in hypertension_icd_10 for code in icd_list)

# Function to check if a subject has at least one CHD risk factor ICD
def has_chd_risk(icd_list):
    return any(code.split('_')[0] in chd_risk_icd_9 or code.split('_')[0] in chd_risk_icd_10 for code in icd_list)

# Function to check if a subject has at least one exclusion ICD
def has_exclusion(icd_list):
    return any(code.split('_')[0] in exclude_icd_9 or code.split('_')[0] in exclude_icd_10 for code in icd_list)

# Step 1: Select only patients who have at least one hypertension ICD
df_hypertension = df_dia_icd[df_dia_icd['icd_code'].apply(has_hypertension)]

# Step 2: From the remaining patients, select only those who have at least one CHD risk factor ICD
df_chd_risk = df_hypertension[df_hypertension['icd_code'].apply(has_chd_risk)]

# Step 3: Remove patients who have at least one exclusion ICD
df_dia_icd_filtered = df_chd_risk[~df_chd_risk['icd_code'].apply(has_exclusion)]

df_dia_icd_filtered

Unnamed: 0,subject_id,icd_code
5,10000635,"[R0789_10, R29810_10, R200_10, I10_10, E7800_1..."
7,10000764,"[8020_9, 41071_9, 5849_9, 2875_9, 7802_9, 7847..."
9,10001176,"[4829_9, 25000_9, 78060_9, 2761_9, 4019_9, 414..."
10,10001217,"[3240_9, 3484_9, 3485_9, 5180_9, 340_9, 04109_..."
30,10002769,"[45342_9, 70713_9, 45981_9, 4019_9, 2724_9, V1..."
...,...,...
128016,19995012,"[I6381_10, Z6841_10, G8192_10, R29707_10, E669..."
128027,19995790,"[41401_9, 4019_9, 2720_9, 25000_9, 44021_9, 59..."
128041,19996912,"[6185_9, 6186_9, 4019_9, 53081_9, 25000_9, 272..."
128042,19996968,"[5772_9, 5770_9, 28959_9, 5768_9, 30393_9, 401..."


In [11]:
df_dia_icd_filtered_withage = df_dia_icd_filtered.merge(df_patient_age, on='subject_id', how='left')
df_dia_icd_filtered_withage['anchor_age'] = pd.to_numeric(df_dia_icd_filtered_withage['anchor_age'], errors='coerce')
df_filtered_final = df_dia_icd_filtered_withage[df_dia_icd_filtered_withage['anchor_age'] >= 55]
df_filtered_final

Unnamed: 0,subject_id,icd_code,anchor_age
0,10000635,"[R0789_10, R29810_10, R200_10, I10_10, E7800_1...",74
1,10000764,"[8020_9, 41071_9, 5849_9, 2875_9, 7802_9, 7847...",86
2,10001176,"[4829_9, 25000_9, 78060_9, 2761_9, 4019_9, 414...",64
3,10001217,"[3240_9, 3484_9, 3485_9, 5180_9, 340_9, 04109_...",55
4,10002769,"[45342_9, 70713_9, 45981_9, 4019_9, 2724_9, V1...",58
...,...,...,...
12139,19992305,"[43310_9, 5070_9, 2724_9, 4019_9, 25000_9, 244...",74
12140,19992425,"[2273_9, 431_9, 2761_9, 45342_9, 37751_9, V436...",70
12142,19995012,"[I6381_10, Z6841_10, G8192_10, R29707_10, E669...",64
12143,19995790,"[41401_9, 4019_9, 2720_9, 25000_9, 44021_9, 59...",66


In [12]:
df_filtered_gold = df_filtered_final.sample(n=1673, random_state=42)
df_filtered_gold = df_filtered_gold['subject_id']

In [13]:
df_filtered_gold.info()

<class 'pandas.core.series.Series'>
Index: 1673 entries, 2742 to 10247
Series name: subject_id
Non-Null Count  Dtype 
--------------  ----- 
1673 non-null   object
dtypes: object(1)
memory usage: 26.1+ KB


extended cohort selection

In [14]:
df_dia_icd = df_dia_icd.merge(df_patient_age, on='subject_id', how='left')
df_dia_icd['anchor_age'] = pd.to_numeric(df_dia_icd['anchor_age'], errors='coerce')
df_dia_icd_extended = df_dia_icd[df_dia_icd['anchor_age'] >= 45]
df_extended = df_dia_icd.sample(n=10630, random_state=42)

In [15]:
df_extended_filtered = df_extended[~df_extended['subject_id'].isin(df_filtered_gold)]
df_extended_gold = df_filtered_final[~df_filtered_final['subject_id'].isin(df_filtered_gold)]
df_extended_filtered = df_extended_filtered['subject_id']
df_extended_gold = df_extended_gold['subject_id']

In [16]:
# Concatenate the DataFrames
df_union = pd.concat([df_extended_filtered, df_extended_gold], ignore_index=True)

df_union


0        16113983
1        11809107
2        19897876
3        18142473
4        16453468
           ...   
17475    19990563
17476    19992305
17477    19992425
17478    19995012
17479    19996912
Name: subject_id, Length: 17480, dtype: object

In [17]:
# Create a DataFrame for gold subjects with gold_flag = 1
df_gold = pd.DataFrame(df_filtered_gold, columns=['subject_id'])
df_gold['gold_flag'] = 1

# Create a DataFrame for extended subjects with gold_flag = 0
df_extended = pd.DataFrame(df_union, columns=['subject_id'])
df_extended['gold_flag'] = 0

# Concatenate both DataFrames
df_combined = pd.concat([df_gold, df_extended], ignore_index=True)

df_combined

Unnamed: 0,subject_id,gold_flag
0,12289464,1
1,11152219,1
2,12014559,1
3,10390531,1
4,14669875,1
...,...,...
19148,19990563,0
19149,19992305,0
19150,19992425,0
19151,19995012,0


In [18]:
df_combined.to_csv('../CSV/gold_patients.csv', index=False)