# Common Features in Predictive Models for Graft Failure in Kidney Transplant

The analysis of common features is based on the article of Kaboré et al. [[1](#references)].
Their list is based on a systematic review of 39 papers.

## Results
![image](../img/predictors-figure-2.png "Commonly used Predictors")

They differentiate between predictors related to the recipient, the transplantation (surgery) and donor information.
In this visualization the authors do not indicate the timeframe of the features.
Their paper elaborates that some factors are taken before or during the transplant, like information on the recipient, whereas other such as eGFR/creatinine are measured in the week after the transplant.
While this could be an indicator for the graft function, we decided to only include data before the transplantation into our models.

For further consideration, I transformed the figure into a table and added information on time of measurement and availability in MSDW.

| Group | Name | Number of mentions | Time relative to transplantation | Available in MSDW |
|-------|------|--------------------|----------------------------------|-------------------|
| Recipient | Age | 29 | - | + |
| Recipient | Gender | 16 | - | + |
| Recipient | Race | 13 | - | + |
| Recipient | Diabetes | 13 | - | + |
| Recipient | Smoking Status | 6 | - | + |
| Recipient | History of Angina | 3 | - | + |
| Recipient | Charlson Comorbidity Index | 4 | - | ? |
| Recipient | Duration of dialysis | 9 | - | ? |
| Recipient | Hypertension history | 9 | - | + |
| Recipient | BMI or Height/Weight | 14 | - | + |
| Recipient | Number of previous transplant | 5 | - | + |
| Recipient | Proteinurea | 5 | + | + |
| Recipient | Davies index | 2 | - | ? |
| Recipient | Time on waiting list | 4 | - | ? |
| Recipient | Creatinine / eGFR | 12 | + | + |
| Recipient | Serum albumin | 5 | ? | + |
| Recipient | Primary source of Payment | 5 | - | ? |
| Recipient | Cardiovascular disease | 11 | - | + |
| Recipient | Wright-Khan index | 2 | - | ? |
| Recipient | Cause of ESRD | 10 | - | ? |
| Recipient | Pre-transplant dialysis (yes/no) | 8 | - | + |
| Transplantation | Previous transplant | 3 | -  | + |
| Transplantation | Hepatitis C antibodies | 3 | - | + |
| Transplantation | Peak panel-reactive antibody | 6 | - | ? |
| Transplantation | Acute rejection | 7 | + | + |
| Transplantation | HLA-DR mismatch | 15 | - | ? |
| Transplantation | Immunosuppresseur regimen | 5 | - | + |
| Transplantation | Graft cold schemia time | 7 | - | ? |
| Transplantation | Year of transplantation | 6 | - | + (but effort) |
| Transplantation | Delayed graft function | 6 | + | + |
| Transplantation | CMV serology | 3 | - | ?/+ |
| Transplantation | Acute tubular necrosis | 3 | + | + |
| Donor | Gender | 8 | - | ? | ? |
| Donor | BMI or Height/Weight | 7 | - | ? | ? |
| Donor | Serum creatinine | 4 | - | ? | ? |
| Donor | Age | 16 | -  | ? | ? |
| Donor | History of Hypertension | 5 | - | ? | ? |
| Donor | Cause of death | 9 | - | ? | ? |
| Donor | Type | 8 | - | ? | ? |
| Donor | Race | 11 | - | ? | ? |

## Feature Hunting

As we want accurate models, we will build on the aforementioned list of common features and find as many as possible in MSDW.
We will also give an estimate how reproducible the features are in other patient cohorts.

In [None]:
%run 00_load_cohort.ipynb

### Recipient Features

| Name | Source in MSDW | Ease of Reproducibility | 
|------|----------------|-------------------------|
| Age | `age_in_days` | +++ |
| Gender | `gender` | +++ |
| Race | `race` | ++ |
| Diabetes | from condition store with ICD-9 and ICD-10 | ++ |
| Smoking Status | from metadata | - |
| History of Angina | self-derived ICD-9 and ICD-10 codes | ++ |
| Charlson Comorbidity Index | *not found* | --- |
| Duration of dialysis | self-derived ICD-9 and ICD-10, transformed into length (proxy) | + |
| Hypertension history | from condition store with ICD-9 and ICD-10 | ++ |
| BMI or Height/Weight | both possible, but lots of miscodings | + |
| Number of previous transplant | possible with precondition method (via CPT-4 code definition of cohort) | ++ |
| Proteinuria | self-derived ICD-9 and ICD-10 codes; however, the authors only report it as predictor after tx | ++ |
| Davies index | *not found* | --- |
| Time on waiting list | n/a | --- |
| Creatinine / eGFR | available from EPIC Lab; however, the authors only report it as predictor after tx | +++ |
| Serum albumin | available from EPIC Lab;  | +++|
| Primary source of Payment | *not ingested* | - |
| Cardiovascular disease | *underspecified* but could be made available with list of codes | ? |
| Wright-Khan index | *not found* | --- |
| Cause of ESRD | n/a; however length, could be made available via ICD-9 / ICD-10 codes| --- (++) |
| Pre-transplant dialysis (yes/no) | see `Duration of dialysis` | +++ |

Some phenotypes will be derived from Elixhauser comorbidities as per [[2](#references)].

In [None]:
patient_data = cohort.merge_patient_data()

In [None]:
# Diabetes
from fiber.condition import Diagnosis
from fiber.storage.yaml import get_condition

diabetes_complicated = get_condition(
    condition_class=Diagnosis,
    name='diabetes complicated', 
    coding_schemes=['ICD-9', 'ICD-10'])

diabetes_complicated_df = cohort.has_precondition(
    name='diabetes complicated',
    condition=diabetes_complicated
)

diabetes_uncomplicated = get_condition(
    condition_class=Diagnosis,
    name='diabetes uncomplicated', 
    coding_schemes=['ICD-9', 'ICD-10'])

diabetes_uncomplicated_df = cohort.has_precondition(
    name='diabetes uncomplicated',
    condition=diabetes_uncomplicated
)

In [None]:
# Smoking Status
import math
from fiber.condition import TobaccoUse

tobacco_use_df = cohort.values_for(
    target=TobaccoUse(),
    before=cohort.condition,
)[[
    'medical_record_number', 'age_in_days', 'time_delta_in_days', 'value'
]].sort_values([
    'medical_record_number', 'age_in_days', 'time_delta_in_days'
], ascending=False)

smoking_status_df = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=tobacco_use_df,
    aggregation_functions={'value': lambda x: x.iloc[0]},
    name='smoking status'
)

In [None]:
# History of Angina

# Diagnosis('I20%', 'ICD-10').patients_per('description', 'context_diagnosis_code')
# Diagnosis(['413.%', '411.1'], 'ICD-9').patients_per('description', 'context_diagnosis_code')

angina_df = cohort.has_precondition(
    name='angina pectoris',
    condition=Diagnosis('I20%', 'ICD-10') | Diagnosis(['413.%', '411.1'], 'ICD-9')
)

# Duration of dialysis

This proves to be more difficult. We suggest the following steps.

* Identify relevant Procedure and Diagnosis codes (here may not include all codes!)
* Define length as one of the following proxies:
  * number of days with occurrence
  * days between first and last day of dialysis, probably equivalent to days from first day until tx
  * find periods of dialysis (might be interrupted due to recovery)

In [None]:
from fiber.condition import Procedure
# Procedure(
#     description='%dialysis%', 
#     code='%', 
#     context='ICD-9'
# ).patients_per('context_procedure_code', 'procedure_description')

# Diagnosis(
#     description='%dialysis%', 
#     code='%', 
#     context='ICD-%'
# ).patients_per('context_diagnosis_code', 'description', 'diagnosis_type').head(10)

With the codes tested, we decided to assume the length of dialysis as the number of days between earliest occurrence of one of the codes and the transplant.

In [None]:
dialysis_proc_df = cohort.values_for(
    target=Procedure('39.95', 'ICD-9'),
    before=cohort.condition,
)

dialysis_procedure_data = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=dialysis_proc_df,
    aggregation_functions={'time_delta_in_days': lambda x: abs(x.min())},
    name='dialysis procedure'
).rename(columns={
    'time_delta_in_days_dialysis_procedure_from_inf_days_before_to_0_days_after': 'length_of_dialysis'
})

dialysis_diag_df = cohort.values_for(
    target=Diagnosis('V45.11', 'ICD-9') | Diagnosis('Z99.2', 'ICD-10'),
    before=cohort.condition,
)

dialysis_diagnosis_data = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=dialysis_diag_df,
    aggregation_functions={'time_delta_in_days': lambda x: abs(x.min())},
    name='dialysis diagnosis'
).rename(columns={
    'time_delta_in_days_dialysis_diagnosis_from_inf_days_before_to_0_days_after': 'length_of_dialysis'
})

dialysis_df = dialysis_procedure_data.append(dialysis_diagnosis_data).groupby(
    ['medical_record_number', 'age_in_days']
).max().reset_index()

In [None]:
# Hypertension
hypertension_complicated = get_condition(
    condition_class=Diagnosis,
    name='hypertension complicated', 
    coding_schemes=['ICD-9', 'ICD-10']
)

hypertension_complicated_df = cohort.has_precondition(
    name='hypertension complicated',
    condition=hypertension_complicated
)

hypertension_uncomplicated = get_condition(
    condition_class=Diagnosis,
    name='hypertension uncomplicated', 
    coding_schemes=['ICD-9', 'ICD-10']
)

hypertension_uncomplicated_df = cohort.has_precondition(
    name='hypertension uncomplicated',
    condition=hypertension_uncomplicated
)

In [None]:
# BMI / Height/Weight

from fiber.condition import Height, Weight

height_values = cohort.values_for(
    target=Height(),
    before=cohort.condition,
)[[
    'medical_record_number', 'age_in_days', 'time_delta_in_days', 'numeric_value'
]].sort_values([
    'medical_record_number', 'age_in_days', 'time_delta_in_days', 'numeric_value'
], ascending=False)

# account for miscodings
height_values = height_values[height_values.numeric_value > 10]

height_measurements = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=height_values,
    aggregation_functions={'numeric_value': lambda x: x.iloc[0]},
    name='height'
).rename(columns={
    'numeric_value_height_from_inf_days_before_to_0_days_after': 'height'
})

weight_values = cohort.values_for(
    target=Weight(),
    before=cohort.condition,
)[[
    'medical_record_number', 'age_in_days', 'time_delta_in_days', 'numeric_value'
]].sort_values([
    'medical_record_number', 'age_in_days', 'time_delta_in_days', 'numeric_value'
], ascending=False)

# account for miscodings
weight_values[weight_values.numeric_value > 10]

weight_measurements = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=weight_values,
    aggregation_functions={'numeric_value': lambda x: x.iloc[0]},
    name='weight'
).rename(columns={
    'numeric_value_weight_from_inf_days_before_to_0_days_after': 'weight'
})

bmi_df = height_measurements[height_measurements.height > 0].merge(weight_measurements)


bmi_df['bmi'] = bmi_df.weight / ((bmi_df.height / 100) ** 2)
del bmi_df['height']
del bmi_df['weight']

# account for miscodings
bmi_df = bmi_df[bmi_df.bmi < 50]

height_df = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=height_values,
    aggregation_functions={'numeric_value': 'mean'},
    name='height'
).rename(columns={
    'numeric_value_height_from_inf_days_before_to_0_days_after': 'height'
})

weight_df = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=weight_values,
    aggregation_functions={'numeric_value': 'mean'},
    name='weight'
).rename(columns={
    'numeric_value_weight_from_inf_days_before_to_0_days_after': 'weight'
})

In [None]:
# Previous transplant
previous_transplant_df = cohort.has_precondition(
    name='previous transplant', 
    condition=cohort.condition, 
    time_windows=[[-math.inf, -1]]
)

In [None]:
# Proteinuria
# Diagnosis('R80%', 'ICD-10').patients_per('description', 'context_diagnosis_code', 'diagnosis_type')
# Diagnosis(code=['791.0', '593.6'], context='ICD-9').patients_per('description', 'context_diagnosis_code', 'diagnosis_type')

proteinuria_cond = Diagnosis('R80%', 'ICD-10') | Diagnosis(code=['791.0', '593.6'], context='ICD-9')

proteinuria_df = cohort.has_precondition(
    name='proteinuria',
    condition=proteinuria_cond, 
    time_windows=[[-math.inf, 0]]
)

In [None]:
# Lab values
# LabValue('CREATININE-SERUM').patients_per('test_name')
# LabValue('%ALBUMIN%').patients_per('test_name')

In [None]:
# End-stage renal disease
renal_disease_df = cohort.values_for(
    target=Diagnosis('585.6', 'ICD-9') | Diagnosis('N18.6', 'ICD-10'),
    before=cohort.condition,
)

renal_disease_data = cohort.aggregate_values_in(
    time_windows=[[-math.inf, 0]],
    df=renal_disease_df,
    aggregation_functions={'time_delta_in_days': lambda x: abs(x.min())},
    name='esrd diagnosis'
).rename(columns={
    'time_delta_in_days_esrd_diagnosis_from_inf_days_before_to_0_days_after': 'length_of_esrd'
})

## Transplantation Features

| Name | Source in MSDW | Ease of Reproducibility |
|------|----------------|-------------------------|
| Previous transplant | `see above` | ++  |
| Hepatitis C antibodies | self-derived ICD-9 and ICD-10 codes | ++ |
| Peak panel-reactive antibody | ? | --- |
| Acute rejection | `out of scope` | |
| HLA-DR mismatch | no donor information, so best effort # of hla tests ... | -- |
| Immunosuppresseur regimen | IMHO should be generalized into Drugs, but they are also somehow not so nice | -- |
| Graft cold schemia time | not available | |
| Year of transplantation | can be engineered (ethics?) | + |
| Delayed graft function | `out of scope` | |
| CMV serology | taking the disease as proof for serology; self-derived list of ICD-9 and ICD-10 codes | ++ |
| Acute tubular necrosis | self-derived list of ICD-9 and ICD-10 codes | ++ |

In [None]:
# Hepatitis C Antibodies
# EPIC Lab does not contain many antibody lab tests. Also the procedures do not contain information on outcome of antibody panel.
# LabValue('%Hepatitis%').patients_per(LabValue.description_column)
# Procedure(description='%Hepatitis C%antibody%').patients_per(Procedure.description_column, Procedure.code_column, Procedure.category_column)
# However, from: https://www.icd10data.com/ICD10CM/Index/H/Hepatitis, we can take diagnoses.

hepatitis_c_icd_10_cond = Diagnosis(['B17.1%', 'B19.2%', 'B18.2', 'Z22.52'], 'ICD-10')
hepatitis_c_icd_9_cond = Diagnosis(['070.54', '070.70', '070.51', '070.44', '070.71', 'V02.62', '070.41', '070.7'], 'ICD-9')

hepatitis_c_df = cohort.has_precondition(
    name='hepatitis c',
    condition=hepatitis_c_icd_10_cond | hepatitis_c_icd_9_cond
)

In [None]:
# (peak) panel-reactive antibodies

# Procedure(description='%reactive%').patients_per('procedure_description')

# LabValue(name='%panel%').patients_per('test_name')

In [None]:
# HLA DR test
# LabValue('HLA%').patients_per('test_name')
# LabValue('HLA-DR%').patients_per('test_name')

hla_df = cohort.values_for(
    target=LabValue('HLA-DR%'),
    before=cohort.condition
)

In [None]:
# Immunosuppresseur regimen
from fiber.condition import Drug

drug_occ_df = cohort.pivot_all_for(
    condition=Drug(),
    pivot_table_kwargs={
        'columns': ['code'],
        'aggfunc': {'code': ['count']},
    },
    threshold=0,
    window=[-365, -1]
)

drugs_df = cohort.merge_patient_data(drug_occ_df)

# however a lot of null values or low incidence. Or unusable codes ...
drugs_df.columns

In [None]:
# graft cold ischemia time
# Diagnosis(description='%ischemia%').patients_per('description')

In [None]:
# living / deceased donor
# Procedure(description='%nephrectomy%', code='%', context='CPT-4', mrns=cohort.mrns(), data_columns=['context_name', 'context_procedure_code', 'procedure_description', 'medical_record_number']).get_data(included_mrns=cohort.mrns()).groupby(['context_name', 'context_procedure_code', 'procedure_description']).count()

In [None]:
# year of tranplant

patient_data_incl_years = patient_data.copy()

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

patient_data_incl_years['year_of_transplant'] = (
    patient_data_incl_years.date_of_birth.astype(str).str[0:4].astype(int) 
    + patient_data_incl_years.age_in_days / 365 
    + (patient_data_incl_years.month_of_birth.astype("category", categories=months).cat.codes + 1) / 12
)

In [None]:
# CMV serology (here before, in paper probably after 1 year or so)
cmv_cond = Diagnosis(code='078.5', context='ICD-9') | Diagnosis(code=['B25%'], context='ICD-10')
cmv_df = cohort.has_precondition(
    name='cmv',
    condition=cmv_cond
)

In [None]:
# Acute tubular necrosis (here before, in paper probably after 1 year or so)
atn_cond = Diagnosis(code='078.5', context='ICD-9') | Diagnosis(code='N17.0', context='ICD-10')
atn_df = cohort.has_precondition(
    name='acute tubular necrosis',
    condition=atn_cond
)

## Donor Information

We cannot extract any donor information.

In [None]:
df = cohort.has_onset(
    'infection',
    Diagnosis('T86.13', 'ICD-10'),
    time_windows=[[0, 365]]
)

In [None]:
df.infection_onset_from_0_days_after_to_365_days_after.sum()

# Other reported predictors for infection and rejection

- dialysis modality
- [[3]](#references)
    - systemic lupus erythematosus
    - cancer
    - (previous renal transplant)
    - history of anti-rejection therapy
    - basal serum albumin concentration < 3.5 mg/dl
    - dyslipidemia
    - (end-stage renal disease of unknown etiology)
    - (# haplotyde matches)
    
- [[4]](#references)
    - vaccinations

In [None]:
# Vaccinations
vacc_proc_df = cohort.pivot_all_for(
    condition=Procedure(description='%vaccin%', data_columns=['medical_record_number', 'age_in_days', 'procedure_description']),
    pivot_table_kwargs={
        'columns': ['description'],
        'aggfunc': {'description': ['count']},
    },
    threshold=0,
    window=[-math.inf, -1]
)

vacc_proc_df = cohort.merge_patient_data(vacc_proc_df)

vacc_material_df = cohort.pivot_all_for(
    condition=Material(description='%vaccin%', data_columns=['medical_record_number', 'age_in_days', Material.description_column]),
    pivot_table_kwargs={
        'columns': ['description'],
        'aggfunc': {'description': ['count']},
    },
    threshold=0,
    window=[-math.inf, -1]
)

vacc_material_df = cohort.merge_patient_data(vacc_material_df)

# References

1. Rémi Kaboré, Maria C. Haller, Jérôme Harambat, Georg Heinze, Karen Leffondré, Risk prediction models for graft failure in kidney transplantation: a systematic review, Nephrology Dialysis Transplantation, Volume 32, Issue suppl_2, 1 April 2017, Pages ii68–ii76, https://doi.org/10.1093/ndt/gfw405

2. Quan, Hude, et al. "Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data." Medical care (2005): 1130-1139.

3. Valdez-Ortiz, R., Sifuentes-Osornio, J., Morales-Buenrostro, L. E., Ayala-Palma, H., Dehesa-López, E., Alberú, J., & Correa-Rotter, R. (2011). Risk factors for infections requiring hospitalization in renal transplant recipients: a cohort study. International Journal of Infectious Diseases, 15(3), e188-e196.

4. Karuthu, S., & Blumberg, E. A. (2012). Common infections in kidney transplant recipients. Clinical Journal of the American Society of Nephrology, 7(12), 2058-2070.