**Get outcomes in (Time, Event) format.**

CRC outcomes determined from cancer registry exclusively until its censoring and from HES thereafter. Administrative censoring of registry and HES is different in England/Wales/Scotland, so has to be handled in a country-specific manner. CRC defined as ICD10 code beginning (C18, C19, C20) or ICD9 beginning (153, 1540, 1541).

Date (of event or censoring) determined as earliest of:
- CRC diagnosis date (from registry or HES)
- Death date (from <a href='https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100093'>death register</a>)
- Date <a href='https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=191'>lost to follow up</a>
- Latest cancer follow-up date
  - Depends on administrative censoring date of HES/registry for each country 


In [1]:
import pandas as pd
import numpy as np

Censoring dates (from <a href='https://biobank.ndph.ox.ac.uk/showcase/exinfo.cgi?src=Data_providers_and_dates'>UKB dates of data availability</a>):

In [2]:
censoring_dates = {
    "Country": ["England", "Wales", "Scotland"],
    "Registry censoring": ["31/12/2020", "31/12/2016", "30/11/2021"],
    "HES censoring": ["31/10/2022", "31/05/2022", "31/08/2022"],
    "Death censoring": ["30/11/2022", "30/11/2022", "30/11/2022"]
}

censoring_dates = pd.DataFrame(censoring_dates)
for col in censoring_dates.columns:
    if col != "Country":
        censoring_dates[col] = pd.to_datetime(censoring_dates[col], dayfirst=True)

In [3]:
censoring_dates

Unnamed: 0,Country,Registry censoring,HES censoring,Death censoring
0,England,2020-12-31,2022-10-31,2022-11-30
1,Wales,2016-12-31,2022-05-31,2022-11-30
2,Scotland,2021-11-30,2022-08-31,2022-11-30


Create df with dates to be used for time-to-event/censoring computation for all participants

In [4]:
# Load ID recruitment date (to compute time from recruitment) and assessment centre (to infer country and apply appropriate censoring)
dates = pd.read_feather('../data/all_showcase_baseline.feather', columns=['eid', '53.0.0', '54.0.0']).rename(columns={'53.0.0': 'recruitment date'})

# Replace assessment centre with country
assessment_centre_country = {
    'England': [10003, 11001, 11002, 11006, 11007, 11008, 11009, 11010, 11011, 11012, 11013, 11014, 11016, 11017, 11018, 11020, 11021, 11024, 11025, 11026, 11027, 11028],
    'Wales': [11003, 11022, 11023],
    'Scotland': [11004, 11005]
}
replace_dict = {centre: country for country, centres in assessment_centre_country.items() for centre in centres}
dates['54.0.0'] = dates['54.0.0'].astype(int).replace(replace_dict).astype('category')
dates = dates.rename(columns={'54.0.0': 'Country'})

# Join country-specific censoring dates for data sources onto dates df
dates = pd.merge(dates, censoring_dates, on='Country', how='left')

Exclude those with pre-existing CRC

In [5]:
# Select prevalent CRCs from cancer registry
crc_icd10 = ('C18', 'C19', 'C20')
crc_icd9 = ('153', '1540', '1541')

cancers = pd.read_csv("../data/cancers_long.csv", parse_dates=['date_diagnosis'])
registry_crc = cancers.loc[(cancers['ICD10'].fillna('').str.startswith(crc_icd10)) | (cancers['ICD9'].fillna('').str.startswith(crc_icd9)), ['eid', 'date_diagnosis']]
registry_crc = pd.merge(registry_crc, dates, on='eid', how='left', validate='m:1')
prevalent_crc = registry_crc.loc[registry_crc['date_diagnosis'] <= registry_crc['recruitment date'], 'eid'].unique().tolist()

# Exclude
dates = dates[~dates['eid'].isin(prevalent_crc)]

CRC: Registry

In [6]:
# Select first incident CRCs from registry
registry_crc = registry_crc.sort_values(by='date_diagnosis').drop_duplicates(subset=['eid'], keep='first')
registry_crc_incident = registry_crc[registry_crc['date_diagnosis'] > registry_crc['recruitment date']]
# Merge onto dates df
dates = pd.merge(dates, registry_crc_incident[['eid', 'date_diagnosis']].rename(columns={'date_diagnosis': 'CRC (registry)'}), on='eid', how='left', validate='1:1')

CRC: HES

In [7]:
# Select incident CRCs from HES
hes = pd.read_feather('../data/hes_long.feather')
crc_icd10_condition = hes['ICD10'].fillna('').str.startswith(crc_icd10)
crc_icd9_condition = hes['ICD9'].fillna('').str.startswith(crc_icd9)
hes_crc = hes[(crc_icd10_condition | crc_icd9_condition)]

# Join regisry censoring date from dates df and select only those after registry censored (varies by country)
hes_crc = pd.merge(hes_crc, dates[['eid', 'Registry censoring']])
hes_crc = hes_crc[hes_crc['date'] > hes_crc['Registry censoring']]
# Only keep first record of CRC
hes_crc = hes_crc.sort_values(by='date').drop_duplicates(subset='eid', keep='first')
# Merge onto dates df
dates = pd.merge(dates, hes_crc[['eid', 'date']].rename(columns={'date': 'CRC (HES)'}), on='eid', how='left', validate='1:1')

Censored individuals: death (field 40000) and lost to follow-up (field 191)

In [8]:
censored = pd.read_feather('../data/all_showcase_baseline.feather', columns=['eid','40000.0.0', '191.0.0'])
censored = censored.rename(columns={'40000.0.0': 'Death', '191.0.0': 'Lost'})
censored = censored.dropna(how='all', subset=['Death', 'Lost'])
dates = pd.merge(dates, censored, on='eid', how='left', validate='1:1')

Define events (E)

In [9]:
# Check if any cancers recorded after death. None, all good :)
dates[(dates['CRC (registry)'] > dates['Death']) | (dates['CRC (HES)'] > dates['Death'])]

Unnamed: 0,eid,recruitment date,Country,Registry censoring,HES censoring,Death censoring,CRC (registry),CRC (HES),Death,Lost


In [10]:
# Define E = 0 by default; 1 if there is a date for cancer (from registry or HES)
dates['E'] = 0
dates.loc[(dates['CRC (registry)'].notna()) | (dates['CRC (HES)'].notna()), 'E'] = 1

Compute times (T).
For CRC (E=1), time from recruitment date to CRC diagnosis (earliest of registry or HES event date).
For others (E=0), difference between recruitment and earliest of: 
- death
- lost to follow-up
- latest cancer follow-up linkage
    - registry administrative censoring date
    - HES administrative censoring date

In [11]:
dates.loc[dates['E'] == 1, 'T'] = (dates[['CRC (registry)', 'CRC (HES)']].min(axis=1) - dates['recruitment date']).dt.days
dates.loc[dates['E'] == 0, 'T'] = (dates[['Death', 'Lost', 'Registry censoring', 'HES censoring']].min(axis=1) - dates['recruitment date']).dt.days

Save in T,E format

In [12]:
dates[['eid', 'T', 'E']].to_feather('../data/surv_outcomes_crc.feather')