## GSP PAH index and lookback dates

Index and lookback date are extracted from the merged datset (APC & AE & OP)

In [190]:
import pandas as pd
import numpy as np
data = pd.read_csv("merged_mock.csv", low_memory=False, index_col=0)

# define date variables - make sure pandas understands them as dates
data.insert(0, "ADMINDATE", pd.to_datetime(data.ADMIDATE))
data.drop("ADMIDATE", inplace=True)

Mock data doesn't have PROCODE field and Harvey hasn't provided a list of PROCODEs and MAINSPEF yet so we:
- Generate fake PROCODEs and add it to data.
- Define list of PROCODEs and MAINSPEF that we want to keep

In [191]:
n, p = data.shape
random_procode = np.tile(np.arange(0,10), np.int(np.ceil(n/10)))[:n]
data.insert(0, "PROCODE", random_procode)
procode_to_keep = np.array([0, 3, 5, 8])
mainspef_to_keep = np.array(["320", "300", "110", "800", "420"])

## GSK PAH: Define index date of positives

__Brief__

For PAH positive cohort, this is the date that a patient visits the specialty centre which is TCD by the PROCODE and MAINSPEF. Code should be prepared to input the specialties and providers via a list.

__Steps taken__

- Find events that have both PROCODE and MAINSPEF
- Turn ADMIDATE into a date column
- Find first by date

In [202]:
# find patients that went to the right provider and right specialty
pos_index_df = data.loc[(data.PROCODE.isin(procode_to_keep) & data.MAINSPEF.isin(mainspef_to_keep)),]
pos_index_df = pos_index_df[["ADMINDATE", "ENCRYPTED_HESID"]]

# sort by date, group by patients
pos_index_df = pos_index_df.sort_values(by="ADMINDATE")
pos_index_dates = pos_index_df.groupby("ENCRYPTED_HESID").first()
pos_index_dates

Unnamed: 0_level_0,ADMINDATE
ENCRYPTED_HESID,Unnamed: 1_level_1
0017E9B890C74E54AF5B93AB663D2F65,2010-06-02
006EAF56DD6A4A04950B5E3787FA392B,2012-03-11
014F5A15EB2E4393B9DE138B35C997B5,NaT
0179E0F45D5D41DBA73593FDF31B4768,NaT
01CDC87EEC6740A1843010FDE7DBBA96,2009-12-06
024D2F127EB94E81A6A3A79D6BD6D89C,2015-06-10
044E3A98BC6D4E63B74B0FD25F751629,2008-12-22
046B06A93C5C459597D9BF07AA033599,2017-01-18
049BC400721B49E2BC870BF33333A120,2006-08-08
04B6C21B666F46E5828BE2F7AEBB6A1E,2012-10-02


## GSK PAH index date of negatives

__Brief__

For PAH negative, the index date should be the latest date of the set of relevant ICD codes or MAINSPEF visit. Patients who do not have a valid index date should be excluded. Code should be flexible so that in can take these in as a list. 

- Define specialities we're interested in
- Define ICD codes we need to match for with for negatives
- Build a regex that matches strings which start with either of the ICDs 
- Define columns with diagnoses, i.e. ICD codes

In [172]:
# specialties that we're interested in 
mainspef_to_keep = np.array(["320", "300", "110", "800", "420"])

# ICD codes we're interested in 
ICDs = [
    "I420",
    "E039",
    "I050",
    "I342",
    "Q232",
    "M351",
    "G473",
    "M32",
    "K766",
    "I370",
    "L940",
    "L941",
    "M43",
    "I20",
    "I21",
    "I22",
    "I23",
    "I24",
    "I25",
    "I26",
    "I27",
    "I28",
    "I50",
    "J40",
    "J41",
    "J42",
    "J43",
    "J44",
    "J45",
    "J47",
    "J849"
]

# regexp matching if any of the above ICDs are at the beginning of an ICD code
ICD_regex = "^"+"|^".join(ICDs)

# list of diagnoses columns
diag_cols = ["DIAG_0" + str(i+1) for i in range(9)] + ["DIAG_" + str(i+10) for i in range(11)]

__Steps taken__

- Check if any of the diagnoses columns start with any of the ICD codes
- For each event get a Boolean value
- Subset the data file to only retain events that either had one of the ICD codes or one of the specialties
- Sort by date in __descending__ order
- Group by patient 

In [210]:
# for each diag column run a regexp that tests all ICD codes
list_of_diag_columns = [data[c].str.match(ICD_regex) for c in data[diag_cols]]

# merge the diag cols back into a df
diag_df = pd.concat(list_of_diag_columns, axis=1, join='inner')

# merge the resulting diag df with the boolean series of mainspec
neg_index_df = pd.concat([diag_df, data.MAINSPEF.isin(mainspef_to_keep)], axis=1)

# check for each row/event if we have a single true
neg_index_df = neg_index_df.sum(axis=1)
neg_index_df = data.loc[neg_index_df > 0, ["ADMINDATE", "ENCRYPTED_HESID"]]

# sort by date, group by patients, find last date
neg_index_df = neg_index_df.sort_values(by="ADMINDATE", ascending=False)
neg_index_dates = neg_index_df.groupby("ENCRYPTED_HESID").first()

# exclude patients who are missing dates
neg_index_no_date_found_ix = pd.isnull(neg_index_dates).values.ravel()

# remaining ones
neg_index_dates.loc[~neg_index_no_date_found_ix]

Unnamed: 0_level_0,ADMINDATE
ENCRYPTED_HESID,Unnamed: 1_level_1
0017E9B890C74E54AF5B93AB663D2F65,2017-06-14
006EAF56DD6A4A04950B5E3787FA392B,2017-03-25
0179E0F45D5D41DBA73593FDF31B4768,2014-08-23
01CDC87EEC6740A1843010FDE7DBBA96,2017-03-20
024D2F127EB94E81A6A3A79D6BD6D89C,2016-08-03
0315BA6844B246588629589AE77EEBCC,2017-03-01
044E3A98BC6D4E63B74B0FD25F751629,2017-05-17
046B06A93C5C459597D9BF07AA033599,2017-02-19
049BC400721B49E2BC870BF33333A120,2017-03-18
04B6C21B666F46E5828BE2F7AEBB6A1E,2017-05-11


Negative patients discarded

In [211]:
np.sum(neg_index_no_date_found_ix)

108

## GSK PAH - lookback date for positives

__Brief__

PAH positive – earliest date of a list of given ICD codes or visit to set of MAINSPEF. If these events are not observed it should be the date of the earliest entry in any table (APC or A&E or OP). No positive patients should be excluded at this point.

__Steps taken__
- This is very similar to the negative index date.
- If we find a NaT, we look for the patient's earlies date, across all 3 tables

In [205]:
# merge the diag df with the boolean series of mainspec
pos_lookback_df = pd.concat([diag_df, data.MAINSPEF.isin(mainspef_to_keep)], axis=1)

# check for each row/event if we have a single true
pos_lookback_df = pos_lookback_df.sum(axis=1)
pos_lookback_df = data.loc[pos_lookback_df > 0, ["ADMINDATE", "ENCRYPTED_HESID"]]

# sort by date, group by patients, find first date
pos_lookback_df = pos_lookback_df.sort_values(by="ADMINDATE")
pos_lookback_dates = pos_lookback_df.groupby("ENCRYPTED_HESID").first()
pos_lookback_dates

Unnamed: 0_level_0,ADMINDATE
ENCRYPTED_HESID,Unnamed: 1_level_1
0017E9B890C74E54AF5B93AB663D2F65,2010-06-02
005891E201304E68859CFC1C390AEC3F,NaT
006EAF56DD6A4A04950B5E3787FA392B,2012-03-11
014F5A15EB2E4393B9DE138B35C997B5,NaT
0179E0F45D5D41DBA73593FDF31B4768,2014-08-23
01CDC87EEC6740A1843010FDE7DBBA96,2007-11-11
024D2F127EB94E81A6A3A79D6BD6D89C,2015-05-17
0315BA6844B246588629589AE77EEBCC,2016-03-31
044E3A98BC6D4E63B74B0FD25F751629,2008-12-22
046B06A93C5C459597D9BF07AA033599,2016-09-22


Find dates for missing ones

__NOTE!!!__ We still have missing dates, what to do here?



In [229]:
# exclude patients who are missing dates
pos_index_no_date_found_ix = pd.isnull(pos_lookback_dates).values.ravel()
pos_index_no_date_found_patient_id = pos_lookback_dates.index[pos_index_no_date_found_ix]

# take patients with missing dates, get all of their events and find the first
no_date_found_mask = data.ENCRYPTED_HESID.isin(pos_index_no_date_found_patient_id)
pos_lookback_dates2 = data.loc[no_date_found_mask, ["ADMINDATE", "ENCRYPTED_HESID"]]

# sort by date, group by patients
pos_lookback_dates2 = pos_lookback_dates2.sort_values(by="ADMINDATE")
pos_lookback_dates2 = pos_lookback_dates2.groupby("ENCRYPTED_HESID").first()

Update `pos_lookback_dates` where it had missing values. 

In [236]:
pos_lookback_dates.loc[pos_index_no_date_found_ix, "ADMINDATE"] = pos_lookback_dates2.values.ravel()

In [237]:
pos_lookback_dates

Unnamed: 0_level_0,ADMINDATE
ENCRYPTED_HESID,Unnamed: 1_level_1
0017E9B890C74E54AF5B93AB663D2F65,2010-06-02
005891E201304E68859CFC1C390AEC3F,2016-08-23
006EAF56DD6A4A04950B5E3787FA392B,2012-03-11
014F5A15EB2E4393B9DE138B35C997B5,NaT
0179E0F45D5D41DBA73593FDF31B4768,2014-08-23
01CDC87EEC6740A1843010FDE7DBBA96,2007-11-11
024D2F127EB94E81A6A3A79D6BD6D89C,2015-05-17
0315BA6844B246588629589AE77EEBCC,2016-03-31
044E3A98BC6D4E63B74B0FD25F751629,2008-12-22
046B06A93C5C459597D9BF07AA033599,2016-09-22


## GSK PAH - lookback date for negatives

__Brief__

For PAH negative, the index date should be the earliest date of the set of relevant ICD codes or MAINSPEF visit. Code should be flexible so that in can take these in as a list. Negative patients who do not have a valid lookback date should be excluded. 

__Steps taken__

- Check if any of the diagnoses columns start with any of the ICD codes
- For each event get a Boolean value
- Subset the data file to only retain events that either had one of the ICD codes or one of the specialties
- Sort by date in __ascending__ order
- Group by patient 

In [239]:
# merge the resulting diag df with the boolean series of mainspec
neg_lookback_df = pd.concat([diag_df, data.MAINSPEF.isin(mainspef_to_keep)], axis=1)

# check for each row/event if we have a single true
neg_lookback_df = neg_lookback_df.sum(axis=1)
neg_lookback_df = data.loc[neg_lookback_df > 0, ["ADMINDATE", "ENCRYPTED_HESID"]]

# sort by date, group by patients, find last date
neg_lookback_df = neg_lookback_df.sort_values(by="ADMINDATE")
neg_index_dates = neg_lookback_df.groupby("ENCRYPTED_HESID").first()
neg_index_dates

Unnamed: 0_level_0,ADMINDATE
ENCRYPTED_HESID,Unnamed: 1_level_1
0017E9B890C74E54AF5B93AB663D2F65,2010-06-02
005891E201304E68859CFC1C390AEC3F,NaT
006EAF56DD6A4A04950B5E3787FA392B,2012-03-11
014F5A15EB2E4393B9DE138B35C997B5,NaT
0179E0F45D5D41DBA73593FDF31B4768,2014-08-23
01CDC87EEC6740A1843010FDE7DBBA96,2007-11-11
024D2F127EB94E81A6A3A79D6BD6D89C,2015-05-17
0315BA6844B246588629589AE77EEBCC,2016-03-31
044E3A98BC6D4E63B74B0FD25F751629,2008-12-22
046B06A93C5C459597D9BF07AA033599,2016-09-22


Get rid off missing ones

In [240]:
# exclude patients who are missing dates
neg_lookback_no_date_found_ix = pd.isnull(neg_index_dates).values.ravel()

# remaining ones
neg_index_dates.loc[~neg_lookback_no_date_found_ix]

Unnamed: 0_level_0,ADMINDATE
ENCRYPTED_HESID,Unnamed: 1_level_1
0017E9B890C74E54AF5B93AB663D2F65,2010-06-02
006EAF56DD6A4A04950B5E3787FA392B,2012-03-11
0179E0F45D5D41DBA73593FDF31B4768,2014-08-23
01CDC87EEC6740A1843010FDE7DBBA96,2007-11-11
024D2F127EB94E81A6A3A79D6BD6D89C,2015-05-17
0315BA6844B246588629589AE77EEBCC,2016-03-31
044E3A98BC6D4E63B74B0FD25F751629,2008-12-22
046B06A93C5C459597D9BF07AA033599,2016-09-22
049BC400721B49E2BC870BF33333A120,2006-08-08
04B6C21B666F46E5828BE2F7AEBB6A1E,2008-09-30


In [241]:
np.sum(neg_index_no_date_found_ix)

108