# PRMT-1953 Part two: Investigate the impact of data pipeline Duplicate EHR fix on analytics dataset

We re-ran the data pipeline with the Duplicate EHR fix, output is within transfer-sample-5, with a description of what inputs we used to generate the data.

We did a quick analysis of the transfers between Sept 2020-Feb 2021 of the status breakdown. The total number of transfers is the same - 1343234. The following is the comparison between the output from the old branch (transfers-duplicates-hypothesis) after duplicate EHR fix within the notebook vs the current data pipeline with the duplicate EHR fix:

In [1]:
import pandas as pd
import numpy as np

### Old branch (transfers-duplicates-hypothesis) and Duplicate EHR notebook fix

In [2]:
# Import transfer files to extract whether message creator is sender or requester
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-duplicates-hypothesis/"
transfer_files = [
    "9-2020-transfers.parquet",
    "10-2020-transfers.parquet",
    "11-2020-transfers.parquet",
    "12-2020-transfers.parquet",
    "1-2021-transfers.parquet",
    "2-2021-transfers.parquet",
]

transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))

# In the data from the PRMT-1742-duplicates-analysis branch, these columns have been added , but contain only empty values.
transfers_raw = transfers_raw.drop(["sending_supplier", "requesting_supplier"], axis=1)

# Given the findings in PRMT-1742 - many duplicate EHR errors are misclassified, the below reclassifies the relevant data
has_at_least_one_successful_integration_code = lambda errors: any((np.isnan(e) or e==15 for e in errors))
successful_transfers_bool = transfers_raw['request_completed_ack_codes'].apply(has_at_least_one_successful_integration_code)
transfers = transfers_raw.copy()
transfers.loc[successful_transfers_bool, "status"] = "INTEGRATED"

# Correctly interpret certain sender errors as failed.
# This is explained in PRMT-1974. Eventually this will be fixed upstream in the pipeline.
pending_sender_error_codes=[6,7,10,24,30,23,14,99]
transfers_with_pending_sender_code_bool=transfers['sender_error_code'].isin(pending_sender_error_codes)
transfers_with_pending_with_error_bool=transfers['status']=='PENDING_WITH_ERROR'
transfers_which_need_pending_to_failure_change_bool=transfers_with_pending_sender_code_bool & transfers_with_pending_with_error_bool
transfers.loc[transfers_which_need_pending_to_failure_change_bool,'status']='FAILED'

# Add integrated Late status
eight_days_in_seconds=8*24*60*60
transfers_after_sla_bool=transfers['sla_duration']>eight_days_in_seconds
transfers_with_integrated_bool=transfers['status']=='INTEGRATED'
transfers_integrated_late_bool=transfers_after_sla_bool & transfers_with_integrated_bool
transfers.loc[transfers_integrated_late_bool,'status']='INTEGRATED LATE'

# If the record integrated after 28 days, change the status back to pending.
# This is to handle each month consistently and to always reflect a transfers status 28 days after it was made.
# TBD how this is handled upstream in the pipeline
twenty_eight_days_in_seconds=28*24*60*60
transfers_after_month_bool=transfers['sla_duration']>twenty_eight_days_in_seconds
transfers_pending_at_month_bool=transfers_after_month_bool & transfers_integrated_late_bool
transfers.loc[transfers_pending_at_month_bool,'status']='PENDING'
transfers_with_early_error_bool=(~transfers.loc[:,'sender_error_code'].isna()) |(~transfers.loc[:,'intermediate_error_codes'].apply(len)>0)
transfers.loc[transfers_with_early_error_bool & transfers_pending_at_month_bool,'status']='PENDING_WITH_ERROR'

# Supplier name mapping
supplier_renaming = {
    "EGTON MEDICAL INFORMATION SYSTEMS LTD (EMIS)":"EMIS",
    "IN PRACTICE SYSTEMS LTD":"Vision",
    "MICROTEST LTD":"Microtest",
    "THE PHOENIX PARTNERSHIP":"TPP",
    None: "Unknown"
}

asid_lookup_file = "s3://prm-gp2gp-data-sandbox-dev/asid-lookup/asidLookup-Mar-2021.csv.gz"
asid_lookup = pd.read_csv(asid_lookup_file)
lookup = asid_lookup[["ASID", "MName", "NACS","OrgName"]]

transfers = transfers.merge(lookup, left_on='requesting_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'requesting_supplier', 'ASID': 'requesting_supplier_asid', 'NACS': 'requesting_ods_code','OrgName':'requesting_practice_name'}, axis=1)
transfers = transfers.merge(lookup, left_on='sending_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'sending_supplier', 'ASID': 'sending_supplier_asid', 'NACS': 'sending_ods_code','OrgName':'sending_practice_name'}, axis=1)

transfers["sending_supplier"] = transfers["sending_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())
transfers["requesting_supplier"] = transfers["requesting_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())
transfers_old_branch = transfers.copy()

### New data pipeline with Duplicate EHR Fix (PRMT-1617)

In [3]:
# Import transfer files to extract whether message creator is sender or requester
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-sample-5/"
transfer_files = [
    "2020-9-transfers.parquet",
    "2020-10-transfers.parquet",
    "2020-11-transfers.parquet",
    "2020-12-transfers.parquet",
    "2021-1-transfers.parquet",
    "2021-2-transfers.parquet"
]

transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))

# In the data from the PRMT-1742-duplicates-analysis branch, these columns have been added , but contain only empty values.
transfers_raw = transfers_raw.drop(["sending_supplier", "requesting_supplier"], axis=1)
transfers = transfers_raw.copy()

# Correctly interpret certain sender errors as failed.
# This is explained in PRMT-1974. Eventually this will be fixed upstream in the pipeline.
pending_sender_error_codes=[6,7,10,24,30,23,14,99]
transfers_with_pending_sender_code_bool=transfers['sender_error_code'].isin(pending_sender_error_codes)
transfers_with_pending_with_error_bool=transfers['status']=='PENDING_WITH_ERROR'
transfers_which_need_pending_to_failure_change_bool=transfers_with_pending_sender_code_bool & transfers_with_pending_with_error_bool
transfers.loc[transfers_which_need_pending_to_failure_change_bool,'status']='FAILED'

# Add integrated Late status
eight_days_in_seconds=8*24*60*60
transfers_after_sla_bool=transfers['sla_duration']>eight_days_in_seconds
transfers_with_integrated_bool=transfers['status']=='INTEGRATED'
transfers_integrated_late_bool=transfers_after_sla_bool & transfers_with_integrated_bool
transfers.loc[transfers_integrated_late_bool,'status']='INTEGRATED LATE'

# If the record integrated after 28 days, change the status back to pending.
# This is to handle each month consistently and to always reflect a transfers status 28 days after it was made.
# TBD how this is handled upstream in the pipeline
twenty_eight_days_in_seconds=28*24*60*60
transfers_after_month_bool=transfers['sla_duration']>twenty_eight_days_in_seconds
transfers_pending_at_month_bool=transfers_after_month_bool & transfers_integrated_late_bool
transfers.loc[transfers_pending_at_month_bool,'status']='PENDING'
transfers_with_early_error_bool=(~transfers.loc[:,'sender_error_code'].isna()) |(~transfers.loc[:,'intermediate_error_codes'].apply(len)>0)
transfers.loc[transfers_with_early_error_bool & transfers_pending_at_month_bool,'status']='PENDING_WITH_ERROR'

# Supplier name mapping
supplier_renaming = {
    "EGTON MEDICAL INFORMATION SYSTEMS LTD (EMIS)":"EMIS",
    "IN PRACTICE SYSTEMS LTD":"Vision",
    "MICROTEST LTD":"Microtest",
    "THE PHOENIX PARTNERSHIP":"TPP",
    None: "Unknown"
}

asid_lookup_file = "s3://prm-gp2gp-data-sandbox-dev/asid-lookup/asidLookup-Mar-2021.csv.gz"
asid_lookup = pd.read_csv(asid_lookup_file)
lookup = asid_lookup[["ASID", "MName", "NACS","OrgName"]]

transfers = transfers.merge(lookup, left_on='requesting_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'requesting_supplier', 'ASID': 'requesting_supplier_asid', 'NACS': 'requesting_ods_code','OrgName':'requesting_practice_name'}, axis=1)
transfers = transfers.merge(lookup, left_on='sending_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'sending_supplier', 'ASID': 'sending_supplier_asid', 'NACS': 'sending_ods_code','OrgName':'sending_practice_name'}, axis=1)

transfers["sending_supplier"] = transfers["sending_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())
transfers["requesting_supplier"] = transfers["requesting_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())

# Making the status to be more human readable here
transfers["status"] = transfers["status"].str.replace("_", " ").str.title()

transfers_new = transfers.copy()

## Comparison

In [4]:
print(f"The total number of transfers between the old transfer set and new one match: {transfers_old_branch.shape[0]} and {transfers_new.shape[0]}")

The total number of transfers between the old transfer set and new one match: 1343234 and 1343234


In [5]:
transfers_new["status"].value_counts()

Integrated            1174129
Integrated Late         72639
Pending                 49073
Failed                  44168
Pending With Error       3225
Name: status, dtype: int64

In [6]:
transfers_old_branch["status"].value_counts()

INTEGRATED            1175456
INTEGRATED LATE         71547
PENDING                 46860
FAILED                  46146
PENDING_WITH_ERROR       3225
Name: status, dtype: int64

### Direct comparison of status change

In [7]:
# Joining the datasets
transfers_previous_status = transfers_old_branch.copy().loc[:, ["status", "conversation_id"]]
transfers_previous_status = transfers_previous_status.rename({"status": "Previous Status"}, axis=1)
transfers_new_status = transfers_new.copy().loc[:, ["status", "conversation_id"]]
transfers_new_status = transfers_new_status.rename({"status": "New Status"}, axis=1)
transfer_status_change = transfers_previous_status.merge(transfers_new_status, left_on="conversation_id", right_on="conversation_id", how="outer").fillna("Unknown")

In [8]:
transfer_status_change.groupby(by=["Previous Status", "New Status"]).agg("count").fillna(0).astype(int).rename({"conversation_id": "Number of transfers"}, axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of transfers
Previous Status,New Status,Unnamed: 2_level_1
FAILED,Failed,44167
FAILED,Pending,1979
INTEGRATED,Integrated,1174121
INTEGRATED,Integrated Late,1100
INTEGRATED,Pending,235
INTEGRATED LATE,Integrated,8
INTEGRATED LATE,Integrated Late,71539
PENDING,Failed,1
PENDING,Pending,46859
PENDING_WITH_ERROR,Pending With Error,3225


### Looking at transfers whose status changed from Integrated to Pending

In [9]:
transfers_with_previous_integrated_status_bool = transfer_status_change["Previous Status"] == "INTEGRATED"
transfers_with_new_pending_status_bool = transfer_status_change["New Status"] == "Pending"
integrated_to_pending_conversation_ids= transfer_status_change.loc[transfers_with_previous_integrated_status_bool & transfers_with_new_pending_status_bool, "conversation_id"].values

In [10]:
transfers_old_branch.set_index("conversation_id").loc[integrated_to_pending_conversation_ids]

Unnamed: 0_level_0,sla_duration,requesting_practice_asid,sending_practice_asid,sender_error_code,final_error_code,intermediate_error_codes,status,date_requested,date_completed,request_completed_ack_codes,requesting_supplier_asid,requesting_supplier,requesting_ods_code,requesting_practice_name,sending_supplier_asid,sending_supplier,sending_ods_code,sending_practice_name
conversation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
697E0720-F131-11EA-B60A-4104D775471F,24.0,028449383042,666501433047,,12.0,[],INTEGRATED,2020-09-07 17:41:51.847,2020-09-30 16:04:35.556,"[12.0, 12.0, 12.0, 12.0, nan]",028449383042,TPP,A83023,STANLEY MEDICAL GROUP,666501433047,EMIS,A88014,STANHOPE PARADE HEALTH CENTRE
EBC46930-F691-11EA-96E7-F7B11ACB2AA2,4.0,417376381048,727574574016,,12.0,[],INTEGRATED,2020-09-14 13:55:22.451,2020-09-30 09:02:04.049,"[12.0, nan]",417376381048,TPP,K82615,WALNUT TREE HEALTH CENTRE,727574574016,EMIS,M82008,CHURCH STRETTON MEDICAL CENTRE
1BD32920-F6AD-11EA-9354-DDB8251F3610,3.0,296554937015,686979807019,,12.0,[],INTEGRATED,2020-09-14 17:09:54.854,2020-10-05 11:28:09.848,"[12.0, 12.0, 12.0, nan]",296554937015,TPP,E87011,LISSON GROVE HEALTH CENTRE,686979807019,EMIS,H85049,BATTERSEA RISE GROUP PRACTICE
650495A0-F984-11EA-B317-EBDB0CCCCEBE,2.0,969175028014,200000004793,,12.0,[],INTEGRATED,2020-09-18 07:55:15.170,2020-10-06 13:25:56.718,"[12.0, 12.0, 12.0, 12.0, nan]",969175028014,TPP,F86657,YORK ROAD SURGERY,200000004793,EMIS,F86022,ILFORD MEDICAL CENTRE
26B67630-F739-11EA-B980-55445A8C4022,2.0,380336392044,200000017766,,12.0,[],INTEGRATED,2020-09-15 09:52:22.735,2020-09-29 10:15:39.444,"[12.0, nan]",380336392044,TPP,P82003,KILDONAN HOUSE,200000017766,EMIS,P82016,HARWOOD MEDICAL CENTRE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4D5AFF10-64A6-11EB-9F44-59F9B2FAD74E,4.0,734276839010,163425916042,,12.0,[],INTEGRATED,2021-02-01 15:58:19.410,2021-02-08 11:55:12.944,"[12.0, 12.0, nan]",734276839010,TPP,N81059,CULCHETH MEDICAL CENTRE,163425916042,EMIS,C84023,THE UNIV OF NOTTINGHAM HEALTH SERV
8A6BD080-6615-11EB-B291-9D9CEBFCE4CD,4.0,245920374049,550010203044,,12.0,[],INTEGRATED,2021-02-03 11:47:07.130,2021-02-04 10:05:04.652,"[12.0, nan]",245920374049,TPP,D82063,WATTON MEDICAL PRACTICE,550010203044,EMIS,G82094,CHARING SURGERY
4B165810-678C-11EB-B0F7-112D2178EAAE,3.0,300611837018,413718379048,,12.0,[],INTEGRATED,2021-02-05 08:29:42.869,2021-03-08 14:48:14.319,"[12.0, nan]",300611837018,TPP,B81045,ASHBY TURN PRIMARY CARE PARTNERS,413718379048,EMIS,B81043,SOUTH AXHOLME PRACTICE
F54A8E60-6646-11EB-98FC-2D5DBB14FCE6,2.0,581977920015,261187902042,,12.0,[],INTEGRATED,2021-02-03 17:40:51.888,2021-03-03 09:47:47.100,"[12.0, nan]",581977920015,TPP,C83052,THE INGHAM SURGERY,261187902042,EMIS,C83044,CASKGATE STREET SURGERY


In [11]:
transfers_new.set_index("conversation_id").loc[integrated_to_pending_conversation_ids]

Unnamed: 0_level_0,sla_duration,requesting_practice_asid,sending_practice_asid,sender_error_code,final_error_codes,intermediate_error_codes,status,date_requested,date_completed,requesting_supplier_asid,requesting_supplier,requesting_ods_code,requesting_practice_name,sending_supplier_asid,sending_supplier,sending_ods_code,sending_practice_name
conversation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
697E0720-F131-11EA-B60A-4104D775471F,3693294.0,028449383042,666501433047,,"[12.0, 12.0, 12.0, 12.0, nan]",[],Pending,2020-09-07 17:41:51.847,2020-10-20 11:38:20.420,028449383042,TPP,A83023,STANLEY MEDICAL GROUP,666501433047,EMIS,A88014,STANHOPE PARADE HEALTH CENTRE
EBC46930-F691-11EA-96E7-F7B11ACB2AA2,2753200.0,417376381048,727574574016,,"[12.0, nan]",[],Pending,2020-09-14 13:55:22.451,2020-10-16 10:42:12.086,417376381048,TPP,K82615,WALNUT TREE HEALTH CENTRE,727574574016,EMIS,M82008,CHURCH STRETTON MEDICAL CENTRE
1BD32920-F6AD-11EA-9354-DDB8251F3610,3100796.0,296554937015,686979807019,,"[12.0, 12.0, 12.0, nan]",[],Pending,2020-09-14 17:09:54.854,2020-10-20 14:30:01.702,296554937015,TPP,E87011,LISSON GROVE HEALTH CENTRE,686979807019,EMIS,H85049,BATTERSEA RISE GROUP PRACTICE
650495A0-F984-11EA-B317-EBDB0CCCCEBE,2689269.0,969175028014,200000004793,,"[12.0, 12.0, 12.0, 12.0, nan]",[],Pending,2020-09-18 07:55:15.170,2020-10-19 10:56:31.084,969175028014,TPP,F86657,YORK ROAD SURGERY,200000004793,EMIS,F86022,ILFORD MEDICAL CENTRE
26B67630-F739-11EA-B980-55445A8C4022,3552864.0,380336392044,200000017766,,"[12.0, nan]",[],Pending,2020-09-15 09:52:22.735,2020-10-26 12:46:52.547,380336392044,TPP,P82003,KILDONAN HOUSE,200000017766,EMIS,P82016,HARWOOD MEDICAL CENTRE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4D5AFF10-64A6-11EB-9F44-59F9B2FAD74E,3614287.0,734276839010,163425916042,,"[12.0, 12.0, nan]",[],Pending,2021-02-01 15:58:19.410,2021-03-15 11:57:06.605,734276839010,TPP,N81059,CULCHETH MEDICAL CENTRE,163425916042,EMIS,C84023,THE UNIV OF NOTTINGHAM HEALTH SERV
8A6BD080-6615-11EB-B291-9D9CEBFCE4CD,3098638.0,245920374049,550010203044,,"[12.0, nan]",[],Pending,2021-02-03 11:47:07.130,2021-03-11 08:31:27.664,245920374049,TPP,D82063,WATTON MEDICAL PRACTICE,550010203044,EMIS,G82094,CHARING SURGERY
4B165810-678C-11EB-B0F7-112D2178EAAE,2771277.0,300611837018,413718379048,,"[12.0, nan]",[],Pending,2021-02-05 08:29:42.869,2021-03-09 10:17:54.735,300611837018,TPP,B81045,ASHBY TURN PRIMARY CARE PARTNERS,413718379048,EMIS,B81043,SOUTH AXHOLME PRACTICE
F54A8E60-6646-11EB-98FC-2D5DBB14FCE6,2503760.0,581977920015,261187902042,,"[12.0, nan]",[],Pending,2021-02-03 17:40:51.888,2021-03-04 17:10:26.663,581977920015,TPP,C83052,THE INGHAM SURGERY,261187902042,EMIS,C83044,CASKGATE STREET SURGERY
