# [HYPOTHESIS] Duplicate EHR errors are for EHRs that have been successfully received previously

### Context
We know that a lot of GP2GP errors are due to duplicate EHRs, and these are currently being factored as failed.

### Hypothesis
We believe that transfers that result in a duplicate error 

will have previously been successfully integrated by the receiving practice

We will know this to be true when we see multiple transfers with the same conversation ID, where the first has succeeded and the second has resulted in a duplicate error

### Value
Understanding of if/ how we need to increase accuracy of the metrics produced by our data pipeline

### Scope
Look at six months of GP2GP data

Identify what proportion of transfers that resulted in a duplicate, previously had the same EHR successfully received by a practice

Identify what proportion of transfers that resulted in a duplicate, previously had the same EHR sent but not received by a practice

### Acceptance Criteria
We know whether to classify duplicates as failures or not (where failure means paper fallback)

In [1]:
import pandas as pd
import numpy as np

In [2]:
def Series_of_lists_value_counts(Series):
    # Replace any nan values in list
    Series=Series.apply(lambda row: ['None' if np.isnan(x) else x for x in row])
    # Convert this into a dataframe of list items in order
    journey_frame=pd.DataFrame.from_records(Series.tolist())
    # To ensure grouping of different list lengths, fill gaps
    journey_frame=journey_frame.fillna('n/a')
    # Store index for grouping
    grouping_index=list(journey_frame.columns)
    # Add column to aggreate on for group
    journey_frame['Total Occurences']=1

    # Now do the actual aggregate
    journey_frame=journey_frame.groupby(grouping_index).agg('count').sort_values(by='Total Occurences',ascending=False)
    
    return journey_frame.reset_index().replace({'n/a':np.nan})

In [3]:
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-duplicates-hypothesis/"
transfer_files = [
    "9-2020-transfers.parquet",
    "10-2020-transfers.parquet",
    "11-2020-transfers.parquet",
    "12-2020-transfers.parquet",
    "1-2021-transfers.parquet",
    "2-2021-transfers.parquet"
]
transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))


asid_lookup_file = "s3://prm-gp2gp-data-sandbox-dev/asid-lookup/asidLookup-Mar-2021.csv.gz"
asid_lookup = pd.read_csv(asid_lookup_file)

In [4]:
# Add in who the supplier is
supplier_renaming = {
    "EGTON MEDICAL INFORMATION SYSTEMS LTD (EMIS)":"EMIS",
    "IN PRACTICE SYSTEMS LTD":"Vision",
    "MICROTEST LTD":"Microtest",
    "THE PHOENIX PARTNERSHIP":"TPP",
    None: "Unknown"
}

lookup = asid_lookup[["ASID", "MName"]]
transfers = transfers.merge(lookup, left_on='requesting_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'requesting_supplier', 'ASID': 'requesting_supplier_asid'}, axis=1)
transfers = transfers.merge(lookup, left_on='sending_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'sending_supplier', 'ASID': 'sending_supplier_asid'}, axis=1)

transfers["sending_supplier"] = transfers["sending_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())
transfers["requesting_supplier"] = transfers["requesting_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())

In [5]:
print('Total Transfers:')
print(transfers.shape[0])

Total Transfers:
1343234


In [6]:
# Use only conversations with a duplicate error
transfers_with_duplicate_error_bool=transfers['request_completed_ack_codes'].apply(lambda x: 12 in x)
transfers_with_duplicate_error=transfers[transfers_with_duplicate_error_bool]
print('Total transfers with duplicate error:')
print(transfers_with_duplicate_error.shape[0])

Total transfers with duplicate error:
20679


In [7]:
print("NOTE: Addition of '15' as an automatic success : IS THIS VALID??")
# Of the transfers with a duplicate error, how many contain a successful transfer?
transfers_with_duplicate_error_and_success_bool=transfers_with_duplicate_error['request_completed_ack_codes'].apply(lambda x: True in [(np.isnan(i) or i==15) for i in x])
print('Total transfers with a duplicate error and successful transfer:')
print(transfers_with_duplicate_error_and_success_bool.sum())


NOTE: Addition of '15' as an automatic success : IS THIS VALID??
Total transfers with a duplicate error and successful transfer:
18385


In [8]:
overview_data=transfers_with_duplicate_error.copy()
overview_data['Successful Transfer']=transfers_with_duplicate_error_and_success_bool

pd.pivot_table(overview_data,index='Successful Transfer',columns='status',values='conversation_id',aggfunc='count').fillna(0).astype(int)


status,FAILED,INTEGRATED,PENDING
Successful Transfer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,2294,0,0
True,4284,14099,2


## Sidebar: Let's look at duplicates data with no "success" recorded

In [9]:
# Let's take a quick look at those without a success
transfers_with_duplicate_error_no_success=transfers_with_duplicate_error[~transfers_with_duplicate_error_and_success_bool]
print(transfers_with_duplicate_error_no_success['status'].value_counts())

integrated_transfers=transfers_with_duplicate_error_no_success.loc[transfers_with_duplicate_error_no_success['status']=='INTEGRATED']
print()
print('Integrated Data with Duplicates but no "success"')
print("The reason for the discrepency appears to be because it has code 15 (ie boomerang patient) which is actually an integration")
Series_of_lists_value_counts(integrated_transfers['request_completed_ack_codes'])


FAILED    2294
Name: status, dtype: int64

Integrated Data with Duplicates but no "success"
The reason for the discrepency appears to be because it has code 15 (ie boomerang patient) which is actually an integration


Unnamed: 0,index,Total Occurences


In [10]:
non_integrated_transfers=transfers_with_duplicate_error_no_success.loc[~(transfers_with_duplicate_error_no_success['status']=='INTEGRATED')]
print('Non-Integrated Data with Duplicates and no "success"')
print("These have a duplicate code but we can't actually see the original transfer")
Series_of_lists_value_counts(non_integrated_transfers['request_completed_ack_codes'])

Non-Integrated Data with Duplicates and no "success"
These have a duplicate code but we can't actually see the original transfer


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,46,47,48,49,50,51,52,53,54,Total Occurences
0,12.0,,,,,,,,,,...,,,,,,,,,,1456
1,12.0,12.0,,,,,,,,,...,,,,,,,,,,316
2,12.0,12.0,12.0,,,,,,,,...,,,,,,,,,,110
3,11.0,12.0,,,,,,,,,...,,,,,,,,,,65
4,28.0,12.0,,,,,,,,,...,,,,,,,,,,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,12.0,12.0,28.0,12.0,,,,,,,...,,,,,,,,,,1
88,12.0,12.0,31.0,12.0,12.0,12.0,12.0,,,,...,,,,,,,,,,1
89,12.0,12.0,31.0,12.0,,,,,,,...,,,,,,,,,,1
90,12.0,12.0,99.0,,,,,,,,...,,,,,,,,,,1


In [11]:
Series_of_lists_value_counts(non_integrated_transfers['request_completed_ack_codes'].apply(set))

Unnamed: 0,0,1,2,Total Occurences
0,12.0,,,1979
1,11.0,12.0,,114
2,12.0,31.0,,82
3,28.0,12.0,,52
4,12.0,28.0,,27
5,17.0,12.0,,23
6,26.0,12.0,,4
7,99.0,12.0,,4
8,99.0,12.0,31.0,3
9,11.0,12.0,28.0,2


In [12]:
#non_integrated_transfers.loc[non_integrated_transfers['request_completed_ack_codes'].apply(lambda x : len(set(x))==1 and x[0]==12)]

# A quick inspection of E1361C40-0332-11EB-9274-95976CACBE28 showed that the reason why only 12 appeared in so many failed transfers is because the original transfer was not acknowledged ie was left pending and then overwritten as "12"
# Our hypothesis for these is they are classified as failed but may have been left pending and then followed by a duplicate

### Let's look at duplicate data where there is also a "success"

In [13]:
# Of those with a duplicate error, but also a success, what is the status?
transfers_with_duplicate_error.loc[transfers_with_duplicate_error_and_success_bool,'status'].value_counts()

INTEGRATED    14099
FAILED         4284
PENDING           2
Name: status, dtype: int64

In [14]:
# What's the deal with the pending one?

#transfers_with_duplicate_error.loc[transfers_with_duplicate_error['status']=='PENDING']

# Weird; how come it has a successful acknowledgement but is listed as pending??
#Maybe it integrated in October; our code picked up on this but the original status code did not?

In [15]:
# For transfers not listed as integrated, how does this look?
transfers_with_duplicate_and_success_error=transfers_with_duplicate_error.loc[transfers_with_duplicate_error_and_success_bool]
non_integrated_transfers_with_duplicate_and_success_error=transfers_with_duplicate_and_success_error.loc[~(transfers_with_duplicate_and_success_error['status']=='INTEGRATED')]

non_integrated_pathways=Series_of_lists_value_counts(non_integrated_transfers_with_duplicate_and_success_error['request_completed_ack_codes'])
print('5 most common paths for non-integrated conversations with duplicates and success')
non_integrated_pathways.head().dropna(axis=1,how='all')

5 most common paths for non-integrated conversations with duplicates and success


Unnamed: 0,0,1,2,3,Total Occurences
0,12.0,,,,3186
1,12.0,12.0,,,571
2,,12.0,,,192
3,12.0,12.0,12.0,,134
4,,12.0,12.0,,39


In [16]:
# For transfers listed as a success, how does the journey look?
integrated_transfers_with_duplicate_and_success_error=transfers_with_duplicate_and_success_error.loc[(transfers_with_duplicate_and_success_error['status']=='INTEGRATED')]

integrated_pathways=Series_of_lists_value_counts(integrated_transfers_with_duplicate_and_success_error['request_completed_ack_codes'])
print('5 most common paths for integrated conversations with duplicates and success')

integrated_pathways.head().dropna(axis=1,how='all')

5 most common paths for integrated conversations with duplicates and success


Unnamed: 0,0,1,2,3,4,Total Occurences
0,12,,,,,10891
1,12,12.0,,,,1502
2,12,15.0,,,,824
3,12,12.0,12.0,,,366
4,12,12.0,12.0,12.0,,132


In [17]:
# is it the last value that really matters?
print('What do non-integrated paths end in?')
Nonintegrated_end=non_integrated_pathways.copy()
Nonintegrated_end['Final State']=non_integrated_pathways.apply(lambda row: row.dropna().drop('Total Occurences').values[-1],axis=1)
print(Nonintegrated_end.groupby('Final State')['Total Occurences'].agg('sum'))

print()

print('What do integrated paths end in?')
Integrated_end=integrated_pathways.copy()
Integrated_end['Final State']=integrated_pathways.apply(lambda row: row.dropna().values[-2],axis=1)
print(Integrated_end.groupby('Final State')['Total Occurences'].agg('sum'))

print()
print('If it ends in error 12, it might fail to list as integrate')

What do non-integrated paths end in?
Final State
11.0       4
12.0     322
17.0       1
None    3959
Name: Total Occurences, dtype: int64

What do integrated paths end in?
Final State
12.0        1
15.0     1008
25.0        1
None    13089
Name: Total Occurences, dtype: int64

If it ends in error 12, it might fail to list as integrate


1. Why when it ends in None, does it fail rather than integrate?
2. Why does the None come after the 12?

## Does our final state match the final error code?

In [18]:
compare_final_codes_table=transfers.loc[:,['status','final_error_code','request_completed_ack_codes']]
compare_final_codes_table['Final req complete Ack Code']=compare_final_codes_table['request_completed_ack_codes'].apply(lambda x: x[-1] if len(x)>0 else np.nan)
compare_final_codes_table=compare_final_codes_table.fillna('None').groupby(['status','final_error_code','Final req complete Ack Code']).agg('count').rename({'request_completed_ack_codes':'Total Occurences'},axis=1)
compare_final_codes_table['Matches']=(compare_final_codes_table.reset_index()['final_error_code']==compare_final_codes_table.reset_index()['Final req complete Ack Code']).values
compare_final_codes_table.sort_values(by='Total Occurences',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Total Occurences,Matches
status,final_error_code,Final req complete Ack Code,Unnamed: 3_level_1,Unnamed: 4_level_1
INTEGRATED,,,1174844,True
INTEGRATED,15.0,15.0,75442,True
PENDING,,,39105,True
PENDING_WITH_ERROR,,,26574,True
FAILED,99.0,99.0,13602,True
FAILED,12.0,,3957,False
FAILED,12.0,12.0,2525,True
FAILED,30.0,30.0,2157,True
FAILED,31.0,31.0,1243,True
FAILED,25.0,25.0,951,True


In [19]:
print('Number of times it lists as Failed due to duplicate error 12 despite actually ending in a success')
compare_final_codes_table.loc[('FAILED',12,'None'),'Total Occurences']

Number of times it lists as Failed due to duplicate error 12 despite actually ending in a success


3957