# [HYPOTHESIS] Changing the data pipeline to accommodate duplicate EHRs correctly will effect other error codes 

**We believe that** changing the data pipeline to accommodate duplicate EHRs correctly, as proposed in PRMT-1742 (Duplicate EHR errors) will

**Result** in transfers that don't contain error code 12 (duplicate EHR) being reassigned as integrated rather than failed

**We will know this to be true when** we test the effect of the data pipeline change using the 6 months of data generated for PRMT-1742 (Duplicate EHR errors) and compare the new status to the status produced by the original data pipeline.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def Series_of_lists_value_counts(Series):
    # Replace any nan values in list
    Series=Series.apply(lambda row: ['None' if np.isnan(x) else x for x in row])
    # Convert this into a dataframe of list items in order
    journey_frame=pd.DataFrame.from_records(Series.tolist())
    # To ensure grouping of different list lengths, fill gaps
    journey_frame=journey_frame.fillna('n/a')
    # Store index for grouping
    grouping_index=list(journey_frame.columns)
    # Add column to aggreate on for group
    journey_frame['Total Occurrences']=1

    # Now do the actual aggregate
    journey_frame=journey_frame.groupby(grouping_index).agg('count').sort_values(by='Total Occurrences',ascending=False)
    
    return journey_frame.reset_index().replace({'n/a':np.nan})

In [3]:
error_code_lookup_file = pd.read_csv("https://raw.githubusercontent.com/nhsconnect/prm-gp2gp-data-sandbox/master/data/gp2gp_response_codes.csv")

### Import Data generated during duplicates issue hypothesis

In [4]:
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-duplicates-hypothesis/"
transfer_files = [
    "9-2020-transfers.parquet",
    "10-2020-transfers.parquet",
    "11-2020-transfers.parquet",
    "12-2020-transfers.parquet",
    "1-2021-transfers.parquet",
    "2-2021-transfers.parquet"
]
transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))

### Add a "New status" column that simulates the effect that the proposed pipeline change would have

In [5]:
transfers=transfers_raw.copy()
transfers['New status']=transfers['status']
successful_transfers_bool = transfers['request_completed_ack_codes'].apply(lambda x: True in [(np.isnan(i) or i==15) for i in x])
transfers.loc[successful_transfers_bool,'New status']='INTEGRATED'

### Extract transfers that would have their status changed by the new pipeline

In [6]:
different_status_bool=transfers['status']!=transfers['New status']
changed_transfers=transfers.loc[different_status_bool]


### Check the volume of status changes that have occurred

In [7]:
print('Total Number of transfers:')
print(transfers.shape[0])
print('Total Number of changed transfers:')
print(changed_transfers.shape[0])
changed_transfers.groupby(['status','New status']).agg({'conversation_id':'count'}).rename({'conversation_id':'Count'},axis=1)

Total Number of transfers:
1343234
Total Number of changed transfers:
4513


Unnamed: 0_level_0,Unnamed: 1_level_0,Count
status,New status,Unnamed: 2_level_1
FAILED,INTEGRATED,4493
PENDING,INTEGRATED,19
PENDING_WITH_ERROR,INTEGRATED,1


### What request completed acknowledgment codes were in these changed transfers?
ie what error types caused these to not be assigned as integrated in the original pipeline?

#### Let's investigate the pending with error

In [8]:
changed_transfers.loc[changed_transfers['status']=="PENDING_WITH_ERROR",['sender_error_code','final_error_code','intermediate_error_codes','request_completed_ack_codes']]

Unnamed: 0,sender_error_code,final_error_code,intermediate_error_codes,request_completed_ack_codes
16618,20.0,,[29],[15.0]


#### Let's investigate the pending transfers

In [9]:
changed_transfers.loc[changed_transfers['status']=="PENDING",['sender_error_code','final_error_code','intermediate_error_codes','request_completed_ack_codes']]

Unnamed: 0,sender_error_code,final_error_code,intermediate_error_codes,request_completed_ack_codes
16376,,,[],[nan]
40566,,,[],"[12.0, nan]"
89593,,,[],[nan]
122940,,,[],[nan]
113944,,,[],[nan]
115002,,,[],[nan]
197413,,,[],[nan]
209017,,,[],[nan]
6735,,,[],[nan]
28427,,,[],[nan]


### Let's investigate the Failed Transfers

In [14]:
# Find the common sets of final request completed acknowledgement codes (Note: These are not in order which they originally appeared in Spine!!)
original_error_codes_failed_transfers=Series_of_lists_value_counts(changed_transfers.loc[changed_transfers['status']=="FAILED",'request_completed_ack_codes'].apply(set))

# Rename Error Code columns to make table more readable
original_error_codes_failed_transfers=original_error_codes_failed_transfers.rename({0:'Error Code 1',1:'Error Code 2',2:'Error Code 3'},axis=1)

# Add in Error Descriptions
error_descriptions=original_error_codes_failed_transfers[['Error Code 1','Error Code 2','Error Code 3']]
error_descriptions=error_descriptions.replace(error_code_lookup_file["ErrorCode"].values,error_code_lookup_file['ErrorName'].values)
original_error_codes_failed_transfers[['Error 1','Error 2','Error 3']]=error_descriptions

original_error_codes_failed_transfers=original_error_codes_failed_transfers.fillna('')[['Error Code 1','Error Code 2','Error Code 3','Error 1','Error 2','Error 3','Total Occurrences']]
original_error_codes_failed_transfers

Unnamed: 0,Error Code 1,Error Code 2,Error Code 3,Error 1,Error 2,Error 3,Total Occurrences
0,,12.0,,,Duplicate EHR,,4210
1,,11.0,,,Failed to integrate,,178
2,,11.0,12.0,,Failed to integrate,Duplicate EHR,28
3,12.0,15.0,,Duplicate EHR,ABA suppressed,,23
4,,31.0,,,Missing LM,,17
5,,12.0,31.0,,Duplicate EHR,Missing LM,14
6,11.0,15.0,,Failed to integrate,ABA suppressed,,6
7,,25.0,,,Timeout,,5
8,,25.0,12.0,,Timeout,Duplicate EHR,3
9,11.0,12.0,15.0,Failed to integrate,Duplicate EHR,ABA suppressed,2
