# PRMT-1960 Can we use the presence of a error code at a particular point in the process to designate a transfer as failed
### Context

Data range: 01/09/2020 - 28/02/2021 (6 months)

### Hypothesis

**We believe that** certain Error Codes appear at certain points in the GP2GP process,

**Can** automatically be considered failures.

**We will know this to be true when** we can see in the data that whenever a given error codes appear at a given stage of the transfer process (e.g. in intermediate, sender or final message(s)), those transfers have no successful integrations.

### Scope

We have:
- looked at the effect of re-designating any transfers that have a pending with error status, and contain the fatal intermediate error codes as failed - see fatal error codes in Notebook 16: PRMT-1622
- for each error code, for each stage in the process, looked at the eventual status of the transfer
- identify which error codes appearing at which stage can be automatically assumed as failed.
- This analysis is for a 6 month time frame - From September 2020 to February 2021 (using transfers - duplicate hypothesis - dataset).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
error_code_lookup_file = pd.read_csv("https://raw.githubusercontent.com/nhsconnect/prm-gp2gp-data-sandbox/master/data/gp2gp_response_codes.csv")

In [3]:
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-duplicates-hypothesis/"
transfer_files = [
    "9-2020-transfers.parquet",
    "10-2020-transfers.parquet",
    "11-2020-transfers.parquet",
    "12-2020-transfers.parquet",
    "1-2021-transfers.parquet",
    "2-2021-transfers.parquet",
]
transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))
# This is only needed when using transfers-duplicates-hypothesis datasets
transfers_raw = transfers_raw.drop(["sending_supplier", "requesting_supplier"], axis=1)

In [4]:
# Given the findings in PRMT-1742 - many duplicate EHR errors are misclassified, the below reclassifies the relevant data
successful_transfers_bool = transfers_raw['request_completed_ack_codes'].apply(lambda x: True in [(np.isnan(i) or i==15) for i in x])
transfers = transfers_raw.copy()
transfers.loc[successful_transfers_bool, "status"] = "INTEGRATED"


# Part 1: Pending with Error
## Fatal Error Codes - Effect on pending with error transfers

We want to find the:
- Number of pending with error as the status - total number of transfers
   - broken down by the (4) Likely Fatal Error: Common errors with no integrations
   - broken down by the (4) Likely Fatal Error: Common errors with no integrations & (2) Seems Fatal: Tiny chance of Integration

### Data set information

In [5]:
start_time = transfers['date_requested'].min()
end_time = transfers['date_requested'].max()

start_date = start_time.date()
end_date = end_time.date()

print(f"Min time of dataset: {start_time}")
print(f"Max time of dataset: {end_time}")

total_number_transfers = transfers["status"].value_counts().sum()
print(f"Total number of transfers: {total_number_transfers}")

print("Breakdown by status:")
transfers["status"].value_counts()

Min time of dataset: 2020-09-01 04:51:16.148000
Max time of dataset: 2021-02-28 23:04:58.544000
Total number of transfers: 1343234
Breakdown by status:


INTEGRATED            1254802
PENDING                 39087
PENDING_WITH_ERROR      26574
FAILED                  22771
Name: status, dtype: int64

In [6]:
transfers_with_pending_bool = transfers.loc[:, "status"] == "PENDING"
transfers_with_pending = transfers.loc[transfers_with_pending_bool]
print("To confirm that no pending transfers have any intermediate error codes")
transfers_with_pending["intermediate_error_codes"].apply(len).value_counts()

To confirm that no pending transfers have any intermediate error codes


0    39087
Name: intermediate_error_codes, dtype: int64

In [7]:
print("To confirm that no pending transfers have a sender error code")
transfers_with_pending["sender_error_code"].value_counts(dropna=False)

To confirm that no pending transfers have a sender error code


NaN    39087
Name: sender_error_code, dtype: int64

### Fatal Errors

In [None]:
transfers_with_pending_with_error_bool = transfers.loc[:, "status"] == "PENDING_WITH_ERROR"
transfers_with_pending_with_error = transfers.loc[transfers_with_pending_with_error_bool]
transfers_with_pending_with_error["intermediate_and_sender_error_codes"] = transfers_with_pending_with_error.apply(lambda row: np.append(row["intermediate_error_codes"], row["sender_error_code"]), axis=1)
                                                                                                                   
print(f"Total number of transfers with pending with error status:")
print(transfers["status"].value_counts()["PENDING_WITH_ERROR"])

print(f"Validating transfers_with_pending_with_error data frame is the correct size")
transfers_with_pending_with_error.shape                                                                                                               
                                                                                                                   

In [None]:
print('Do pending with error transfers, contain fatal error codes? Just error codes which are 100% fatal [PRMT-1622]:')
fatal_error_codes = [10, 6, 7, 24, 99, 15]
transfers_with_fatal_errors_bool = transfers_with_pending_with_error["intermediate_and_sender_error_codes"].apply(lambda interm_error_codes: list(set(interm_error_codes) & set(fatal_error_codes))).apply(len) > 0
transfers_with_fatal_errors_bool.value_counts().iloc[[1,0]]

In [None]:
print('Do pending with error transfers, contain fatal error codes? All error codes which are 99% + fatal [PRMT-1622]:')
extended_fatal_error_codes = fatal_error_codes + [30, 14, 23]
transfers_with_extended_fatal_errors_bool = transfers_with_pending_with_error["intermediate_and_sender_error_codes"].apply(lambda interm_error_codes: list(set(interm_error_codes) & set(extended_fatal_error_codes))).apply(len) > 0
transfers_with_extended_fatal_errors_bool.value_counts()

In [None]:
pd.pivot_table(transfers, index="sender_error_code", columns="status", aggfunc="count", values="conversation_id").fillna(0).astype(int)

From the above figures -  it appears that almost all transfers with 'pending with error' status contain sender error 30 (LM general failure) or 14 (Message not send because requesting LM messages). Error codes 30 and 14 - large message issues - these are deemed to be usually fatal, and therefore we may be able to classify the vast majority of these as a status of failed instead.

#### Given this finding, let's open this up to all error types (eg sender, final, intermediate, req ack)

# Part 2: All Error Types
## Error Code, Error Type (Sender/Intermediate/Final) and transfer status
Looking at all transfers that have any error codes (either as a sender error code, final error code, or intermediate error code) - and what their final transfer status is (failed/integrated/pending or pending with error), in order to see any patterns.

In [None]:
# Sender Errors
transfers_with_sender_error_bool = transfers["sender_error_code"].apply(lambda sender_error_code: np.isfinite(sender_error_code))
transfers_with_sender_error = transfers[transfers_with_sender_error_bool]
transfers_with_sender_error = transfers_with_sender_error[["sender_error_code", "status"]]
transfers_with_sender_error["Error Type"] = "Sender"
transfers_with_sender_error = transfers_with_sender_error.rename({ "sender_error_code": "Error Code" }, axis=1)

In [None]:
# Final Errors
transfers_with_final_error_bool = transfers["final_error_code"].apply(lambda final_error_code: np.isfinite(final_error_code))
transfers_with_final_error = transfers[transfers_with_final_error_bool]
transfers_with_final_error = transfers_with_final_error[["final_error_code", "status"]]
transfers_with_final_error["Error Type"] = "Final"
transfers_with_final_error = transfers_with_final_error.rename({ "final_error_code": "Error Code" }, axis=1)

In [None]:
# Intermediate Errors
has_intermediate_errors_bool = transfers["intermediate_error_codes"].apply(len) > 0
transfers_with_intermediate_errors_exploded = transfers[has_intermediate_errors_bool].explode("intermediate_error_codes")
# A single transfer may have the same duplicate error code repeatedly - let's only count each one once by dropping duplicates
transfers_with_unique_interm_errors = transfers_with_intermediate_errors_exploded.drop_duplicates(subset=["conversation_id", "intermediate_error_codes"])
transfers_with_unique_interm_errors = transfers_with_unique_interm_errors[["intermediate_error_codes", "status"]]
transfers_with_unique_interm_errors["Error Type"] = "intermediate"
transfers_with_unique_interm_errors = transfers_with_unique_interm_errors.rename({ "intermediate_error_codes": "Error Code" }, axis=1)

In [None]:
# Request Completed Acknowledgement Errors [As added by pipeline branch created by PRMT-1622; there are "final" error codes being lost by the current pipeline stored here]
has_req_ack_errors_bool = transfers['request_completed_ack_codes'].apply(len) > 0
transfers_with_req_ack_errors_exploded = transfers[has_req_ack_errors_bool].explode("request_completed_ack_codes")
# A single transfer may have the same duplicate error code repeatedly - let's only count each one once by dropping duplicates
transfers_with_req_ack_errors = transfers_with_req_ack_errors_exploded.drop_duplicates(subset=["conversation_id", "request_completed_ack_codes"])
transfers_with_req_ack_errors = transfers_with_req_ack_errors[["request_completed_ack_codes", "status"]]
transfers_with_req_ack_errors["Error Type"] = "Request completed acknowledgement"
transfers_with_req_ack_errors = transfers_with_req_ack_errors.rename({ "request_completed_ack_codes": "Error Code" }, axis=1).dropna()

In [None]:
transfers_with_errors = pd.concat([transfers_with_unique_interm_errors, transfers_with_final_error, transfers_with_sender_error,transfers_with_req_ack_errors])
transfers_with_errors["Error Type"].value_counts()

In [None]:
transfers_with_errors["Error Description"] = transfers_with_errors["Error Code"]
transfers_with_errors["Error Description"] = transfers_with_errors["Error Description"].replace(error_code_lookup_file["ErrorCode"].values, error_code_lookup_file["ErrorName"].values)
error_code_summary_pivot_table = pd.pivot_table(transfers_with_errors, index=["Error Code", "Error Description", "Error Type"], columns="status", aggfunc=lambda x: len(x), fill_value=0, margins=True, margins_name="Total")
pd.set_option('display.max_rows', len(error_code_summary_pivot_table))
#error_code_summary_pivot_table

In [None]:
sender_pending=[6, 7, 10, 14, 23, 24]

failure_when_sender =[30] 

sender_mixed_outcome=[19, 20]

intermediate_mixed_outcome=[29]

end_mixed_outcome=[11, 12,31]

end_failures=[17, 21, 25,9,99]


distinct=[15, 205]





In [None]:
# All Sender Error Codes: Almost always end up Pending with Error
error_code_summary_pivot_table.loc[sender_pending]

In [None]:
# 30 is a mix but when it is a sender error code, it almost always ends up Pending with Error
error_code_summary_pivot_table.loc[failure_when_sender]

In [None]:
error_code_summary_pivot_table.loc[sender_mixed_outcome]

In [None]:
error_code_summary_pivot_table.loc[intermediate_mixed_outcome]

In [None]:
error_code_summary_pivot_table.loc[end_mixed_outcome]

In [None]:
error_code_summary_pivot_table.loc[end_failures]

In [None]:
error_code_summary_pivot_table.loc[distinct]

### Verification of numbers

In [None]:
print("To verify above values")
transfers_with_sender_error["Error Code"].value_counts()

In [None]:
transfers_with_unique_interm_errors["Error Code"].value_counts()

In [None]:
transfers_with_final_error["Error Code"].value_counts()
