### PRMT-2039 ["HYPOTHESIS"] Table showing distribution number of messages per pathway and status

We have so far assumed that a particular transfer status is associated with a particular number of messages:
Eg Pending would typically be expected to have 3, Integrated would usually expect to see 4.

However, we have seen fewer that 3 for pending in inspection of some Vision to Vision transfers.

By generating a full table:
- for 6 months of transfers, 
- broken down by status and supplier pathway, 
- showing what % of transfers had what number of messages, 

we can:
- check the degree to which this assumption holds. 
- Identify areas (eg by pathway and/or status) for further investigation
- This may allow us to redefine our statuses from their current 4




In [1]:
import pandas as pd
import numpy as np
# Using data generated from branch PRMT-1742-duplicates-analysis.
# This is needed to correctly handle duplicates.
# Once the upstream pipeline has a fix for duplicate EHRs, then we can go back to using the main output.
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-duplicates-hypothesis/"
transfer_files = [
    "9-2020-transfers.parquet",
    "10-2020-transfers.parquet",
    "11-2020-transfers.parquet",
    "12-2020-transfers.parquet",
    "1-2021-transfers.parquet",
    "2-2021-transfers.parquet"
]

transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))

# In the data from the PRMT-1742-duplicates-analysis branch, these columns have been added , but contain only empty values.
transfers_raw = transfers_raw.drop(["sending_supplier", "requesting_supplier"], axis=1)


# Given the findings in PRMT-1742 - many duplicate EHR errors are misclassified, the below reclassifies the relevant data
has_at_least_one_successful_integration_code = lambda errors: any((np.isnan(error) or error==15 for error in errors))
successful_transfers_bool = transfers_raw['request_completed_ack_codes'].apply(has_at_least_one_successful_integration_code)
transfers = transfers_raw.copy()
transfers.loc[successful_transfers_bool, "status"] = "INTEGRATED"

# Correctly interpret certain sender errors as failed.
# This is explained in PRMT-1974. Eventaully this will be fixed upstream in the pipeline. 
pending_sender_error_codes=[6,7,10,24,30,23,14,99]
transfers_with_pending_sender_code_bool=transfers['sender_error_code'].isin(pending_sender_error_codes)
transfers_with_pending_with_error_bool=transfers['status']=='PENDING_WITH_ERROR'
transfers_which_need_pending_to_failure_change_bool=transfers_with_pending_sender_code_bool & transfers_with_pending_with_error_bool
transfers.loc[transfers_which_need_pending_to_failure_change_bool,'status']='FAILED'

# Add integrated Late status
eight_days_in_seconds=8*24*60*60
transfers_after_sla_bool=transfers['sla_duration']>eight_days_in_seconds
transfers_with_integrated_bool=transfers['status']=='INTEGRATED'
transfers_integrated_late_bool=transfers_after_sla_bool & transfers_with_integrated_bool
transfers.loc[transfers_integrated_late_bool,'status']='INTEGRATED LATE'

# If the record integrated after 28 days, change the status back to pending or pending with error.
# This is to handle each month consistentently and to always reflect a transfers status 28 days after it was made.
# TBD how this is handled upstream in the pipeline
twenty_eight_days_in_seconds=28*24*60*60
transfers_after_month_bool=transfers['sla_duration']>twenty_eight_days_in_seconds
transfers_pending_at_month_bool=transfers_after_month_bool & transfers_integrated_late_bool
transfers.loc[transfers_pending_at_month_bool,'status']='PENDING'
transfers_with_early_error_bool=(~transfers.loc[:,'sender_error_code'].isna()) |(~transfers.loc[:,'intermediate_error_codes'].apply(len)>0)
transfers.loc[transfers_with_early_error_bool & transfers_pending_at_month_bool,'status']='PENDING_WITH_ERROR'

# Supplier name mapping
supplier_renaming = {
    "EGTON MEDICAL INFORMATION SYSTEMS LTD (EMIS)":"EMIS",
    "IN PRACTICE SYSTEMS LTD":"Vision",
    "MICROTEST LTD":"Microtest",
    "THE PHOENIX PARTNERSHIP":"TPP",
    None: "Unknown"
}

asid_lookup_file = "s3://prm-gp2gp-data-sandbox-dev/asid-lookup/asidLookup-Mar-2021.csv.gz"
asid_lookup = pd.read_csv(asid_lookup_file)
lookup = asid_lookup[["ASID", "MName", "NACS","OrgName"]]

transfers = transfers.merge(lookup, left_on='requesting_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'requesting_supplier', 'ASID': 'requesting_supplier_asid', 'NACS': 'requesting_ods_code','OrgName':'requesting_practice_name'}, axis=1)
transfers = transfers.merge(lookup, left_on='sending_practice_asid',right_on='ASID',how='left')
transfers = transfers.rename({'MName': 'sending_supplier', 'ASID': 'sending_supplier_asid', 'NACS': 'sending_ods_code','OrgName':'sending_practice_name'}, axis=1)

transfers["sending_supplier"] = transfers["sending_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())
transfers["requesting_supplier"] = transfers["requesting_supplier"].replace(supplier_renaming.keys(), supplier_renaming.values())

In [2]:
# Import the message data - output of PRMT-2038 (see 33-PRMT-2038-generate-new-fields-raw-splunk-data notebook)
interaction_file_name='s3://prm-gp2gp-data-sandbox-dev/extra-fields-data-from-splunk/Sept_20_Feb_21_conversations_interaction_messages.parquet'
message_lists=pd.read_parquet(interaction_file_name)

In [3]:
# Merge the datasets
combined_transfers=transfers.merge(message_lists,left_on='conversation_id',right_index=True ,how='left')

In [4]:
# Add in the number of messages as the main field
combined_transfers['interaction length']=combined_transfers['interaction name'].apply(len)

In [5]:
# Generate a table to count the transfers of each message length broken down by pathway/status
message_count_table=combined_transfers.pivot_table(index=['requesting_supplier','sending_supplier','status'],columns='interaction length',values='conversation_id',aggfunc='count').fillna(0).astype(int)

In [6]:
message_count_table

Unnamed: 0_level_0,Unnamed: 1_level_0,interaction length,1,2,3,4,5,6,7,8,9,10,...,3381,3413,3585,3711,4229,4845,4862,5033,5897,5960
requesting_supplier,sending_supplier,status,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
EMIS,EMIS,FAILED,0,3461,74,878,220,286,419,132,123,118,...,0,0,0,0,0,0,1,0,0,0
EMIS,EMIS,INTEGRATED,0,0,210,442995,774,135,47498,208,29284,983,...,1,1,1,1,1,1,0,1,1,0
EMIS,EMIS,INTEGRATED LATE,0,0,9,27119,73,17,3383,22,1630,148,...,0,0,0,0,0,0,0,0,0,0
EMIS,EMIS,PENDING,1883,27,3554,5741,85,326,636,229,355,158,...,0,0,0,0,0,0,0,0,0,0
EMIS,EMIS,PENDING_WITH_ERROR,0,1884,23,12,16,7,1,2,1,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Vision,Unknown,PENDING,2,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Vision,Vision,FAILED,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Vision,Vision,INTEGRATED,0,0,1,1745,12,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Vision,Vision,INTEGRATED LATE,0,0,0,122,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# For each pathway/status, what is the percentage in each message length?
message_count_table_percentages=message_count_table.div(message_count_table.sum(axis=1),axis=0).multiply(100).round(2)

In [8]:
message_count_table_percentages.reset_index()

interaction length,requesting_supplier,sending_supplier,status,1,2,3,4,5,6,7,...,3381,3413,3585,3711,4229,4845,4862,5033,5897,5960
0,EMIS,EMIS,FAILED,0.00,42.42,0.91,10.76,2.70,3.51,5.14,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
1,EMIS,EMIS,INTEGRATED,0.00,0.00,0.03,61.67,0.11,0.02,6.61,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
2,EMIS,EMIS,INTEGRATED LATE,0.00,0.00,0.02,62.76,0.17,0.04,7.83,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
3,EMIS,EMIS,PENDING,11.51,0.17,21.73,35.11,0.52,1.99,3.89,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
4,EMIS,EMIS,PENDING_WITH_ERROR,0.00,93.13,1.14,0.59,0.79,0.35,0.05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92,Vision,Unknown,PENDING,66.67,0.00,0.00,33.33,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
93,Vision,Vision,FAILED,0.00,50.00,50.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
94,Vision,Vision,INTEGRATED,0.00,0.00,0.06,99.09,0.68,0.06,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0
95,Vision,Vision,INTEGRATED LATE,0.00,0.00,0.00,97.60,2.40,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0


In [9]:
# Uncomment to output CSV to S3 location
# message_count_table_percentages.reset_index().to_csv('s3://prm-gp2gp-data-sandbox-dev/notebook-outputs/34--PRMT-2039-message-length-counts-by-pathway-and-status.csv')