# PRMT-2183 Is it worth using Attachment MIDs in the pipeline to categorise transfers?

## Hypothesis

**We believe that** we should use attachment MID data to more accurately classify transfers.

**We will know this to be true** when we see a significant number of transfers that are currently understood to be "pending" be re-categorised into either "pending integration" or "still transfering"

Alternatively: how many transfers actually get stuck waiting for COPC large message fragments?

## Approach

Take a sample of pending transfers that include some number of large attachment COPC fragments, all of which are acknowledged.
Measure how many of these gp2gp conversations appear to have sent all COPC messages that are referenced as attachment MIDS in the core EHR message.

See: https://gpitbjss.atlassian.net/wiki/spaces/TW/pages/2552529087/EHR+structure+in+GP2GP

As the data we are analysing is "in the past" if a conversation is still pending, then it was not a successful outcome for the patient.

To simplify this analysis we will use only transfers with no duplicate core EHR (error code 12).

So, using spine data we will calculate:
1. Number of Conversations
2. Number of Conversations without any error code 12
3. Number of above which have core extract received, but do not have final ack message (transfers that would not be categorised as anything else under our new status proposal).
4. Number of above that have COPC messages according to Attachment MIDs
5. Number of above where all COPCs sent are acknowledged
6. Number of the above for which the number of COPC sent is less than the number in Attachment MID 

These numbers correspond roughly to this categorisation hierarchy:

```
Looking back at collection of transfers, we could categorise them as follows

-> A - Completed (successfully or otherwise)
-> B - Not Completed (aka "pending")
  -> C - EHR was not sent or request was not acknowledged 
  -> D - EHR was sent (This is more or less our current level of determination)
    These next categorisations are viable using the attachment dataset:
    -> E - Those where all the attachment fragments were sent (sending complete)
    -> F - Those where some attachment fragments were not sent (sending incomplete)
    -> G - ???
```

Two ways of interpreting the impact of the enhancement:

1. We can now correctly categorise cases where some COPC were not sent and it got stuck - size of category F

This would be calculated by taking (6) as a percentage of (2).

2. By the process of elimination, we could potentially categorise anything where the EHR was sent, but the final ack is missing into "stuck" or "not stuck", depending on if we are waiting for the transfer to complete or not. - size of category D

This would only be feasible, if there is not some other significant variation within category D (category G). E.g what if attachment fragments are sent but not acked.

This would be calculated by taking (3) as a percentage of (2).

In [1]:
import pandas as pd

In [2]:
# Raw gp2gp spine data

gp2gp_spine_data_files = [
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Mar-2021.csv.gz",
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Apr-2021.csv.gz"
]

gp2gp_spine = pd.concat((
    pd.read_csv(f, parse_dates=["_time"])
    for f in gp2gp_spine_data_files
))

interaction_name_mapping = {
    "urn:nhs:names:services:gp2gp/RCMR_IN010000UK05": "req start",
    "urn:nhs:names:services:gp2gp/RCMR_IN030000UK06": "req complete",
    "urn:nhs:names:services:gp2gp/COPC_IN000001UK01": "COPC",
    "urn:nhs:names:services:gp2gp/MCCI_IN010000UK13": "ack"
}

gp2gp_spine['interaction_name']=gp2gp_spine['interactionID'].replace(interaction_name_mapping)
gp2gp_spine = gp2gp_spine.drop_duplicates()

Splunk query used to extract the attachment mid data:
```sql
index="spine2vfmmonitor" logReference="MPS0208" attachmentType="mid"
| table _time, attachmentID, conversationID
```

In [3]:
attachment_mids_folder="s3://prm-gp2gp-data-sandbox-dev/43-PRMT-2167-attachment-mids/"
attachment_mids_files=["attachment_mids_april_2021.csv","attachment_mids_march_2021.csv"]
attachment_mids=pd.concat([pd.read_csv(attachment_mids_folder + file) for file in attachment_mids_files])

In [4]:
# 1. Filtering for conversations that have started within the dataset
all_messages = gp2gp_spine.copy()
conversation_ids_with_req_start = all_messages.loc[all_messages['interaction_name']=='req start','conversationID'].unique()
messages_from_started_conversations = all_messages[all_messages["conversationID"].isin(conversation_ids_with_req_start)]

print(f"Total number of conversations: {conversation_ids_with_req_start.shape[0]}")

Total number of conversations: 545063


In [5]:
# 2. Filtering for conversations that do not have error code 12
is_message_with_error_code_12 = messages_from_started_conversations['jdiEvent']=='12'

conversation_ids_with_error_code_12 = messages_from_started_conversations.loc[is_message_with_error_code_12,'conversationID'].unique()
conversation_ids_without_error_code_12 = list(set(messages_from_started_conversations['conversationID']) - set(conversation_ids_with_error_code_12))

messages_from_conversations_without_duplicate_ehr_bool = messages_from_started_conversations["conversationID"].isin(conversation_ids_without_error_code_12)
messages_from_conversations_without_duplicate_ehr = messages_from_started_conversations[messages_from_conversations_without_duplicate_ehr_bool]

print(f"Total number of conversations without error code 12: {len(conversation_ids_without_error_code_12)}")

Total number of conversations without error code 12: 537440


In [6]:
# 3. Conversations that have core extract received, but do not have final ack message

# First filtering for conversations with core extract
is_ehr_message = messages_from_conversations_without_duplicate_ehr['interaction_name']=='req complete'
conversation_ids_with_core_extract = messages_from_conversations_without_duplicate_ehr.loc[is_ehr_message,'conversationID'].unique() 
is_message_in_conversation_with_ehr = messages_from_conversations_without_duplicate_ehr["conversationID"].isin(conversation_ids_with_core_extract)
messages_from_conversations_with_ehr = messages_from_conversations_without_duplicate_ehr[is_message_in_conversation_with_ehr]

# Second filtering for conversations that do not have final ack message
ids_of_req_complete_messages = messages_from_conversations_with_ehr.loc[messages_from_conversations_with_ehr['interaction_name']=='req complete','GUID'].unique()

is_message_ehr_ack = messages_from_conversations_with_ehr["messageRef"].isin(ids_of_req_complete_messages)
conversation_ids_with_ehr_ack = messages_from_conversations_with_ehr.loc[is_message_ehr_ack, "conversationID"]
conversation_ids_without_ehr_ack= list(set(messages_from_conversations_with_ehr['conversationID']) - set(conversation_ids_with_ehr_ack))

is_message_in_conversation_without_ehr_ack = messages_from_conversations_with_ehr["conversationID"].isin(conversation_ids_without_ehr_ack)
messages_from_conversations_without_ehr_ack = messages_from_conversations_with_ehr[is_message_in_conversation_without_ehr_ack]

print(f"Total number of conversations that have core extract received, but do not have final ack message: {len(conversation_ids_without_ehr_ack)}")

Total number of conversations that have core extract received, but do not have final ack message: 24895


In [7]:
# 4. Number of above that have COPC messages according to Attachment MID
all_conversations_with_attachment_ids = attachment_mids["conversationID"].unique()

is_message_in_conversation_with_large_attachment = messages_from_conversations_without_ehr_ack["conversationID"].isin(all_conversations_with_attachment_ids)
messages_from_conversations_with_large_attachments = messages_from_conversations_without_ehr_ack[is_message_in_conversation_with_large_attachment]
count_of_conversations_that_should_have_copcs = messages_from_conversations_with_large_attachments["conversationID"].unique().shape[0]

print(f"Total number of conversations that have a COPC messages according to attachment MID:")
count_of_conversations_that_should_have_copcs

Total number of conversations that have a COPC messages according to attachment MID:


10865

In [8]:
is_message_copc = messages_from_conversations_with_large_attachments['interaction_name']=='COPC'
conversations_with_copc = messages_from_conversations_with_large_attachments.loc[is_message_copc, 'conversationID'].unique()

is_message_in_conversation_with_copc = messages_from_conversations_with_large_attachments['conversationID'].isin(conversations_with_copc)
messages_from_conversations_with_copcs = messages_from_conversations_with_large_attachments[is_message_in_conversation_with_copc]

print(f"Total number of conversations that actually have any COPC messages: ")
print(messages_from_conversations_with_copcs["conversationID"].unique().shape[0])

print(f"Total number of conversations that expect to have COPC messages but don't ")
print(count_of_conversations_that_should_have_copcs - messages_from_conversations_with_copcs["conversationID"].unique().shape[0])

Total number of conversations that actually have any COPC messages: 
10436
Total number of conversations that expect to have COPC messages but don't 
429


In [9]:
# 5. Number of above where all COPCs sent are acknowledged (that are sent by Sender)

# Add in whether message is sent from Sender or Requester
is_req_started_message = messages_from_conversations_with_copcs['interaction_name']=='req start'
requester_lookup = messages_from_conversations_with_copcs.loc[is_req_started_message, ['messageSender', 'conversationID']].rename({"messageSender": "requester"}, axis=1)
messages_from_conversations_with_copcs = messages_from_conversations_with_copcs.merge(requester_lookup, left_on="conversationID", right_on="conversationID", how="left")

messages_from_conversations_with_copcs["Message sender type"] = "Sender"
is_message_from_requester = messages_from_conversations_with_copcs["messageSender"] == messages_from_conversations_with_copcs["requester"]
messages_from_conversations_with_copcs.loc[is_message_from_requester, "Message sender type"] = "Requester" 

# Filtering for conversations where COPCs are sent by the Sender
is_copc_message_from_sender = (messages_from_conversations_with_copcs['interaction_name']=='COPC') & (messages_from_conversations_with_copcs['Message sender type'] == "Sender")
copc_message_guids = messages_from_conversations_with_copcs.loc[is_copc_message_from_sender,'GUID'].unique()
all_messagerefs = messages_from_conversations_with_copcs.loc[messages_from_conversations_with_copcs['interaction_name']=='ack','messageRef'].unique()

copcs_guids_without_ack = list(set(copc_message_guids) - set(all_messagerefs))

messages_from_conversations_missing_copc_ack = messages_from_conversations_with_copcs["GUID"].isin(copcs_guids_without_ack)
conversation_ids_copcs_without_ack = messages_from_conversations_with_copcs.loc[messages_from_conversations_missing_copc_ack, "conversationID"].unique()

is_message_a_sender_copc = (messages_from_conversations_with_copcs['Message sender type'] == "Sender") & (messages_from_conversations_with_copcs['interaction_name']=='COPC')
conversation_ids_with_acked_sender_copcs = messages_from_conversations_with_copcs.loc[is_message_a_sender_copc, "conversationID"].unique() 

conversation_ids_copcs_with_ack = list(set(conversation_ids_with_acked_sender_copcs) - set(conversation_ids_copcs_without_ack))

is_message_in_conversation_with_sender_copcs_acked = messages_from_conversations_with_copcs["conversationID"].isin(conversation_ids_copcs_with_ack)
messages_from_conversations_with_copcs_acked = messages_from_conversations_with_copcs[is_message_in_conversation_with_sender_copcs_acked]

print(f"Total number of conversations where all COPCs sent are acknowledged:")
messages_from_conversations_with_copcs_acked["conversationID"].unique().shape[0]

Total number of conversations where all COPCs sent are acknowledged:


10164

In [10]:
# 6. Number of the above for which the number of COPC sent is less than the number in Attachment MID
copc_expected = attachment_mids.drop("_time", axis=1).drop_duplicates().groupby("conversationID").agg("count").rename({"attachmentID": "Number of COPCs expected"}, axis=1).fillna(0)

is_message_acked_sender_copc = (messages_from_conversations_with_copcs_acked['interaction_name']=='COPC') & (messages_from_conversations_with_copcs_acked['Message sender type'] == "Sender")
copcs_seen = messages_from_conversations_with_copcs_acked.loc[is_message_acked_sender_copc, ["conversationID", "GUID"]].fillna(0)
copcs_seen = copcs_seen.drop_duplicates().groupby("conversationID").agg("count").rename({"GUID": "Number of COPCs seen"}, axis=1).fillna(0)

copc_comparison_table = copc_expected.merge(copcs_seen, left_index=True, right_index=True, how="right").fillna(0)
missing_copc_messages = (copc_comparison_table["Number of COPCs seen"] < copc_comparison_table["Number of COPCs expected"]).value_counts()

print(f"Number of the above for which the number of COPC sent is less than the number in Attachment MID:")
missing_copc_messages[True]

Number of the above for which the number of COPC sent is less than the number in Attachment MID:


260

In [11]:
missing_copc_messages

False    9904
True      260
dtype: int64

## Findings: Impact of the enhancement

1. We can now correctly categorise cases where some COPC were not sent and it got stuck - size of category F

In [12]:
print("% of transfers that would be re-categorised as a consequence of using attachment MIDs data")
(missing_copc_messages[True] / len(conversation_ids_without_error_code_12)) * 100

% of transfers that would be re-categorised as a consequence of using attachment MIDs data


0.04837749330157785

2. By the process of elimination, we could potentially categorise anything where the EHR was sent, but the final ack is missing into "stuck" or "not stuck", depending on if we are waiting for the transfer to complete or not. - size of category D

This would only be feasible, if there is not some other significant variation within category D (category G). E.g what if attachment fragments are sent but not acked.

This would be calculated by taking (3) as a percentage of (2).

In [13]:
print(f"{(len(conversation_ids_without_ehr_ack) / len(conversation_ids_without_error_code_12)) * 100}%")

4.632144983626079%


## Addendum

Context

We recently did analysis using the attachment MIDs data to identify whether it would help us identify transfers that have been fully transfers vs. transfers that have missing attachments. We identified a small subset of transfers that had not fully transferred.

Scope

Perform analysis on a sample of transfers (longer than 1 months, maybe 3?) to identify any patterns in these transfers

Are they specific to one supplier?

Are they across a small group of practices, or across many?

Anything else?

In [14]:
missing_copc_messages_bool = (copc_comparison_table["Number of COPCs seen"] < copc_comparison_table["Number of COPCs expected"])
conversations_with_missing_copcs=copc_comparison_table[missing_copc_messages_bool].index

In [15]:
transfer_file_location = "s3://prm-gp2gp-data-sandbox-dev/transfers-sample-6/"
transfer_files = [
    "2021-3-transfers.parquet",
    "2021-4-transfers.parquet",
]
transfer_input_files = [transfer_file_location + f for f in transfer_files]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_input_files
))
transfers=transfers_raw.copy().set_index("conversation_id")

In [16]:
transfers_by_supplier_pathway=transfers.groupby(by=["sending_supplier", "requesting_supplier"]).agg({"date_requested": "count"}).rename({"date_requested": "Number of Transfers"}, axis=1)
transfers_with_missing_copcs = transfers.loc[conversations_with_missing_copcs]
missing_copcs_by_supplier_pathway=transfers_with_missing_copcs.groupby(by=["sending_supplier", "requesting_supplier"]).agg({"date_requested": "count"}).rename({"date_requested": "Number of Transfers with Missing COPC"}, axis=1)

supplier_pathways_missing_copc_comparison_table=transfers_by_supplier_pathway.merge(missing_copcs_by_supplier_pathway, left_index=True, right_index=True, how="outer").fillna(0)
supplier_pathways_missing_copc_comparison_table["Estimated % Missing"] = supplier_pathways_missing_copc_comparison_table["Number of Transfers with Missing COPC"]/supplier_pathways_missing_copc_comparison_table["Number of Transfers"]*100
supplier_pathways_missing_copc_comparison_table.sort_values(by="Number of Transfers", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of Transfers,Number of Transfers with Missing COPC,Estimated % Missing
sending_supplier,requesting_supplier,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EMIS,EMIS,331997,171.0,0.051506
EMIS,SystmOne,105542,69.0,0.065377
SystmOne,EMIS,91422,20.0,0.021877
Vision,EMIS,6251,0.0,0.0
EMIS,Vision,5954,0.0,0.0
Vision,SystmOne,1434,0.0,0.0
Vision,Vision,1285,0.0,0.0
SystmOne,Vision,1127,0.0,0.0
SystmOne,SystmOne,51,0.0,0.0


In [17]:
# check if there's anything going on with the statuses
transfers_with_missing_copcs.groupby(by=["status", "failure_reason"]).agg({"date_requested": "count"}).rename({"date_requested": "Number of Transfers with Missing COPC"}, axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of Transfers with Missing COPC
status,failure_reason,Unnamed: 2_level_1
PROCESS_FAILURE,Integrated Late,8
PROCESS_FAILURE,"Transferred, not integrated",151
TECHNICAL_FAILURE,Final Error,66
UNCLASSIFIED_FAILURE,Ambiguous COPC messages,4
UNCLASSIFIED_FAILURE,"Transferred, not integrated, with error",10


In [18]:
transfers_with_missing_copcs["requesting_practice_asid"].value_counts().value_counts()


1    232
2     14
Name: requesting_practice_asid, dtype: int64

In [19]:
transfers_with_missing_copcs["sending_practice_asid"].value_counts().value_counts()

1    238
2     11
Name: sending_practice_asid, dtype: int64

## Addendum Findings

1. Missing COPCs appear to be far more likely when EMIS is the sender.
2. For the failure reasons, while the majority are transferred not integrated, there is also a large volume of technical failures with a final error.
3. There does not appear to be a practice specific issue, No single practice has this issue mroe than twice as either a sender or requestor.
