# PRMT-2183 Is it worth using Attachment MIDs in the pipeline to categorise transfers?

## Hypothesis

**We believe that** we should use attachment MID data to more accurately classify transfers.

**We will know this to be true** when ~we see a significant number of Transferred, not Integrated transfers re-categorised as Technical Failures~

... we see a significant number of transfers that are currently understood to be "pending" be re-categorised into either "pending integration" or "still transfering"

Alternatively: how many transfers actually get stuck waiting for COPC large message fragments?

## Approach

Take a sample of pending transfers that include some number of large attachment COPC fragments, all of which are acknowledged.
Measure how many of these gp2gp conversations appear to have sent all COPC messages that are referenced as attachment MIDS in the core EHR message.

See: https://gpitbjss.atlassian.net/wiki/spaces/TW/pages/2552529087/EHR+structure+in+GP2GP

As the data we are analysing is "in the past" if a conversation is still pending, then it was not a successful outcome for the patient.

To simplify this analysis we will use only transfers with no duplicate core EHR (error code 12).

So, using spine data we will calculate:
1. Number of Conversations
2. Number of Conversations without any error code 12
3. Number of above which have core extract received, but do not have final ack message (transfers that would not be categorised as anything else under our new status proposal).
4. Number of above that have COPC messages according to Attachment MIDs
5. Number of above where all COPCs sent are acknowledged
6. Number of the above for which the number of COPC sent is less than the number in Attachment MID 

These numbers correspond roughly to this categorisation hierarchy:

```
Looking back at collection of transfers, we could categorise them as follows

-> A - Completed (successfully or otherwise)
-> B - Not Completed (aka "pending")
  -> C - EHR was not sent or request was not acknowledged 
  -> D - EHR was sent (This is more or less our current level of determination)
    These next categorisations are viable using the attachment dataset:
    -> E - Those where all the attachment fragments were sent (sending complete)
    -> F - Those where some attachment fragments were not sent (sending incomplete)
    -> G - ???
```

Two ways of interpreting the impact of the enhancement:

1. We can now correctly categorise cases where some COPC were not sent and it got stuck - size of category F

This would be calculated by taking (6) as a percentage of (2).

2. By the process of elimination we can now correctly categorise anything where the EHR was sent, but the final ack is missing into "stuck" or "not stuck", depending on if we are waiting for the transfer to complete or not. - size of category D

This would only be feasible, if there is not some other significant variation within category D (category G). E.g what if attachment fragments are sent but not acked.

This would be calculated by taking (3) as a percentage of (2).

In [1]:
import pandas as pd

In [2]:
# Raw gp2gp spine data

gp2gp_spine_data_files = [
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Mar-2021.csv.gz",
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Apr-2021.csv.gz"
]

gp2gp_spine = pd.concat((
    pd.read_csv(f, parse_dates=["_time"])
    for f in gp2gp_spine_data_files
))

interaction_name_mapping = {
    "urn:nhs:names:services:gp2gp/RCMR_IN010000UK05": "req start",
    "urn:nhs:names:services:gp2gp/RCMR_IN030000UK06": "req complete",
    "urn:nhs:names:services:gp2gp/COPC_IN000001UK01": "COPC",
    "urn:nhs:names:services:gp2gp/MCCI_IN010000UK13": "ack"
}

gp2gp_spine['interaction_name']=gp2gp_spine['interactionID'].replace(interaction_name_mapping)
gp2gp_spine = gp2gp_spine.drop_duplicates()

Splunk query used to extract the attachment mid data:
```sql
index="spine2vfmmonitor" logReference="MPS0208" attachmentType="mid"
| table _time, attachmentID, conversationID
```

In [3]:
attachment_mids_folder="s3://prm-gp2gp-data-sandbox-dev/43-PRMT-2167-attachment-mids/"
attachment_mids_files=["attachment_mids_april_2021.csv","attachment_mids_march_2021.csv"]
attachment_mids=pd.concat([pd.read_csv(attachment_mids_folder + file) for file in attachment_mids_files])

In [4]:
attachment_mids

Unnamed: 0,_time,attachmentID,conversationID
0,2021-04-30T17:20:10.015+0000,0C66B7F0-9722-4E65-91D3-FAAF2B06DD6D,ADF88575-CC9C-4066-BFD9-3EE3C4B4C845
1,2021-04-30T17:20:10.013+0000,10370E86-C1E4-4DB9-91B9-2DDFA4900D93,ADF88575-CC9C-4066-BFD9-3EE3C4B4C845
2,2021-04-30T17:20:10.012+0000,F15AA0E4-E0B3-4B06-A6D9-6DEF1762F190,ADF88575-CC9C-4066-BFD9-3EE3C4B4C845
3,2021-04-30T16:50:59.252+0000,50A81888-42DA-4CB9-8BBA-4FCAE2B89BC7,A5299960-9D95-4323-975F-DACF34F6569C
4,2021-04-30T16:50:59.251+0000,7C150673-B75C-4258-83F6-FB11589D4011,A5299960-9D95-4323-975F-DACF34F6569C
...,...,...,...
3102027,2021-03-01T07:29:00.345+0000,9D6EBFF4-642D-4D78-8226-FDFFD7E71136,ACE98900-7A5F-11EB-A96A-99FF0384246D
3102028,2021-03-01T07:29:00.344+0000,3B238F0D-8ED0-4498-9232-16ABA663C847,ACE98900-7A5F-11EB-A96A-99FF0384246D
3102029,2021-03-01T07:29:00.342+0000,A6B50E2D-9484-4F2D-8EE5-985DA8B1255A,ACE98900-7A5F-11EB-A96A-99FF0384246D
3102030,2021-03-01T07:28:56.682+0000,4C1E79FE-D1C3-46C7-8FB9-3C7AFD4A34BB,ACE98900-7A5F-11EB-A96A-99FF0384246D


In [5]:
# 1. Filtering for conversations that have started within the dataset
all_messages = gp2gp_spine.copy()
conversation_ids_with_req_start = all_messages.loc[all_messages['interaction_name']=='req start','conversationID'].unique()
messages_from_started_conversations = all_messages[all_messages["conversationID"].isin(conversation_ids_with_req_start)]

print(f"Total number of conversations: {conversation_ids_with_req_start.shape[0]}")

Total number of conversations: 545063


In [6]:
# 2. Filtering for conversations that do not have error code 12
conversation_ids_with_error_code_12 = messages_from_started_conversations.loc[messages_from_started_conversations['jdiEvent']=='12','conversationID'].unique()
conversation_ids_without_error_code_12 = list(set(messages_from_started_conversations['conversationID']) - set(conversation_ids_with_error_code_12))
messages_from_conversations_without_duplicate_ehr = messages_from_started_conversations[messages_from_started_conversations["conversationID"].isin(conversation_ids_without_error_code_12)]

print(f"Total number of conversations without error code 12: {len(conversation_ids_without_error_code_12)}")

Total number of conversations without error code 12: 537440


In [7]:
# 3. Conversations that have core extract received, but do not have final ack message
# First filtering for conversations with core extract
conversation_ids_with_core_extract = messages_from_conversations_without_duplicate_ehr.loc[messages_from_conversations_without_duplicate_ehr['interaction_name']=='req complete','conversationID'].unique() 
messages_from_conversations_with_ehr = messages_from_conversations_without_duplicate_ehr[messages_from_conversations_without_duplicate_ehr["conversationID"].isin(conversation_ids_with_core_extract)]

# Second filtering for conversations that do not have final ack message
ids_of_req_complete_messages = messages_from_conversations_with_ehr.loc[messages_from_conversations_with_ehr['interaction_name']=='req complete','GUID'].unique()
conversation_ids_with_request_complete_ack = messages_from_conversations_with_ehr.loc[messages_from_conversations_with_ehr["messageRef"].isin(ids_of_req_complete_messages), "conversationID"]
conversation_ids_without_request_complete_ack= list(set(messages_from_conversations_with_ehr['conversationID']) - set(conversation_ids_with_request_complete_ack))
messages_from_conversations_without_final_ack = messages_from_conversations_with_ehr[messages_from_conversations_with_ehr["conversationID"].isin(conversation_ids_without_request_complete_ack)]

print(f"Total number of conversations that have core extract received, but do not have final ack message: {len(conversation_ids_without_request_complete_ack)}")

Total number of conversations that have core extract received, but do not have final ack message: 24895


In [8]:
# 4. Number of above that have COPC messages according to Attachment MID
all_conversations_with_attachment_ids = attachment_mids["conversationID"].unique()
messages_from_conversations_with_large_attachments = messages_from_conversations_without_final_ack[messages_from_conversations_without_final_ack["conversationID"].isin(all_conversations_with_attachment_ids)]
count_of_conversations_that_should_have_copcs = messages_from_conversations_with_large_attachments["conversationID"].unique().shape[0]

print(f"Total number of conversations that have a COPC messages according to attachment MID:")
count_of_conversations_that_should_have_copcs

Total number of conversations that have a COPC messages according to attachment MID:


10865

In [9]:
conversations_with_copc = messages_from_conversations_with_large_attachments.loc[(messages_from_conversations_with_large_attachments['interaction_name']=='COPC'), 'conversationID'].unique()
messages_from_conversations_with_copcs = messages_from_conversations_with_large_attachments[messages_from_conversations_with_large_attachments['conversationID'].isin(conversations_with_copc)]

print(f"Total number of conversations that actually have any COPC messages: ")
print(messages_from_conversations_with_copcs["conversationID"].unique().shape[0])

print(f"Total number of conversations that expect to have COPC messages but don't ")
print(count_of_conversations_that_should_have_copcs - messages_from_conversations_with_copcs["conversationID"].unique().shape[0])

Total number of conversations that actually have any COPC messages: 
10436
Total number of conversations that expect to have COPC messages but don't 
429


In [10]:
# 5. Number of above where all COPCs sent are acknowledged (that are sent by Sender)

# Add in whether message is sent from Sender or Requester
requester_lookup = messages_from_conversations_with_copcs.loc[messages_from_conversations_with_copcs['interaction_name']=='req start', ['messageSender', 'conversationID']].rename({"messageSender": "requester"}, axis=1)
messages_from_conversations_with_copcs = messages_from_conversations_with_copcs.merge(requester_lookup, left_on="conversationID", right_on="conversationID", how="left")

messages_from_conversations_with_copcs["Message sender type"] = "Sender"
requester_bool = messages_from_conversations_with_copcs["messageSender"] == messages_from_conversations_with_copcs["requester"]
messages_from_conversations_with_copcs.loc[requester_bool, "Message sender type"] = "Requester" 

# Filtering for conversations where COPCs are sent by the Sender
copc_message_guids = messages_from_conversations_with_copcs.loc[(messages_from_conversations_with_copcs['interaction_name']=='COPC') & (messages_from_conversations_with_copcs['Message sender type'] == "Sender"),'GUID'].unique()
all_messagerefs = messages_from_conversations_with_copcs.loc[messages_from_conversations_with_copcs['interaction_name']=='ack','messageRef'].unique()

copcs_guids_without_ack = list(set(copc_message_guids) - set(all_messagerefs))

conversation_ids_copcs_without_ack = messages_from_conversations_with_copcs.loc[messages_from_conversations_with_copcs["GUID"].isin(copcs_guids_without_ack), "conversationID"].unique()
conversation_ids_with_acked_sender_copcs = messages_from_conversations_with_copcs.loc[(messages_from_conversations_with_copcs['Message sender type'] == "Sender") & (messages_from_conversations_with_copcs['interaction_name']=='COPC'), "conversationID"].unique() 

conversation_ids_copcs_with_ack = list(set(conversation_ids_with_acked_sender_copcs) - set(conversation_ids_copcs_without_ack))

messages_from_conversations_with_copcs_acked = messages_from_conversations_with_copcs[messages_from_conversations_with_copcs["conversationID"].isin(conversation_ids_copcs_with_ack)]

print(f"Total number of conversations where all COPCs sent are acknowledged:")
messages_from_conversations_with_copcs_acked["conversationID"].unique().shape[0]

Total number of conversations where all COPCs sent are acknowledged:


10164

In [12]:
# 6. Number of the above for which the number of COPC sent is less than the number in Attachment MID
copc_expected = attachment_mids.drop("_time", axis=1).drop_duplicates().groupby("conversationID").agg("count").rename({"attachmentID": "Number of COPCs expected"}, axis=1).fillna(0)
copcs_seen = messages_from_conversations_with_copcs_acked.loc[(messages_from_conversations_with_copcs_acked['interaction_name']=='COPC') & (messages_from_conversations_with_copcs_acked['Message sender type'] == "Sender"),["conversationID", "GUID"]].fillna(0)
copcs_seen = copcs_seen.drop_duplicates().groupby("conversationID").agg("count").rename({"GUID": "Number of COPCs seen"}, axis=1).fillna(0)
copc_comparison_table = copc_expected.merge(copcs_seen, left_index=True, right_index=True, how="right").fillna(0)
missing_copc_messages = (copc_comparison_table["Number of COPCs seen"] < copc_comparison_table["Number of COPCs expected"]).value_counts()

print(f"Number of the above for which the number of COPC sent is less than the number in Attachment MID:")
missing_copc_messages[True]

Number of the above for which the number of COPC sent is less than the number in Attachment MID:


260

In [17]:
print("% of transfers that would be re-categorised as a consequence of using attachment MIDs data")
(missing_copc_messages[True] / len(conversation_ids_without_error_code_12)) * 100

% of transfers that would be re-categorised as a consequence of using attachment MIDs data


0.04837749330157785