# PRMT-2183 Is it worth using Attachment MIDs in the pipeline to categorise transfers?

## Hypothesis
**We believe that** we should use attachment MID data to more accurately classify transfers
**We will know this to be true** when we see a significant number of Transferred, not Integrated transfers re-categorised as Technical Failures

## Context
Take a sample of pending transfers with COPC, where all COPC messages have been acknowledged. Measure for how many of these the number of COPC messages sent matches the number of COPC expected according to Attachment MID. 

Using spine data:
1. Number of Conversations
2. Number of Conversations without any error code 12
3. Number of above which have core extract received, but do not have final ack message (transfers that would not be categorised as anything else under our new status proposal).
4. Number of above that have COPC messages according to Attachment MID
5. Number of above where all COPCs sent are acknowledged
6. Number of the above for which the number of COPC sent is less than the number in Attachment MID 

6 as a percentage of (2) is our ultimate output, as it tells us what portion of transfers would be re-categorised as a consequence of using attachment MIDs data.


In [1]:
import pandas as pd

In [2]:
# Raw gp2gp spine data

gp2gp_spine_data_files = [
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Mar-2021.csv.gz",
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Apr-2021.csv.gz"
]

gp2gp_spine = pd.concat((
    pd.read_csv(f, parse_dates=["_time"])
    for f in gp2gp_spine_data_files
))

interaction_name_mapping = {
    "urn:nhs:names:services:gp2gp/RCMR_IN010000UK05": "req start",
    "urn:nhs:names:services:gp2gp/RCMR_IN030000UK06": "req complete",
    "urn:nhs:names:services:gp2gp/COPC_IN000001UK01": "COPC",
    "urn:nhs:names:services:gp2gp/MCCI_IN010000UK13": "ack"
}

gp2gp_spine['interaction_name']=gp2gp_spine['interactionID'].replace(interaction_name_mapping)
gp2gp_spine = gp2gp_spine.drop_duplicates()

In [4]:
attachment_mids_folder="s3://prm-gp2gp-data-sandbox-dev/43-PRMT-2167-attachment-mids/"
attachment_mids_files=["attachment_mids_april_2021.csv","attachment_mids_march_2021.csv"]
attachment_mids=pd.concat([pd.read_csv(attachment_mids_folder + file) for file in attachment_mids_files])

In [24]:
# 1. Filtering for conversations that have started within the dataset
filter_data = gp2gp_spine.copy()
conversation_ids_with_req_start = filter_data.loc[filter_data['interaction_name']=='req start','conversationID'].unique()
filter_data = filter_data[filter_data["conversationID"].isin(conversation_ids_with_req_start)]

print(f"Total number of conversations: {conversation_ids_with_req_start.shape[0]}")

Total number of conversations: 545063


In [25]:
# 2. Filtering for conversations that do not have error code 12
conversation_ids_with_error_code_12 = filter_data.loc[filter_data['jdiEvent']=='12','conversationID'].unique()
conversation_ids_without_error_code_12 = list(set(filter_data['conversationID'].unique()) - set(conversation_ids_with_error_code_12))
filter_data = filter_data[filter_data["conversationID"].isin(conversation_ids_without_error_code_12)]

print(f"Total number of conversations without error code 12: {len(conversation_ids_without_error_code_12)}")

Total number of conversations without error code 12: 537440


In [26]:
# 3. Conversations that have core extract received, but do not have final ack message
# First filtering for conversations with core extract
conversation_ids_with_core_extract = filter_data.loc[filter_data['interaction_name']=='req complete','conversationID'].unique() 
filter_data = filter_data[filter_data["conversationID"].isin(conversation_ids_with_core_extract)]

# Second filtering for conversations that do not have final ack message
ids_of_req_complete_messages = filter_data.loc[filter_data['interaction_name']=='req complete','GUID'].unique()
conversation_ids_with_request_complete_ack = filter_data.loc[filter_data["messageRef"].isin(ids_of_req_complete_messages), "conversationID"]
conversation_ids_without_request_complete_ack= list(set(filter_data['conversationID'].unique()) - set(conversation_ids_with_request_complete_ack))
filter_data = filter_data[filter_data["conversationID"].isin(conversation_ids_without_request_complete_ack)]

print(f"Total number of conversations that have core extract received, but do not have final ack message: {len(conversation_ids_without_request_complete_ack)}")

Total number of conversations that have core extract received, but do not have final ack message: 24895


In [27]:
# 4. Number of above that have COPC messages according to Attachment MID
all_conversations_with_attachment_ids = attachment_mids["conversationID"].unique()
filter_data = filter_data[filter_data["conversationID"].isin(all_conversations_with_attachment_ids)]
count_of_conversations_that_should_have_copcs = filter_data["conversationID"].unique().shape[0]

print(f"Total number of conversations that have a COPC messages according to attachment MID:")
count_of_conversations_that_should_have_copcs

Total number of conversations that have a COPC messages according to attachment MID:


10865

In [28]:
conversations_with_copc = filter_data.loc[(filter_data['interaction_name']=='COPC'), 'conversationID'].unique()
conversations_with_copc.shape[0]
filter_data = filter_data[filter_data['conversationID'].isin(conversations_with_copc)]

print(f"Total number of conversations that actually have any COPC messages: ")
print(filter_data["conversationID"].unique().shape[0])

print(f"Total number of conversations that expect to have COPC messages but don't ")
print(count_of_conversations_that_should_have_copcs - filter_data["conversationID"].unique().shape[0])

Total number of conversations that actually have any COPC messages: 
10436
Total number of conversations that expect to have COPC messages but don't 
429


In [29]:
# 5. Number of above where all COPCs sent are acknowledged (that are sent by Sender)

# Add in whether message is sent from Sender or Requester
requester_lookup = filter_data.loc[filter_data['interaction_name']=='req start', ['messageSender', 'conversationID']].rename({"messageSender": "requester"}, axis=1)
filter_data = filter_data.merge(requester_lookup, left_on="conversationID", right_on="conversationID", how="left")

filter_data["Message sender type"] = "Sender"
requester_bool = filter_data["messageSender"] == filter_data["requester"]
filter_data.loc[requester_bool, "Message sender type"] = "Requester" 

# Filtering for conversations where COPCs are sent by the Sender
copc_message_guids = filter_data.loc[(filter_data['interaction_name']=='COPC') & (filter_data['Message sender type'] == "Sender"),'GUID'].unique()
all_messagerefs = filter_data.loc[filter_data['interaction_name']=='ack','messageRef'].unique()
copcs_guids_without_ack = list(set(copc_message_guids) - set(all_messagerefs))

conversation_ids_copcs_without_ack = filter_data.loc[filter_data["GUID"].isin(copcs_guids_without_ack), "conversationID"].unique()
conversation_ids_copcs_with_ack = list(set(filter_data['conversationID'].unique()) - set(conversation_ids_copcs_without_ack))

filter_data = filter_data[filter_data["conversationID"].isin(conversation_ids_copcs_with_ack)]

print(f"Total number of conversations where all COPCs sent are acknowledged:")
filter_data["conversationID"].unique().shape[0]

Total number of conversations where all COPCs sent are acknowledged:


10208

In [30]:
missing_conversation_ids = list(set(filter_data["conversationID"].unique()) - set(copc_comparison_table.index))
filter_data[filter_data["conversationID"] == missing_conversation_ids[0]]

Unnamed: 0,_time,conversationID,GUID,interactionID,messageSender,messageRecipient,messageRef,jdiEvent,toSystem,fromSystem,interaction_name,requester,Message sender type
358143,2021-04-28 11:56:52.182000+00:00,77FB9B15-7275-4394-A0A2-B90667BE93D8,D543DE73-60A9-4849-A943-6B5280947296,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,550635141017,847520267018,NotProvided,NONE,EMIS,EMIS,COPC,550635141017,Requester
358145,2021-04-28 11:56:49.435000+00:00,77FB9B15-7275-4394-A0A2-B90667BE93D8,CD8CF236-BFC5-4286-8A37-49F3612C352B,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,847520267018,550635141017,77FB9B15-7275-4394-A0A2-B90667BE93D8,NONE,EMIS,EMIS,ack,550635141017,Sender
398169,2021-04-28 11:56:15.269000+00:00,77FB9B15-7275-4394-A0A2-B90667BE93D8,77FB9B15-7275-4394-A0A2-B90667BE93D8,urn:nhs:names:services:gp2gp/RCMR_IN010000UK05,550635141017,847520267018,NotProvided,NONE,EMIS,EMIS,req start,550635141017,Requester
443378,2021-04-28 11:56:46.942000+00:00,77FB9B15-7275-4394-A0A2-B90667BE93D8,F572FC12-F53F-492A-9AF9-36E7C8A04BF1,urn:nhs:names:services:gp2gp/RCMR_IN030000UK06,847520267018,550635141017,NotProvided,NONE,EMIS,EMIS,req complete,550635141017,Sender


In [35]:
# 6. Number of the above for which the number of COPC sent is less than the number in Attachment MID
copc_expected = attachment_mids.drop("_time", axis=1).drop_duplicates().groupby("conversationID").agg("count").rename({"attachmentID": "Number of COPCs expected"}, axis=1).fillna(0)
copcs_seen = filter_data.loc[(filter_data['interaction_name']=='COPC') & (filter_data['Message sender type'] == "Sender"),["conversationID", "GUID"]].fillna(0)
copcs_seen = copcs_seen.drop_duplicates().groupby("conversationID").agg("count").rename({"GUID": "Number of COPCs seen"}, axis=1).fillna(0)
copc_comparison_table = copc_expected.merge(copcs_seen, left_index=True, right_index=True, how="right").fillna(0)
missing_copc_messages = (copc_comparison_table["Number of COPCs seen"] < copc_comparison_table["Number of COPCs expected"]).value_counts()

print(f"Number of the above for which the number of COPC sent is less than the number in Attachment MID:")
missing_copc_messages[True]

Number of the above for which the number of COPC sent is less than the number in Attachment MID:


260

In [32]:
# Should match the Total number of conversations where all COPCs sent are acknowledged:
missing_copc_messages.sum()

10164

In [33]:
print("% of transfers that would be re-categorised as a consequence of using attachment MIDs data")
(missing_copc_messages[True] / len(conversation_ids_without_error_code_12)) * 100

% of transfers that would be re-categorised as a consequence of using attachment MIDs data


0.04837749330157785