# PRMT-2167 PRMT-2167 Using attachments data to extract expected number of COPC messages

## Context
Is it possible to extract the expected number of COPC messages for each transfer using the following dataset: Spine log reference: MPS0208

If possible, this will help us to distinguish between pending transfers which have not fully transferred and those that have fully transferred and are awaiting integration. 

In [2]:
import pandas as pd

In [3]:
# Raw gp2gp spine data

gp2gp_spine_data_files = [
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Mar-2021.csv.gz",
  "s3://prm-gp2gp-data-sandbox-dev/spine-gp2gp-data/Apr-2021.csv.gz"
]

gp2gp_spine = pd.concat((
    pd.read_csv(f, parse_dates=["_time"])
    for f in gp2gp_spine_data_files
))

interaction_name_mapping = {
    "urn:nhs:names:services:gp2gp/RCMR_IN010000UK05": "req start",
    "urn:nhs:names:services:gp2gp/RCMR_IN030000UK06": "req complete",
    "urn:nhs:names:services:gp2gp/COPC_IN000001UK01": "COPC",
    "urn:nhs:names:services:gp2gp/MCCI_IN010000UK13": "ack"
}

gp2gp_spine['interaction_name']=gp2gp_spine['interactionID'].replace(interaction_name_mapping)

In [4]:
gp2gp_spine.shape

(14022754, 11)

In [5]:
gp2gp_spine["GUID"].unique().shape

(14021775,)

In [6]:
gp2gp_spine.drop_duplicates().shape

(14022499, 11)

In [7]:
gp2gp_spine = gp2gp_spine.drop_duplicates()

## Part 1. Exploratory

In [8]:
conversation_counts = gp2gp_spine.groupby("conversationID").agg({'GUID': 'count'}).rename({'GUID': 'messageCount'}, axis=1)

In [9]:
conversation_counts.sample(10)

Unnamed: 0_level_0,messageCount
conversationID,Unnamed: 1_level_1
A996B470-F91D-4C78-8265-429E8BF45FB5,4
1C01D62D-B29A-4CF2-88A7-930E6CFE7FEB,7
5046FF24-ED70-44BC-A409-68E7D178B5FD,4
11119360-A9A3-11EB-88E8-B585230C5066,12
484173FE-DD37-4252-9F4F-675EA7B46FC6,4
22FAA3AF-EB64-4EE6-AF5B-DBBA4899C650,4
74C88F95-9AEB-443D-9122-77D458AAD317,13
545E8ADB-8560-4BB2-BE80-413C3C6B7CF2,7
870A6E5F-7C28-4D58-A7BA-9A21919A93C1,27
4116F224-D605-4004-8B32-7ED680D0F29D,47


In [10]:
target_conversation = gp2gp_spine["conversationID"] == "407ADCAF-E59E-468D-B05B-C9BD1EA7CDAB"

gp2gp_spine[target_conversation]\
    .sort_values("_time")[["_time", "GUID", "interaction_name", "messageSender", "messageRecipient", "messageRef", "jdiEvent"]]


Unnamed: 0,_time,GUID,interaction_name,messageSender,messageRecipient,messageRef,jdiEvent
4933961,2021-04-07 12:44:24.465000+00:00,407ADCAF-E59E-468D-B05B-C9BD1EA7CDAB,req start,44100627049,507231158043,NotProvided,NONE
3484659,2021-04-07 12:44:43.518000+00:00,04CC2C51-5601-4572-B09F-80B895559E06,req complete,507231158043,44100627049,NotProvided,NONE
3242909,2021-04-07 12:44:44.750000+00:00,5925DBC7-6E09-40D2-A8B5-C929B87B744C,ack,507231158043,44100627049,407ADCAF-E59E-468D-B05B-C9BD1EA7CDAB,NONE
3242899,2021-04-07 12:44:46.986000+00:00,017C5DA8-8B66-48AC-B245-A66ECD9A71DF,COPC,44100627049,507231158043,NotProvided,NONE
5179297,2021-04-07 12:44:47.927000+00:00,D875241A-2A8D-492E-8FC5-51EAD69DF0CB,COPC,507231158043,44100627049,NotProvided,NONE
4227828,2021-04-07 12:44:49.545000+00:00,28ACBBC7-D8B0-4706-9546-EDBC5B4D7F4C,COPC,507231158043,44100627049,NotProvided,NONE
4227832,2021-04-07 12:44:50.484000+00:00,7DAA9445-67CC-4964-8A99-21D3158AA3E4,ack,44100627049,507231158043,D875241A-2A8D-492E-8FC5-51EAD69DF0CB,NONE
6106155,2021-04-07 12:45:41.794000+00:00,754E03DA-E87A-4B7A-9532-59F60ACC908A,ack,44100627049,507231158043,28ACBBC7-D8B0-4706-9546-EDBC5B4D7F4C,NONE
2308284,2021-04-07 19:10:56.821000+00:00,1C619F79-12A4-4E9E-B513-F98E4A8F14C9,ack,44100627049,507231158043,04CC2C51-5601-4572-B09F-80B895559E06,NONE


## Three conversations with COPC message comparison of count of COPC messages

- SPINE: Manual count of number of COPC messages
- MI RR / MI SR: values of fields
- Attachment MIDS: Manual count of number of attachment MIDS

##### "3AD5FC47-AACE-4B24-813A-B4CC0804E04A"
- SPINE: timed out, 4 LM COPC fragments sent (i.e 5 COPC messages)
- MI RR: LargeMessageFragmentCount: 5, LargeMessageFragmentSuccessCount: 4
- MI SR: LargeMessageFragmentCount: 5, LargeMessageFragmentSentCount: 4, LargeMessageFragmentSuccessCount: 4
- Attachment MIDS: 5 (four of them matching ones that we can see were sent in SPINE)

##### "B02437A9-5F79-4749-A8D1-CEB7F17F0C16"
- SPINE: completed OK, 8 LM COPC fragments sent (i.e 9 COPC messages)
- MI RR: LargeMessageFragmentCount: 8, LargeMessageFragmentSuccessCount: 8
- MI SR: LargeMessageFragmentCount: 8, LargeMessageFragmentSentCount: 8, LargeMessageFragmentSuccessCount: 8
- Attachment MIDS: 8 (all of them matching ones that we can see were sent in SPINE)

##### "407ADCAF-E59E-468D-B05B-C9BD1EA7CDAB"
- SPINE: completed OK, 2 LM COPC fragments sent (i.e 3 COPC messages)
- MI RR: LargeMessageFragmentCount: 2, LargeMessageFragmentSuccessCount: 2
- MI SR: LargeMessageFragmentCount: 2, LargeMessageFragmentSentCount: 2, LargeMessageFragmentSuccessCount: 2
- Attachment MIDS: 2 (all of them matching ones that we can see were sent in SPINE)

In [12]:
gp2gp_spine["month"] = gp2gp_spine["_time"].dt.month

In [13]:
copc_interaction = gp2gp_spine["interaction_name"] == "COPC"
copc_messages = gp2gp_spine[copc_interaction]

In [14]:
(copc_messages.value_counts("conversationID") - 1).sum()

5812220

## Part 2. COPC count using attachment MIDs

We have tried using the MI data, but this is inconsistent and likely to cause issues.

We are now comparing 2 sets of data:
1. The normal message data from Spine
2. The attachment metadata, which contains a list of expected COPC fragments

We want to see whether the metadata consistently (accurately and reliably) tells us the number of COPC fragments that are required to complete the transfer.

This will allow us to then check whether there are enough COPC fragments sent by looking at the Spine data and therefore distinguish between awaiting COPC and awaiting integration.

First, to check the consistency of the metadata, we will look at transfers that we know integrated successfully and see whether we can match the COPC message patterns from Spine to the COPC data in the attachment metadata.

In [15]:
attachment_mids_folder="s3://prm-gp2gp-data-sandbox-dev/43-PRMT-2167-attachment-mids/"
attachment_mids_files=["attachment_mids_april_2021.csv","attachment_mids_march_2021.csv"]
attachment_mids=pd.concat([pd.read_csv(attachment_mids_folder + file) for file in attachment_mids_files])

In [16]:
# Find conversation_ids for transfers which completed successfully
ack_data=gp2gp_spine.loc[gp2gp_spine['interaction_name']=='ack',['messageRef','jdiEvent']].set_index('messageRef')
req_complete_interactions_bool=gp2gp_spine['interaction_name']=='req complete'
conversations_completed=gp2gp_spine.loc[req_complete_interactions_bool,['GUID','conversationID']].set_index('GUID')
conversations_with_completions=conversations_completed.merge(ack_data,left_index=True,right_index=True,how='inner')
successful_transfers_bool=conversations_with_completions['jdiEvent'].isin(['NONE',15])
successful_transfers_conversation_ids=conversations_with_completions.loc[successful_transfers_bool,'conversationID'].values

# Just select conversations for which we have the start as well
conversations_with_start=gp2gp_spine.loc[gp2gp_spine['interaction_name']=='req start','conversationID'].unique()
successful_transfers_conversation_ids=set(successful_transfers_conversation_ids).intersection(set(conversations_with_start))

In [17]:
attachment_mids_successful_transfers_bool=attachment_mids['conversationID'].isin(successful_transfers_conversation_ids)
attachment_mids_successful_transfers=attachment_mids[attachment_mids_successful_transfers_bool]

In [18]:
successful_transfers_attachment_counts=attachment_mids_successful_transfers.drop('_time',axis=1).drop_duplicates().groupby('conversationID').agg('count').rename({'attachmentID':'Number of Attachments Expected (mids)'},axis=1)

In [19]:
copc_messages_in_successful_transfers_bool=copc_messages['conversationID'].isin(successful_transfers_conversation_ids)
copc_messages_in_successful_transfers=copc_messages[copc_messages_in_successful_transfers_bool]
copc_messages_in_successful_transfers=copc_messages_in_successful_transfers.drop('_time',axis=1).drop_duplicates()
successful_transfers_copc_messages_sent=copc_messages_in_successful_transfers.groupby('conversationID').agg({'GUID':'count'}).rename({'GUID':'Number of COPC messages sent'},axis=1)

# We expect the number of COPC messages to exceed the number of fragments by 1 as the first COPC message is not a fragment; it is just a starting COPC message
successful_transfers_copc_messages_sent=successful_transfers_copc_messages_sent-1

In [20]:
successful_transfers_copc_counts_table=successful_transfers_attachment_counts.merge(successful_transfers_copc_messages_sent,left_index=True,right_index=True,how='outer')

successful_transfers_copc_counts_table['difference in COPC counts']=successful_transfers_copc_counts_table['Number of COPC messages sent']-successful_transfers_copc_counts_table['Number of Attachments Expected (mids)']
print('Proportion of successful transfers for which the number of expected Attachments matches the number of COPC messages sent')
successful_transfers_copc_counts_table['difference in COPC counts'].value_counts()[0]/successful_transfers_copc_counts_table['difference in COPC counts'].shape[0]

Proportion of successful transfers for which the number of expected Attachments matches the number of COPC messages sent


0.9770181193213376

In [119]:
COPC_stats=pd.DataFrame(index=[],columns=['Number of Conversations'])
COPC_stats.loc['Total successful transfers']=successful_transfers_copc_counts_table['difference in COPC counts'].shape[0]
COPC_stats.loc['Successful Transfers which do match']=successful_transfers_copc_counts_table['difference in COPC counts'].value_counts()[0]
COPC_stats.loc['Successful Transfers which do not match']=successful_transfers_copc_counts_table['difference in COPC counts'].shape[0]-successful_transfers_copc_counts_table['difference in COPC counts'].value_counts()[0]
COPC_stats

Unnamed: 0,Number of Conversations
Total successful transfers,190846
Successful Transfers which do match,186460
Successful Transfers which do not match,4386


We see that in approx 2.3% of cases, the number of COPC messages expected does not match the number of COPC fragments expected. Given our importance to define a clearer line, it may be worth inspecting these further

## Part 3. Exploration of transfers discrepancy in attachment MID and COPC message counts

In [22]:
successful_transfers_COPC_message_discrepency_bool=successful_transfers_copc_counts_table['difference in COPC counts']!=0
successful_transfers_COPC_message_discrepency=successful_transfers_copc_counts_table[successful_transfers_COPC_message_discrepency_bool]
successful_transfers_COPC_message_discrepency

Unnamed: 0_level_0,Number of Attachments Expected (mids),Number of COPC messages sent,difference in COPC counts
conversationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000E6FA0-833A-11EB-BA23-21CBCDBE5D76,3,2,-1
003530D0-8E17-11EB-A5E8-A9A7EFF7CF10,40,38,-2
003D1780-92E2-11EB-AD75-F7977F03F961,5,2,-3
003D9290-87F3-11EB-A8E7-B9B24A64CBA5,10,5,-5
005FD477-79D2-4881-801D-F9EAE1B02562,34,35,1
...,...,...,...
FF90140E-DFE1-4C4B-AFCC-729CE6BBAAB3,36,37,1
FF929737-C911-4380-9806-6563C1A51771,84,83,-1
FFB4169E-11E5-4B94-898C-90348295B1E1,6,5,-1
FFD4E36D-73FA-4F91-B427-D0D2B35A88B0,115,113,-2


In [23]:
successful_transfers_copc_counts_table['difference in COPC counts'].value_counts()

 0      186460
 1        1591
-1        1430
-2         423
-3         201
         ...  
-157         1
-88          1
-154         1
-50          1
-131         1
Name: difference in COPC counts, Length: 124, dtype: int64

We will inspect a single transfer where the number of COPC fragments sent is 2, yet 3 were expected. Despite this, it integrated successfully.
We will manually inspect this here and in Splunk. 

In [24]:
"000E6FA0-833A-11EB-BA23-21CBCDBE5D76".lower()

'000e6fa0-833a-11eb-ba23-21cbcdbe5d76'

In [25]:
gp2gp_spine.loc[gp2gp_spine['conversationID']=="000E6FA0-833A-11EB-BA23-21CBCDBE5D76"].sort_values(by='_time')

Unnamed: 0,_time,conversationID,GUID,interactionID,messageSender,messageRecipient,messageRef,jdiEvent,toSystem,fromSystem,interaction_name,month
4120968,2021-03-12 13:51:10.144000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,urn:nhs:names:services:gp2gp/RCMR_IN010000UK05,187633304048,498177902018,NotProvided,NONE,EMIS,SystmOne,req start,3
4454032,2021-03-12 13:51:42.804000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,DE720D25-C035-4492-B691-DB7C279C2504,urn:nhs:names:services:gp2gp/RCMR_IN030000UK06,498177902018,187633304048,NotProvided,NONE,SystmOne,EMIS,req complete,3
4404986,2021-03-12 13:51:44.197000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,148CB682-833A-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,187633304048,498177902018,NotProvided,NONE,EMIS,SystmOne,COPC,3
4404985,2021-03-12 13:51:44.309000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,0E5EA933-B2CF-4D5D-B5FE-61E7A2ED13EE,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,498177902018,187633304048,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,NONE,SystmOne,EMIS,ack,3
4541187,2021-03-12 13:51:48.283000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,0FB38286-5DF4-41D2-A357-8CC4F0444536,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,498177902018,187633304048,NotProvided,NONE,SystmOne,EMIS,COPC,3
4541178,2021-03-12 13:51:51.945000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,B082E4A2-E8E0-4E9B-B6A4-D30F552FACBE,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,498177902018,187633304048,NotProvided,NONE,SystmOne,EMIS,COPC,3
4541173,2021-03-12 13:51:53.578000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,1A22EB59-833A-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,187633304048,498177902018,B082E4A2-E8E0-4E9B-B6A4-D30F552FACBE,NONE,EMIS,SystmOne,ack,3
4244620,2021-03-12 13:52:47.170000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,3A13094B-833A-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,187633304048,498177902018,0FB38286-5DF4-41D2-A357-8CC4F0444536,NONE,EMIS,SystmOne,ack,3
4166577,2021-03-15 08:10:01.986000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,3CA0296E-0F44-4ABE-92B5-3F280C28F528,urn:nhs:names:services:gp2gp/RCMR_IN030000UK06,498177902018,187633304048,NotProvided,NONE,SystmOne,EMIS,req complete,3
3990740,2021-03-15 08:10:03.081000+00:00,000E6FA0-833A-11EB-BA23-21CBCDBE5D76,D8284147-8565-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,187633304048,498177902018,3CA0296E-0F44-4ABE-92B5-3F280C28F528,12,EMIS,SystmOne,ack,3


In [26]:
attachment_mids.loc[attachment_mids['conversationID']=="000E6FA0-833A-11EB-BA23-21CBCDBE5D76"]

Unnamed: 0,_time,attachmentID,conversationID
1848570,2021-03-12T13:51:42.816+0000,0FB38286-5DF4-41D2-A357-8CC4F0444536,000E6FA0-833A-11EB-BA23-21CBCDBE5D76
1933875,2021-03-12T13:51:48.285+0000,B082E4A2-E8E0-4E9B-B6A4-D30F552FACBE,000E6FA0-833A-11EB-BA23-21CBCDBE5D76
1986978,2021-03-15T08:10:01.994+0000,5CBEF397-4366-4081-B578-7041037ADBD2,000E6FA0-833A-11EB-BA23-21CBCDBE5D76


Note the presence of error code 12 (duplicates) above. 
This is the cause of the discrepency (based on inspection in the Splunk environment).
Let's see what proportion of the successful transfers with the COPC count discrepency have error code 12. 

In [120]:
# Is the reason for the discrepancy related to error code 12 (duplicates)?

print('What % of conversations have error code 12?')
conversations_with_error_12=gp2gp_spine.groupby('conversationID')['jdiEvent'].apply(list).apply(lambda jdis: '12' in jdis)
print(round(100*conversations_with_error_12.mean(),1))

print('What % of transfers where the fragment count does not match, have error code 12?')
print(round(conversations_with_error_12[successful_transfers_COPC_message_discrepency.index.unique()].mean()*100,1))

print('Total number of these transfers with error code 12')
print(conversations_with_error_12[successful_transfers_COPC_message_discrepency.index.unique()].sum())

COPC_stats.loc['Non matching successful transfers with Error Code 12']=conversations_with_error_12[successful_transfers_COPC_message_discrepency.index.unique()].sum()


What % of conversations have error code 12?
1.5
What % of transfers where the fragment count does not match, have error code 12?
50.4
Total number of these transfers with error code 12
2210


So 50.5% of successful transfers with the COPC discrepency have error code 12 somewhere in the transfer; this is the likely cause of the discrepency and will need to be accounted for if implementing this. 

We should now inspect successful transfers with the discrepency that don't have have error code 12

In [121]:
conversations_without_error_12=conversations_with_error_12==False
discrepency_transfers_without_12_bool=conversations_without_error_12[successful_transfers_COPC_message_discrepency.index.unique()]
conversation_ids_with_discrepency_without_12=discrepency_transfers_without_12_bool[discrepency_transfers_without_12_bool].index
COPC_stats.loc['Non matching successful transfers without Error Code 12']=len(conversation_ids_with_discrepency_without_12)

In [29]:
successful_transfers_COPC_message_discrepency.loc[conversation_ids_with_discrepency_without_12]

Unnamed: 0_level_0,Number of Attachments Expected (mids),Number of COPC messages sent,difference in COPC counts
conversationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
003530D0-8E17-11EB-A5E8-A9A7EFF7CF10,40,38,-2
007D390D-8D9B-4154-AC4A-95CC58CB0918,103,99,-4
00854B50-8350-11EB-90EC-2DA8D845319E,198,192,-6
0087080E-BB11-469C-9DC7-0152CD5C52C2,82,80,-2
009AF571-1D52-4AF9-9DB2-B7E83E23E6DF,178,175,-3
...,...,...,...
FF6104D0-906F-11EB-88D5-CD16F5EF3B6C,150,148,-2
FF929737-C911-4380-9806-6563C1A51771,84,83,-1
FFB4169E-11E5-4B94-898C-90348295B1E1,6,5,-1
FFD4E36D-73FA-4F91-B427-D0D2B35A88B0,115,113,-2


In [30]:
successful_transfers_COPC_message_discrepency.loc[conversation_ids_with_discrepency_without_12,'difference in COPC counts'].value_counts().head(10)

-1    1328
-2     370
-3     161
 1     130
-4      68
-6      23
-5      20
 2      13
-7      12
 3       5
Name: difference in COPC counts, dtype: int64

In [31]:
# Quick check to see if there's any high volumes of a particular error code
discrepencies_without_12_jdi_counts=gp2gp_spine.loc[gp2gp_spine['conversationID'].isin(conversation_ids_with_discrepency_without_12)].pivot_table(index='conversationID',columns='jdiEvent',values='GUID',aggfunc='count').fillna(0)
discrepencies_without_12_jdi_occurences=discrepencies_without_12_jdi_counts.copy()
discrepencies_without_12_jdi_occurences[discrepencies_without_12_jdi_occurences>0]=1
print('% of successful transfers with a COPC discrepency but no error code 12 which have other error codes')
100*discrepencies_without_12_jdi_occurences.sum()/discrepencies_without_12_jdi_occurences.shape[0]

% of successful transfers with a COPC discrepency but no error code 12 which have other error codes


jdiEvent
11        0.137868
17        0.045956
19        0.045956
20        3.906250
25        0.873162
28        0.137868
29        4.136029
31        2.941176
99        1.332721
NONE    100.000000
dtype: float64

In [32]:
conversation_of_interest="FFB4169E-11E5-4B94-898C-90348295B1E1"
gp2gp_spine.loc[gp2gp_spine['conversationID']==conversation_of_interest].sort_values(by='_time')

Unnamed: 0,_time,conversationID,GUID,interactionID,messageSender,messageRecipient,messageRef,jdiEvent,toSystem,fromSystem,interaction_name,month
1761926,2021-03-24 14:11:15.698000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,6147B96C-FFFB-4665-9944-0601DE250076,urn:nhs:names:services:gp2gp/RCMR_IN010000UK05,200000010568,892009055013,NotProvided,NONE,SystmOne,EMIS,req start,3
1651750,2021-03-24 14:12:18.722000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,EFFBFD1A-8CAA-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/RCMR_IN030000UK06,892009055013,200000010568,NotProvided,NONE,EMIS,SystmOne,req complete,3
1778321,2021-03-24 14:12:33.396000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,D3D4CD52-9D4D-431D-85CF-F7C03EDC12B4,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,200000010568,892009055013,NotProvided,NONE,SystmOne,EMIS,COPC,3
1579045,2021-03-24 14:15:11.023000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,F0377F76-8CAA-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,892009055013,200000010568,NotProvided,NONE,EMIS,SystmOne,COPC,3
1834429,2021-03-24 14:15:11.753000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,F0305387-8CAA-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,892009055013,200000010568,NotProvided,NONE,EMIS,SystmOne,COPC,3
1834428,2021-03-24 14:15:12.236000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,F03C3A62-8CAA-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,892009055013,200000010568,NotProvided,NONE,EMIS,SystmOne,COPC,3
1579037,2021-03-24 14:15:12.998000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,F04A9247-8CAA-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,892009055013,200000010568,NotProvided,NONE,EMIS,SystmOne,COPC,3
1579033,2021-03-24 14:15:13.996000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,F02B9896-8CAA-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,892009055013,200000010568,NotProvided,NONE,EMIS,SystmOne,COPC,3
1834423,2021-03-24 14:15:14.485000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,36A3F254-4732-46ED-BB15-569A58226144,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,200000010568,892009055013,F0377F76-8CAA-11EB-BDC2-48DF37DF7D10,NONE,SystmOne,EMIS,ack,3
1834416,2021-03-24 14:15:15.329000+00:00,FFB4169E-11E5-4B94-898C-90348295B1E1,27B328CA-C068-4EA9-A79B-64B40D1F7236,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,200000010568,892009055013,F0305387-8CAA-11EB-BDC2-48DF37DF7D10,NONE,SystmOne,EMIS,ack,3


In [33]:
attachment_mids.loc[attachment_mids['conversationID']==conversation_of_interest].sort_values(by='_time')

Unnamed: 0,_time,attachmentID,conversationID
817592,2021-03-24T14:12:18.754+0000,F02B9896-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1
817591,2021-03-24T14:12:18.756+0000,F0305387-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1
817590,2021-03-24T14:12:18.760+0000,F0377F76-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1
817589,2021-03-24T14:12:18.762+0000,F03C3A62-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1
817588,2021-03-24T14:12:18.765+0000,F0436657-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1
817587,2021-03-24T14:12:18.767+0000,F04A9247-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1


Expected COPC messages in Spine:

F02B9896-8CAA-11EB-BDC2-48DF37DF7D10: Sent and acknowledged

F0305387-8CAA-11EB-BDC2-48DF37DF7D10: Sent and acknowledged

F0377F76-8CAA-11EB-BDC2-48DF37DF7D10: Sent and acknowledged

F03C3A62-8CAA-11EB-BDC2-48DF37DF7D10: **Sent but not acknowledged**

F0436657-8CAA-11EB-BDC2-48DF37DF7D10: **Acknowledged but not sent**

F04A9247-8CAA-11EB-BDC2-48DF37DF7D10: Sent and acknowledged

So in the spine messages, we see 5 COPC fragments sent and 5 messages acknowledged. But they don't match; 1 is missing an acknowledgment, and one acknowledgment does not match the sending of messages

Yet all these messages are expected according to the mid file. 

And GP2GP continues and completes the integration despite this mis-match

Do the IDs change mid-way through a transfer??


We could now check if this seems to be common - for the discrepencies with error code 12, do we see all MIDs but not always sent AND acknowledged?

In [34]:
attachment_mids_discrepencies_without_12=attachment_mids.loc[attachment_mids['conversationID'].isin(conversation_ids_with_discrepency_without_12)]
spine_messages_discrepencies_without_12=gp2gp_spine.loc[gp2gp_spine['conversationID'].isin(conversation_ids_with_discrepency_without_12)]
spine_messages_discrepencies_without_12

Unnamed: 0,_time,conversationID,GUID,interactionID,messageSender,messageRecipient,messageRef,jdiEvent,toSystem,fromSystem,interaction_name,month
233,2021-03-31 19:01:07.915000+00:00,ED770FBC-E884-44E0-8640-4BC37E6F5FE9,747D1A6A-A315-4A7F-851A-46018C07714D,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,200000011174,50625659045,A86287C0-5B45-4D04-9387-B0871BEED52F,NONE,EMIS,EMIS,ack,3
585,2021-03-31 18:20:25.481000+00:00,C2841230-9201-11EB-AC9D-11926ABB2185,C371E310-924D-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,75700954011,706945148015,E127DD66-F144-4926-B74F-12537DB87F43,NONE,EMIS,SystmOne,ack,3
5892,2021-03-31 17:27:44.296000+00:00,B3667460-9205-11EB-B705-25F564AD4013,673BA421-9246-11EB-BDC2-48DF37DF7D10,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,118930647011,268433151019,44045B55-28BB-434F-8206-09E911524910,NONE,EMIS,SystmOne,ack,3
10356,2021-03-31 16:54:04.945000+00:00,1C69EFF5-A47E-413C-BC2E-8B8D5CC7839E,8B3DF0F7-C2B1-4067-8F7C-82F8A5DECC18,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,228500756017,685720180049,766068C7-D84E-4DF6-9F7F-60B7BC872E73,NONE,EMIS,EMIS,ack,3
21939,2021-03-31 16:25:08.939000+00:00,CE8D94CA-00A7-4111-A840-E22AE9F6C87F,607C1950-62B1-4468-996D-614DB6255CFC,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,124544614017,800792063045,AAC71089-9200-11EB-BDC2-48DF37DF7D10,NONE,SystmOne,EMIS,ack,3
...,...,...,...,...,...,...,...,...,...,...,...,...
6557036,2021-04-01 14:21:24.188000+00:00,87CA4265-6DC2-4648-8F73-681D6AB0BF0C,97A13F62-591E-4D00-B666-54640AB4DEB1,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,200000015334,200000000193,NotProvided,NONE,EMIS,EMIS,COPC,4
6557038,2021-04-01 14:21:23.637000+00:00,87CA4265-6DC2-4648-8F73-681D6AB0BF0C,6AF42D60-2D8B-47EA-A242-AEF6951AC611,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,200000000193,200000015334,3843330D-B4D7-4E0D-8FBD-F794071F9D8D,NONE,EMIS,EMIS,ack,4
6557041,2021-04-01 14:21:22.910000+00:00,87CA4265-6DC2-4648-8F73-681D6AB0BF0C,D8FBA316-641B-41AF-B37C-968FFDD43785,urn:nhs:names:services:gp2gp/COPC_IN000001UK01,200000015334,200000000193,NotProvided,NONE,EMIS,EMIS,COPC,4
6557048,2021-04-01 14:21:22.727000+00:00,87CA4265-6DC2-4648-8F73-681D6AB0BF0C,F93C0524-445E-4621-9692-0321EF2F1A0B,urn:nhs:names:services:gp2gp/MCCI_IN010000UK13,200000000193,200000015334,F0BD32DB-7502-411C-8240-6481D656CA1D,NONE,EMIS,EMIS,ack,4


In [35]:
discrepency_counts=attachment_mids_discrepencies_without_12.copy()
message_sent_list=pd.DataFrame(spine_messages_discrepencies_without_12[['conversationID','GUID']])
message_sent_list['message sent']=True
message_sent_list=message_sent_list.drop_duplicates()
discrepency_counts=discrepency_counts.merge(message_sent_list,left_on=['conversationID','attachmentID'],right_on=['conversationID','GUID'],how='left').drop("GUID",axis=1)

message_acknowledged_list=pd.DataFrame(spine_messages_discrepencies_without_12[['conversationID','messageRef']])
message_acknowledged_list['message acknowledged']=True
message_acknowledged_list=message_acknowledged_list.drop_duplicates()
discrepency_counts=discrepency_counts.merge(message_acknowledged_list,left_on=['conversationID','attachmentID'],right_on=['conversationID','messageRef'],how='left').drop("messageRef",axis=1)
discrepency_counts=discrepency_counts.fillna(False)
discrepency_counts=discrepency_counts.drop('_time',axis=1).drop_duplicates()

Unnamed: 0,attachmentID,conversationID,message sent,message acknowledged,messages seen,summary
0,59A31618-D0DB-4207-A0D3-CD80D8A0DA1E,C91B212E-7390-4087-8216-40997C7A5DC3,True,True,True,"Message Sent,Message Acknowledged,"
1,FD0D68D8-127F-48AE-88F6-31D7797FF726,C91B212E-7390-4087-8216-40997C7A5DC3,True,True,True,"Message Sent,Message Acknowledged,"
2,EDF7ADA9-0F83-45EF-8046-8B976FA3035E,C91B212E-7390-4087-8216-40997C7A5DC3,True,True,True,"Message Sent,Message Acknowledged,"
3,DCECCD1A-C388-4A2F-845B-924085C78B8E,C91B212E-7390-4087-8216-40997C7A5DC3,True,True,True,"Message Sent,Message Acknowledged,"
4,6AC24946-571C-45C4-9346-E985EFCCFE0D,C91B212E-7390-4087-8216-40997C7A5DC3,True,True,True,"Message Sent,Message Acknowledged,"
...,...,...,...,...,...,...
189388,4ED92459-7A83-11EB-BDC1-48DF37DF7D10,E331BB0D-43EC-4140-AB6A-1CD1089DC8A7,True,True,True,"Message Sent,Message Acknowledged,"
189389,4ED6B36B-7A83-11EB-BDC1-48DF37DF7D10,E331BB0D-43EC-4140-AB6A-1CD1089DC8A7,True,True,True,"Message Sent,Message Acknowledged,"
189390,4ED6B364-7A83-11EB-BDC1-48DF37DF7D10,E331BB0D-43EC-4140-AB6A-1CD1089DC8A7,True,True,True,"Message Sent,Message Acknowledged,"
189391,4ED6B359-7A83-11EB-BDC1-48DF37DF7D10,E331BB0D-43EC-4140-AB6A-1CD1089DC8A7,True,True,True,"Message Sent,Message Acknowledged,"


In [36]:
discrepency_counts.loc[discrepency_counts['attachmentID']=="F0436657-8CAA-11EB-BDC2-48DF37DF7D10"]

Unnamed: 0,attachmentID,conversationID,message sent,message acknowledged
121189,F0436657-8CAA-11EB-BDC2-48DF37DF7D10,FFB4169E-11E5-4B94-898C-90348295B1E1,False,True


In [37]:
# We expect to see a discrepency here (not everything should be seen and acknowledged as seen in the conversation above)
(discrepency_counts['message sent'] + discrepency_counts['message acknowledged']).value_counts()

  f"evaluating in Python space because the {repr(op_str)} "


True     187519
False      1874
dtype: int64

In [38]:
discrepency_counts['messages seen']=(discrepency_counts['message sent'] | discrepency_counts['message acknowledged']) 
discrepencies_by_conversation=discrepency_counts.groupby('conversationID').agg({'messages seen':['count','sum']})['messages seen'].rename({'count':'Attachment MIDs in metadata','sum':'Attachments in Spine (sent OR acknowledged)'},axis=1)
discrepencies_by_conversation['attachment count difference']=(discrepencies_by_conversation['Attachment MIDs in metadata']-discrepencies_by_conversation['Attachments in Spine (sent OR acknowledged)'])

In [39]:
discrepencies_by_conversation['attachment count difference'].value_counts()[0]

2091

In [40]:
discrepency_counts['summary']=''
discrepency_counts.loc[discrepency_counts['message sent'],'summary']=discrepency_counts['summary'] +'Message Sent,'
discrepency_counts.loc[discrepency_counts['message acknowledged'],'summary']=discrepency_counts['summary'] +'Message Acknowledged,'
discrepency_counts.loc[discrepency_counts['summary']=='','summary']='Message missing'

In [41]:
discrepencies_message_distributions_by_conversation=discrepency_counts.pivot_table(index='conversationID',columns='summary',values='attachmentID',aggfunc='count').fillna(0)
discrepencies_message_distributions_by_conversation

summary,"Message Acknowledged,","Message Sent,","Message Sent,Message Acknowledged,",Message missing
conversationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
003530D0-8E17-11EB-A5E8-A9A7EFF7CF10,2.0,0.0,38.0,0.0
007D390D-8D9B-4154-AC4A-95CC58CB0918,4.0,0.0,99.0,0.0
00854B50-8350-11EB-90EC-2DA8D845319E,6.0,2.0,190.0,0.0
0087080E-BB11-469C-9DC7-0152CD5C52C2,2.0,1.0,79.0,0.0
009AF571-1D52-4AF9-9DB2-B7E83E23E6DF,0.0,4.0,170.0,4.0
...,...,...,...,...
FF6104D0-906F-11EB-88D5-CD16F5EF3B6C,2.0,3.0,145.0,0.0
FF929737-C911-4380-9806-6563C1A51771,1.0,1.0,82.0,0.0
FFB4169E-11E5-4B94-898C-90348295B1E1,1.0,1.0,4.0,0.0
FFD4E36D-73FA-4F91-B427-D0D2B35A88B0,2.0,3.0,110.0,0.0


In [68]:
print((discrepencies_message_distributions_by_conversation['Message missing']>0).mean())

0.0390625


In [70]:
print((discrepencies_message_distributions_by_conversation['Message Acknowledged,']==discrepencies_message_distributions_by_conversation['Message Sent,']).mean())

0.2931985294117647


In only 30% of our remaining cases, do missing acknowledged messages match missing sent messages.

This suggests that the main issue is not the message "switching" ID. 

In [44]:
(discrepencies_message_distributions_by_conversation.drop(['Message Sent,Message Acknowledged,',"Message missing"],axis=1).sum(axis=1)>0).mean()

0.890625

In [122]:
#COPC_stats.loc['Non-matching successful transfers without error 12']=len(conversation_ids_with_discrepency_without_12)
COPC_stats.loc['Non-matching successful transfers without error 12 missing EITHER sent OR acknowledged messages']=(discrepencies_message_distributions_by_conversation.drop(['Message Sent,Message Acknowledged,',"Message missing"],axis=1).sum(axis=1)>0).sum()
COPC_stats.loc['Non-matching successful transfers without error 12 where missing number sent matches missing number acknowledged']=(discrepencies_message_distributions_by_conversation['Message Acknowledged,']==discrepencies_message_distributions_by_conversation['Message Sent,']).sum()
COPC_stats.loc['Non-matching successful transfers without error 12 with a missing attachmentID']=(discrepencies_message_distributions_by_conversation['Message missing']>0).sum()

In [45]:
(discrepencies_message_distributions_by_conversation.drop(['Message Sent,Message Acknowledged,',"Message missing"],axis=1).sum(axis=1)>0).sum()

1938

So of the remaining discrepencies (50% we think are due to error code 12), all the expected COPC MIDs appear for conversations in about 96% of cases.

However, in 89% of these conversations, there are COPC MIDs which do not show up as being both sent and acknowledged (they only appear as one)

In [123]:
COPC_stats['% of successful transfers']=(COPC_stats['Number of Conversations']/COPC_stats.loc['Total successful transfers','Number of Conversations']).multiply(100).astype(float).round(2)
COPC_stats

Unnamed: 0,Number of Conversations,% of successful transfers
Total successful transfers,190846,100.0
Successful Transfers which do match,186460,97.7
Successful Transfers which do not match,4386,2.3
Non matching successful transfers with Error Code 12,2210,1.16
Non matching successful transfers without Error Code 12,2176,1.14
Non-matching successful transfers without error 12 missing EITHER sent OR acknowledged messages,1938,1.02
Non-matching successful transfers without error 12 where missing number sent matches missing number acknowledged,638,0.33
Non-matching successful transfers without error 12 with a missing attachmentID,85,0.04
