# PRMR-1528 Attachment metadata deduplication

### Context

Currently we suspect that there is duplication in the attachments dataset. We believe attachments are being logged multiple times:
- once in the `EHR request completed` message
- again in a COPC message when large messaging is used (due to overal size of EHR and attachments exceeding 5mb or 99 attachments limit)
- again in multiple COPC messages when attachment is broken down into fragments (due to being over 5mb)
- again if there are duplicated `EHR request completed` messages

Duplication caused by COPC messages could be resolved by filtering them out such that for each transfer we only consider attachments referenced in the manifest contained in the `EHR request completed` message. We propose enriching attachments dataset with interaction ID to enable this filtering.

Duplication caused by multiple `EHR request completed` messages could be rosolved by counting attachments only from the latest one.

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query, for 21/12/2020 00:00:00:00 to 03/01/2020 24:00:00 time frame:

```
index="spine2vfmmonitor" logReference=MPS0208
| table _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64, internalID
```

2. Download the two data sets as CSVs and place in a directory called `attachments_metadata`. Set the `INPUT_DATA_DIR` environment variable to point to the _parent_ of this directory.

3. Run the following Splunk query for the same time range, and place the CSV in a directory alongside the first one called `gp2gp_requests`.

```
index="spine2vfmmonitor" service="gp2gp" logReference="MPS0053c"
| table _time, conversationID, internalID, interactionID
```

Example directory layout, where `INPUT_DATA_DIR` is `/attachments`.
```
attachments/attachments_metadata/attachments.csv
attachments/gp2gp_messages/gp2gp_messages.csv
```

In [1]:
import paths, os
import duckdb
from scripts.attachments import construct_attachments_db

In [2]:
attachment_data_dir = os.environ["INPUT_DATA_DIR"]
cursor =  duckdb.connect()
construct_attachments_db(cursor, attachment_data_dir)
attachments = cursor.table("attachment_metadata")

In [3]:
cursor.execute("""
    create or replace view attachment_messages as 
        select attachment_metadata.*, interaction_id from attachment_metadata 
        left join gp2gp_messages
        on attachment_metadata.internal_id=gp2gp_messages.internal_id;
""")

cursor.execute("select count(*) as conversations_with_no_interaction_id from attachment_messages where interaction_id is null").df()

Unnamed: 0,conversations_with_no_interaction_id
0,0


### Count of the EHR request completed and COPC messages

In [4]:
cursor.execute("""
    select count(*) as count, interaction_id
    from attachment_messages group by interaction_id""").df()

Unnamed: 0,count,interaction_id
0,2130721,urn:nhs:names:services:gp2gp/RCMR_IN030000UK06
1,372045,urn:nhs:names:services:gp2gp/COPC_IN000001UK01


### Conversations containing more than one EHR request completed message

In [5]:
cursor.execute("""
    select * from (
    select count(*) as count, conversation_id
    from gp2gp_messages
    where interaction_id='urn:nhs:names:services:gp2gp/RCMR_IN030000UK06'
    group by conversation_id) request_completed_message_per_conversation 
    where count > 1
""").df()

Unnamed: 0,count,conversation_id
0,2,EC1AD9F3-EFC2-4C6F-9C30-B696C1EF8074
1,2,F5EB37B0-B344-41C3-BD1A-9AA9B09E65FD
2,2,8A05247C-9A68-491B-9652-A3E3A9378BC5
3,2,5D701E8E-6E67-4BE5-920B-8D057D61503D
4,3,FBA54C48-66F8-46DF-9C31-00401C781EDE
...,...,...
702,3,EDD96FD2-BD57-4CFE-A2C3-32BAB34683C2
703,2,788F5DCF-68A3-40D3-97F5-A9964CA4BBC2
704,3,3F6E8C90-412F-11EB-A1CF-874D6673CFE6
705,2,B79B3994-3AA4-4550-9DB1-37C771A0E88B


In [6]:
cursor.execute("""
    create or replace view ehr_request_completed_messages as 
        select * from gp2gp_messages
        where interaction_id='urn:nhs:names:services:gp2gp/RCMR_IN030000UK06'
""")

<duckdb.DuckDBPyConnection at 0x7f7f4022ae70>

In [7]:
cursor.execute("""
create or replace view unique_ehr_request_completed_messages as 
select a.* from ehr_request_completed_messages a
inner join (
    select conversation_id, max(time) as time
    from ehr_request_completed_messages
    group by conversation_id
) b on a.conversation_id = b.conversation_id and a.time = b.time""")

<duckdb.DuckDBPyConnection at 0x7f7f4022ae70>

In [8]:
cursor.execute("""
    create table ehr_attachments as 
    select a.* from attachment_metadata a
    join unique_ehr_request_completed_messages b 
    on a.internal_id = b.internal_id
""")

cursor.execute("select * from ehr_attachments").df()

Unnamed: 0,time,attachment_id,conversation_id,from_system,to_system,attachment_type,compressed,content_type,large_attachment,length,original_base64,internal_id
0,2021-01-03 18:25:05.371,FE4B9C70-4DF0-11EB-93BE-48DF371F565C,0F17219E-5549-4DF0-BE9A-6601E34D5F7B,SystmOne,EMIS Web,mid,True,image/tiff,False,1875188.0,False,20210103182505020335_EC3E43_1522284092
1,2021-01-03 18:25:05.370,FE46E183-4DF0-11EB-93BE-48DF371F565C,0F17219E-5549-4DF0-BE9A-6601E34D5F7B,SystmOne,EMIS Web,mid,False,image/tiff,False,454000.0,False,20210103182505020335_EC3E43_1522284092
2,2021-01-03 18:25:05.369,FE46E180-4DF0-11EB-93BE-48DF371F565C,0F17219E-5549-4DF0-BE9A-6601E34D5F7B,SystmOne,EMIS Web,mid,False,image/tiff,False,350536.0,False,20210103182505020335_EC3E43_1522284092
3,2021-01-03 18:25:05.368,FE447080-4DF0-11EB-93BE-48DF371F565C,0F17219E-5549-4DF0-BE9A-6601E34D5F7B,SystmOne,EMIS Web,mid,False,image/tiff,False,345202.0,False,20210103182505020335_EC3E43_1522284092
4,2021-01-03 18:25:05.366,attachment20.0@test.com,0F17219E-5549-4DF0-BE9A-6601E34D5F7B,SystmOne,EMIS Web,cid,False,image/tiff,False,339098.0,False,20210103182505020335_EC3E43_1522284092
...,...,...,...,...,...,...,...,...,...,...,...,...
2091050,2020-12-21 08:27:32.750,attachment5.0@test.com,BAD45EEF-57D4-4BAA-AF85-637D462DC602,SystmOne,EMIS Web,cid,False,text/rtf,False,684.0,False,20201221082732676178_D03B72_1522284092
2091051,2020-12-21 08:27:32.749,attachment4.0@test.com,BAD45EEF-57D4-4BAA-AF85-637D462DC602,SystmOne,EMIS Web,cid,False,text/rtf,False,610.0,False,20201221082732676178_D03B72_1522284092
2091052,2020-12-21 08:27:32.748,attachment3.0@test.com,BAD45EEF-57D4-4BAA-AF85-637D462DC602,SystmOne,EMIS Web,cid,False,text/rtf,False,298.0,False,20201221082732676178_D03B72_1522284092
2091053,2020-12-21 08:27:32.745,attachment2.0@test.com,BAD45EEF-57D4-4BAA-AF85-637D462DC602,SystmOne,EMIS Web,cid,False,text/rtf,False,234.0,False,20201221082732676178_D03B72_1522284092
