# PRMR-1528 Attachment metadata deduplication

### Context

Currently we suspect that there is duplication in the attachments dataset. We believe attachments are being logged multiple times:
- once in the `EHR request completed` message
- again in a COPC message when large messaging is used (due to overal size of EHR and attachments exceeding 5mb or 99 attachments limit)
- again in multiple COPC messages when attachment is broken down into fragments (due to being over 5mb)
- again if there are duplicated `EHR request completed` messages

Duplication caused by COPC messages could be resolved by filtering them out such that for each transfer we only consider attachments referenced in the manifest contained in the `EHR request completed` message. We propose enriching attachments dataset with interaction ID to enable this filtering.

Duplication caused by multiple `EHR request completed` messages could be rosolved by counting attachments only from the latest one.

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query, for 21/12/2020 00:00:00:00 to 03/01/2020 24:00:00 time frame. Export the result as a csv named `attachment_metadata.csv`. 

```
index="spine2vfmmonitor" logReference=MPS0208
| table _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64, internalID
```

2. Run the following Splunk query for the same time range. Export the result as a csv named `gp2gp_messages.csv`.

```
index="spine2vfmmonitor" service="gp2gp" logReference="MPS0053c"
| table _time, conversationID, internalID, interactionID
```

3. Place both the csv files in a directory called `attachments`. Set the `INPUT_DATA_DIR` environment variable to point to this directory.

Example directory layout, where `INPUT_DATA_DIR` is `attachments`.
```
attachments/attachment_metadata.csv
attachments/gp2gp_messages.csv
```

In [None]:
import paths, os
import duckdb
from scripts.attachments import construct_attachments_db

In [None]:
attachment_data_dir = os.environ["INPUT_DATA_DIR"]
cursor =  duckdb.connect()
construct_attachments_db(cursor, attachment_data_dir)
attachments = cursor.table("attachment_metadata")

In [None]:
cursor.execute("""
    create or replace view attachment_messages as 
        select attachment_metadata.*, interaction_id from attachment_metadata 
        left join gp2gp_messages
        on attachment_metadata.internal_id=gp2gp_messages.internal_id;
""")

cursor.execute("select count(*) as conversations_with_no_interaction_id from attachment_messages where interaction_id is null").df()

### Count of the EHR request completed and COPC messages

In [None]:
cursor.execute("""
    select count(*) as count, interaction_id
    from attachment_messages group by interaction_id""").df()

### Conversations containing more than one EHR request completed message

In [None]:
conversations_with_duplicate_ehr = cursor.execute("""
    select * from (
    select count(*) as count, conversation_id
    from gp2gp_messages
    where interaction_id='urn:nhs:names:services:gp2gp/RCMR_IN030000UK06'
    group by conversation_id) request_completed_message_per_conversation 
    where count > 1
""").df()
conversations_with_duplicate_ehr

In [None]:
conversations_with_duplicate_ehr['count'].sum() - len(conversations_with_duplicate_ehr)

In [None]:
cursor.execute("""
    create or replace view ehr_request_completed_messages as 
        select * from gp2gp_messages
        where interaction_id='urn:nhs:names:services:gp2gp/RCMR_IN030000UK06'
""")

In [None]:
cursor.execute("""
create or replace view unique_ehr_request_completed_messages as 
select a.* from ehr_request_completed_messages a
inner join (
    select conversation_id, max(time) as time
    from ehr_request_completed_messages
    group by conversation_id
) b on a.conversation_id = b.conversation_id and a.time = b.time""")

In [None]:
cursor.execute("""
    create table ehr_attachments as 
    select a.* from attachment_metadata a
    join unique_ehr_request_completed_messages b 
    on a.internal_id = b.internal_id
""")

cursor.execute("select * from ehr_attachments").df()