# PRMT-1724 Attachment data for NME (New Market Entrant)

### Context

NME have requested some insights on attachments data. They want to understand the number of attachments over a certain size, as they need to know what types of sizes and volumes they need to be able to handle. 

They would like to know:
- total number of attachments 
- how many attachments are over 5mb (volume and %)
- how many attachments are over 20mb (volume and %)

There is some effort required to de-duplicate the underlying data. This is explored in notebook `10-PRMT-1528`.

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query for:
- 01/01/2021 00:00:00:00 to 31/01/2021 24:00:00 and export the result as a csv named `1-2021-attachment-metadata.csv`. 
- 01/02/2021 00:00:00:00 to 28/02/2021 24:00:00 and export the result as a csv named `2-2021-attachment-metadata.csv`. 
- 01/03/2021 00:00:00:00 to 17/03/2021 24:00:00 and export the result as a csv named `3-2021-partial-attachment-metadata.csv`. 

```
index="spine2vfmmonitor" logReference=MPS0208
| table _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64, internalID
```

2. Run the following Splunk query for the same time ranges. Export the results as a csvs named `1-2021-gp2gp-messages.csv`, `2-2021-gp2gp-messages.csv`, `3-2021-partial-gp2gp-messages.csv`.

```
index="spine2vfmmonitor" service="gp2gp" logReference="MPS0053c"
| table _time, conversationID, internalID, interactionID
```

In [1]:
import pandas as pd
import numpy as np

In [2]:
attachments_metadata_prefix = "s3://<bucket-name>"
attachment_files = [
    "1-2021-attachment-metadata.csv.gz",
    "2-2021-attachment-metadata.csv.gz",
    "3-2021-partial-attachment-metadata.csv.gz"
]
attachment_input_files = [attachments_metadata_prefix + f for f in attachment_files]

In [3]:
def convert_to_int(val):
    if val == "Unknown":
        return np.NaN
    else:
        return np.int(val)

attachments = pd.concat((
    pd.read_csv(f, converters={"Length": convert_to_int}, parse_dates=["_time"])
    for f in attachment_input_files
))

In [4]:
gp2gp_messages_prefix = "s3://<bucket-name>"
gp2gp_messages_files = [
    "1-2021-gp2gp-messages.csv.gz",
    "2-2021-gp2gp-messages.csv.gz",
    "3-2021-partial-gp2gp-messages.csv.gz"
]
gp2gp_messages_input_files = [gp2gp_messages_prefix + f for f in gp2gp_messages_files]

In [5]:
gp2gp_messages = pd.concat((
    pd.read_csv(f, parse_dates=["_time"])
    for f in gp2gp_messages_input_files
))

## Deduplicate Attachment data

In [6]:
ehr_request_completed_messages = gp2gp_messages[gp2gp_messages["interactionID"] == "urn:nhs:names:services:gp2gp/RCMR_IN030000UK06"]

unique_ehr_request_completed_messages = ehr_request_completed_messages.sort_values(by="_time").drop_duplicates(subset=["conversationID"], keep="last")

In [7]:
ehr_attachments = pd.merge(attachments, unique_ehr_request_completed_messages[["internalID", "interactionID"]], on="internalID", how="inner")

## Attachment sizes

In [8]:
ehr_attachments_with_size_in_mb = ehr_attachments.assign(LengthInMB=lambda x: x["Length"]/ (1024 * 1024))

In [9]:
attachments_over_5_mb = np.sum((ehr_attachments_with_size_in_mb["LengthInMB"] >= 5) & (ehr_attachments_with_size_in_mb["LengthInMB"] < 20))
attachments_over_20_mb = np.sum(ehr_attachments_with_size_in_mb["LengthInMB"] >= 20)
attachments_under_5_mb = np.sum(ehr_attachments_with_size_in_mb["LengthInMB"] < 5)
attachments_size_unknown = np.sum(ehr_attachments_with_size_in_mb["LengthInMB"].isnull())

attachment_sizes = pd.DataFrame([[attachments_over_5_mb, attachments_over_20_mb, attachments_under_5_mb, attachments_size_unknown]],
                  columns=['over 5 MB', 'over 20 MB', 'under 5 MB', 'Unknown'])
attachment_sizes['Total'] = attachment_sizes.sum(axis=1)
attachment_sizes

Unnamed: 0,over 5 MB,over 20 MB,under 5 MB,Unknown,Total
0,161844,28038,19404108,139,19594129


In [17]:
attachment_size_percentages = attachment_sizes.iloc[:, 0:4].apply(lambda x: x / attachment_sizes.iloc[:, 4] * 100)

attachment_size_percentages = attachment_size_percentages.add_suffix(' (%)')

attachment_size_percentages.round(2)

Unnamed: 0,over 5 MB (%),over 20 MB (%),under 5 MB (%),Unknown (%)
0,0.83,0.14,99.03,0.0
