# PRMT-1862 Attachment Size Analysis for NME (New Market Entrant)

### Context

The NME/GPC team need to understand the size of the attachments being transferred via GP2GP, in order to inform their decisions for GP Connect and reduce the need to chunk attachments. They would like to know:
- The max file size that can be transferred without the request timing out.

There is some effort required to de-duplicate the underlying data. This is explored in notebook `10-PRMT-1528` and `PRMT-1724`. The date range used was 1st January 2021 00:00:00 to 31 March 2021 24:00:00.

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query for:
- 01/01/2021 00:00:00:00 to 17/01/2021 24:00:00 (using Date Range) and export the result as a CSV named `1-jan-17-jan-2021-attachment-data`
- 18/01/2021 00:00:00:00 to 31/01/2021 24:00:00 (using Date Range) and export the result as a CSV named `18-jan-31-jan-2021-attachment-data`
- 01/02/2021 00:00:00:00 to 14/02/2021 24:00:00 (using Date Range) and export the result as a CSV named `01-feb-14-feb-2021-attachment-data`
- 15/02/2021 00:00:00:00 to 28/02/2021 24:00:00 (using Date Range) and export the result as a CSV named `15-feb-28-feb-2021-attachment-data`
- 01/03/2021 00:00:00:00 to 14/03/2021 24:00:00 (using Date Range) and export the result as a CSV named `1-mar-14-mar-2021-attachment-data.`
- 15/03/2021 00:00:00:00 to 21/03/2021 24:00:00 (using Date Range) and export the result as a CSV named `15-mar-21-mar-2021-attachment-data`
- 22/03/2021 00:00:00:00 to 31/03/2021 24:00:00 (using Date Range) and export the result as a CSV named `22-mar-31-mar-2021-attachment-data`


Splunk Query for all attachment metadata:
```
index="spine2vfmmonitor" logReference=MPS0208
| table  *
```

2. Run the following Splunk query for the following month ranges:
- 01/01/2021 00:00:00:00 to 31/01/2021 24:00:00 (using Date Range) and export result as `1-2021-gp2gp-messages.csv`
- 01/02/2021 00:00:00:00 to 28/02/2021 24:00:00 (using Date Range) and expoert result as `2-2021-gp2gp-messages.csv`
- 01/03/2021 00:00:00:00 to 31/03/2021 24:00:00 (using Date Range) and export result as `3-2021-gp2gp-messages.csv`

Splunk query for GP2GP messages:
```
index="spine2vfmmonitor" service="gp2gp" logReference="MPS0053c"
| table _time, conversationID, internalID, interactionID
```

In [1]:
import pandas as pd
import numpy as np

In [2]:
attachments_metadata_prefix = "s3://prm-gp2gp-data-sandbox-dev/attachment-insights/attachments-metadata--all-fields/"
attachment_files = [
    "1-jan-17-jan-2021-attachment-data.csv.gz",
    "18-jan-31-jan-2021-attachment-data.csv.gz",
    "1-feb-14-feb-2021-attachment-data.csv.gz",
    "15-feb-28-feb-2021-attachment-data.csv.gz",
    "1-mar-14-mar-2021-attachment-data.csv.gz",
    "15-mar-21-mar-2021-attachment-data.csv.gz",
    "22-mar-31-mar-2021-attachment-data.csv.gz"
]
attachment_input_files = [attachments_metadata_prefix + f for f in attachment_files]

In [None]:
def convert_to_int(val):
    if val == "Unknown":
        return np.NaN
    else:
        return np.int(val)

attachments = pd.concat((
    pd.read_csv(f, converters={"Length": convert_to_int}, parse_dates=["_time"])
    for f in attachment_input_files
))

In [None]:
gp2gp_messages_prefix = "s3://prm-gp2gp-data-sandbox-dev/attachment-insights/gp2gp-messages/"
gp2gp_messages_files = [
    "1-2021-gp2gp-messages.csv.gz",
    "2-2021-gp2gp-messages.csv.gz",
    "3-2021-gp2gp-messages.csv.gz"
]
gp2gp_messages_input_files = [gp2gp_messages_prefix + f for f in gp2gp_messages_files]

In [None]:
gp2gp_messages = pd.concat((
    pd.read_csv(f, parse_dates=["_time"])
    for f in gp2gp_messages_input_files
))

## Deduplicate Attachment data

In [None]:
ehr_request_completed_messages = gp2gp_messages[gp2gp_messages["interactionID"] == "urn:nhs:names:services:gp2gp/RCMR_IN030000UK06"]

unique_ehr_request_completed_messages = ehr_request_completed_messages.sort_values(by="_time").drop_duplicates(subset=["conversationID"], keep="last")

In [None]:
ehr_attachments = pd.merge(attachments, unique_ehr_request_completed_messages[["internalID", "interactionID"]], on="internalID", how="inner")

## Attachment sizes

In [None]:
ehr_attachments_with_size_in_mb = ehr_attachments.assign(LengthInMB=lambda x: x["Length"]/ (1024 * 1024))

In [None]:
attachments_over_5_mb = np.sum((ehr_attachments_with_size_in_mb["LengthInMB"] >= 5) & (ehr_attachments_with_size_in_mb["LengthInMB"] < 20))
attachments_over_20_mb = np.sum(ehr_attachments_with_size_in_mb["LengthInMB"] >= 20)
attachments_under_5_mb = np.sum(ehr_attachments_with_size_in_mb["LengthInMB"] < 5)
attachments_size_unknown = np.sum(ehr_attachments_with_size_in_mb["LengthInMB"].isnull())

attachment_sizes = pd.DataFrame([[attachments_over_5_mb, attachments_over_20_mb, attachments_under_5_mb, attachments_size_unknown]],
                  columns=['over 5 MB', 'over 20 MB', 'under 5 MB', 'Unknown'])
attachment_sizes['Total'] = attachment_sizes.sum(axis=1)
attachment_sizes

In [None]:
attachment_size_percentages = attachment_sizes.iloc[:, 0:4].apply(lambda x: x / attachment_sizes.iloc[:, 4] * 100)

attachment_size_percentages = attachment_size_percentages.add_suffix(' (%)')

attachment_size_percentages.round(2)