# PRMT-2351 Look at attachment sizes for TPP-EMIS transfers

## Context

The largest contributor of technical failures are TPP-EMIS general failures (error code 30), and have been increasing over time. These failures were thought to occur due to a 60MB attachment limit, which has now been upped to 100MB. 

The fix was rolled out in June and seems to have reduced errors when TPP is receiving the transfer. However, we have seen a movement in the opposite direction when TPP is sending the transfer.

Previously we sent a sample of transfers that failed with error 30 and TPP responded saying their attachments passed the total 500MB limit. We believe this 500MB limit is causing some of these failures.

## Scope

**Hypothesis:** We believe that there is a total attachment size limit of 500MB on transfers involving TPP as either the sender or requestor.

We will know this to be true if we cannot find any transfers with attachments over 500MB in total.

If possible, find example conversation IDs to send to TPP where there are attachments, and the combined attachment size is greater than 500MB. 

In [1]:
import pandas as pd
import numpy as np
import datetime

## Load attachment data

1. Log into Splunk and run the following query for:
- 01/06/2021 00:00:00:00 to 30/06/2021 24:00:00 (using Date Range) and export the result as a CSV, gzip and name `6-2021-attachment-metadata.csv.gz`
- 01/07/2021 00:00:00:00 to 31/07/2021 24:00:00 (using Date Range) and export the result as a CSV, gzip and name `7-2021-attachment-metadata.csv.gz`
- 01/08/2021 00:00:00:00 to 31/08/2021 24:00:00 (using Date Range) and export the result as a CSV, gzip and name `8-2021-attachment-metadata.csv.gz`

index="spine2vfmmonitor" logReference=MPS0208
| table _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64, internalID, DomainData

2. Run the following Splunk query for the same time range. Export the results as a csvs named `6-2021-ehr-request-completed-messages.csv`, `7-2021-ehr-request-completed-messages.csv`, `8-2021-ehr-request-completed-messages.csv` and gzip it.

index="spine2vfmmonitor" service="gp2gp" logReference="MPS0053d" interactionID="urn:nhs:names:services:gp2gp/RCMR_IN030000UK06"
| table _time, conversationID, internalID, interactionID

In [2]:
data_folder="s3://prm-gp2gp-notebook-data-prod/PRMT-2351-supplier-attachments/"
attachment_metadata_files = [
    "6-2021-attachment-metadata.csv.gz",
    "7-2021-attachment-metadata.csv.gz",
    "8-2021-attachment-metadata.csv.gz"
]

attachments = pd.concat([pd.read_csv(data_folder+file, parse_dates=["_time"], na_values=["Unknown"], dtype={"Length": pd.Int64Dtype()}) for file in attachment_metadata_files])

In [3]:
non_core_attachments_bool=attachments['DomainData']!='X-GP2GP-Skeleton: Yes'
attachments_real=attachments.loc[non_core_attachments_bool]

In [4]:
gp2gp_messages_files = [
    "6-2021-ehr-request-completed-messages.csv.gz",
    "7-2021-ehr-request-completed-messages.csv.gz",
    "8-2021-ehr-request-completed-messages.csv.gz"
]
ehr_request_completed_messages = pd.concat([pd.read_csv(data_folder+file, parse_dates=["_time"]) for file in gp2gp_messages_files])

## Deduplicate attachment data

In [5]:
unique_ehr_request_completed_messages = (
    ehr_request_completed_messages
        .sort_values(by="_time")
        .drop_duplicates(subset=["conversationID"], keep="last")
)

In [6]:
ehr_attachments = pd.merge(attachments_real, unique_ehr_request_completed_messages[["internalID", "interactionID"]], on="internalID", how="inner")

## Find total average size of attachments for a transfer with TPP as Sender

In [7]:
# Selecting TPP as sender
from_tpp_bool = ehr_attachments["FromSystem"]=="SystmOne"
tpp_sender_attachments = ehr_attachments[from_tpp_bool]

In [8]:
# Total attachments size per conversation
attachments_grouped_by_conversation = tpp_sender_attachments.groupby(by="conversationID").agg({"Length": "sum"})
attachments_grouped_by_conversation["Length in Mb"] = attachments_grouped_by_conversation["Length"].fillna(0)/(1024**2)
attachments_grouped_by_conversation["Length in Mb"].describe()

count    99924.000000
mean        24.965082
std         26.530671
min          0.000801
25%          7.497465
50%         15.894673
75%         32.552813
max        344.731819
Name: Length in Mb, dtype: float64

## Find total average size of attachments for a transfer with TPP as Requester

In [9]:
# Selecting TPP as requester
to_tpp_bool = ehr_attachments["ToSystem"]=="SystmOne"
requester_tpp_attachments = ehr_attachments[to_tpp_bool]

In [10]:
# Total attachments size per conversation
tpp_requester_attachments_grouped_by_conversation = requester_tpp_attachments.groupby(by=["conversationID","FromSystem", "ToSystem" ]).agg({"Length": "sum"})
tpp_requester_attachments_grouped_by_conversation["Length in Mb"] = tpp_requester_attachments_grouped_by_conversation["Length"].fillna(0)/(1024**2)
tpp_requester_attachments_grouped_by_conversation["Length in Mb"].describe()

count    70063.000000
mean        29.150257
std         55.622234
min          0.000000
25%          5.268707
50%          9.823219
75%         24.895061
max        990.651524
Name: Length in Mb, dtype: float64

In [11]:
# Identify conversations with more than 500MB in attachments
attachment_500_bool = tpp_requester_attachments_grouped_by_conversation["Length in Mb"] > 500
attachments_over_500 = tpp_requester_attachments_grouped_by_conversation[attachment_500_bool]
attachments_over_500

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Length,Length in Mb
conversationID,FromSystem,ToSystem,Unnamed: 3_level_1,Unnamed: 4_level_1
12790C40-D58D-11EB-931B-E5D9FACE3C9D,EMIS Web,SystmOne,669349420,638.341351
376CC740-C77C-11EB-AE43-8D0EC28DC28C,EMIS Web,SystmOne,763669324,728.291821
5F1A7760-FA80-11EB-92F5-87B534863116,EMIS Web,SystmOne,645235140,615.344181
631BF210-F393-11EB-B9DD-A7D5327E348A,EMIS Web,SystmOne,536615492,511.756413
834C2C90-DE66-11EB-A3EB-4BD788CC54C7,EMIS Web,SystmOne,827807876,789.45911
ADAE08D0-C77F-11EB-AE43-8D0EC28DC28C,EMIS Web,SystmOne,698180728,665.837029
AFB58770-EA43-11EB-BD9C-0B0CFF304BAB,EMIS Web,SystmOne,590789720,563.420982
CBAF6430-DB09-11EB-BDB3-69270508194F,EMIS Web,SystmOne,1038773412,990.651524
F8A33B10-E973-11EB-8A2C-5FAA8C650CA1,EMIS Web,SystmOne,556104372,530.342457


### Read in transfers to be able to check outcomes of these transfers

In [12]:
transfer_files = [
    "s3://prm-gp2gp-transfer-data-prod/v4/2021/6/transfers.parquet",
    "s3://prm-gp2gp-transfer-data-prod/v4/2021/7/transfers.parquet",
    "s3://prm-gp2gp-notebook-data-prod/PRMT-2355-half-august-data-with-14-day-cutoff/transfers/v4/2021/8/transfers.parquet"
]
transfers_raw = pd.concat((
    pd.read_parquet(f)
    for f in transfer_files
))

transfers = transfers_raw.copy()

In [13]:
transfers.rename(columns=({'conversation_id' : 'conversationID' }), inplace=True)

In [14]:
ehr_attachments_over_500MB = pd.merge(attachments_over_500, transfers, on="conversationID", how="left")
ehr_attachments_over_500MB

Unnamed: 0,conversationID,Length,Length in Mb,sla_duration,requesting_practice_asid,sending_practice_asid,requesting_supplier,sending_supplier,sender_error_codes,final_error_codes,intermediate_error_codes,status,failure_reason,date_requested,date_completed
0,12790C40-D58D-11EB-931B-E5D9FACE3C9D,669349420,638.341351,5.0,341270550013,598420833017,SystmOne,EMIS,[nan],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-06-25 08:12:25.239,2021-06-25 08:20:55.458
1,376CC740-C77C-11EB-AE43-8D0EC28DC28C,763669324,728.291821,3.0,961329258019,200000015074,SystmOne,EMIS,[nan],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-06-07 10:36:29.239,2021-06-07 10:56:19.849
2,5F1A7760-FA80-11EB-92F5-87B534863116,645235140,615.344181,2.0,230783848016,200000015074,SystmOne,EMIS,[nan],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-08-11 08:44:43.116,2021-08-11 08:54:25.112
3,631BF210-F393-11EB-B9DD-A7D5327E348A,536615492,511.756413,3.0,830487381048,494422195011,SystmOne,EMIS,[20.0],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-08-02 13:13:11.781,2021-08-02 13:24:00.125
4,834C2C90-DE66-11EB-A3EB-4BD788CC54C7,827807876,789.45911,5.0,481668188040,910792286049,SystmOne,EMIS,[nan],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-07-06 14:29:04.045,2021-07-06 14:36:49.944
5,ADAE08D0-C77F-11EB-AE43-8D0EC28DC28C,698180728,665.837029,7.0,961329258019,200000015074,SystmOne,EMIS,[nan],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-06-07 11:01:15.890,2021-06-07 11:11:03.367
6,AFB58770-EA43-11EB-BD9C-0B0CFF304BAB,590789720,563.420982,7.0,133421929042,342804386015,SystmOne,EMIS,"[20.0, 20.0]",[99.0],[],TECHNICAL_FAILURE,Final Error,2021-07-21 16:50:00.044,2021-07-26 10:51:45.002
7,CBAF6430-DB09-11EB-BDB3-69270508194F,1038773412,990.651524,2.0,200000009097,200000015415,SystmOne,EMIS,[20.0],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-07-02 07:47:49.321,2021-07-02 08:02:18.873
8,F8A33B10-E973-11EB-8A2C-5FAA8C650CA1,556104372,530.342457,1.0,200000000326,200000015334,SystmOne,EMIS,[20.0],[30.0],[],TECHNICAL_FAILURE,Final Error,2021-07-20 16:03:08.131,2021-07-20 16:14:02.430
