# PRMT-1181 Attachment Metadata Insights

### Context

PCSE want to be able to understand the scope and size of the GP2GP fallback service. They want to know more about the types of attachments that come via GP2GP.

They would like to know:
- Average and maximum file sizes
- Graph which demonstrates distribution of file sizes
- Identify File types that got as far as the transfer
- Add graph on number of attachments

There is some effort required to de-duplicate the underlying data. This is explored in notebook `10-PRMT-1528`.

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query, for 21/12/2020 00:00:00:00 to 03/01/2020 24:00:00 time frame. Export the result as a csv named `attachment_metadata.csv`. 

```
index="spine2vfmmonitor" logReference=MPS0208
| table _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64, internalID
```

2. Run the following Splunk query for the same time range. Export the result as a csv named `gp2gp_messages.csv`.

```
index="spine2vfmmonitor" service="gp2gp" logReference="MPS0053c"
| table _time, conversationID, internalID, interactionID
```

3. Place both the csv files in a directory called `attachments`. Set the `INPUT_DATA_DIR` environment variable to point to this directory.

Example directory layout, where `INPUT_DATA_DIR` is `attachments`.
```
attachments/attachment_metadata.csv
attachments/gp2gp_messages.csv


In [None]:
import paths, os
import duckdb
from scripts.attachments import construct_attachments_db

In [None]:
attachment_data_dir = os.environ["INPUT_DATA_DIR"]
cursor =  duckdb.connect()
construct_attachments_db(cursor, attachment_data_dir)


In [None]:
cursor.execute("""
    create or replace view ehr_request_completed_messages as 
        select * from gp2gp_messages
        where interaction_id='urn:nhs:names:services:gp2gp/RCMR_IN030000UK06'
""")

In [None]:
cursor.execute("""
create table unique_ehr_request_completed_messages as 
select a.* from ehr_request_completed_messages a
inner join (
    select conversation_id, max(time) as time
    from ehr_request_completed_messages
    group by conversation_id
) b on a.conversation_id = b.conversation_id and a.time = b.time""")

In [None]:
cursor.execute("""
    create table ehr_attachments as 
    select a.* from attachment_metadata a
    join unique_ehr_request_completed_messages b 
    on a.internal_id = b.internal_id
""")


In [None]:
attachments = cursor.table("ehr_attachments")
ehr_requests = cursor.table("unique_ehr_request_completed_messages")

## Attachment types

In [None]:
(start_time, end_time) = attachments\
    .aggregate("MIN(time) as start_time, MAX(time) as end_time") \
    .execute().fetchone()

start_date = start_time.date()
end_date = end_time.date()

In [None]:
attachment_count = attachments.aggregate("COUNT(*)").df()
print(f"{attachment_count.iat[0, 0]} attachments in dataset ({start_date} to {end_date})")

### Number of attachments by file type

In [None]:
attachments_per_content_type = attachments.aggregate("content_type, count(*) as count").order("count DESC").df()
attachment_type_bars = attachments_per_content_type.plot.bar(
    x="content_type", y="count",
    title=f"Count of attachments by file type, {start_date} to {end_date}",
    rot=45,
    figsize=(16,8),
    legend=False,
)
attachment_type_bars.set(xlabel="Attachment file type", ylabel="Number of attachments")
attachment_type_bars.ticklabel_format(style='plain', axis='y')

In [None]:
attachments_per_content_type.set_index("content_type").style.set_caption(f"Count of attachments by file type, {start_date} to {end_date}")

### Number of transfers with attachment

In [None]:
transfer_count, with_attachment_count, without_attachment_count = cursor.execute("""
    select
        count(*) as transfers,
        sum(case when attachments.internal_id is not null then 1 else 0 end) as with_attachments,
        sum(case when attachments.internal_id is null then 1 else 0 end) as without_attachments
    from unique_ehr_request_completed_messages
    left join (select distinct internal_id from attachment_metadata) attachments
    on attachments.internal_id=unique_ehr_request_completed_messages.internal_id
""").fetchone()
percent_with_attachment = with_attachment_count/transfer_count*100
print(f"Out of {transfer_count} transfers made between {start_date} and {end_date}, {with_attachment_count} had at least one attachment. ({percent_with_attachment}%) ")

### Number of attachments per transfer (excluding transfers with no attachments)

In [None]:
attachments_per_transfer = attachments.aggregate("conversation_id, count(*) as count").df()

In [None]:
attachments_per_transfer_truncated = attachments.aggregate("conversation_id, count(*) as count").filter("count <= 1000").df()

In [None]:
attachments_per_transfer_hist_truncated = attachments_per_transfer_truncated.plot.hist(
    title=f"Histogram of attachments per transfer (where transfer has less than 1000 attachments), {start_date} to {end_date}",
    legend=False,
    bins=150, figsize=(14,8)
)
attachments_per_transfer_hist = attachments_per_transfer.plot.hist(
    title=f"Histogram of attachments per transfer, {start_date} to {end_date}",
    legend=False,
    bins=150, figsize=(14,8)
)
attachments_per_transfer_hist_truncated.set(xlabel="Attachments per transfer", ylabel="Frequency")
attachments_per_transfer_hist_truncated.ticklabel_format(style='plain', axis='y')
attachments_per_transfer_hist.set(xlabel="Attachments per transfer", ylabel="Frequency")
attachments_per_transfer_hist.ticklabel_format(style='plain', axis='y')

In [None]:
attachments_per_transfer.describe().style.set_caption(f"Count of attachments per transfer, {start_date} to {end_date}")

## Size of attachments

In [None]:
attachment_lengths = attachments.project("content_type, length/(1024.0*1024.0) as megabytes").df()

In [None]:
attachment_lengths.describe().style.set_caption(f"Attachment file sizes (megabytes), {start_date} to {end_date}")

In [None]:
attachment_lengths_hist = attachment_lengths.plot.hist(
    title=f"Histogram of attachment size (logarithmic), {start_date} to {end_date}",
    legend=False,
    bins=100, log=True, figsize=(14,8)
)
attachment_lengths_hist.set(xlabel="Attachment size (MB)", ylabel="Frequency (logarithmic)")
attachment_lengths_hist.ticklabel_format(style='plain', axis='x', useOffset=False)

In [None]:
attachment_lengths_hist_lin = attachment_lengths.plot.hist(
    title=f"Histogram of attachment size (linear), {start_date} to {end_date}",
    legend=False,
    bins=100, log=False, figsize=(14,8)
)
attachment_lengths_hist_lin.set(xlabel="Attachment size (MB)", ylabel="Frequency (linear)")
attachment_lengths_hist_lin.ticklabel_format(style='plain', axis='y', useOffset=False)

### Attachment size by file type

In [None]:
attachment_lengths.groupby('content_type').describe().style.set_caption("Attachment file sizes by content type")


In [None]:
attachment_lengths_boxplot = attachment_lengths.boxplot(
    'megabytes', by='content_type',
    figsize=(14,8),
    showfliers=False,
    rot=45,
)
attachment_lengths_boxplot.set_title(f"Boxplot of attachment size by content type (outliers removed), {start_date} to {end_date}")
attachment_lengths_boxplot.set(xlabel="Attachment type", ylabel="Size (MB)")

## Size of combined attachments per transfer

In [None]:
combined_attachment_size = attachments.aggregate("conversation_id, SUM(length)/(1024*1024) as megabytes").df()

In [None]:
combined_attachment_size.describe().style.set_caption(f"Size of combined attachments per transfer, {start_date} to {end_date}")

In [None]:
transfer_sizes_hist = combined_attachment_size.plot.hist(
        title=f"Histogram of combined attachment size per transfer (logarithmic), {start_date} to {end_date}",
    legend=False,
    bins=100, log=True, figsize=(14,8)
)
transfer_sizes_hist.set(xlabel="Combined attachment size (MB)", ylabel="Frequency (logarithmic)")
transfer_sizes_hist.ticklabel_format(style='plain', axis='x', useOffset=False)

In [None]:
transfer_sizes_hist_lin = attachments.aggregate("conversation_id, SUM(length)/(1024*1024) as bytes").df().plot.hist(
        title=f"Histogram of combined attachment size per transfer (linear), {start_date} to {end_date}",
    legend=False,
    bins=100, log=False, figsize=(14,8)
)
transfer_sizes_hist_lin.set(xlabel="Combined attachment size (MB)", ylabel="Frequency (linear)")
transfer_sizes_hist_lin.ticklabel_format(style='plain', axis='x', useOffset=False)