# PRMT-1181 Attachment Metadata Insights

### Context

PCSE want to be able to understand the scope and size of the GP2GP fallback service. They want to know more about the types of attachments that come via GP2GP.

They would like to know:
- Average and maximum file sizes
- Graph which demonstrates distribution of file sizes
- Identify File types that got as far as the transfer
- Add graph on number of attachments

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query, for 21/12/2020 00:00:00:00 to 27/12/2020 24:00:00 and 28/12/2020 00:00:00:00 to 03/01/2020 24:00:00 time frames (currently there are issues with downloading large data sets in Splunk):

```
index="spine2vfmmonitor" logReference=MPS0208
| fields _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64
| fields - _raw
```

2. Download the two data sets as CSVs and place in a directory. Set the `INPUT_DATA_DIR` environment variable to point to this directory.

In [None]:
import paths, os
import duckdb
from scripts.attachments import construct_attachments_db

In [None]:
attachment_data_dir = os.environ["INPUT_DATA_DIR"]
cursor =  duckdb.connect()
attachments = construct_attachments_db(cursor, attachment_data_dir)

## Number of attachments

In [None]:
attachment_count = attachments.aggregate("count(*)").df()
print(f"{attachment_count.iat[0, 0]} attachments in dataset")

In [None]:
attachments_per_content_type = attachments.aggregate("content_type, count(*) as count").df()
attachments_per_content_type.plot.bar(x="content_type", y="count", rot=45, figsize=(16,8))

In [None]:
attachments_per_transfer = attachments.aggregate("conversation_id, count(*) as count").df()

In [None]:
attachments_per_transfer.hist(bins=200, figsize=(14,8))

In [None]:
attachments_per_transfer.describe()

## Size of attachments

In [None]:
attachment_lengths = attachments.project("content_type, length").df()

In [None]:
attachment_lengths.describe()
#attachments_stats.style.set_caption("Attachment file sizes (bytes)")
#TODO: ADD date range to table title

In [None]:
attachment_lengths.hist(bins=100, log=True, figsize=(14,8))
#TODO: add sensible axis labels (e.g MB)

### Attachment size by file type

In [None]:
attachment_lengths.groupby('content_type').describe()
#attachment_statistics_by_content_type_df = query_attachment_statistics_by_content_type(attachments)
#attachment_statistics_by_content_type_df.style.set_caption("Attachment file sizes (bytes)")
#TODO: Format cells, add date range in title

In [None]:
attachment_lengths.boxplot('length', by='content_type',  figsize=(14,8), showfliers=False)

## Appendix: Attachment uniqueness

In [None]:
attachments.aggregate("count(*)").df()

In [None]:
cursor.execute("select count(*) from (select distinct conversation_id, attachment_id from attachment_metadata) uniq").df()

In [None]:
converstaion_attachment_counts = attachments\
    .aggregate("conversation_id, count(*) as rows, count(distinct(attachment_id)) as unique_attachments")\
    .create_view("convo_attachment_counts")

In [None]:
converstaion_attachment_counts.filter("rows != unique_attachments").df()