# PRMT-1181 Attachment Metadata Insights

### Context

PCSE want to be able to understand the scope and size of the GP2GP fallback service. They want to know more about the types of attachments that come via GP2GP.

They would like to know:
- Average and maximum file sizes
- Graph which demonstrates distribution of file sizes
- Identify File types that got as far as the transfer
- Add graph on number of attachments

### Requirements

In order to replicate this notebook, perform the following steps:

1. Log into Splunk and run the following query, for 21/12/2020 00:00:00:00 to 27/12/2020 24:00:00 and 28/12/2020 00:00:00:00 to 03/01/2020 24:00:00 time frames (currently there are issues with downloading large data sets in Splunk):

```
index="spine2vfmmonitor" logReference=MPS0208
| fields _time, attachmentID, conversationID, FromSystem, ToSystem, attachmentType, Compressed, ContentType, LargeAttachment, Length, OriginalBase64
| fields - _raw
```

2. Download the two data sets as CSV and put into a directory. Change the `attachment_data_dir` variable to the location of these files.

In [1]:
import paths
import duckdb
from scripts.attachment_insights import construct_db_database

In [2]:
# Should you need to rerun this, you will need to delete the attachment.db and attachment.db.wal files generated.
attachment_data_dir = "/Users/helen.zhou/Desktop/attachmentData"
database_file = f"{attachment_data_dir}/attachment.db"
construct_db_database(attachment_data_dir, database_file)

Loading /Users/helen.zhou/Desktop/attachmentData/attachment21Dec.csv
Loading /Users/helen.zhou/Desktop/attachmentData/attachment28Dec.csv
Done


In [3]:
attachment_cursor = duckdb.connect(database_file)
attachment_rel = attachment_cursor.table("attachment_metadata")

In [4]:
print("The total number of transfers involving attachments:")

attachment_rel.aggregate("count(*)")

The total number of transfers involving attachments:


---------------------
-- Expression Tree --
---------------------
Aggregate [count_star()]
  Scan Table [attachment_metadata]

---------------------
-- Result Columns  --
---------------------
- count_star() (BIGINT)

---------------------
-- Result Preview  --
---------------------
count_star()	
BIGINT	
[ Rows: 1]
2502766	



The total number of transfers matches the total number of events when doing a Splunk query, therefore validating the concatenation of the datasets has been completed successfully.

The Splunk query used with the Date Range specified as 21/12/2020 00:00:00 to 03/01/2021 24:00:00:

```
index="spine2vfmmonitor" logReference=MPS0208
| stats count, dc(conversationID) as distinct_conversations
```

Results
Count: 2502766 and distinct_conversations: 30314 
