Detect problems in spreadsheet attachments #8047
Labels
enhancement
Adds new functionality
f:request-analysis
reduce-admin
Reduce issues coming to us in the first place
x:uk
Milestone
Authorities making data breaches by including hidden data in Excel spreadsheets has been a long-known issue. As WhatDoTheyKnow has grown we’ve seen increasing numbers of Excel attachments, and therefore Excel-based data breaches. We now know that our reporting of these to authorities and ICO has not improved behaviour. Enough is enough.
Goal: Scan Excel files for potential problems and act appropriately if issues are detected.
At a basic level, we want to automatically scan Excel files for known signs of hidden data and then alert appropriate people so that the problem can be handled. What this latter step looks like is TBC, so this issue focuses on the problem detection and recording of issues.
Which files?
The key problem file type to target is XLSX, as that’s now the most commonly released.
🕳️⚠️ There are a variety of other Excel file formats, but they’re a tiny proportion of releases. The slight exception to this is XLS, which was the default but we’re seeing less each year. We could add a basic file size check here. It’s not worth jumping through hoops to add advanced checking mechanisms until we’ve got XLSX checking in operation.
Checking for problems
At present we only create attachments the first time the IncomingMessage is viewed in a browser, which could be months after receiving the response. The reason for this is to even out processing load and not take up space extracting attachments that no one wants to see.
In the context of checking for problems, we should scan attachments as soon as possible after they’re received. That means we should “peek” inside the mail while we’re ingesting it and force a full parse of the raw email immediately if there’s a scannable file type attached:
🕳️⚠️ If “peeking” gets complicated and we can’t easily prevent publication prior to the analyser finishing, let’s discuss. We can probably live with it if there’s a minor chance that an attachment may be briefly publicly available.
We only want to automatically do this scanning the first time we receive a response. We may choose to force manual rescans in future – if we add enhanced checks – but what we don’t want is this automation to start firing if an attachment is recreated e.g. due to a digest mismatch, which happens from time to time as underlying libraries change. Ideally we’ll avoid any complexity by having this scanning process kick off from the creation of a RawEmail or IncomingMessage – that way we won’t have to maintain state about what’s been checked.
What are the signs?
As well as a large file size, we have a growing collection of knowledge about how to detect hidden content. We don’t need to ship with every possible check in place from the outset; we just want the key checks there and have the ability to add to them later.
The text was updated successfully, but these errors were encountered: