Detect problems in spreadsheet attachments #8047

garethrees · 2023-12-14T09:47:24Z

Authorities making data breaches by including hidden data in Excel spreadsheets has been a long-known issue. As WhatDoTheyKnow has grown we’ve seen increasing numbers of Excel attachments, and therefore Excel-based data breaches. We now know that our reporting of these to authorities and ICO has not improved behaviour. Enough is enough.

Goal: Scan Excel files for potential problems and act appropriately if issues are detected.

At a basic level, we want to automatically scan Excel files for known signs of hidden data and then alert appropriate people so that the problem can be handled. What this latter step looks like is TBC, so this issue focuses on the problem detection and recording of issues.

Which files?

The key problem file type to target is XLSX, as that’s now the most commonly released.

🕳️⚠️ There are a variety of other Excel file formats, but they’re a tiny proportion of releases. The slight exception to this is XLS, which was the default but we’re seeing less each year. We could add a basic file size check here. It’s not worth jumping through hoops to add advanced checking mechanisms until we’ve got XLSX checking in operation.

Checking for problems

At present we only create attachments the first time the IncomingMessage is viewed in a browser, which could be months after receiving the response. The reason for this is to even out processing load and not take up space extracting attachments that no one wants to see.

In the context of checking for problems, we should scan attachments as soon as possible after they’re received. That means we should “peek” inside the mail while we’re ingesting it and force a full parse of the raw email immediately if there’s a scannable file type attached:

class IncomingMessage
  after_create :parse_raw_email!, if: -> { raw_email.contains_spreadsheet? }

class RawEmail
  def contains_spreadsheet?
    mail.attachments.any? do
      |attachment| spreadsheet_extensions.include?(File.extname(attachment.filename)
    end
  end

🕳️⚠️ If “peeking” gets complicated and we can’t easily prevent publication prior to the analyser finishing, let’s discuss. We can probably live with it if there’s a minor chance that an attachment may be briefly publicly available.

We only want to automatically do this scanning the first time we receive a response. We may choose to force manual rescans in future – if we add enhanced checks – but what we don’t want is this automation to start firing if an attachment is recreated e.g. due to a digest mismatch, which happens from time to time as underlying libraries change. Ideally we’ll avoid any complexity by having this scanning process kick off from the creation of a RawEmail or IncomingMessage – that way we won’t have to maintain state about what’s been checked.

What are the signs?

As well as a large file size, we have a growing collection of knowledge about how to detect hidden content. We don’t need to ship with every possible check in place from the outset; we just want the key checks there and have the ability to add to them later.

The text was updated successfully, but these errors were encountered:

garethrees · 2023-12-14T09:51:04Z

Much of this has been added in #8045.

I think what's left at this point is:

scan attachments as soon as possible after they’re received
only want to automatically do this scanning the first time we receive a response

We want to parse these email automatically so the Excel Analyzer runs and detects issues with hidden data. Fixes #8047

garethrees added x:uk f:request-analysis enhancement Adds new functionality reduce-admin Reduce issues coming to us in the first place labels Dec 14, 2023

garethrees added this to the Spreadsheet Problem Detection milestone Dec 14, 2023

garethrees mentioned this issue Dec 14, 2023

Notify authority when problems are detected in spreadsheet releases #8048

Open

garethrees assigned gbp Dec 14, 2023

gbp added a commit that referenced this issue Feb 9, 2024

Add automatic parsing of emails with spreadsheets

6f6b6a3

We want to parse these email automatically so the Excel Analyzer runs and detects issues with hidden data. Fixes #8047

gbp added a commit that referenced this issue Feb 9, 2024

Add automatic parsing of emails with spreadsheets

64fae22

We want to parse these email automatically so the Excel Analyzer runs and detects issues with hidden data. Fixes #8047

gbp mentioned this issue Feb 9, 2024

[#8047] Add automatic parsing of emails with spreadsheets #8123

Merged

gbp added a commit that referenced this issue Feb 9, 2024

Add automatic parsing of emails with spreadsheets

15819b1

We want to parse these email automatically so the Excel Analyzer runs and detects issues with hidden data. Fixes #8047

gbp closed this as completed in bcd107e Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect problems in spreadsheet attachments #8047

Detect problems in spreadsheet attachments #8047

garethrees commented Dec 14, 2023

garethrees commented Dec 14, 2023

Detect problems in spreadsheet attachments #8047

Detect problems in spreadsheet attachments #8047

Comments

garethrees commented Dec 14, 2023

Which files?

Checking for problems

What are the signs?

garethrees commented Dec 14, 2023