Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect problems in spreadsheet attachments #8047

Closed
garethrees opened this issue Dec 14, 2023 · 1 comment · Fixed by #8123
Closed

Detect problems in spreadsheet attachments #8047

garethrees opened this issue Dec 14, 2023 · 1 comment · Fixed by #8123
Assignees
Labels
enhancement Adds new functionality f:request-analysis reduce-admin Reduce issues coming to us in the first place x:uk

Comments

@garethrees
Copy link
Member

Authorities making data breaches by including hidden data in Excel spreadsheets has been a long-known issue. As WhatDoTheyKnow has grown we’ve seen increasing numbers of Excel attachments, and therefore Excel-based data breaches. We now know that our reporting of these to authorities and ICO has not improved behaviour. Enough is enough.

Goal: Scan Excel files for potential problems and act appropriately if issues are detected.

At a basic level, we want to automatically scan Excel files for known signs of hidden data and then alert appropriate people so that the problem can be handled. What this latter step looks like is TBC, so this issue focuses on the problem detection and recording of issues.

Which files?

The key problem file type to target is XLSX, as that’s now the most commonly released.

🕳️⚠️ There are a variety of other Excel file formats, but they’re a tiny proportion of releases. The slight exception to this is XLS, which was the default but we’re seeing less each year. We could add a basic file size check here. It’s not worth jumping through hoops to add advanced checking mechanisms until we’ve got XLSX checking in operation.

Checking for problems

At present we only create attachments the first time the IncomingMessage is viewed in a browser, which could be months after receiving the response. The reason for this is to even out processing load and not take up space extracting attachments that no one wants to see.

In the context of checking for problems, we should scan attachments as soon as possible after they’re received. That means we should “peek” inside the mail while we’re ingesting it and force a full parse of the raw email immediately if there’s a scannable file type attached:

class IncomingMessage
  after_create :parse_raw_email!, if: -> { raw_email.contains_spreadsheet? }

class RawEmail
  def contains_spreadsheet?
    mail.attachments.any? do
      |attachment| spreadsheet_extensions.include?(File.extname(attachment.filename)
    end
  end

🕳️⚠️ If “peeking” gets complicated and we can’t easily prevent publication prior to the analyser finishing, let’s discuss. We can probably live with it if there’s a minor chance that an attachment may be briefly publicly available.

We only want to automatically do this scanning the first time we receive a response. We may choose to force manual rescans in future – if we add enhanced checks – but what we don’t want is this automation to start firing if an attachment is recreated e.g. due to a digest mismatch, which happens from time to time as underlying libraries change. Ideally we’ll avoid any complexity by having this scanning process kick off from the creation of a RawEmail or IncomingMessage – that way we won’t have to maintain state about what’s been checked.

What are the signs?

As well as a large file size, we have a growing collection of knowledge about how to detect hidden content. We don’t need to ship with every possible check in place from the outset; we just want the key checks there and have the ability to add to them later.

@garethrees
Copy link
Member Author

Much of this has been added in #8045.

I think what's left at this point is:

  • scan attachments as soon as possible after they’re received
  • only want to automatically do this scanning the first time we receive a response

gbp added a commit that referenced this issue Feb 9, 2024
We want to parse these email automatically so the Excel Analyzer runs
and detects issues with hidden data.

Fixes #8047
gbp added a commit that referenced this issue Feb 9, 2024
We want to parse these email automatically so the Excel Analyzer runs
and detects issues with hidden data.

Fixes #8047
gbp added a commit that referenced this issue Feb 9, 2024
We want to parse these email automatically so the Excel Analyzer runs
and detects issues with hidden data.

Fixes #8047
@gbp gbp closed this as completed in bcd107e Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adds new functionality f:request-analysis reduce-admin Reduce issues coming to us in the first place x:uk
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants