Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3546] Fix issue applying censor rule to binary data #5474

Merged
merged 1 commit into from Jul 20, 2020

Conversation

gbp
Copy link
Member

@gbp gbp commented Nov 21, 2019

Relevant issue(s)

Fixes #3546
Fixes #4064

What does this do?

The original implementation assumed the binary would be in ASCII-8BIT
encoding but this is not always the case. I'm unsure at this stage if
something has changed, such as how uncompressed data is extracted from
to the pdftk tool, or if this case has never been handled.

This change forces censor rule into the same encoding and ensures the
correct number of replacement bytes are used.

Why was this needed?

Increasing amount of exception emails.

@gbp gbp self-assigned this Nov 21, 2019
@gbp gbp added this to In progress in transparency-current-sprint via automation Nov 21, 2019
@gbp gbp changed the title [WiP] [#3546] Fix issue applying censor rule to binary data [#3546] Fix issue applying censor rule to binary data Nov 22, 2019
@gbp gbp marked this pull request as ready for review November 22, 2019 09:16
@gbp gbp requested a review from garethrees November 22, 2019 09:16
@gbp
Copy link
Member Author

gbp commented Nov 22, 2019

@garethrees this is green so should fix the exception but I'm not 100% sure this is the right change to make. For instance why is regexp being converted to ASCII-8BIT in the first place?

If we didn't convert then we probably could do:

    binary_to_censor.gsub(to_replace(binary_to_censor.encoding)) do |match|
      match.gsub(single_char_regexp, 'x')
    end

@gbp gbp moved this from In progress to Awaiting review in transparency-current-sprint Dec 2, 2019
@gbp gbp assigned garethrees and unassigned gbp Dec 2, 2019
Copy link
Member

@garethrees garethrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Minor comments but happy to merge as-is or you can fix and then merge – no need for re-review.

I've found another case that I think we may want to address, but have created a new issue for that to avoid blocking this getting out.

app/models/censor_rule.rb Outdated Show resolved Hide resolved
spec/lib/alaveteli_text_masker_spec.rb Show resolved Hide resolved
@garethrees garethrees moved this from Awaiting review to Reviewer approved in transparency-current-sprint Dec 11, 2019
@garethrees
Copy link
Member

Also fixes #4064 (updated PR description)

@gbp gbp force-pushed the 3546-encoding-compatibility-error branch from 1ccfcab to cfe9033 Compare December 16, 2019 09:24
The original implementation assumed the binary would be in ASCII-8BIT
encoding but this is not always the case. I'm unsure at this stage if
something has changed, such as how uncompressed data is extracted from
to the pdftk tool, or if this case has never been handled.

This change forces the data into ASCII-8BIT and back into the original
encoding regardless of what it initially was.
@gbp gbp force-pushed the 3546-encoding-compatibility-error branch from cfe9033 to e7f0b30 Compare December 16, 2019 13:58
@gbp
Copy link
Member Author

gbp commented Jan 29, 2020

Before merging, as this will expose more attachments (instead of erroring) we want to check existing censor rules apply correctly in these new cases.

@gbp
Copy link
Member Author

gbp commented Jul 20, 2020

#5821 unblocks this.

@gbp gbp removed the has-blockers label Jul 20, 2020
@gbp gbp merged commit 5bf0d01 into develop Jul 20, 2020
transparency-current-sprint automation moved this from Reviewer approved to Done Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
3 participants