Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New guidance document: Correctly using Regular Expressions for Secure Input Validation #461

Open
david-a-wheeler opened this issue Apr 9, 2024 · 2 comments

Comments

@david-a-wheeler
Copy link
Contributor

I propose creating a new guidance document, Correctly using Regular Expressions for Secure Input Validation.

A draft is here:
https://docs.google.com/document/d/1Ors5T04Pgh3dcBfelbBrEBrvY3OKB7loUBUJPYBmmZw/edit

Here's the background. Seth Larson’s Regex character “$” doesn't mean “end-of-string” noted that many people thought “$” always means “end of string” even though it’s platform-specific. The fact that's platform-specific isn't technically new; a 2009 book noted this, and some documents also note it. However, many developers have no idea. Regexes are widely used for input validation, so this misunderstanding can lead to vulnerabilities.

The proposed solution is simple: Let's create a narrowly-focused guidance document on this one topic. Briefly explain what you should do, followed by text that justifies it.

Completing this will require:

  • Verifying Java semantics. Basically, does "$" by default match only the end of string, or does it also match "\n" followed by end-of string? Similarly, does "^" match only the beginning of the string, or does it also match a newline? Documentation suggests that the answers are "yes" and "no" respectively, but we need to verify this. This should take ~10 minutes for a developer who knows Java & has a working environment.
  • Verifying .NET semantics, same questions. Again, ~10 min, if you know C# and have a working environment.
  • Review of material.
  • Conversion to Markdown & posting the results to the world.
@david-a-wheeler
Copy link
Contributor Author

david-a-wheeler commented Apr 9, 2024

The Best Practices WG agreed today (2024-04-09) to work on this guidance document.

We need someone to check on Java, and someone (not necessarily the same person) to check on .NET, to answer the following:

The biggest question is determining if a regular expression like /x$/ only matches inputs like “ax” or if it will also match other inputs such as “ax\n”. Similarly, we must check if /^d/ matches only “dog”, or if it will match other beginnings like “\ndog” or “x\ndog”. We also need to determine if there are symbols for beginning of string (typically “\A”) or end of string (typically “\z” though Python uses “\Z”).

I believe Java & .NET, by default, use Perl semantics (not POSIX/JavaScript/Go semantics). That is, "^" only matches the beginning of a string, but "$" will also match a newline at the end (at least) and you need to use \z to reliably match the end of the. string.

@david-a-wheeler
Copy link
Contributor Author

All: We now have a complete draft of the guide: https://docs.google.com/document/d/1Ors5T04Pgh3dcBfelbBrEBrvY3OKB7loUBUJPYBmmZw/edit

For now, please comment on that Google document.

I propose that the group review & hopefully eventually accept this material. If accepted, I propose that it be converted into 2 markdown documents: A short guidance document (covering the first two sections) and a rationale (the rest of the document). While we were doing the research it was helpful to have a single document (so we could keep them consistent), but now that it's near the final form, it's probably best to have a short guidance document (so we don't scare readers) while preserving the longer rationale.

Does this seem reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants