Skip to content

Commit

Permalink
[pseudo] Only expand UCNs for raw_identifiers
Browse files Browse the repository at this point in the history
It turns out clang::expandUCNs only works on tokens that contain valid UCNs
and no other random escapes, and clang only uses it on raw_identifiers.

Currently we can hit an assertion by creating tokens with stray non-valid-UCN
backslashes in them.

Fortunately, expanding UCNs in raw_identifiers is actually all we need.
Most tokens (keywords, punctuation) can't have them. UCNs in literals can be
treated as escape sequences like \n even this isn't the standard's
interpretation. This more or less matches how clang works.
(See https://isocpp.org/files/papers/P2194R0.pdf which points out that the
standard's description of how UCNs work is misaligned with real implementations)

Differential Revision: https://reviews.llvm.org/D125049
  • Loading branch information
sam-mccall committed May 6, 2022
1 parent 8175509 commit 232cc44
Show file tree
Hide file tree
Showing 4 changed files with 29 additions and 9 deletions.
13 changes: 8 additions & 5 deletions clang-tools-extra/pseudo/include/clang-pseudo/Token.h
Original file line number Diff line number Diff line change
Expand Up @@ -199,12 +199,15 @@ clang::LangOptions genericLangOpts(
clang::Language = clang::Language::CXX,
clang::LangStandard::Kind = clang::LangStandard::lang_unspecified);

/// Derives a token stream by decoding escapes, interpreting raw_identifiers and
/// splitting the greatergreater token.
/// Decoding raw tokens written in the source code, returning a derived stream.
///
/// Tokens containing UCNs, escaped newlines, trigraphs etc are decoded and
/// their backing data is owned by the returned stream.
/// raw_identifier tokens are assigned specific types (identifier, keyword etc).
/// - escaped newlines within tokens are removed
/// - trigraphs are replaced with the characters they encode
/// - UCNs within raw_identifiers are replaced by the characters they encode
/// (UCNs within strings, comments etc are not translated)
/// - raw_identifier tokens are assigned their correct keyword type
/// - the >> token is split into separate > > tokens
/// (we use a modified grammar where >> is a nonterminal, not a token)
///
/// The StartsPPLine flag is preserved.
///
Expand Down
19 changes: 15 additions & 4 deletions clang-tools-extra/pseudo/lib/Lex.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -90,12 +90,23 @@ TokenStream cook(const TokenStream &Code, const LangOptions &LangOpts) {
assert(CharSize != 0 && "no progress!");
Pos += CharSize;
}
// Remove universal character names (UCN).
llvm::StringRef Text = CleanBuffer;
llvm::SmallString<64> UCNBuffer;
clang::expandUCNs(UCNBuffer, CleanBuffer);
// A surface reading of the standard suggests UCNs might appear anywhere.
// But we need only decode them in raw_identifiers.
// - they cannot appear in punctuation/keyword tokens, because UCNs
// cannot encode basic characters outside of literals [lex.charset]
// - they can appear in literals, but we need not unescape them now.
// We treat them as escape sequences when evaluating the literal.
// - comments are handled similarly to literals
// This is good fortune, because expandUCNs requires its input to be a
// reasonably valid identifier (e.g. without stray backslashes).
if (Tok.Kind == tok::raw_identifier) {
clang::expandUCNs(UCNBuffer, CleanBuffer);
Text = UCNBuffer;
}

llvm::StringRef Text = llvm::StringRef(UCNBuffer).copy(*CleanedStorage);
Tok.Data = Text.data();
Tok.Data = Text.copy(*CleanedStorage).data();
Tok.Length = Text.size();
Tok.Flags &= ~static_cast<decltype(Tok.Flags)>(LexFlags::NeedsCleaning);
}
Expand Down
4 changes: 4 additions & 0 deletions clang-tools-extra/pseudo/test/crash/backslashes.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// We used to try to interpret these backslashes as UCNs.
// RUN: clang-pseudo -source=%s -print-tokens
\
\ x
2 changes: 2 additions & 0 deletions clang-tools-extra/pseudo/tool/ClangPseudo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/FormatVariadic.h"
#include "llvm/Support/MemoryBuffer.h"
#include "llvm/Support/Signals.h"

using clang::pseudo::Grammar;
using llvm::cl::desc;
Expand Down Expand Up @@ -52,6 +53,7 @@ static std::string readOrDie(llvm::StringRef Path) {

int main(int argc, char *argv[]) {
llvm::cl::ParseCommandLineOptions(argc, argv, "");
llvm::sys::PrintStackTraceOnErrorSignal(argv[0]);

clang::LangOptions LangOpts = clang::pseudo::genericLangOpts();
std::string SourceText;
Expand Down

0 comments on commit 232cc44

Please sign in to comment.