fix: complete StandardEncoding glyph table (v0.1.1)#2
Conversation
The four base PDF encodings (StandardEncoding, WinAnsiEncoding,
MacRomanEncoding, PDFDocEncoding) are now built from a single
encodingRows source-of-truth that mirrors pdfminer.six's latin_enc.py
and PDF Reference 1.7 Appendix D.2 Table D.2.
The previous tables silently dropped ~32 named glyphs per encoding
outside printable ASCII — most visibly the smart quotes (' ' " "),
en/em dashes (- -), bullet, florin, and dagger marks. PDFs that
used these without a /ToUnicode map (PDF/A filings, SEC 10-Ks, most
LaTeX-emitted documents) returned empty or garbled text where these
glyphs appeared.
AdobeGlyphToUnicode now resolves the full Adobe Glyph List for the
~250 glyph names referenced by the four PDF base encodings plus
common /Differences additions. Compound names ("f_i" -> "fi") and
variant suffixes (".alt", ".sc") are handled per AGL §2.
StandardEncoding now correctly maps 0x27 to quoteright (U+2019) and
0x60 to quoteleft (U+2018) per the PDF spec; WinAnsi / MacRoman /
PDFDoc keep ASCII apostrophe and backtick at those slots. The
existing TestEncodingByName test was updated to reflect this
spec-correct behavior (the previous "identity over 0x20..0x7e for
all four encodings" assertion was wrong for StandardEncoding).
|
Warning Review limit reached
More reviews will be available in 37 minutes and 51 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reviewer's GuideCentralizes and corrects the four PDF base encodings (Standard, WinAnsi, MacRoman, PDFDoc) using a spec-aligned encoding row table and expanded Adobe glyph mapping, fixes StandardEncoding apostrophe/quote behavior, and extends tests and changelog accordingly. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In AdobeGlyphToUnicode, the compound-name handling currently calls indexByte just to detect '_' and then ignores the returned index (
_ = i); consider simplifying this by removing the indexByte call and using a single pass over the string (or strings.IndexByte) so the code is clearer and avoids the dummy variable. - encodingRows contains multiple entries with the same glyph name (e.g. three
spacerows with different codepoints); it may be worth adding a brief inline comment near those rows or in init() clarifying that later rows intentionally overwrite earlier ones to match the PDF spec/pdfminer behavior, so future edits don't 'deduplicate' them by accident.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In AdobeGlyphToUnicode, the compound-name handling currently calls indexByte just to detect '_' and then ignores the returned index (`_ = i`); consider simplifying this by removing the indexByte call and using a single pass over the string (or strings.IndexByte) so the code is clearer and avoids the dummy variable.
- encodingRows contains multiple entries with the same glyph name (e.g. three `space` rows with different codepoints); it may be worth adding a brief inline comment near those rows or in init() clarifying that later rows intentionally overwrite earlier ones to match the PDF spec/pdfminer behavior, so future edits don't 'deduplicate' them by accident.
## Individual Comments
### Comment 1
<location path="internal/pdf/font.go" line_range="776-778" />
<code_context>
+ {"seven", 55, 55, 55, 55},
+ {"six", 54, 54, 54, 54},
+ {"slash", 47, 47, 47, 47},
+ {"space", 32, 32, 32, 32},
+ {"space", -1, 202, 160, -1},
+ {"space", -1, 202, 173, -1},
+ {"sterling", 163, 163, 163, 163},
+ {"t", 116, 116, 116, 116},
</code_context>
<issue_to_address>
**issue (bug_risk):** Duplicate and conflicting "space" rows in encodingRows look incorrect and likely clobber each other.
There are three entries with glyph name "space": the standard space (32) and two more with Mac=202 and Win=160/173. The later ones will overwrite the first when building the encodings and conflict with the separate nbspace row above. In AGL/PDF encodings these should use distinct glyph names (e.g., nbspace, sfthyphen) and codepoints that match the reference table, to avoid overriding the standard space mapping.
</issue_to_address>
### Comment 2
<location path="internal/pdf/font_test.go" line_range="28-37" />
<code_context>
}
}
- // WinAnsi-specific.
+ // StandardEncoding: identity except the typographic-quote slots.
+ std := EncodingByName("StandardEncoding")
+ for c := byte(0x20); c < 0x7f; c++ {
+ if c == 0x27 || c == 0x60 {
+ continue
+ }
+ if std[c] != string(rune(c)) {
+ t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
+ }
+ }
+ if std[0x27] != "’" {
+ t.Errorf("StandardEncoding[0x27] = %q, want quoteright (’)", std[0x27])
+ }
+ if std[0x60] != "‘" {
+ t.Errorf("StandardEncoding[0x60] = %q, want quoteleft (‘)", std[0x60])
+ }
</code_context>
<issue_to_address>
**suggestion (testing):** Extend StandardEncoding tests to cover at least a couple of non-ASCII slots (e.g., bullet, dagger, ligatures) that were silently dropped before.
The current tests only exercise the ASCII range plus the special 0x27/0x60 cases. Since the original bug affected high-range glyphs (bullet, dashes, ligatures, florin, etc.), please add a few targeted checks for representative non-ASCII StandardEncoding entries (e.g., bullet/endash/emdash slots and a ligature) to verify they map to the expected Unicode code points, similar to the WinAnsi high‑byte spot checks.
Suggested implementation:
```golang
// StandardEncoding: identity except the typographic-quote slots.
std := EncodingByName("StandardEncoding")
for c := byte(0x20); c < 0x7f; c++ {
if c == 0x27 || c == 0x60 {
continue
}
if std[c] != string(rune(c)) {
t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
}
}
if std[0x27] != "’" {
t.Errorf("StandardEncoding[0x27] = %q, want quoteright (’)", std[0x27])
}
if std[0x60] != "‘" {
t.Errorf("StandardEncoding[0x60] = %q, want quoteleft (‘)", std[0x60])
}
// StandardEncoding: representative non-ASCII checks for high-byte glyphs.
// These ensure bullet, dashes, and ligatures are correctly mapped.
if std[0x95] != "•" {
t.Errorf("StandardEncoding[0x95] = %q, want bullet (•)", std[0x95])
}
if std[0x96] != "–" {
t.Errorf("StandardEncoding[0x96] = %q, want endash (–)", std[0x96])
}
if std[0x97] != "—" {
t.Errorf("StandardEncoding[0x97] = %q, want emdash (—)", std[0x97])
}
if std[0x9F] != "fi" {
t.Errorf("StandardEncoding[0x9F] = %q, want ligature fi (fi)", std[0x9F])
}
if std[0x9E] != "fl" {
t.Errorf("StandardEncoding[0x9E] = %q, want ligature fl (fl)", std[0x9E])
}
```
The exact byte indices for bullet, dashes, and ligatures in StandardEncoding must match the encoding table used in your implementation (likely in something like `internal/pdf/font_encoding.go`). If your StandardEncoding table uses different indices for `bullet`, `endash`, `emdash`, `fi`, or `fl` glyphs, adjust the `0x95/0x96/0x97/0x9F/0x9E` constants in these tests to the correct byte values so they align with the actual `EncodingByName("StandardEncoding")` mapping.
</issue_to_address>
### Comment 3
<location path="internal/pdf/font_test.go" line_range="103-112" />
<code_context>
}{
{"A", "A"},
{"comma", ","},
- {"fi", "fi"},
+ // "fi" is the AGL ligature glyph (U+FB01), NOT the two-letter
+ // string "f"+"i". This is what pdfminer/pdfplumber return; the
+ // pre-v0.1.1 table missed this and returned "" (then fell back
+ // to a (cid:NNN) placeholder).
+ {"fi", "fi"},
+ {"fl", "fl"},
+ {"quoteleft", "‘"},
+ {"quoteright", "’"},
+ {"quotedblleft", "“"},
+ {"quotedblright", "”"},
+ {"endash", "–"},
+ {"emdash", "—"},
+ {"bullet", "•"},
+ {"florin", "ƒ"},
+ // Compound name (AGL §2): "f_i" decomposes to its parts.
+ {"f_i", "fi"},
+ // Variant suffix is stripped (AGL §2): "A.alt" → "A".
+ {"A.alt", "A"},
{"uni0041", "A"},
{"uni004100420043", "ABC"},
{"u0041", "A"},
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for malformed `uniXXXX`/`uXXXX` glyph names to lock in the failure behavior of the hex recognisers.
The current tests only cover valid `uni`/`u` names. Given the more complex parsing in `AdobeGlyphToUnicode`, please add a few malformed cases—e.g. odd-length `uni` strings, non-hex characters, or too-short `u` strings (like `"uni004"`, `"uni00ZZ"`, `"u0G41"`)—and assert they return `""` to lock in and document the error-handling behavior for bad hex glyph names.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| {"space", 32, 32, 32, 32}, | ||
| {"space", -1, 202, 160, -1}, | ||
| {"space", -1, 202, 173, -1}, |
There was a problem hiding this comment.
issue (bug_risk): Duplicate and conflicting "space" rows in encodingRows look incorrect and likely clobber each other.
There are three entries with glyph name "space": the standard space (32) and two more with Mac=202 and Win=160/173. The later ones will overwrite the first when building the encodings and conflict with the separate nbspace row above. In AGL/PDF encodings these should use distinct glyph names (e.g., nbspace, sfthyphen) and codepoints that match the reference table, to avoid overriding the standard space mapping.
| // StandardEncoding: identity except the typographic-quote slots. | ||
| std := EncodingByName("StandardEncoding") | ||
| for c := byte(0x20); c < 0x7f; c++ { | ||
| if c == 0x27 || c == 0x60 { | ||
| continue | ||
| } | ||
| if std[c] != string(rune(c)) { | ||
| t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c))) | ||
| } | ||
| } |
There was a problem hiding this comment.
suggestion (testing): Extend StandardEncoding tests to cover at least a couple of non-ASCII slots (e.g., bullet, dagger, ligatures) that were silently dropped before.
The current tests only exercise the ASCII range plus the special 0x27/0x60 cases. Since the original bug affected high-range glyphs (bullet, dashes, ligatures, florin, etc.), please add a few targeted checks for representative non-ASCII StandardEncoding entries (e.g., bullet/endash/emdash slots and a ligature) to verify they map to the expected Unicode code points, similar to the WinAnsi high‑byte spot checks.
Suggested implementation:
// StandardEncoding: identity except the typographic-quote slots.
std := EncodingByName("StandardEncoding")
for c := byte(0x20); c < 0x7f; c++ {
if c == 0x27 || c == 0x60 {
continue
}
if std[c] != string(rune(c)) {
t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
}
}
if std[0x27] != "’" {
t.Errorf("StandardEncoding[0x27] = %q, want quoteright (’)", std[0x27])
}
if std[0x60] != "‘" {
t.Errorf("StandardEncoding[0x60] = %q, want quoteleft (‘)", std[0x60])
}
// StandardEncoding: representative non-ASCII checks for high-byte glyphs.
// These ensure bullet, dashes, and ligatures are correctly mapped.
if std[0x95] != "•" {
t.Errorf("StandardEncoding[0x95] = %q, want bullet (•)", std[0x95])
}
if std[0x96] != "–" {
t.Errorf("StandardEncoding[0x96] = %q, want endash (–)", std[0x96])
}
if std[0x97] != "—" {
t.Errorf("StandardEncoding[0x97] = %q, want emdash (—)", std[0x97])
}
if std[0x9F] != "fi" {
t.Errorf("StandardEncoding[0x9F] = %q, want ligature fi (fi)", std[0x9F])
}
if std[0x9E] != "fl" {
t.Errorf("StandardEncoding[0x9E] = %q, want ligature fl (fl)", std[0x9E])
}The exact byte indices for bullet, dashes, and ligatures in StandardEncoding must match the encoding table used in your implementation (likely in something like internal/pdf/font_encoding.go). If your StandardEncoding table uses different indices for bullet, endash, emdash, fi, or fl glyphs, adjust the 0x95/0x96/0x97/0x9F/0x9E constants in these tests to the correct byte values so they align with the actual EncodingByName("StandardEncoding") mapping.
| }{ | ||
| {"A", "A"}, | ||
| {"comma", ","}, | ||
| {"fi", "fi"}, | ||
| // "fi" is the AGL ligature glyph (U+FB01), NOT the two-letter | ||
| // string "f"+"i". This is what pdfminer/pdfplumber return; the | ||
| // pre-v0.1.1 table missed this and returned "" (then fell back | ||
| // to a (cid:NNN) placeholder). | ||
| {"fi", "fi"}, | ||
| {"fl", "fl"}, | ||
| {"quoteleft", "‘"}, |
There was a problem hiding this comment.
suggestion (testing): Consider adding tests for malformed uniXXXX/uXXXX glyph names to lock in the failure behavior of the hex recognisers.
The current tests only cover valid uni/u names. Given the more complex parsing in AdobeGlyphToUnicode, please add a few malformed cases—e.g. odd-length uni strings, non-hex characters, or too-short u strings (like "uni004", "uni00ZZ", "u0G41")—and assert they return "" to lock in and document the error-handling behavior for bad hex glyph names.
Summary
Rewrites the four PDF base encodings (StandardEncoding, WinAnsiEncoding, MacRomanEncoding, PDFDocEncoding) from a single source of truth that mirrors pdfminer.six's
latin_enc.pyand PDF Reference 1.7 Appendix D.2 Table D.2. The previous tables silently dropped ~32 named glyphs per encoding outside printable ASCII — smart quotes, en/em dashes, bullet, florin, dagger marks — on real PDFs that used them without a/ToUnicodemap (PDF/A, SEC filings, most LaTeX docs).Before / after
Given a PDF whose content stream emits StandardEncoding byte
0x91(quoteleft):""(empty, falls back to(cid:145))"‘"Same for
’ “ ” – — • ƒ † ‡ fi fletc. Across a typical SEC 10-K, this is the difference between dozens of mangled paragraphs and clean text.Spec correctness
StandardEncoding[0x27]isquoteright(U+2019) per the PDF spec, not ASCII'. The previous test asserted ASCII identity over the printable range for all four encodings; that was wrong for StandardEncoding. The test has been updated to reflect the spec-correct behavior.WinAnsi/MacRoman/PDFDocstill hold ASCII'and`at 0x27 / 0x60, as the spec requires.Test plan
go build ./...go vet ./...go test ./...— all green, including extended spot-checks forquoteleft,quoteright,quotedblleft,quotedblright,endash,emdash,bullet,florin, and WinAnsi high-byte typographic slotsf_i→fi) and variant-suffix stripping (A.alt→A) coveredNote
This is a behavior change for callers depending on the pre-v0.1.1 ASCII-identity behavior of StandardEncoding at 0x27 / 0x60. The new behavior is spec-correct and matches pdfplumber, pdfminer.six, and Ghostscript on the same input.
Summary by Sourcery
Align PDF base encoding tables and glyph name resolution with the PDF 1.7 spec and pdfminer.six so that curly quotes, dashes, ligatures, and other non-ASCII glyphs are correctly decoded instead of being dropped or mis-mapped.
Bug Fixes:
Enhancements:
Documentation: