Skip to content

Commit

Permalink
DOC: CMaps (#811)
Browse files Browse the repository at this point in the history
  • Loading branch information
MartinThoma committed Apr 24, 2022
1 parent 7541047 commit 6729b80
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 1 deletion.
52 changes: 52 additions & 0 deletions docs/dev/cmaps.md
@@ -0,0 +1,52 @@
# CMaps

Looking at the cmap of "crazyones":

```bash
pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
```

You can see this:

```text
begincmap
/CMapName /T1Encoding-UTF16 def
/CMapType 2 def
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
1 begincodespacerange
<00> <FF>
endcodespacerange
1 beginbfchar
<1B> <FB00>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
```

## codespacerange

A codespacerange maps a complete sequence of bytes to a range of unicode glyphs.
It defines a starting point:

```text
1 beginbfchar
<1B> <FB00>
```

That means that `1B` (Hex for 27) maps to the unicode character [`FB00`](https://unicode-table.com/en/FB00/) - the ligature ff (two lowercase f's).

The two numbers in `begincodespacerange` mean that it starts with an offset of
0 (hence from `1B ➜ FB00`) upt to an offset of FF (dec: 255), hence 1B+FF = 282
[FBFF](https://www.compart.com/de/unicode/U+FBFF).

Within the text stream, there is

```text
(The)-342(mis\034ts.)
```

`\034 ` is octal for 28 decimal.
14 changes: 13 additions & 1 deletion docs/dev/pdf-format.md
Expand Up @@ -84,7 +84,7 @@ startxref 1234

Let's go through it:

* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`.
* `trailer <<` indicates that the *trailer dictionary* starts. It ends with `>>`.
* `startxref` is a keyword followed by the byte-location of the `xref` keyword.
As the trailer is always at the bottom of the file, this allows readers to
quickly find the xref table.
Expand All @@ -99,3 +99,15 @@ Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required
* `R` is the keyword that indicates that the object is a reference to the
catalog dictionary.
* `/Size` (integer) contains the total number of entries in the files xref table.


## Reading PDF files

Most PDF files are compressed. If you want to read them, first uncompress them:

```bash
pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
```

Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in
our favorite IDE / text editor.
1 change: 1 addition & 0 deletions docs/index.rst
Expand Up @@ -54,6 +54,7 @@ You can contribute to `PyPDF2 on Github <https://github.com/py-pdf/PyPDF2>`_.

dev/intro
dev/pdf-format
dev/cmaps

.. toctree::
:caption: About PyPDF2
Expand Down

0 comments on commit 6729b80

Please sign in to comment.