Skip to content

Commit

Permalink
DOC: The PDF Format + commit prefixes (#810)
Browse files Browse the repository at this point in the history
  • Loading branch information
MartinThoma committed Apr 24, 2022
1 parent b3247e8 commit 7541047
Show file tree
Hide file tree
Showing 3 changed files with 154 additions and 1 deletion.
51 changes: 51 additions & 0 deletions docs/dev/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,57 @@ the users, but for people who want to work on PyPDF2 itself.
pip install -r requirements/dev.txt
```

## Running Tests

```
pytest .
```

## Tools: git and pre-commit

Git is a command line application for version control. If you don't know it,
you can [play ohmygit](https://ohmygit.org/) to learn it.

Github is the service where the PyPDF2 project is hosted. While git is free and
open source, Github is a paid service by Microsoft - but for free in lot of
cases.

[pre-commit](https://pypi.org/project/pre-commit/) is a command line application
that uses git hooks to automatically execute code. This allows you to avoid
style issues and other code quality issues. After you entered `pre-commit install`
once in your local copy of PyPDF2, it will automatically be executed when
you `git commit`.

## Commit Messages

Having a clean commit message helps people to quickly understand what the commit
was about, witout actually looking at the changes. The first line of the
commit message is used to [auto-generate the CHANGELOG](https://github.com/py-pdf/PyPDF2/blob/main/make_changelog.py). For this reason, the format should be:

```
PREFIX: DESCRIPTION
BODY
```

The `PREFIX` can be:

* `BUG`: A bug was fixed. Likely there is one or multiple issues. Then write in
the `BODY`: `Closes #123` where 123 is the issue number on Github.
It would be absolutely amazing if you could write a regression test in those
cases. That is a test that would fail without the fix.
* `ENH`: A new feature! Describe in the body what it can be used for.
* `DEP`: A deprecation - either marking something as "this is going to be removed"
or actually removing it.
* `ROB`: A robustness change. Dealing better with broken PDF files.
* `DOC`: A documentation change.
* `TST`: Adding / adjusting tests.
* `DEV`: Developer experience improvements - e.g. pre-commit or setting up CI
* `MAINT`: Quite a lot of different stuff. Performance improvements are for sure
the most interesting changes in here. Refactorings as well.
* `STY`: A style change. Something that makes PyPDF2 code more consistent.
Typically a small change.

## Benchmarks

We need to keep an eye on performance and thus we have a few benchmarks.
Expand Down
101 changes: 101 additions & 0 deletions docs/dev/pdf-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# The PDF Format

It's recommended to look in the PDF specification for details and clarifications.
This is only intended to give a very rough overview of the format.

## Overall Structure

A PDF consists of:

1. Header: Contains the version of the PDF, e.g. `%PDF-1.7`
2. Body: Contains a sequence of indirect objects
3. Cross-reference table (xref): Contains a list of the indirect objects in the body
4. Trailer

## The xref table

A cross-reference table (xref) is a table of the indirect objects in the body.
It allows quick access to those objects by pointing to their location in the file.

It looks like this:

```text
xref 42 5
0000001000 65535 f
0000001234 00000 n
0000001987 00000 n
0000011987 00000 n
0000031987 00000 n
```

Let's go through it step-by-step:

* `xref` is justa keyword that specifies the start of the xref table.
* `42` is TODO; `6` is the number of entries in the xref table.
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset,
a 5-digit generation number, and a literal keyword which is either `n` or `f`.
* `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
the object is in the file.
* `ggggg` is the generation number. It tells the reader how old the object is.
* `n` means that the object is a normal in-use object, `f` means that the object
is a free object.
* The first free object always has a generation number of 65535. It forms
the head of a linked-list of all free objects.
* The generation number of a normal object is always 0. The generation
number allows the PDF format to contain multiple versions of the same
object. This is a version history mechanism.

## The body

The body is a sequence of indirect objects:

`counter generationnumber << the_object >> endobj`

* `counter` (integer) is a unique identifier for the object.
* `generationnumber` (integer) is the generation number of the object.
* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
specify which kind of object it is.
* `endobj` marks the end of the object.

A concrete example can be found in `test_reader.py::test_get_images_raw`:

```text
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
2 0 obj << >> endobj
3 0 obj << >> endobj
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]
/MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R
/Resources << /Font << >> >>
/Rotate 0 /Type /Page >> endobj
5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj
```

## The trailer

The trailer looks like this:

```text
trailer << /Root 5 0 R
/Size 6
>>
startxref 1234
%%EOF
```

Let's go through it:

* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`.
* `startxref` is a keyword followed by the byte-location of the `xref` keyword.
As the trailer is always at the bottom of the file, this allows readers to
quickly find the xref table.
* `%%EOF` is the end-of-file marker.

The trailer dictionary is a key-value list. The keys are specified in
Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).

* `/Root` (dictionary) contains the document catalog.
* The `5` is the object number of the catalog dictionary
* `0` is the generation number of the catalog dictionary
* `R` is the keyword that indicates that the object is a reference to the
catalog dictionary.
* `/Size` (integer) contains the total number of entries in the files xref table.
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,11 @@ You can contribute to `PyPDF2 on Github <https://github.com/py-pdf/PyPDF2>`_.
modules/PageRange

.. toctree::
:caption: PyPDF Developers
:caption: Developer Guide
:maxdepth: 1

dev/intro
dev/pdf-format

.. toctree::
:caption: About PyPDF2
Expand Down

0 comments on commit 7541047

Please sign in to comment.