Skip to content
Nicolas Canceill edited this page Nov 10, 2013 · 6 revisions

Welcome to the pdf_hide wiki! This page explains how the algorithm works.

You should begin by checking the quick start guide, and maybe learning more about the project.

The PDF HIDE algorithm

Pre-requisites

PDF

This algorithm is based on PDF (Portable Document Format), a platform-independent file format used to represent documents. Text and images inside PDF files are displayed in the same way on every platform.

A PDF document consists of a collection of objects that determines the output and functionality of the document. The way objects are displayed is controlled by specific commands inside the object, called operators.

One of the most used objects is the stream object. For example, paragraphs of text are contained in a stream object. The PDF specification defines numerous operators for controlling how text is displayed.

Displaying text in PDF

There are various programs capable of producing a PDF output, and most of them have the ability to justify text: ie have text aligned on both sides. Variable space between characters is also often used to create a better-looking output — within the field of typography, this concept is known as kerning.

In such cases, those programs usually make a heavy usage of a specific text operator, so that the justified/kerned text renders properly: the TJ operator.

The TJ operator

The TJ operator is used to display text strings in a PDF file. It contains an array of strings and numbers which respectively consists of the characters and the space values that are used between these characters.

The characters are displayed in the same way as when the normal text operator Tj is used. However, each TJ space value is subtracted from the current text position, which shifts the corresponding string to the left by that value (or to the right, in case of a negative number).

The TJ space values are expressed in scaled text space units. The default unit is 1/1000 of an em. An em is a unit relative to the specified font size. For example, 1 em with a font size of 12 pt is equal to 12 pt.

The TJ operator is virtually used in almost every PDF file containing paragraphs of text. Each line of text is represented by one TJ operator. Each TJ operator contains one or more space values. If a text is justified, which means that it is both aligned with the left and right margin, the TJ operator is used more often to introduce variable spacing between words and characters to meet the justification rules.

Principle

The basic method is simple:

  • embed: the program selects some TJ operators from the file, and changes the values in order to embed data;
  • extract: the program finds the proper TJ operators in the file, reads the values, and extracts the data.

Parameters

The embedding algorithm takes three entry parameters:

  • the PDF file to use (any valid PDF file)
  • the data to embed (any binary data)
  • the key (any UTF-8 string)

The extracting algorithm takes two entry parameters:

  • the PDF file to use
  • the key

The algorithm can be tuned by several parameters:

  • the bit depth n, between 1 and 8, defaults to 4
  • the redundancy r, strictly between 0 and 1, defaults to 0.1

Start/End codes

The data needs to be extractable by the recipient of the PDF file, but it will probably not span over all the available TJ operators in the file. As a result, the embedding process needs to take care of writing a specific end code so the data an be found. A start code is also written in order to embed a checksum of the data.

Implementation details

Pseudo-randomness

Pseudo-randomness is implemented using logistic maps of the type x -> µ•x•(1-x) with µ in ]3.57,4[.

Embedding

The algorithm encodes the data into a list of n-bit integers. It appends 20 integers at the beginning, containing a checksum of the data. It also appends 20 integers at the end, containing a checksum of the key.

Together, they constitue the list of integers to embed. The checksum of the key will work as an end code for extracting, while the checksum of the data itself will be used to check integrity after decoding.

The algorithm initializes a pseudo-random number generator with the key as seed.

Then, the algorithm parses the PDF file for TJ operators. At each of them, it proceeds in several steps:

  • It checks if the absolute value is in the range [1,2^n]
  • It generates a random number in ]0,1[ and checks if it is greater than r
  • It checks if there still is any integer to embed.

If all the above conditions are satisfied, the algorithm embeds the next integer in the list as the new value. Otherwise, in the case of the [1,2^n] range, a random integer in the same range is embedded — otherwise, the original value is left untouched. In any case, the sign of the original value is left untouched, only the absolute value may change.

When the algorithm has gone through all the TJ operators, it checks if all data was successfully embedded, and in that case it writes the new PDF file.

Extracting

The algorithm encodes the checksum of the key into a list of n-bit integers. This is the end code.

The algorithm initializes a pseudo-random number generator with the key as seed.

Then, the algorithm parses the PDF file for TJ operators. At each of them, it proceeds in several steps:

  • It checks if the absolute value is in the range [1,2^n]
  • It generates a random number in ]0,1[ and checks if it is greater than r

If all the above conditions are satisfied, the algorithm records the absolute value of the operator in a list.

When the algorithm has gone through all the TJ operators, it goes through the list of extracted integers and looks for the end code. When it is found, the algorithm discards all subsequent integers in the list.

Finally, the algorithm decodes the list of integers into binary data — excluding the 20 integers at the beginning and at the end. It decodes the first 20 integers, and compares them with the checksum of the binary data extracted.

Improvements

The improvements enable several features.

Pseudo-randomness

The improvements use Python's inner implementation of the Mersenne Twister pseudo-random number generator, instead of the tuned logistic maps of the default algorithm.

This is a choice for efficiency: it is a fast, fully-deterministic generator, providing high entropy at a low cost.

Low bits

If n is an integer, then the representation of any integer x in base n has several digits iff x is n or larger. In that case each digit is a coefficient that applies to a distinct power of n (starting with 1, then n, then n square, etc.), and some of these digits will be more meaningful than others depending on which power of n they apply to.

In the decimal system, n is 10, and the number 222 has three digits which apply, one to 100, one to 10, one to 1. As a result, the leftmost digit 2 is more meaningful than the rightmost 2: if you shift the former, 222 becomes 122 or 322, while if you shift the latter, then 222 becomes 221 or 223, which is a less meaningful change.

In the binary system, digits are called bits, and the least meaningful bits of a number are called low bits. Modifying the k lower bits of a number will shift it by less than 2 to the power of k.

Instead of only using the [1,2^n], the improvements use all TJ operators (except for the unusual case of a value equal to zero), and embeds the integers in the n lower bits of the value.

Custom range

For a specific type of PDF files (eg created by a specific suite of software), there may be a specific range of values where embedding will be even more unnoticeable. In that case, the improvements make it possible to only use TJ values from that range.

In that case, embedded integers are shifted by 1 — ie in the range [0,2^n-1]. The custom range will have to be aligned on multiples of 2^n, which may restrict some values.

Currently, there is one specific range implemented:

  • for files created with pdflatex, values in [-250,-450] are uniformly distributed, except for -333 and -334

Roadmap

Future improvements include:

  • Random start position
  • Preserving the line length (and thus the alignment)

Distributed under GPLv3. Copyright © 2013 Nicolas Canceill.