Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

Latest commit

 

History

History
416 lines (285 loc) · 15.3 KB

configuration.rst

File metadata and controls

416 lines (285 loc) · 15.3 KB

Configuration

Paperless provides a wide range of customizations. Depending on how you run paperless, these settings have to be defined in different places.

  • If you run paperless on docker, paperless.conf is not used. Rather, configure paperless by copying necessary options to docker-compose.env.
  • If you are running paperless on anything else, paperless will search for the configuration file in these locations and use the first one it finds:

    /path/to/paperless/paperless.conf
    /etc/paperless.conf
    /usr/local/etc/paperless.conf

Required services

PAPERLESS_REDIS=<url>

This is required for processing scheduled tasks such as email fetching, index optimization and for training the automatic document matcher.

Defaults to redis://localhost:6379.

PAPERLESS_DBHOST=<hostname>

By default, sqlite is used as the database backend. This can be changed here. Set PAPERLESS_DBHOST and PostgreSQL will be used instead of mysql.

PAPERLESS_DBPORT=<port>

Adjust port if necessary.

Default is 5432.

PAPERLESS_DBNAME=<name>

Database name in PostgreSQL.

Defaults to "paperless".

PAPERLESS_DBUSER=<name>

Database user in PostgreSQL.

Defaults to "paperless".

PAPERLESS_DBPASS=<password>

Database password for PostgreSQL.

Defaults to "paperless".

Paths and folders

PAPERLESS_CONSUMPTION_DIR=<path>

This where your documents should go to be consumed. Make sure that it exists and that the user running the paperless service can read/write its contents before you start Paperless.

Don't change this when using docker, as it only changes the path within the container. Change the local consumption directory in the docker-compose.yml file instead.

Defaults to "../consume", relative to the "src" directory.

PAPERLESS_DATA_DIR=<path>

This is where paperless stores all its data (search index, SQLite database, classification model, etc).

Defaults to "../data", relative to the "src" directory.

PAPERLESS_MEDIA_ROOT=<path>

This is where your documents and thumbnails are stored.

You can set this and PAPERLESS_DATA_DIR to the same folder to have paperless store all its data within the same volume.

Defaults to "../media", relative to the "src" directory.

PAPERLESS_STATICDIR=<path>

Override the default STATIC_ROOT here. This is where all static files created using "collectstatic" manager command are stored.

Unless you're doing something fancy, there is no need to override this.

Defaults to "../static", relative to the "src" directory.

PAPERLESS_FILENAME_FORMAT=<format>

Changes the filenames paperless uses to store documents in the media directory. See advanced-file_name_handling for details.

Default is none, which disables this feature.

Hosting & Security

PAPERLESS_SECRET_KEY=<key>

Paperless uses this to make session tokens. If you expose paperless on the internet, you need to change this, since the default secret is well known.

Use any sequence of characters. The more, the better. You don't need to remember this. Just face-roll your keyboard.

Default is listed in the file src/paperless/settings.py.

PAPERLESS_ALLOWED_HOSTS<comma-separated-list>

If you're planning on putting Paperless on the open internet, then you really should set this value to the domain name you're using. Failing to do so leaves you open to HTTP host header attacks: https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation

Just remember that this is a comma-separated list, so "example.com" is fine, as is "example.com,www.example.com", but NOT " example.com" or "example.com,"

Defaults to "*", which is all hosts.

PAPERLESS_CORS_ALLOWED_HOSTS<comma-separated-list>

You need to add your servers to the list of allowed hosts that can do CORS calls. Set this to your public domain name.

Defaults to "http://localhost:8000".

PAPERLESS_FORCE_SCRIPT_NAME=<path>

To host paperless under a subpath url like example.com/paperless you set this value to /paperless. No trailing slash!

Note

I don't know if this works in paperless-ng. Probably not.

Defaults to none, which hosts paperless at "/".

PAPERLESS_STATIC_URL=<path>

Override the STATIC_URL here. Unless you're hosting Paperless off a subdomain like /paperless/, you probably don't need to change this.

Defaults to "/static/".

PAPERLESS_AUTO_LOGIN_USERNAME=<username>

Specify a username here so that paperless will automatically perform login with the selected user.

Danger

Do not use this when exposing paperless on the internet. There are no checks in place that would prevent you from doing this.

Defaults to none, which disables this feature.

OCR settings

Paperless uses OCRmyPDF for performing OCR on documents and images. Paperless uses sensible defaults for most settings, but all of them can be configured to your needs.

PAPERLESS_OCR_LANGUAGE=<lang>

Customize the language that paperless will attempt to use when parsing documents.

It should be a 3-letter language code consistent with ISO 639: https://www.loc.gov/standards/iso639-2/php/code_list.php

Set this to the language most of your documents are written in.

This can be a combination of multiple languages such as deu+eng, in which case tesseract will use whatever language matches best. Keep in mind that tesseract uses much more cpu time with multiple languages enabled.

Defaults to "eng".

PAPERLESS_OCR_MODE=<mode>

Tell paperless when and how to perform ocr on your documents. Four modes are available:

  • skip: Paperless skips all pages and will perform ocr only on pages where no text is present. This is the safest option.
  • skip_noarchive: In addition to skip, paperless won't create an archived version of your documents when it finds any text in them. This is useful if you don't want to have two almost-identical versions of your digital documents in the media folder. This is the fastest option.
  • redo: Paperless will OCR all pages of your documents and attempt to replace any existing text layers with new text. This will be useful for documents from scanners that already performed OCR with insufficient results. It will also perform OCR on purely digital documents.

    This option may fail on some documents that have features that cannot be removed, such as forms. In this case, the text from the document is used instead.

  • force: Paperless rasterizes your documents, converting any text into images and puts the OCRed text on top. This works for all documents, however, the resulting document may be significantly larger and text won't appear as sharp when zoomed in.

The default is skip, which only performs OCR when necessary and always creates archived documents.

PAPERLESS_OCR_OUTPUT_TYPE=<type>

Specify the the type of PDF documents that paperless should produce.

  • pdf: Modify the PDF document as little as possible.
  • pdfa: Convert PDF documents into PDF/A-2b documents, which is a subset of the entire PDF specification and meant for storing documents long term.
  • pdfa-1, pdfa-2, pdfa-3 to specify the exact version of PDF/A you wish to use.

If not specified, pdfa is used. Remember that paperless also keeps the original input file as well as the archived version.

PAPERLESS_OCR_PAGES=<num>

Tells paperless to use only the specified amount of pages for OCR. Documents with less than the specified amount of pages get OCR'ed completely.

Specifying 1 here will only use the first page.

When combined with PAPERLESS_OCR_MODE=redo or PAPERLESS_OCR_MODE=force, paperless will not modify any text it finds on excluded pages and copy it verbatim.

Defaults to 0, which disables this feature and always uses all pages.

PAPERLESS_OCR_IMAGE_DPI=<num>

Paperless will OCR any images you put into the system and convert them into PDF documents. This is useful if your scanner produces images. In order to do so, paperless needs to know the DPI of the image. Most images from scanners will have this information embedded and paperless will detect and use that information. In case this fails, it uses this value as a fallback.

Set this to the DPI your scanner produces images at.

Default is none, which causes paperless to fail if no DPI information is present in an image.

PAPERLESS_OCR_USER_ARG=<json>

OCRmyPDF offers many more options. Use this parameter to specify any additional arguments you wish to pass to OCRmyPDF. Since Paperless uses the API of OCRmyPDF, you have to specify these in a format that can be passed to the API. See the API reference of OCRmyPDF for valid parameters. All command line options are supported, but they use underscores instead of dashed.

Caution

Paperless has been tested to work with the OCR options provided above. There are many options that are incompatible with each other, so specifying invalid options may prevent paperless from consuming any documents.

Specify arguments as a JSON dictionary. Keep note of lower case booleans and double quoted parameter names and strings. Examples:

{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}    

Software tweaks

PAPERLESS_TASK_WORKERS=<num>

Paperless does multiple things in the background: Maintain the search index, maintain the automatic matching algorithm, check emails, consume documents, etc. This variable specifies how many things it will do in parallel.

PAPERLESS_THREADS_PER_WORKER=<num>

Furthermore, paperless uses multiple threads when consuming documents to speed up OCR. This variable specifies how many pages paperless will process in parallel on a single document.

Caution

Ensure that the product

PAPERLESS_TASK_WORKERS * PAPERLESS_THREADS_PER_WORKER

does not exceed your CPU core count or else paperless will be extremely slow. If you want paperless to process many documents in parallel, choose a high worker count. If you want paperless to process very large documents faster, use a higher thread per worker count.

The default is a balance between the two, according to your CPU core count, with a slight favor towards threads per worker, and using as much cores as possible.

If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust PAPERLESS_THREADS_PER_WORKER automatically.

PAPERLESS_TIME_ZONE=<timezone>

Set the time zone here. See https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE for details on how to set it.

Defaults to UTC.

PAPERLESS_CONSUMER_POLLING=<num>

If paperless won't find documents added to your consume folder, it might not be able to automatically detect filesystem changes. In that case, specify a polling interval in seconds here, which will then cause paperless to periodically check your consumption directory for changes.

Defaults to 0, which disables polling and uses filesystem notifications.

PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>

When the consumer detects a duplicate document, it will not touch the original document. This default behavior can be changed here.

Defaults to false.

PAPERLESS_CONSUMER_RECURSIVE=<bool>

Enable recursive watching of the consumption directory. Paperless will then pickup files from files in subdirectories within your consumption directory as well.

Defaults to false.

PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=<bool>

Set the names of subdirectories as tags for consumed files. E.g. <CONSUMPTION_DIR>/foo/bar/file.pdf will add the tags "foo" and "bar" to the consumed file. Paperless will create any tags that don't exist yet.

PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work.

Defaults to false.

PAPERLESS_CONVERT_MEMORY_LIMIT=<num>

On smaller systems, or even in the case of Very Large Documents, the consumer may explode, complaining about how it's "unable to extend pixel cache". In such cases, try setting this to a reasonably low value, like 32. The default is to use whatever is necessary to do everything without writing to disk, and units are in megabytes.

For more information on how to use this value, you should search the web for "MAGICK_MEMORY_LIMIT".

Defaults to 0, which disables the limit.

PAPERLESS_CONVERT_TMPDIR=<path>

Similar to the memory limit, if you've got a small system and your OS mounts /tmp as tmpfs, you should set this to a path that's on a physical disk, like /home/your_user/tmp or something. ImageMagick will use this as scratch space when crunching through very large documents.

For more information on how to use this value, you should search the web for "MAGICK_TMPDIR".

Default is none, which disables the temporary directory.

PAPERLESS_OPTIMIZE_THUMBNAILS=<bool>

Use optipng to optimize thumbnails. This usually reduces the size of thumbnails by about 20%, but uses considerable compute time during consumption.

Defaults to true.

PAPERLESS_POST_CONSUME_SCRIPT=<filename>

After a document is consumed, Paperless can trigger an arbitrary script if you like. This script will be passed a number of arguments for you to work with. For more information, take a look at advanced-post_consume_script.

The default is blank, which means nothing will be executed.

PAPERLESS_FILENAME_DATE_ORDER=<format>

Paperless will check the document text for document date information. Use this setting to enable checking the document filename for date information. The date order can be set to any option as specified in https://dateparser.readthedocs.io/en/latest/settings.html#date-order. The filename will be checked first, and if nothing is found, the document text will be checked as normal.

Defaults to none, which disables this feature.

PAPERLESS_FILENAME_PARSE_TRANSFORMS

Transforms filenames before they are processed by paperless. See advanced-transforming_filenames for details.

Defaults to none, which disables this feature.

Binaries

There are a few external software packages that Paperless expects to find on your system when it starts up. Unless you've done something creative with their installation, you probably won't need to edit any of these. However, if you've installed these programs somewhere where simply typing the name of the program doesn't automatically execute it (ie. the program isn't in your $PATH), then you'll need to specify the literal path for that program.

PAPERLESS_CONVERT_BINARY=<path>

Defaults to "/usr/bin/convert".

PAPERLESS_GS_BINARY=<path>

Defaults to "/usr/bin/gs".

PAPERLESS_OPTIPNG_BINARY=<path>

Defaults to "/usr/bin/optipng".