Skip to content

Document Lists and Where They're Stored

Adam Hooper edited this page Feb 16, 2018 · 27 revisions

Overview stores and searches documents. If you're hacking Overview or a plugin, you'll want to know how to access them.

First: what's a document?

From Overview's perspective, a document consists of:

  • Raw data the user uploaded -- a .doc or .pdf or .csv file, for instance. We call this the "uploaded file". One uploaded file can spawn multiple documents if, for instance, the user opted to generate one document per page.
  • Rendered data for display -- a specially-generated PDF. We call this the "file view." Don't confuse this with the uploaded file, even if the uploaded file was a PDF. The file view can have embedded OCR data, for instance.
  • Thumbnail -- generated from the file view.
  • Extracted text.
  • User data: "metadata" and "PDF notes". (Overview renders "PDF Notes" atop the file view.) The user can overwrite these fields.
  • A 64-bit integer ID. The most-significant 32 bits are the document-set ID. The lower 32 bits are in the range [0,N), where N is the number of documents in the document set.

File contents, file views and thumbnails are stored in "blob storage". Blob storage an abstraction around services like Amazon S3: it's scalable and does not require a filesystem.

Text, metadata and PDF notes are stored in Postgres. Overview also stores values derived from them in document-set-specific Lucene indexes, to allow advanced searches.

Many computers share document data: web servers (which act on behalf of users), workers (which create documents in the first place), Lucene, Postgres, plugins -- and, of course, the end user.

Overview transmits documents using four broad patterns:

  • Single-document data: Overview transmits large quantities of data about a single document, without overwhelming memory usage.
  • Pages of document data: Overview transmits document text and user data in reasonably-sized batches.
  • Document ID Lists: Overview transmits an ordered list of document IDs which can be looked up later. (The canonical example is a search result.)
  • Document ID Sets: Overview transmits an unordered set of document IDs used while generating search results.
  • Selection IDs: Overview transmits a handle to clients so they can paginate through search results.

Single-document data

For: user requests (when downloading or viewing a document); worker (when generating documents).

Sometimes, the user requests exactly one document. That makes things easy: we can just stream all the data.

Canonical example: when viewing a document in the PDF viewer, the file view is streamed to the client in one request.

Maximum size:

  • uploaded files can be hundreds of megabytes large
  • file views can also be hundreds of megabytes large
  • extracted text can be at most triple max_n_chars_per_document: around 1.5MB at the time of writing
  • user data is limited by the amount we let users upload. Right now we stick with our framework default, 100KB for metadata and 100KB for PDF notes.

Maximum size in memory: uploaded files and file views must always be streamed. Their footprint in memory is the size of the buffer used during streaming -- maybe a few hundred kilobytes. Text and user data won't add up to more than 2MB. (Even perverse metadata JSON like {"foo":[[[[[[[[[[...]]]]]]]]]]} can only consume tens of thousands of Objects.)

Pages of document data: 0-50 documents

For: showing the user a list of documents; responding to an API request.

Canonical examples: Overview's document list, or the API endpoint /api/v1/document-sets/{documentSetId}/documents

(Uploaded file views and file contents aren't transmitted in pages: they're only transmitted one-at-a-time.)

Overview's most prominent component is its "document list": the search results. We only show the documents the user can see.

Maximum size in memory: an API request for all fields, including text, can consume 2MB per document. On Overview's front-end, text isn't included and we only show 20 documents: a user who uploads perverse data may achieve a maximum of 4MB. More realistically, with 5KB user data per document, the page costs 100KB (uncompressed).

Internally, Overview streams documents to itself (for search-indexing, for instance) and to API clients by paginating. It creates a DocumentIdList and then transmits a page of documents at a time. Overview's internal components limit their memory consumption by assuming 1 document <= 2MB and specifying the page sense that makes sense for the task.

Of course, to know which documents to fetch, we need document IDs. And so we have...:

DocumentIdList: 0-10M document IDs, in order

For: supporting streaming documents and loading pages of documents.

Example: These lists are stored server-side. We store entire-document lists in Postgres; we store search results in Redis. You can access search-result IDs by passing ?fields=id to the API.

Picture it: The user searches and finds 200k documents. Overview downloads the first page of results. Now, how does the user load the second page?

We can't simply re-search and return page two of the new search results. What if the user searches by tag, then alters documents' tags while paginating through search results? Subsequent pages would be based on a search that happened at a different time -- and thus had different results. (In practice: documents would never appear in the document list.) The upshot: a search result is specific to the time the search was run.

So we need a structure to store a "snapshot" of a list of documents. The obvious answer: store the IDs.

To be specific: a DocumentIdList stores the lower 32 bits of each document's IDs. (The upper 32 bits of a document ID are the document-set ID.)

Maximum size in memory: (4b / document) * (10M documents) = 40MB

This is an important number: a search result costs at most 40MB. This is too large to pass to the end user. It's quite large even on an intranet: it can take ~0.5s to pass it from one computer to another.

A search result is a DocumentIdList. We cache it in Redis after every search. When the user tries to load page N of search-result documents, we just ask Redis for the IDs stored in bytes 4*N*pageSize .. 4*N*(pageSize + 1). That's tiny.

The document_set and document_id_list tables in Postgres store pre-sorted DocumentIdLists, for reasons we'll soon see.

DocumentIdSet: 0-10M document IDs, for when ordering doesn't matter

For: joining document sets; filtering document lists

Examples: The ?documentIdsBitSetBase64 API option lets you specify one; plugins can serve them via ViewFilter, too. Mostly, they're used internally.

DocumentIdLists are bad at solving these problems:

  1. During search, we receive document IDs from Lucene, from other plugins, and from Postgres. How do we intersect those results, so we have a single set of document IDs? If those search results are all Arrays, we seem doomed to spend several seconds at 100% CPU, just merging results.
  2. How do we sort the final DocumentIdList, given that [sorting documents can take minutes[(https://github.com/overview/overview-server/wiki/Search-Engine-Internals)?

The answer involves document ID sets: unordered lists. We can make them compact, using a special property of Overview's document IDs.

Within a given DocumentSet, document IDs start at (documentSetId << 32L) and are assigned incrementally. That means documentId.toInt (or documentId & 0xffffffff) of the first document in any document set is 0; the second document in any document set has 1; and so on. If we allow a maximum of 10M documents, the maximum 32-bit ID is 9,999,999.

A document ID set tells us, for each document ID, whether the document ID is included or excluded. That's one boolean per document.

DocumentIdSet is a Bitset. For a 10M-document document set, the DocumentIdSet is an array of exactly 10,000,000 boolean values: one bit (not byte) per document ID.

For example: here's the in-memory representation of a 20-document DocumentIdSet. It costs 3 bytes:

 byte 1   byte 2   byte 3
01011011 00111010 01100000
│  │        │        │──┬─
│  │        │        │  │
│  │        │        │  └ the last bits of the last byte are unused
│  │        │        │
│  │        │        └ document 19: excluded (bit is unset)
│  │        │
│  │        └ document 11: included (bit is set)
│  │
│  └ document 3: included (bit is set)
│
└ document 0: excluded (bit is unset)

... total list of included document IDs:
        1, 3, 4, 6, 7; 10, 11, 12, 14; 17, 18
        (11 document IDs)

Maximum size in memory: (1 bit/document) * (10M documents) * (1 byte / 8 bits) = 128kb.

A DocumentIdSet, maxing out at 128kb, is suitable for passing between datacenter computers. Overview uses DocumentIdSets extensively:

  • Overview's API allows callers to search for documents by ID bitset.
  • Lucene, plugins with ViewFilters, and Postgres all return search results as DocumentIdSets. Overview intersects them instantaneously during search.
  • To sort search results, Overview grabs a pre-sorted DocumentIdList for all documents in the document set. It filters the DocumentIdList, only including documents in the DocumentIdSet. (This is fast: O(n).) The bottleneck is in transferring DocumentIdLists: max ~0.5s to read the complete DocumentIdList from Postgres, max ~0.5s to write the filtered DocumentIdList to Redis. (See Search Engine Internals for a thorough explanation.)

DocumentIdSet and DocumentIdList let us search 10M documents in 1-2 seconds. We cache the resulting DocumentIdLists in Redis; after that, pagination and streaming take milliseconds of overhead.

Implementation note: most bitset libraries are written for in-memory calculations, not for transmission. Bitset libraries typically optimize in two ways: they store data in machine-word sized integers (32- or 64-bit) instead of bytes (8-bit); and they store the bit representing the first document as the last bit of the word rather than the first. These speedups impede transmission because there's no standard. Overview stores Java's optimized bitsets within its components. But when Overview communicates with plugins and end-users, it transmits and receives bitsets as illustrated above: big-endian, most-significant bit first. If you're implementing a plugin, you may need to massage your bitset library's data into this standard format. (Here's how Overview converts from transmission format to internal format, for example.)

For: transmitting a handle to a document list over the Internet.

Example: The API's selectionId lets you paginate through a request. So does the Overview web UI. Overview's server responds to every search with a selectionId, and the client can use the selectionId to fetch subsequent pages.

Even 128kb is too large to pass in every HTTP request and response. So when Overview caches a DocumentList in Redis, it assigns a 128-bit UUID and passes that over the wire.

When new DocumentLists are created, Redis evicts the least-recently used ones. So a selection ID will work for minutes, or maybe hours ... but not days. Once evicted, the ID will never be useful again.

Maximum size in memory: 16 bytes.

Clone this wiki locally