Architecture Overview

Service interaction

Here's what you need to know if you're developing Overview or designing your own deployment.

Piece by piece

The User: the most important person. Everything is for the user.
Postgres: stores user information and document sets -- including document text, tags, metadata fields and user-created notes.
BlobStorage: stores "blobs" -- raw file data. This includes uploaded documents (exactly the bytes uploaded), PDF versions of the documents (for viewing), thumbnails, and -- if splitting by page -- the split versions of said documents. This can be configured as S3 or a directory on the filesystem. The PostgreSQL database contains the information needed to read it.
Web: responds to the user's requests. A large part of this is JavaScript code. There's a public API (which stays steady and authenticates via an "API token") and our private API (which changes and authenticates via a cookie). This listens on port 9000.
Reverse Proxy: Users expect to access your website on port 443 -- or if you're insecure, port 80. So you'll need a process that forwards requests on port 443 to the web server on port 9000. Elastic Load Balancer is nice on the cloud; Haproxy is great locally; we also have users who use nginx.
Redis: Our web server is very fast, except for one key operation: paginating through lists of millions of documents. We cache each search result as a list of document integer IDs (up to 80MB) on Redis. Fetching a page means asking Redis for the page of IDs, and then fetching the original documents from Postgres.
Plugins: Every plugin is a website. Web presents a plugin ("View") in an iframe; the iframe's URL contains an API token, so the plugin can query the Web API.
Worker: this is where the document-processing happens:
- Tree Plugin: an architectural relic, still integrated into Overview proper. Runs multi-minute/hour processing jobs.
- Search Index: a full-text search engine. Each document set's text and metadata is indexed into a Lucene Directory on the worker's filesystem. The engine can search, extract snippets for search-result presentation, and highlight all search matches in a document.
- File Importer, CSV Importer and DocumentCloud Importer: processing pipelines. Web sends Worker user-provided data (storing it in Postgres along the way), and Worker converts that user-provided data into documents.

How they communicate

The User connects to the reverse proxy on port 443 (or, if you're being insecure, port 80).
The reverse proxy connects to Web on port 9000. The protocol is HTTP.
Each plugin connects to the reverse proxy the same way the user does.
Web connects to Worker via akka-remote, on Worker's port 9030.
Worker responds to Web via akka-remote itself, on Web's port 9031. (Worker never initiates any communications, but Akka is built for peer-to-peer connections. It's simplest to just open the port.)
Web and Worker connect to Postgres on port 9010 (in development/Docker) or port 5432 (in production).
Web and Worker connect to blob-storage via a shared filesystem (in development/Docker) or through HTTP requests to S3.
Web connects to Redis on port 9020 (in development/Docker) or port 6379 (production).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

Piece by piece

How they communicate

Clone this wiki locally