Skip to content

Architecture Overview

Adam Hooper edited this page Aug 10, 2017 · 1 revision

Service interaction

Here's what you need to know if you're developing Overview or designing your own deployment.

Piece by piece

  • The User: the most important person. Everything is for the user.
  • Postgres: stores user information and document sets -- including document text, tags, metadata fields and user-created notes.
  • BlobStorage: stores "blobs" -- raw file data. This includes uploaded documents (exactly the bytes uploaded), PDF versions of the documents (for viewing), thumbnails, and -- if splitting by page -- the split versions of said documents. This can be configured as S3 or a directory on the filesystem. The PostgreSQL database contains the information needed to read it.
  • Web: responds to the user's requests. A large part of this is JavaScript code. There's a public API (which stays steady and authenticates via an "API token") and our private API (which changes and authenticates via a cookie). This listens on port 9000.
  • Reverse Proxy: Users expect to access your website on port 443 -- or if you're insecure, port 80. So you'll need a process that forwards requests on port 443 to the web server on port 9000. Elastic Load Balancer is nice on the cloud; Haproxy is great locally; we also have users who use nginx.
  • Redis: Our web server is very fast, except for one key operation: paginating through lists of millions of documents. We cache each search result as a list of document integer IDs (up to 80MB) on Redis. Fetching a page means asking Redis for the page of IDs, and then fetching the original documents from Postgres.
  • Plugins: Every plugin is a website. Web presents a plugin ("View") in an iframe; the iframe's URL contains an API token, so the plugin can query the Web API.
  • Worker: this is where the document-processing happens:
    • Tree Plugin: an architectural relic, still integrated into Overview proper. Runs multi-minute/hour processing jobs.
    • Search Index: a full-text search engine. Each document set's text and metadata is indexed into a Lucene Directory on the worker's filesystem. The engine can search, extract snippets for search-result presentation, and highlight all search matches in a document.
    • File Importer, CSV Importer and DocumentCloud Importer: processing pipelines. Web sends Worker user-provided data (storing it in Postgres along the way), and Worker converts that user-provided data into documents.

How they communicate

  • The User connects to the reverse proxy on port 443 (or, if you're being insecure, port 80).
  • The reverse proxy connects to Web on port 9000. The protocol is HTTP.
  • Each plugin connects to the reverse proxy the same way the user does.
  • Web connects to Worker via akka-remote, on Worker's port 9030.
  • Worker responds to Web via akka-remote itself, on Web's port 9031. (Worker never initiates any communications, but Akka is built for peer-to-peer connections. It's simplest to just open the port.)
  • Web and Worker connect to Postgres on port 9010 (in development/Docker) or port 5432 (in production).
  • Web and Worker connect to blob-storage via a shared filesystem (in development/Docker) or through HTTP requests to S3.
  • Web connects to Redis on port 9020 (in development/Docker) or port 6379 (production).