Skip to content

Why a monolithic back end makes sense for Mwmbl

Daoud Clarke edited this page Jun 16, 2022 · 5 revisions

One of the main advantages of implementing a service using microservices is that you can easily scale up and down the compute and storage capacities for each service as necessary. This can lead to more efficient use of resources. The downside is that there is additional overhead in communication between the services.

In the case of Mwmbl, we plan to process a lot of data. Our medium term goal is to crawl and index 1 billion pages. We currently have the following components:

  • Distributed crawler clients (Firefox extension)
  • Crawler API to control crawling and collect data for indexing (FastAPI and Postgresql on a Hetzner cloud server). This stores the batches of crawled pages in Backblaze (similar to S3).
  • Indexer (currently run manually on a local machine)
  • Search back-end (FastAPI on Google Cloud Run)
  • Search front-end (static vanilla JS web site on GitHub pages)

Our current goal is to be able to get a page that has been crawled by one of our volunteer's Firefox extension available in search, ideally fairly quickly. We need to automate the manual indexing step.

Solution 1: automate the indexing as a new process on its own server(s)

Benefits:

  • The indexing process can be scaled up or down as required
  • Indexing code can be updated independently of crawling and serving search results

Downsides:

  • Adds another thing to deploy, making deployment and maintenance more complex
  • Requires moving a lot of data around

The last point specifically is what originally motivated the idea of a monolithic back-end. In particular:

  • We need to read the crawled batches in Backblaze
  • Once the new index is created, we need to somehow transfer it to the search results server

At the moment, the amounts of data being crawled are so small that transferring the data does not seem costly. Nevertheless, it can still take minutes to sync the data from Backblaze to local storage, and that is just with one or two crawler clients active. With hundreds or thousands of clients it will become a major cost.

Similarly, we are building a very small (500Mb) index. When that grows to 10Tb, moving it regularly from the indexing server to the search results server will seem very wasteful.

Solution 2: The halfway-house solution

Another option is to combine the crawler and indexer, but keep the search results server separate. This would eliminate the need to transfer data from Backblaze to the server (except on a new deployment), but still require syncing the index with the search results server.

Solution 3: The monolith

In this solution, we have a single server for crawler coordination, indexing and search results serving. The front-end and crawler clients obviously remain separate.

This solution has the benefit that no data has to be moved between servers, reducing costs and our dependency on proprietary storage solutions (Backblaze).

It has the following downsides though:

  • It is harder to scale components as necessary. For example, if more resources are needed for indexing, then that might require a code change, for example to increase the number of threads used for indexing.
  • We have to worry about concurrency issues. If the indexer is writing to the same index that the search results server is reading from, then it could try and read a page at the same time that we're trying to write to it.

Managing concurrency

There are several possible solutions here:

  • Use per-page read-write locks to keep track of who is reading from or writing to what. This adds considerable complexity to the code and also increases the memory requirements slightly.
  • Write to a new local index, then tell the search results server to switch to the new one. This doubles the hard disk usage.
  • Allow concurrent read-writes, expect random failures, and retry when they occur. Manage the writes so that concurrent read-writes are rare.

The last option is currently the most attractive. As batches come in, we can keep track of which pages need to be updated. We then prioritise pages by those with the most updates, and update them one page at a time. Then the probability of a concurrent read-write would be less than 1/NUM_PAGES.