Skip to content

Latest commit

 

History

History
107 lines (87 loc) · 5.54 KB

storage.md

File metadata and controls

107 lines (87 loc) · 5.54 KB

Storage

MonkeyCI needs to store various kinds of information:

  • Project information (repository url, etc...)
  • Build history (pipelines and jobs executed)
  • Logs
  • Artifacts
  • Caches
  • Customer and billing information (including invoices)

Most of this information is fairly small and structured, except for logs, caches and artifacts, which can be large blobs. The structured information needs to be searchable up to a level, and of course it must be durable. I would like to keep an open view on which technology is most suited for this, so I don't want to blindly fall back to a relational database. Currently I'm thinking that keeping the information in edn (or json) files in object storage could be useful. This can then be augmented with some kind of indexing system, to allow for searching. Indices themselves could also be stored in edn, and could be loaded in a Redis or ElasticSearch. As long as there is no income, I will focus on the cheapest solution that gets the job done, without having to re-invent the wheel. OCI also offers an autonomous JSON database, which could also serve our needs.

Storing Entities

The build process itself only needs to store entities, there is no need to read any, apart from caching and build parameters. Currently caches and artifacts are manipulated directly, information is retrieved using the API (only build parameters) and updates are sent through events.

Buckets

Initially, we will store everything in object storage, as edn files. The advantage over json is that edn can be appended, you can have multiple objects in one file. This could be useful for adding log statements, or updating build progress. The information is stored in a single bucket, organized like <customer>/<repository>/<build>. The build id is generated by MonkeyCI, which could be as simple as a UUID. Each build "folder" contains the following information:

  • Build metadata (timestamp, trigger type, branch, commit id, etc., result)
  • Per pipeline and step: the logs, artifacts, and results.

Depending on the configuration, this could also just be store locally, which is what we will do initially, or in development mode.

Update: We now know that buckets are fairly slow, and OCI also imposes a request limit, which we hit pretty early, even in development mode with only one user. So this is clearly not the way to go, unless we want to put something in front of it, like a microservice that does caching and request grouping.

Files

Instead of buckets, we could also use files. Especially if we're prepared to build a microservice that handles the requests. We could use ZeroMQ for this, or something similar. (After playing with it I know it also has its issues, but let's talk about that later.) It is faster than buckets and we don't have to take into account request limites. But the biggest downside is that it is not easy to scale it. We could use NFS and mount it to multiple replicas but this still would mean we need some way to "lock" the files so changes don't get overwritten by another replica.

Another concern is that files are harder to search through. We have to structure the data carefully in order to be able to quickly find matches. And even then it's not always possible without duplicating information. Sometimes you just need to access the data from different points of view. If we were to solve this we would in reality be re-inventing the relational database. So we may as well use it.

MySQL

A good and easy to use relational database is MySQL, which is owned by Oracle, so it has good support on OCI. This does mean we'll introduce another 3rd party service we need to host and maintain. We could use the cloud-provided service, but this comes at a cost (about €33/month for a basic system). Initially we could set one up in the cluster.

Using an RDBMS would make it a lot more flexible for us to look up data, and we could use JSON fields for the more dynamic parts (like job definitions and results). However, since this data type is not standard and not supported by all database systems, it may be better to just use edn stored in VARCHAR fields. The biggest hurdle here is that we would need to rewrite most of the current entity code because it is now oriented towards working with files.

Artifacts

Artifacts are just blobs that will be put into storage after each build step. Since storage is not free, we will have to put a limit to the amount of data, or to the period we will store it. Artifacts are configured at step level, and have a name and one or more paths that will be added to the artifact. We will probably use tar and gzip to put all files in one package.

Caching

Caches are similar to artifacts, but caches are not publicly available, but rather reused between builds. Similar to CircleCI or Gitlab, we could assign a key to each cache. This means that caches won't be stored along with the build, but higher up, most likely at repository level. Each build step can hold a cache configuration entry, that has a key and a list of paths that need to be cached/restored. Before the step is executed, the cache is restored (if found), and after the step, it is updated. Depending on the configuration, the update will happen only if the step was successful, or regardless of status.