Minimalistic modular engine for StrokeDB
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


StrokeDB Core.

Minimalistic modular database engine.

Authors: Oleg Andreev <> & Yurii Rashkovskii <>

Copyright 2008 IDBNS, inc.

= Architecture overview

StrokeDB stores a set of versioned documents identified by UUID.
A document is a hash-like container of slots: flat set of values tagged with string keys.
Slots store arbitrary serializable values (most common types are booleans, 
numbers, strings, arrays, numbers, time).
Basic repository implements three retrieval methods: 
  * get(uuid) 
  * get_version(version)
  * each{|doc| ... }

Efficient indexing is implemented with Views. View is an indexed sorted list of the key-value
pairs. Each view relies on map(doc) method to produce key-value pairs (like in map-reduce).
View requests are implemented by prefix-based searching using powerful find() method.

StrokeDB Core API relies on: 
  * two methods of document: doc["slot"], doc["slot"]= 
  * slots: "uuid", "version", "previous_version"
  * ability to store Arrays and Strings in a "previous_version" slot. (Array is used when 
  the version is a result of a merge operation.)

== Versions

Every document references several previous versions. 
The very first version of a document does not have a reference to any previous version. 
Regular versions have a reference to only one previous_version.
In case of a merge, document references two or more previous versions.
Document deletion is just a creation of the new version with the {"deleted"=>true} slot.

== Guarantees

StrokeDB doesn't guarantee referential integrity. Any document may reference any UUID, even
not available in the repository. 

== Views

== Synchronization (fetch and merge operations)

Data cases:
1. Many new documents, few versions.
2. Many documents with many versions organized linearly.

Branches must be handled efficiently as well, but the first priority is to make first two cases 
as performant as possible.

Fetch contexts:
1. Occasional fetch (hourly, daily etc.)
2. Streamed syncing (i.e. master-slave replication)

The concept.

1. Every repository is identified by UUID.
2. Every repository writes a log of commits.
3. Commit tuples:
   (timestamp, "store", uuid, version)
   (timestamp, "pull",  repo_uuid, repo_timestamp)
4. When you pull from a repository:
   1. Find out the latest timestamp in your history (index: repo_uuid -> repo_timestamp)
   2. If there is not timestamp yet, pull the whole log.
   3. If there is a timestamp for a repository UUID, pull the tail of the log.
   4. For each "store" record: fetch the version.
   5. For each "pull" record - add to a deferred list of repositories waiting for update.
   6. When whole log is fetched, fetch deferred repositories. We have two options here:
     1. Fetch from the same repository we've been fetching from few moments ago (say, fetch B log from A)
     2. Or, fetch directly from the desired repository (B log from B repository)

Partial fetches.

Repository may expose only limited set of documents for synchronization, specified with a view.
Thus, the view needs to run an update log.

TODO: how to efficiently determine which versions are already fetched?


We may add git-like security features this way:
1. Each version number is a UUIDv5 (SHA-1) of the doc's content.
2. Each subsequent doc UUID is a UUIDv5 for previous commit content.
== Availability