Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
LuMongo is a real-time distributed search and storage system. LuMongo is designed to scale both vertically and horizontally across servers. LuMongo provides the flexibility and power of Lucene queries with the scalability and ease of use of MongoDB. By intelligently leveraging MongoDB, LuMongo is able to make Lucene scale without sacrificing Lucene's rich query syntax and without degrading MongoDB's scalability. All data in LuMongo is stored in MongoDB. This includes indexes, documents, and associated documents. MongoDB can be sharded and replicated to meet the I/O needs. Since MongoDB is the only data store, backup requires a simple filesystem snapshot.
LuMongo operates as a cluster. A cluster can be just a single server or hundreds of servers. A server in a LuMongo cluster is referred to as a LuMongo node. Nodes in a LuMongo cluster communicate using with each other and with clients using Google Protobuffers. Clients also communicate using Google Protobuffers and can connect to any node in the cluster. On error, clients can fail to another cluster node. Nodes in the cluster can be added and removed dynamically. To facilitate dynamic adding and removing of nodes on the cluster, a LuMongo Node only stores data to MongoDB and does not rely on local storage. In addition to facilitating dynamic membership, this allows data persistence requirements such as sharding, replication, and backup to be managed separately from index and search requirements such as CPU and memory. This can be significant benefit for cloud applications. A basic architecture diagram is shown in the diagram below.
##Index Segments LuMongo indexes are broken down into segments. The number of segments is configurable and determines the maximum number of servers that the index can be scaled across. However, once an index is created, the number of segments cannot be changed. Since every segment is a Lucene index, each segment can hold up to 2 billion documents. The maximum number of documents that a cluster can index is approximately the number of segments multiplied by 2 billion. When the first node of the cluster is started it contains all the segments. When the second node is started, the cluster splits the segments across the two nodes. This process continues until all the nodes are started. A example of segments being distributed as nodes are added is shown in the diagram below.
##Indexing Documents are indexed into the cluster using a user provided unique identifier. A hash of the document's unique identifier determines which segment a document's indexed fields will be stored into. This ensures that the same document will always be indexed into the same segment within an index. To help keep the index size small, the unique identifier is the only indexed field whose value is retrievable from the index segment. The full document and associated documents are stored in MongoDB along with document types and various user defined metadata about the documents. When indexing, a client makes an indexing request to any node in the cluster. The node receiving the request calculates the segment for an index based of the unique identifier, determines which node owns that segment, and forwards the request to that node. The node which owns the index segment updates or inserts the document in the index using the unique identifier. If the index segment has exceeded its writes before commit then the segment is committed, as seen in the diagram below. The document itself is stored into MongoDB. Documents can be text, BSON, or binary. Since the document is a MongoDB object, the maximum size of a document is 16MB as MongoDB 2.0. Associated documents can be any size because there are stored on GridFS.
Searches in LuMongo are federated across all segments in an index being searched. Since a document can be only stored in a single segment within an index because it is hashed based on the unique identifier, deduplication of the search results is not necessary when searching a single index. LuMongo searches return only the unique identifiers of the documents which match the query. The unique identifier and a last updated timestamp is all that is stored in Lucene to help keep indexes small. The unique identifiers returned can be used to fetch or to delete documents returned by the search.
LuMongo's search only returns a list of unique identifiers and the last updated date. Therefore, the retrieval of the document is handled with a separate fetch request. The unique identifiers provided by the search results are the keys used to retrieve the document and associated documents. As these documents are stored separately from the index heterogeneous documents can coexist in the same repository. When fetching a document, the type of the document in addition to any metadata provided at storage time is returned as well. A client side document cache is available to speed fetching. The client side cache compares the timestamp of the document returned with the document in the cache to determine if a fetch is needed.
On a normal shutdown of a LuMongo node, all segments committed and are distributed to existing nodes. This allows for rolling shutdowns of the nodes to update them. On unexpected shutdown the segments will fail to the existing nodes without committing. These indexes could require rollback or repair. Currently this is not handled automatically but it will be in future releases using Lucene's built in index repair. Since the documents are stored in MongoDB and not in the index, another possible solution could be fetching the documents for a corrupted segment and reindexing them. MongoDB also provides seamless failover through replication. MongoDB's replication is data center aware backups across datacenters are possible.
Seamless failover is always desired but catastrophic failovers may require recovery. The design of LuMongo itself gives a great solution for disaster recovery. All data in LuMongo is stored in MongoDB. Since MongoDB can be snapshotted through a file level snapshot, all that is needed to restore all indexes, documents, and associated documents is a snapshot of the MongoDB data directory. This allows point in time backups which can be used to restore the system to a given state. This can also be used before re-loads of content and other big changes to revert to the previous state.