Doc/nessDB.txt

================================================================
nessDB with LSM-Tree
Copyright (C) 2012 BohuTANG________________ 
_____________________________  __ \__  __ )
__  __ \  _ \_  ___/_  ___/_  / / /_  __  |
_  / / /  __/(__  )_(__  )_  /_/ /_  /_/ / 
/_/ /_/\___//____/ /____/ /_____/ /_____/  
================================================================
nessDB is a fast Key-Value database, supports Redis-Protocol(PING,SET,MSET,GET,MGET,DEL,EXISTS,INFO).
Which is written in ANSI C with BSD LICENSE and works in most POSIX systems without external dependencies.


OVERALL LAYOUT
==============
			       +------------------------------------+
			       |      nessDB  architecture          |
			       +------------------------------------+

	+--------------------------------+       +---------------------------------+ 
	|          memtable              |       |         background-merge        |
	|--------------------------------|------>|---------------------------------|
	|  key-sorted memory table       |       |background merge detached thread |
	+--------------+-----------------+       +---------------------------------+
		       |
		       |
	+--------------+-----------------+
	|          meta list             |
	|--------------------------------|
	| key-range informtation(memory) |
	+--------------+-----------------+
		       |
		       |
	+--------------+-----------------+
	|         on-disk sst            |
	|--------------------------------|
	|  key-sorted disk index files   |
	+--------------+-----------------+
		       |
		       |
	+--------------+------------------+
	|         on-disk db              |
	|---------------------------------|
	|      disk database file         |
	+---------------------------------+

	+--------------+------------------+
	|         on-disk log             |
	|---------------------------------|
	|         disk log file           |
	+---------------------------------+

	nessDB has three kinds file on disk, there are: '*.sst', '*.db' and '*.log'
	'*.sst' is index file, all keys are sorted and all indexes file don't contain overlapping keys
	'*.db' is data file 
	'*.log' is data recovery file

1) memtable(in memory)
======================
 It is a key sorted 'LSM-Tree' in memory and active for ADD/UPDATE/DELETE operations
 (skiplist.c)

 'memtable' structured:
 +--------+--------------+-------------+
 |   key  | value-offset |  operation |
 +--------+--------------+-------------+
 'key' is entry key(is distint,or will be covered)
 'value-offset' is the offset of one entry data which stored in the 'db' file
 'operation' is 'ADD' or 'DEL' action.


2) meta(in memory)
==================
 'meta' is key-range lists in memory, meta lists are formated as: 
 +---------------+-----------+---------------+--------------------+
 |  begin key    |  end key  | sst file name |  sequential number |
 +---------------+-----------+---------------+--------------------+
 |                  ... all the other items ...                   |
 +---------------+-----------+---------------+--------------------+
 

3) sst(on-disk)
================
  '*.sst' is on-disk index file(key is sorted), the max number of one '*.sst' file is limited by 'SST_MAX' macro,
   which is defined in 'config.h', default count is 25000.
 
 a '*.sst' index file layout:
 +------------------------------------------+
 |          sst block 1                     |
 +------------------------------------------+
 |          sst block 2                     |
 +------------------------------------------+
 |     ... more sst blocks ...              |
 +------------------------------------------+
 |          sst block N                     |
 +------------------------------------------+
 |             FOOTER                       |
 +------------------------------------------+

 'sst block' structured:
 +--------+--------------------------+
 |   key  | value-offset(big-endian) | 
 +--------+--------------------------+
 'key' is entry key(is distint,or will be covered)
 'value-offset' is the offset of one entry data which stored in the 'db' file
 NOTICE: all 'sst block' operations are 'ADD'('DEL' operation entries are filtered when written to index file)


 'FOOTER' structured:
 +-------------+-------+------+
 |   last-key  | count |  crc |
 +-------------+-------+------+
 'last-key' is the last one key of current sorted index file, due to as meta lists end-key.
 'count' is all entries's count of current sorted index file
 'crc' is used to verify the current index file


4) db(on-disk)
=============
 It is the 'value' storage file.
 'db' structured as follow:
 +-----------------------------------+
 | 	         magic number        |
 +-----------------+-----------------+
 |  value 1 length |   value 1       |
 +-----------------+-----------------+
 |      ... all the others ...       |
 +-----------------+-----------------+
 |  value N length |   value N       |
 +-----------------+-----------------+
 'magic number' is an integer for datas not start from '0' offset
  

5) log(on-disk)
===============
 '*.log' is prepared for data recovery when storage engine shutdown than expected.
 Each memtable is mapping one log file on disk. Actually, there are at most two 'log' files exists:
 one is merging memtable's(immutable, read only), and another is current active memtable's.
 When background detached-thread is done, the merging memtable's log will be deleted immediately.

 'log' structured:
 +---------------+----------+-------------+------------+
 |  key 1 length |   key 1  | data offset | operation  |
 +---------------+----------+-------------+------------+
 |               ... all the others ...                |
 +---------------+----------+-------------+------------+
 |  key N length |   key N  | data offset | operation  |
 +---------------+----------+-------------+------------+

 When nessDB reopen again, it checks is some '*.log' files exists
 if exists, recovery it(read log file and add entry to current active memtable)

6) background-merge
===================

  +-------------------------+      +---------------------------+
  | active memtable is full |----->| become immutable memtable |
  +----------+--------------+      +------------+--------------+
                                                |
                                                |
                                                v
                                 +---------------------------------+
                                 | start background merging thread |
                                 +---------------------------------+

 This is a detached-thread to merge immutable memtable(full) entries to on-disk file(*.sst)


All of the above is about nessDB's architecture and main structure layout.
Now, let's show the processes of 'WRITE' and 'READ', 'DB-OPEN'.

WRITE PROCESS
=============

     +----------------------+     +-------------------------------+   +--------------------------+
---->|  write to log first  |---->|  write to active memtable     |-->|    create new  memtable  |---> add to the memtable
     +----------------------+     +-------------------------------+   +--------------------------+
	                                         |
	                                         | if active is full(memtable becomes immutable) 
	                                         v
	                           +-------------------------------+
	                           | create one thread to do merge |
	                           +-------------------------------+
	                                         |
	                                         |
	                                         v
	                           +-------------------------------+
	                           |delete merging memtable's  log |
	                           +-------------------------------+
	                                         |
	                                         v
	                           +-------------------------------+
	                           |      exit the merging thread  |
	                           +-------------------------------+


 Summarize: 
	since merging is on background thread, so can't block the front-end writing, 
	and nessDB has read/write-locking, so concurrent writing/reading is allowed on the same index file.

READ PROCESS
=============

        +----------------------------+
 +----->| lookup from active metable |
^       +-----------+----------------+
|                   |
|            exists | not exists
|            +------+------------------------+
|            |                               |
|            v                               v
| +------------------------+     +-----------------------+
| | read data from db file |     |  get data from LLRU   |
| +------------------------+     +-----------------------+
|                                     exists | not exists
|<-------------------------------------------+------------------+
|                                                               |
|                                               +---------------v--------------------+
|                                               | get index file info from meta list |
|                                               +---------------+--------------------+
|                                                               |
|                                                               |
|                                               +---------------v--------------------+
|                                               | a)read data offset from index file |
|                                               +---------------+--------------------+
|                                                               |
|                                                               |
|                                               +---------------v--------------------+
|                                               | b)read data length from index file |
|                                               +---------------+--------------------+
|                                                               |
|                                                               |
|                                               +---------------v--------------------+
|                                               |     c)read data from db file       |
|                                               +---------------+--------------------+
|                                                               |
\---------------------------------------------------------------/


DB-OPEN PROCESS
=============== 

             +---------------+
             |    db open    |
             +-------+-------+
                     |
                     |
                     v
      +---------------------------------+
      | read footer info from sst files |
      +--------------+------------------+
                     |
                     |
                     v
      +---------------------------------+
      |       create meta list          |
      +--------------+------------------+
                     |
                     |
                     v
      +---------------------------------+
      |   recovery data from log file   |
      +--------------+------------------+
                     |
                     |
                     v
      +----------------------------------+
      |  recovery log entry to  memtable |
      +----------------------------------+


nessDB?
=======
I think coding this database storage engine is without end, this is the reason why it is named 'nessDB'.
It is ness.