# LevelDB parameter tuning using MLOS

## What is Level DB

LevelDB is a key value store built using Log structured merge trees (LSMs) [Wiki] (https://en.wikipedia.org/wiki/Log-structured_merge-tree). LevelDB mainly supports the read, write, delete and range query (sorted iteration) operations. 

Typical to any database system, levelDB also comes with a bunch of parameters which can be tuned according to the workload to get the best performance. Before going to the parameters, we'll briefly describe the working of levelDB. The source code, the architecture and a simple example of how to use levelDB can be found [here](https://github.com/google/leveldb).

## LevelDB working
<p align="center">
  <img width="460" height="300" src=images/leveldb-architecture.png>
  <img width="460" height="100" src=images/memtablesstable.png>
</p>

As shown the diagram above, the main components of LevelDB are the MemTable,the SSTable files and the log file. LeveDB is primarily optimized for writes. 
MemTable is an in memory data structure to which incoming writes are added after they are appended to the log file. MemTables are typically implemented using skip lists or B+ trees. The parameter write_buffer_size (paramter input at DB initialization) can be used to control the size of the MemTable and the log file. 

Once the MemTable reaches the write_buffer_size (Default 4MB), a new MemTable and log file are created and the original MemTable is made immutable. This immutable MemTable is converted to a new SSTable in the background to be added to the Level 0 of the LSM tree. 

SSTable: It is a file in which the key value pairs are stored sorted by keys. The size of SSTable is controlled by the parameter called max_file_size (Default 2MB).

Once the number of SSTable at Level 0 reaches a certain threshold controlled by the paramter kL0_CompactionTrigger (Default 4), these files are merged with higher level overlapping files. If no files are present in the higher level, the files are combined using merge sort techniques and added to higher level. A new file is created for every 2 MB of data by default. 

For higher levels from 1 to the maximum number of levels, compaction process (merging process) is triggered when the level gets filled. 

A detailed explanation of the working of LeveDB is presented [here](https://github.com/google/leveldb/blob/master/doc/impl.md).


## LevelDB paramter tuning using MLOS

In this lab we will be tuning some of the important paramters of LevelDB and observe how it affects the performance. The parameters that we will be tuning are write_buffer_size and max_file_size to try to optimize the throughput and latency of LevelDB for Sequential and random workloads. 

## LevelDB installation: Instruction on Ubuntu 18.04

Follow the commands below to get, compile and install LevelDB

- sudo apt update
- sudo apt-get install cmake
- git clone --recurse-submodules https://github.com/google/leveldb.git
- cd leveldb
- mkdir -p build && cd build
- cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .

Now, from the ~/leveldb/build directory, you should be able to execute ./db_bench, the microbenchmark which can be used to measure the performance of LevelDB for different workloads. 

Please take a look at the db_bench.cc file in the ~/leveldb/benchmarks directory and get an idea about the input parameters and workloads that are possible. 

An example command to run a workload that does random writes of 1M values with value size 100 B is:
./db_bench --benchmarks=fillrandom --val_size=100 --num=1000000

In [1]:
# import the required classes and tools
import grpc
import pandas as pd
import logging

from mlos.Grpc.BayesianOptimizerFactory import BayesianOptimizerFactory
from mlos.Logger import create_logger

from mlos.Examples.SmartCache import HitRateMonitor, SmartCache, SmartCacheWorkloadGenerator, SmartCacheWorkloadLauncher
from mlos.Mlos.SDK import MlosExperiment
from mlos.Optimizers.OptimizationProblem import OptimizationProblem, Objective
from mlos.Spaces import Point, SimpleHypergrid, ContinuousDimension, DiscreteDimension

# The optimizer will be in a remote process via grpc, we pick the port here:
grpc_port = 50051

In [2]:
import subprocess
optimizer_microservice = subprocess.Popen(f"start_optimizer_microservice launch --port {grpc_port}", shell=True)
logger = create_logger('Optimizing Smart Cache', logging_level=logging.WARN)
optimizer_service_grpc_channel = grpc.insecure_channel(f'localhost:{grpc_port}')
bayesian_optimizer_factory = BayesianOptimizerFactory(grpc_channel=optimizer_service_grpc_channel, logger=logger)

In [3]:
parameter_search_space = SimpleHypergrid(
        name='write_buffer_size_config',
        dimensions=[
            DiscreteDimension('write_buffer_size', min=64*1024, max=32*1024*1024)
        ]
    )

In [4]:
# Optimization Problem
#
optimization_problem = OptimizationProblem(
    parameter_space=parameter_search_space,
    objective_space=SimpleHypergrid(name="objectives", dimensions=[ContinuousDimension(name="throughput", min=0, max=1000)]),
    objectives=[Objective(name="throughput", minimize=False)]
)
# create an optimizer proxy that connects to the remote optimizer via grpc:
optimizer = bayesian_optimizer_factory.create_remote_optimizer(optimization_problem=optimization_problem)

In [5]:
# This block can be removed
from mlos.Mlos.SDK import mlos_globals, MlosAgent
mlos_globals.init_mlos_global_context()
mlos_agent = MlosAgent(
    logger=logger,
    communication_channel=mlos_globals.mlos_global_context.communication_channel,
    shared_config=mlos_globals.mlos_global_context.shared_config,
)

In [15]:
# Please change the leveldb_path to the build directory of your leveldb installation
leveldb_path = "/users/nithinv/leveldb/build/"
command = "db_bench --benchmarks=fillrandom"

import subprocess
def run_workload(write_buffer_size):
    result = subprocess.check_output(leveldb_path + command + " --write_buffer_size=" + write_buffer_size, shell=True)
    stats = (str(result).split(":")[-1]).split(";")
    latency, throughput = float(stats[0].strip().split(" ")[0]), float(stats[1].strip().split(" ")[0])
    return latency, throughput

def initialize_optimizer():
    pass

optimizer = bayesian_optimizer_factory.create_remote_optimizer(optimization_problem=optimization_problem)
def run_optimizer():
    # Parameter 1: write_buffer_size: min_value = 64 KB, max_value = 1 GB
    # Parameter 2: max_file_size: min_value = 1 MB, max_value = 1 GB
    # Optimization parameters: latency and throughput, both returned by run_workload
    for i in range(10):
        new_config_values = optimizer.suggest()
        write_buffer_size = new_config_values["write_buffer_size"]
        throughput = run_workload(str(write_buffer_size))[1]
        print(str(write_buffer_size / (1024*1024)) + " MB", write_buffer_size,  str(throughput) + " MB/s")
        if i > 0:
            print("optimal: ", optimizer.optimum())
        objectives_df = pd.DataFrame({'throughput': [throughput]})
        features_df = new_config_values.to_dataframe()
        optimizer.register(features_df, objectives_df)

run_optimizer()

9.023958206176758 MB 9462306 15.5 MB/s
21.169506072998047 MB 22197836 36.7 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 9462306}, {&quot;throughput&quot;: 15.5})
10.208983421325684 MB 10704895 24.8 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 22197836}, {&quot;throughput&quot;: 36.7})
13.992161750793457 MB 14671845 35.6 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 22197836}, {&quot;throughput&quot;: 36.7})
16.306455612182617 MB 17098558 36.2 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 22197836}, {&quot;throughput&quot;: 36.7})
7.275952339172363 MB 7629389 16.0 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 22197836}, {&quot;throughput&quot;: 36.7})
30.3869047164917 MB 31862979 36.9 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 22197836}, {&quot;throughput&quot;: 36.7})
15.966145515441895 MB 16741717 36.8 MB/s
optimal:  ({&quot;write_buffer_size&quot;: 31862979}, {&quot;throughput&quot;: 36.9})
10.518720626831055 MB 11029678 25.4 MB/s
optimal:  ({&quot;write_buffer_size&

### Reference
https://wiesen.github.io/post/leveldb-storage-memtable/

https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/