# LevelDB parameter tuning using MLOS

## What is Level DB

LevelDB is a key value store built using Log Structured Merge Trees (LSMs) [Wiki](https://en.wikipedia.org/wiki/Log-structured_merge-tree). LevelDB supports read, write, delete and range query (sorted iteration) operations. 

Typical to any database system, LevelDB also comes with a bunch of parameters which can be tuned according to the workload to get the best performance. Before going to the parameters, we'll briefly describe the working of LevelDB. The source code, the architecture and a simple example of how to use LevelDB can be found [here](https://github.com/google/leveldb).

## LevelDB working

![LevelDB Architecture Diagram](./images/leveldb-architecture.png)

![MemTable SSTable Diagrams](./images/memtablesstable.png)

LevelDB uses 7 levels to store the data, the amoung of data that can be stored in each of the levels after level 0 is $10^{level}$, so level 1 can store around 10 MB of data, level 2 around 100 MB and so on.

As shown the diagram above, the main components of LevelDB are the _MemTable_,the _SSTable_ files and the _log_ file. LeveDB is primarily optimized for writes. 

_MemTable_ is an in memory data structure to which incoming writes are added after they are appended to the log file. MemTables are typically implemented using skip lists or B+ trees. The parameter `write_buffer_size` (paramter input at DB startup) can be used to control the size of the MemTable and the log file. 

Once the MemTable reaches the `write_buffer_size` (Default 4MB), a new MemTable is created and the original MemTable is made immutable. This immutable MemTable is converted to a new SSTable in the background to be added to the Level 0 of the LSM tree. 

_SSTable_: It is a file in which the key value pairs are stored sorted by keys. The size of SSTable is controlled by the parameter called max_file_size (Default 2MB).

Once the number of SSTable at Level 0 reaches a certain threshold controlled by the paramter `kL0_CompactionTrigger` (Default 4), these files are merged with higher level overlapping files. If no files are present in the higher level, the files are combined using merge sort techniques and added to higher level. A new file is created for every 2 MB of data by default. 

For higher levels from 1 to the maximum number of levels, compaction process (merging process) is triggered when the level gets filled. 

A detailed explanation of the working of LeveDB is presented [here](https://github.com/google/leveldb/blob/master/doc/impl.md).


## LevelDB paramter tuning using MLOS

In this lab we will be tuning some of the important start up time paramters of LevelDB and observe how it affects the performance. The parameters that we will be tuning are `write_buffer_size` and `max_file_size` to try to optimize the throughput and latency of LevelDB for Sequential and random workloads. 

## LevelDB installation: Instruction on Ubuntu 18.04

Follow the commands below to get, compile and install LevelDB

```sh
sudo apt update
sudo apt-get install cmake
git clone --recurse-submodules https://github.com/google/leveldb.git
cd leveldb
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .
```

Now, from the `~/leveldb/build` directory, you should be able to execute `./db_bench`, the microbenchmark which can be used to measure the performance of LevelDB for different workloads. 

Please take a look at the `db_bench.cc` file in the `~/leveldb/benchmarks` directory and get an idea about the input parameters and workloads that are possible. 

An example command to run a workload that does random writes of 1M values with value size 100 B is:

```sh
./db_bench --benchmarks=fillrandom --val_size=100 --num=1000000
```

The output of the command will look like (numbers migth be different):

```txt
LevelDB:    version 1.22
Date:       Thu Oct  8 13:56:00 2020
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
WARNING: Snappy compression is not enabled
------------------------------------------------
Opening the DB now
In the collect stats thread
Total data written = 421.9 MB   
fillrandom :      31.731 micros/op;    3.5 MB/s
```


In the subsequent cells, we will using Bayesian optimization in MLOS to tune the startup time parameters to obtain the parameters that result in best throughput and latency. 

In [38]:
import subprocess
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import t
from mlos.Optimizers.OptimizationProblem import OptimizationProblem, Objective
from mlos.Optimizers.BayesianOptimizer import BayesianOptimizer
from mlos.Spaces import SimpleHypergrid, ContinuousDimension, DiscreteDimension

from mlos.Optimizers.BayesianOptimizerConfigStore import bayesian_optimizer_config_store
from mlos.Optimizers.BayesianOptimizerFactory import BayesianOptimizerFactory
from mlos.Spaces import Point

# configure the optimizer, start from the default configuration
optimizer_config = bayesian_optimizer_config_store.default
# set the fraction of randomly sampled configuration to 10% of suggestions
optimizer_config.experiment_designer_config_fraction_random_suggestions = .1
# configure the random forest surrogate model
random_forest_config = optimizer_config.homogeneous_random_forest_regression_model_config
# refit the model after each observation
random_forest_config.decision_tree_regression_model_config.n_new_samples_before_refit = 1
# Use the best split in trees (not random as in extremely randomized trees)
random_forest_config.decision_tree_regression_model_config.splitter = 'best'
# right now we're sampling without replacement so we need to subsample
# to make the trees different when using the 'best' splitter
random_forest_config.samples_fraction_per_estimator = .9
# Use 10 trees in the random forest (usually more are better, 10 makes it run pretty quickly)
random_forest_config.n_estimators = 10
# Set multiplier for the confidence bound
optimizer_config.experiment_designer_config.confidence_bound_utility_function_config.alpha = 0.1
# optimizer = optimizer_factory.create_local_optimizer(
#     optimization_problem=optimization_problem,
#     optimizer_config=optimizer_config
# )

In [43]:
# You might have to change the min and max value based on the start up time parameter that you want explore
parameter_search_space = SimpleHypergrid(
        name='parameter_config',
        dimensions=[
            DiscreteDimension('parameter', min=1*1024*1024, max=64*1024*1024)
        ]
    )

In [44]:
# Optimization Problem
# You might have to change the min and max value based on the objective that you are using 
optimization_problem = OptimizationProblem(
    parameter_space=parameter_search_space,
    objective_space=SimpleHypergrid(name="objectives", dimensions=[ContinuousDimension(name="objective", min=0, max=100)]),
    objectives=[Objective(name="objective", minimize=False)]
)

optimizer_factory = BayesianOptimizerFactory()
optimizer = optimizer_factory.create_local_optimizer(
    optimization_problem=optimization_problem,
    optimizer_config=optimizer_config
)

def initialize_optimizer():
    optimizer_factory = BayesianOptimizerFactory()
    optimizer = optimizer_factory.create_local_optimizer(
    optimization_problem=optimization_problem,
    optimizer_config=optimizer_config)

ity_function_name": "upper_confidence_bound_on_improvement",
  "experiment_designer_config.confidence_bound_utility_function_config.alpha": 0.1,
  "experiment_designer_config.random_search_optimizer_config.num_samples_per_iteration": 1000,
  "experiment_designer_config.fraction_random_suggestions": 0.5,
  "experiment_designer_config_fraction_random_suggestions": 0.1
}.
10/08/2020 21:58:26 -   BayesianOptimizerFactory -    INFO - [BayesianOptimizerFactory.py:  40 -    create_local_optimizer() ] Creating a bayesian optimizer with config: {
  "surrogate_model_implementation": "HomogeneousRandomForestRegressionModel",
  "experiment_designer_implementation": "ExperimentDesigner",
  "min_samples_required_for_guided_design_of_experiments": 10,
  "homogeneous_random_forest_regression_model_config.n_estimators": 10,
  "homogeneous_random_forest_regression_model_config.features_fraction_per_estimator": 1,
  "homogeneous_random_forest_regression_model_config.samples_fraction_per_estimator": 0.9,


In [45]:
# Remember to initialize the optimizer before each time you run optimization or call run_optimizer() function
initialize_optimizer()

ity_function_name": "upper_confidence_bound_on_improvement",
  "experiment_designer_config.confidence_bound_utility_function_config.alpha": 0.1,
  "experiment_designer_config.random_search_optimizer_config.num_samples_per_iteration": 1000,
  "experiment_designer_config.fraction_random_suggestions": 0.5,
  "experiment_designer_config_fraction_random_suggestions": 0.1
}.
10/08/2020 21:58:27 -   BayesianOptimizerFactory -    INFO - [BayesianOptimizerFactory.py:  40 -    create_local_optimizer() ] Creating a bayesian optimizer with config: {
  "surrogate_model_implementation": "HomogeneousRandomForestRegressionModel",
  "experiment_designer_implementation": "ExperimentDesigner",
  "min_samples_required_for_guided_design_of_experiments": 10,
  "homogeneous_random_forest_regression_model_config.n_estimators": 10,
  "homogeneous_random_forest_regression_model_config.features_fraction_per_estimator": 1,
  "homogeneous_random_forest_regression_model_config.samples_fraction_per_estimator": 0.9,


In [46]:
# Please change the leveldb_path to the build directory of your leveldb installation
leveldb_path = "$HOME/leveldb/build/"
# You can change the command to run a different kind of workload (take a look at db_bench.cc to see the possible workloads)
command = "db_bench"

# You might have to change the run workload function to explore a combination of parameters simultaneously
def run_workload(workload, input_parameter, parameter_value):
    # The line below executes the db_bench command with approprite parameters, you can change this 
    # if you want to specify other input parameters
    result = subprocess.check_output(leveldb_path + command + " --benchmarks=" + workload +  " --" + str(input_parameter) + "=" + str(parameter_value), shell=True)
    stats = (str(result).split(":")[-1]).split(";")
    # The line below is used to parse the output that is returned by db_bench
    latency, throughput = float(stats[0].strip().split(" ")[0]), float(stats[1].strip().split(" ")[0])
    return latency, throughput

#optimizer = bayesian_optimizer_factory.create_remote_optimizer(optimization_problem=optimization_problem)
def run_optimizer():
    # Parameter 1: write_buffer_size: min_value = 1 MB, max_value = 128 MB
    # Parameter 2: max_file_size: min_value = 1 MB, max_value = 128 MB
    # Optimization parameters: latency and throughput, both returned by run_workload
    for i in range(10):
        new_config_values = optimizer.suggest()
        new_parameter_value = new_config_values["parameter"]
        latency, throughput = run_workload("fillrandom", "max_file_size", new_parameter_value)
        print("Parameter value: {}, Objective value: {}".format(str(new_parameter_value),  str(throughput) + " MB/s"))
        if i > 0:
            optimum_parameter, optimum_value = optimizer.optimum() 
            print("Optimal parameter: {}, Optimal value: {}".format(optimum_parameter["parameter"], optimum_value["objective"]))
        objectives_df = pd.DataFrame({'objective': [throughput]})
        features_df = new_config_values.to_dataframe()
        optimizer.register(features_df, objectives_df)

# Remember to call initialize_optimizer function before the run_optimizer
# To avoid the optimizer remembering the optimal values from previous run
run_optimizer()

Parameter value: 14231436, Objective value: 21.8 MB/s
Parameter value: 7465366, Objective value: 21.1 MB/s
Optimal parameter: 14231436, Optimal value: 21.8
Parameter value: 61708892, Objective value: 21.9 MB/s
Optimal parameter: 14231436, Optimal value: 21.8
Parameter value: 13801008, Objective value: 21.5 MB/s
Optimal parameter: 61708892, Optimal value: 21.9
Parameter value: 37672330, Objective value: 21.3 MB/s
Optimal parameter: 61708892, Optimal value: 21.9
Parameter value: 56044662, Objective value: 21.7 MB/s
Optimal parameter: 61708892, Optimal value: 21.9
Parameter value: 4266076, Objective value: 20.8 MB/s
Optimal parameter: 61708892, Optimal value: 21.9
Parameter value: 25616874, Objective value: 22.4 MB/s
Optimal parameter: 61708892, Optimal value: 21.9
Parameter value: 55095476, Objective value: 20.5 MB/s
Optimal parameter: 25616874, Optimal value: 22.4
Parameter value: 2542654, Objective value: 22.0 MB/s
Optimal parameter: 25616874, Optimal value: 22.4


### Verification
Manually run the benchmark for various values of the parameter that you are testing, plot the graphs and verify if the optimal returned by the optimizer matches with the one manually obtained.
For example, if the `input_parameter` is `write_buffer_size`, you can start from 2 MB (2097152) and go up to 64 MB (67108864), by trying values like, 2MB, 4MB, 8MB, 16MB, 32MB, 64MB and verify the point of deflection i.e the point where throughput starts to decrease after increasing or latency starts to increase after decreasing and verify if it matches with what is returned by the optimizer.

## Going further

1. Choose 2 parameters from `leveldb/include/leveldb/options.h` file (this can include `write_buffer_size` and `max_file_size`), and try to tune them manually and using the optimizer and compare the results.

### Reference

- <https://wiesen.github.io/post/leveldb-storage-memtable/>
- <https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/>