-
Notifications
You must be signed in to change notification settings - Fork 115
Home
ALEX is a ML-enhanced range index, similar in functionality to a B+ Tree. Our implementation is a near drop-in replacement for std::map or std::multimap. You can learn more about ALEX in our SIGMOD 2020 paper.
Getting Started
Design Overview
API Documentation
Contributing
ALEX can be used as a header-only library.
All relevant header files are found in src/core.
You will need to compile your program with at least the C++14 standard (e.g., -std=c++14
).
In this repository, we include three programs that you can compile and run:
- An example program of how to use ALEX.
- A simple benchmark that measures the throughput of running point lookups and inserts on ALEX (explained in detail below).
- Unit tests.
On Windows, simply load this repository into Visual Studio as a CMake project. On Linux/Mac, use the following commands:
# Build using CMake, which creates a new build directory
./build.sh
# Run example program
./build/example
# Run unit tests
./build/test_alex
To run the benchmark on a synthetic dataset with 1000 normally-distributed keys:
./build/benchmark \
--keys_file=resources/sample_keys.bin \
--keys_file_type=binary \
--init_num_keys=500 \
--total_num_keys=1000 \
--batch_size=1000 \
--insert_frac=0.5
However, to observe the true performance of ALEX, we must run on a much larger dataset. You can download a 200M-key (1.6GB) dataset from Google Drive. To run one example workload on this dataset:
./build/benchmark \
--keys_file=[download location] \
--keys_file_type=binary \
--init_num_keys=10000000 \
--total_num_keys=20000000 \
--batch_size=1000000 \
--insert_frac=0.5 \
--lookup_distribution=zipf \
--print_batch_stats
You can also run this benchmark on your own dataset.
Your keys will need to be in either binary format or text format (one key per line).
If the data type of your keys is not double
, you will need to modify #define KEY_TYPE double
to
#define KEY_TYPE [your data type]
in src/benchmark/main.cpp.
The four datasets we used in our SIGMOD 2020 paper are publicly available (all in binary format):
- Longitudes (200M 8-byte floats)
- Longlat (200M 8-byte floats)
- Lognormal (190M 8-byte signed ints)
- YCSB (200M 8-byte unsigned ints)
Like the B+ Tree, ALEX is a data structure that indexes sorted data and supports workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. Internally, ALEX uses a collection of linear regressions, organized hierarchically into a tree, to model the distribution of keys. ALEX uses this model to efficiently search for data records by their key. ALEX also automatically adapts its internal models and tree structure to efficiently support writes.
ALEX is inspired by the original learned index from Kraska et al.. However, that work only supports reads (i.e., point lookups and range queries), while ALEX also efficiently supports write (i.e., inserts, updates, and deletes).
In our paper, we show that ALEX outperforms alternatives in both speed and size:
- On read-only workloads, ALEX beats the original learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size.
- Across the spectrum of read-write workloads, ALEX beats B+ Trees (implemented by STX B+ Tree) by up to 4.1X while never performing worse, with up to 2000X smaller index size.
You can find many more details about ALEX here.
ALEX currently operates in memory, single threaded, and on numerical keys. We are considering ways to add support for persistence, concurrency, and string keys to ALEX.
In terms of performance, ALEX has a couple known limitations:
- The premise of ALEX is to model the key distribution using a collection of linear regressions. Therefore, ALEX performs poorly when the key distribution is difficult to model with linear regressions, i.e., when the key distribution is highly nonlinear at small scales. A possible future research direction is to use a broader class of modeling techniques (e.g., also consider polynomial regression models).
- ALEX can have poor performance in the presence of extreme outlier keys, which can cause the key domain and ALEX's tree depth to become unnecessarily large (see Section 5.1 of our paper). A possible future research direction is to add special logic for handling extreme outliers, or to have a modeling strategy that is robust to sparse key spaces.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.