IML-Predictor is a tool for predicting Impala query memory limit based on Supervisored learning. We collect certain
amount of Impala queries and their details as samples, extract features and label them as data for trainning models,
which are used to predict memory for new comming impala queries.
- Impala(>=impala-2.5.0+cdh5.7.2)
- Python(>=3.5)
- Scala(>=2.10.1)
Important: Testing OK on CDH 5.7.2 and 5.12.1, other versions are not guaranteed to be available.
git clone https://github.com/gridsum/IML-predictor.git
cd IML-predictor
sudo apt-get -y install --no-install-recommends \`cat depend_ubuntu\`
python3 -m pip install -r depend_pip3
IML-Predictor is depend on impala-toolbox/data-divertor, which collects queries' details from Cloudera Manager API.
you need to deploy data-divertor in your hadoop cluster and keep it running.
features is the spark application for collecting Impala queries and their details, package and upload to spark node
cd features
mvn clean package
cd target
mv feature-engineering-1.0-SNAPSHOT-jar-with-dependencies.jar feature-engineering.jar
scp feature-engineering.jar username@remoteip:/home/username
Generate a shell script for running spark task and upload to spark node
python generate_spark_submit.py
chmod +x spark.sh
scp spark.sh username@remoteip:/home/username
Login remote server to create the directory for holding feature data
ssh username@remoteip
hadoop fs -mkdir iml-predictor/feature
Before building models, run spark.sh on spark node with parameters start_day and end_day in format '%Y%m%d', for example:
./spark.sh 20180312 20180318
Start IML-Predictor application
python -m memory.server
If your local server can not run spark application, set memory/model/settings.py SparkSubmit.LOCAL=False
and make sure you have execute spark.sh shell script on spark node before building Models.
Post http://ip:port/v1/impala/memory/model_build
Parameters(json format)
Name | Type | description |
---|---|---|
start_day | int or string | start day for queries we collecting, in format "%Y%m%d", like "20180101" |
end_day | int or string | end day for queries we collecting, in format "%Y%m%d", like "20180110" |
generate_feature(optional) | bool | if you want to reuse feature collected before, set it to False |
cross_validate(optional) | bool | if you need cross validating accuracy result saved, set it to True |
Name | Type | description |
---|---|---|
message | string | Information about whether models are building or not |
error_code | int | If model succeed started building, error_code equals 0, else -1 |
Status: 200 OK
{"message": "Models are being built now, try after some time", "error_code": -1}
Get http://ip:port/v1/impala/memory/model_status
Name | Type | description |
---|---|---|
status_code | int | 0 is for Finished, 1 is for Running, 2 is for Failed |
status_str | string | it could be "Finished", "Running" and "Failed" |
Status: 200 OK
{"status_code": 1, "status_str": "Running"}
Post http://ip:port/v1/impala/memory/predict
Parameters(json format)
Name | Type | description |
---|---|---|
sql | string | the sql you want to predict memory limit |
db | string | which db the sql will execute in |
pool(optional) | string | impala pool name for sql executing |
Name | Type | description |
---|---|---|
mem | int | memory limit for sql, unit("MB") |
error_code | int | error_code is zero in this case |
{"mem": 400, "error_code": 0}
Name | Type | description |
---|---|---|
message | string | message explain predict failure reason |
error_code | int | code corresponding to that situation |
Value | description |
---|---|
1 | PARAMETER_ERROR |
2 | IMPALA_CONNECT_ERROR |
3 | IMPALA_QUERY_ERROR |
4 | NetWorkError |
5 | MODEL_FILE_NOT_FOUND_ERROR |
6 | UNKNOWN_ERROR |
impala-toolbox-help@gridsum.com
IML-Predictor is licensed under the Apache License 2.0.