Skip to content

Commit 1221b0f

Browse files
authored
Merge pull request #149 from postgresml/montana/lightgbm
add support for lightgbm
2 parents e17d409 + bc38e02 commit 1221b0f

File tree

9 files changed

+105
-34
lines changed

9 files changed

+105
-34
lines changed

pgml-docs/docs/user_guides/setup/gpu_support.md

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,33 @@
11
# GPU Support
22

3-
PostgresML is capable of leveraging GPUs when the underlying libraries and hardware are properly configured on the database.
3+
PostgresML is capable of leveraging GPUs when the underlying libraries and hardware are properly configured on the database server.
4+
5+
!!! tip
6+
Models trained on GPU will also require GPU support to make predictions.
47

58
## XGBoost
6-
XGBoost is currently the only integrated library that provides GPU accellaration. GPU setup for this library is covered in the [xgboost documentation](https://xgboost.readthedocs.io/en/stable/gpu/index.html). Additionally, you'll need to pass `pgml.train('GPU project', hyperparams => '{tree_method: "gpu_hist"}')` to take advantage during training.
9+
GPU setup for XGBoost is covered in the [xgboost documentation](https://xgboost.readthedocs.io/en/stable/gpu/index.html).
10+
11+
!!! example
12+
```sql linenums="1"
13+
pgml.train(
14+
'GPU project',
15+
algorithm => 'xgboost',
16+
hyperparams => '{"tree_method" : "gpu_hist"}'
17+
);
18+
```
19+
20+
## LightGBM
21+
GPU setup for LightGBM is covered in the [lightgbm documentation](https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html).
722

8-
!!! warning
9-
XGBoost models trained on GPU will also require GPU support to make predictions.
23+
!!! example
24+
```sql linenums="1"
25+
pgml.train(
26+
'GPU project',
27+
algorithm => 'lightgbm',
28+
hyperparams => '{"device" : "gpu"}'
29+
);
30+
```
1031

1132
## Scikit-learn
1233
None of the scikit-learn algorithms natively support GPU devices. There are a few projects to improve scikit performance with additional parralellism, although we currently have not integrated these with PostgresML:

pgml-docs/docs/user_guides/training/algorithm_selection.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,12 @@ The PostgresML dashboard makes it easy to compare various algorithms on your dat
1010
![Model Selection](/images/dashboard/models.png)
1111

1212

13-
## XGBoost
13+
## Gradient Boosting
1414
Algorithm | Regression | Classification
1515
--- | --- | ---
1616
`xgboost` | [XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor) | [XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)
1717
`xgboost_random_forest` | [XGBRFRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFRegressor) | [XGBRFClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFClassifier)
18+
`lightgbm` | [LGBMRegressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor) | [LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)
1819

1920
## Scikit Ensembles
2021
Algorithm | Regression | Classification

pgml-extension/examples/binary_classification.sql

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,11 @@ SELECT malignant, pgml.predict(
4848
FROM pgml.breast_cancer
4949
LIMIT 10;
5050

51+
--
5152
-- After a project has been trained, ommited parameters will be reused from previous training runs
5253
-- In these examples we'll reuse the training data snapshots from the initial call.
54+
--
55+
5356
-- linear models
5457
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'ridge');
5558
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'stochastic_gradient_descent');
@@ -60,20 +63,24 @@ SELECT * FROM pgml.train('Breast Cancer', algorithm => 'passive_aggressive');
6063
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'svm');
6164
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'nu_svm');
6265
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'linear_svm');
66+
6367
-- ensembles
6468
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'ada_boost');
6569
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'bagging');
6670
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
6771
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
68-
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
69-
-- SELECT * FROM pgml.train('Breast Cancer', algorithim => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');
7072
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');
73+
7174
-- other
7275
-- Gaussian Process is too expensive for normal tests on even a toy dataset
7376
-- SELECT * FROM pgml.train('Breast Cancer', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');
74-
-- XGBoost
75-
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'xgboost');
76-
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'xgboost_random_forest');
77+
78+
-- Gradient Boosting
79+
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
80+
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
81+
SELECT * FROM pgml.train('Breast Cancer', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
82+
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
83+
-- SELECT * FROM pgml.train('Breast Cancer', algorithim => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');
7784

7885

7986
-- check out all that hard work

pgml-extension/examples/image_classification.sql

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,30 +25,39 @@ SELECT target, pgml.predict('Handwritten Digit Image Classifier', image) AS pred
2525
FROM pgml.digits
2626
LIMIT 10;
2727

28+
--
2829
-- After a project has been trained, ommited parameters will be reused from previous training runs
2930
-- In these examples we'll reuse the training data snapshots from the initial call.
31+
--
32+
3033
-- linear models
3134
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'ridge');
3235
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'stochastic_gradient_descent');
3336
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'perceptron');
3437
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'passive_aggressive');
38+
3539
-- support vector machines
3640
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'svm');
3741
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'nu_svm');
3842
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'linear_svm');
43+
3944
-- ensembles
4045
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'ada_boost');
4146
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'bagging');
4247
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
4348
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
44-
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
45-
-- SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');
4649
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');
50+
4751
-- other
4852
-- Gaussian Process is too expensive for normal tests on even a toy dataset
4953
-- SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');
50-
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'xgboost');
51-
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'xgboost_random_forest');
54+
55+
-- gradient boosting
56+
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
57+
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
58+
SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
59+
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
60+
-- SELECT * FROM pgml.train('Handwritten Digit Image Classifier', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');
5261

5362

5463
-- check out all that hard work

pgml-extension/examples/joint_regression.sql

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ SELECT weight, waste, pulse, pgml.predict_joint('Exercise vs Physiology', ARRAY[
1818
FROM pgml.linnerud
1919
LIMIT 10;
2020

21-
-- -- linear models
21+
-- linear models
2222
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'ridge');
2323
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'lasso');
2424
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'elastic_net');
@@ -34,25 +34,32 @@ SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'theil_sen
3434
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'huber');
3535
-- Quantile Regression too expensive for normal tests on even a toy dataset
3636
-- SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'quantile');
37-
--- support vector machines
37+
38+
-- support vector machines
3839
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
3940
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
4041
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');
41-
-- -- ensembles
42+
43+
-- ensembles
4244
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
4345
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
4446
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
4547
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
46-
-- -- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
47-
-- SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');
4848
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');
49+
4950
-- other
50-
--SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'kernel_ridge');
51-
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'xgboost');
52-
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'xgboost_random_forest');
51+
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
52+
-- SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'kernel_ridge');
5353
-- Gaussian Process is too expensive for normal tests on even a toy dataset
5454
-- SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'gaussian_process');
5555

56+
-- gradient boosting
57+
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
58+
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
59+
SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
60+
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
61+
-- SELECT * FROM pgml.train_joint('Exercise vs Physiology', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');
62+
5663
-- check out all that hard work
5764
SELECT trained_models.* FROM pgml.trained_models
5865
JOIN pgml.models on models.id = trained_models.id

pgml-extension/examples/multi_classification.sql

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,31 +18,39 @@ SELECT target, pgml.predict('Iris Classifier', ARRAY[sepal_length, sepal_width,
1818
FROM iris_view
1919
LIMIT 10;
2020

21+
--
2122
-- After a project has been trained, ommited parameters will be reused from previous training runs
2223
-- In these examples we'll reuse the training data snapshots from the initial call.
24+
--
25+
2326
-- linear models
2427
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'ridge');
2528
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'stochastic_gradient_descent');
2629
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'perceptron');
2730
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'passive_aggressive');
31+
2832
-- support vector machines
2933
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'svm');
3034
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'nu_svm');
3135
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'linear_svm');
36+
3237
-- ensembles
3338
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'ada_boost');
3439
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'bagging');
3540
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
3641
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
37-
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
38-
-- SELECT * FROM pgml.train('Iris Classifier', algorithim => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');
3942
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');
43+
4044
-- other
4145
-- Gaussian Process is too expensive for normal tests on even a toy dataset
4246
-- SELECT * FROM pgml.train('Iris Classifier', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');
43-
-- XGBoost
44-
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'xgboost');
45-
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'xgboost_random_forest');
47+
48+
-- gradient boosting
49+
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
50+
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
51+
SELECT * FROM pgml.train('Iris Classifier', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
52+
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
53+
-- SELECT * FROM pgml.train('Iris Classifier', algorithim => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');
4654

4755

4856
-- check out all that hard work

pgml-extension/examples/regression.sql

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,11 @@ CROSS JOIN LATERAL (
3737
) models
3838
LIMIT 10;
3939

40+
--
41+
-- After a project has been trained, ommited parameters will be reused from previous training runs
42+
-- In these examples we'll reuse the training data snapshots from the initial call.
43+
--
44+
4045
-- linear models
4146
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
4247
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
@@ -53,25 +58,33 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyper
5358
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');
5459
-- Quantile Regression too expensive for normal tests on even a toy dataset
5560
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');
56-
--- support vector machines
61+
62+
-- support vector machines
5763
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
5864
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
5965
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');
66+
6067
-- ensembles
6168
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
6269
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
6370
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
6471
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
65-
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
66-
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');
6772
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');
73+
6874
-- other
69-
--SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
70-
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost');
71-
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest');
75+
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
76+
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
7277
-- Gaussian Process is too expensive for normal tests on even a toy dataset
7378
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process');
7479

80+
-- gradient boosting
81+
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
82+
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
83+
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
84+
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
85+
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');
86+
87+
7588
-- check out all that hard work
7689
SELECT trained_models.* FROM pgml.trained_models
7790
JOIN pgml.models on models.id = trained_models.id

pgml-extension/pgml_extension/model.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
import numpy
1111
import xgboost as xgb
1212
import diptest
13+
import lightgbm
1314
from sklearn.model_selection import train_test_split
1415
from sklearn.metrics import (
1516
mean_squared_error,
@@ -445,6 +446,8 @@ def algorithm_from_name_and_objective(cls, name: str, objective: str):
445446
"xgboost_classification": xgb.XGBClassifier,
446447
"xgboost_random_forest_regression": xgb.XGBRFRegressor,
447448
"xgboost_random_forest_classification": xgb.XGBRFClassifier,
449+
"lightgbm_regression": lightgbm.LGBMRegressor,
450+
"lightgbm_classification": lightgbm.LGBMClassifier,
448451
}[name + "_" + objective]
449452

450453
@classmethod
@@ -659,6 +662,7 @@ def algorithm(self):
659662
"linear_svm",
660663
"ada_boost",
661664
"gradient_boosting_trees",
665+
"lightgbm",
662666
]:
663667
self._algorithm = sklearn.multioutput.MultiOutputRegressor(self._algorithm)
664668

pgml-extension/setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,10 @@ def install_sql(filename, database_url):
8484
'install': InstallCommand,
8585
},
8686
install_requires=[
87+
"diptest",
8788
"sklearn",
8889
"xgboost",
89-
"diptest",
90+
"lightgbm",
9091
],
9192
extras_require={"dev": "pytest"},
9293
packages=setuptools.find_packages(exclude=("tests",)),

0 commit comments

Comments
 (0)