# Fit LightGBM to Wine Quality data

## Load Dependencies

In [103]:
import joblib
import lightgbm as lgb

from sklearn.compose import make_column_transformer
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

from optuna.distributions import IntUniformDistribution, UniformDistribution
from optuna.integration import OptunaSearchCV

## Load Data

In [104]:
%run ../data/data.py

To simplify the problem to a binary classification task, we redefine the target variable to identify 'high quality' wines - defined as wines with a rating of 7 or higher.

In [105]:
X, y = load_wine_quality(return_X_y=True, binary=True)

## Fit lightGBM

Data preprocessing for the task is relatively simple. The Wine Quality data has no missing values. There is only a single categorical variable: The type of wine. We will use One-Hot Encoding for this. As a tree-based method, lightGBM is not affected by feature scale so no normalization is required.

In [107]:
columns_categorical = X.select_dtypes('object').columns
columns_numeric = X.select_dtypes(exclude='object').columns

In [108]:
feature_pipeline = make_column_transformer((OneHotEncoder(), columns_categorical), ('passthrough', columns_numeric))

In [109]:
gbm = lgb.LGBMClassifier()

In [110]:
gbm_pipeline = make_pipeline(feature_pipeline,gbm)

### Hyperparameter tuning

The lightGBM model has an almost endless list of hyperparameters to tune. A complete list can be found [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html). We will use Bayesian search as implemented in the scikit-optimize package. Compared to naive random search for parameters, this approach finds a good set of hyperparameters in far fewer iterations as it uses Bayesian inference to decide on 'relevant' areas in the parameter space to explore.

The exact parameter ranges used are motivated by the explanations in the official documentation [here](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html) as well as this [Article](https://towardsdatascience.com/machine-learning-kaggle-competition-part-three-optimization-db04ea415507).

In [111]:
param_distributions = {
    'num_leaves': IntUniformDistribution(8, 128),
    'learning_rate': UniformDistribution(0.005, 0.5),
    'min_child_samples': IntUniformDistribution(10, 200), 
    'min_child_weight': UniformDistribution(1e-5, 1e-2),
    'subsample': UniformDistribution(0.2, 1.0), 
    'colsample_bytree': UniformDistribution(0.4, 1.0),
    'reg_alpha': UniformDistribution(0., 100.),
}

In [112]:
model_name = gbm_pipeline.steps[-1][0]
param_distributions = {model_name+'__'+key: value for key, value in param_distributions.items()}

If cross-validation folds are randomly created, there is a chance that the training folds to not contain all possible values for the categorical variable 'Type'. We therefore use Stratified Folds to ensure that there are always 'Red' and 'White' wines represented.

In [113]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=142)

In [114]:
optuna_search = OptunaSearchCV(gbm_pipeline, param_distributions, n_trials=25*len(param_distributions), n_jobs=-1, cv=skf.split(X, X['Type']), scoring='f1', random_state=142)

  optuna_search = OptunaSearchCV(gbm_pipeline, param_distributions, n_trials=25*len(param_distributions), n_jobs=-1, cv=skf.split(X, X['Type']), scoring='f1', random_state=142)


In [117]:
optuna_search.fit(X,y)

h value: 0.7004532379393243.[0m
[32m[I 2021-05-18 23:22:15,211][0m Trial 133 finished with value: 0.6932861171956944 and parameters: {'lgbmclassifier__num_leaves': 39, 'lgbmclassifier__learning_rate': 0.4399559031898858, 'lgbmclassifier__min_child_samples': 33, 'lgbmclassifier__min_child_weight': 0.0010428609221597896, 'lgbmclassifier__subsample': 0.4732487643233036, 'lgbmclassifier__colsample_bytree': 0.8363961112119015, 'lgbmclassifier__reg_alpha': 1.5777889487329357}. Best is trial 103 with value: 0.7004532379393243.[0m
[32m[I 2021-05-18 23:22:16,977][0m Trial 145 finished with value: 0.2736132412447302 and parameters: {'lgbmclassifier__num_leaves': 49, 'lgbmclassifier__learning_rate': 0.4369507635815919, 'lgbmclassifier__min_child_samples': 41, 'lgbmclassifier__min_child_weight': 0.005264274832692974, 'lgbmclassifier__subsample': 0.5048085398671435, 'lgbmclassifier__colsample_bytree': 0.7856626679742762, 'lgbmclassifier__reg_alpha': 68.1919339123242}. Best is trial 103 with v

OptunaSearchCV(cv=<generator object _BaseKFold.split at 0x7f8883962820>,
               estimator=Pipeline(steps=[('columntransformer',
                                          ColumnTransformer(transformers=[('onehotencoder',
                                                                           OneHotEncoder(),
                                                                           Index(['Type'], dtype='object')),
                                                                          ('passthrough',
                                                                           'passthrough',
                                                                           Index(['Fixed Acidity', 'Volatile Acidity', 'Citric Acid', 'Residual Sugar',
       'Chlorides', 'Free Sulfur Dioxide', 'Total Su...
                                    'lgbmclassifier__min_child_samples': IntUniformDistribution(high=200, low=10, step=1),
                                    'lgbmclassifier__min_chi

In [None]:
trained_model = optuna_search.best_estimator_

In [7]:
setattr(trained_model, 'training_data', {'X':X, 'y':y})

In [8]:
_ = joblib.dump(trained_model, 'wine_quality_lightgbm.pkl')