# 6. 部署

在前面几节中，我们抽象出了一些公有代码，放在 `util.py` 中。我们复用这些代码，快速训练一个 LightGBM 模型

In [1]:
DIRECTORY='./data'
TRAIN_FILE='adult/adult.data'
TEST_FILE='adult/adult.test'
MODEL_FILE='model_best.txt'
COLS=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
          'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
          'hours-per-week', 'native-country', 'income']
LABEL_COL='income'

In [2]:
# coding: utf-8
import numpy as np
import pandas as pd
import sklearn.metrics
import sklearn.model_selection
import sklearn.preprocessing
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb

import util

## 一、模型训练

我们用前面学到的技巧，训练一版模型

In [3]:
# 读入 CSV 文件
csv_file = util.gen_abspath(DIRECTORY, TRAIN_FILE)
df = util.read_csv(csv_file, sep=',', header=None)
df.columns=COLS

# 特征与标号
X = df.drop(LABEL_COL, axis=1)  # features
y = df[LABEL_COL].apply(lambda e: 0 if e == ' <=50K' else 1)  # label

# 处理类别特征
cat_feats = [col for col in X.columns if X[col].dtypes == np.dtype('object')]
X = util.label_encoder(X)

# 分割数据集
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)
lgb_train = lgb.Dataset(X_train,
                        y_train,
                        categorical_feature=cat_feats, 
                        free_raw_data=True)
lgb_eval = lgb.Dataset(X_test,
                    y_test,
                    reference=lgb_train,
                    categorical_feature=cat_feats, 
                    free_raw_data=True)

# 配置训练参数：使用在 Adult 数据集上优化过的超参数
best_params = {
    "objective": "binary",
    "boosting_type": "gbdt",
    "metric": 'auc',
    "num_leaves": 34,
    "learning_rate": 0.08,
    "feature_fraction": 0.4,
    "bagging_fraction": 0.9,
    "bagging_freq": 5,
    "subsample": 0.9,
    "verbose": 1
}

# 处理样本数据倾斜
positive_ratio = sum(y_train) / len(y_train)
tolerance = 0.1
if positive_ratio > 0.5 + tolerance or positive_ratio < 0.5 - tolerance:
    # if dataset is highly imbalanced
    weight = util.gen_scale_pos_weight(y_train)
    best_params["scale_pos_weight"] = weight
    print(f"Warning: Sample imbalance, set scale_pos_weight={weight:.3f}")



In [4]:
# 训练
gbm = lgb.train(best_params,
                lgb_train,
                num_boost_round=300,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=30),
                           lgb.log_evaluation(10),
                           util.AdaptiveLearningRate(learning_rate=0.1, decay_rate=0.9, patience=10).callback],
                categorical_feature=cat_feats
)

[LightGBM] [Info] Number of positive: 6270, number of negative: 19778
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000567 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 695
[LightGBM] [Info] Number of data points in the train set: 26048, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.240709 -> initscore=-1.148794
[LightGBM] [Info] Start training from score -1.148794
Training until validation scores don't improve for 30 rounds
Learning rate ==> 0.090 (-0.0100)
[10]	valid_0's auc: 0.913107
Learning rate ==> 0.081 (-0.0090)
[20]	valid_0's auc: 0.922718
Learning rate ==> 0.073 (-0.0081)
[30]	valid_0's auc: 0.926202
Learning rate ==> 0.066 (-0.0073)
[40]	valid_0's auc: 0.928409
Learning rate ==> 0.059 (-0.0066)
[50]	valid_0's auc: 0.929186
Learning rate ==> 0.053 (-0.0059)
[60]	valid_0's auc: 0.93045
Learni

In [5]:
# 评估模型
y_pred = gbm.predict(X_test)
y_label, threshold = util.eval_binary(y_true=y_test, y_pred=y_pred, ret=True)

threshold: 0.70668
accuracy: 0.88085
precision: 0.76412
recall: 0.73202
f1_score: 0.74772
auc: 0.93199
cross-entropy loss: 0.32999
True Positive (TP): 1150
True Negative (TN): 4587
False Positive (FP): 355
False Negative (FN): 421
confusion matrix:
[[4587  355]
 [ 421 1150]]


In [6]:
# 保存模型
model_path = util.gen_abspath(DIRECTORY, MODEL_FILE)
gbm.save_model(model_path)

<lightgbm.basic.Booster at 0x12ad9bfa0>

## 二、模型部署

In [7]:
# 从模型文件加载模型
bst = lgb.Booster(model_file=model_path)

### 2.1 离线部署

如果这时候有一行数据进来，怎么对它进行推理呢？

In [8]:
# 假设这行数据长这样
csv_file = util.gen_abspath(DIRECTORY, TEST_FILE)
df_test = util.read_csv(csv_file, sep=',', header=None)
df_test.columns=COLS

df_one = df_test.iloc[3:4,:]
df_one

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.


首先对它进行预处理

In [9]:
# 转成特征与标号
XX = df_one.drop(LABEL_COL, axis=1)  # features
yy = df_one[LABEL_COL].apply(lambda e: 0 if e == ' <=50K.' else 1)  # label

# 处理类别特征
XX = util.label_encoder(XX)

再进行预测

In [10]:
y_guess = bst.predict(XX)[0]
y_label = 1 if y_guess > threshold else 0

print(f'y_guess: {y_guess:.5f}')
print(f'y_label: {y_label}')
print(f'y_true: {list(yy)[0]}')

y_guess: 0.95410
y_label: 1
y_true: 1


### 2.2 在线部署

可以用 FastAPI 进行在线部署。注意现实中需要对字段进行校验，对有问题的数据设计特殊策略，比如 异常兜底、异常报警、空值填充 等

关于 FastAPI 的更多信息，可以查阅我的博客 [FastAPI 初见](https://www.luochang.ink/posts/fastapi/) 及配套代码 [calendar-api](https://github.com/luochang212/calendar-api)