# Gradient Boosting Machine
multiple weak learner -> giving weight to wrong-estimated data

- AdaBoost: merge based on weight (give weight to wrong data)
- GBM(Gradient Boosting Machine): update weight using Gradient Descent
    - 경사하강법? (Gradient Descent?)
        - 예측 함수와 실제 결과값의 차이 (그래프의 변화량)을 최소로 하는 방향성 선정

In [1]:
import pandas as pd


def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data=old_feature_name_df.groupby("column_name").cumcount(),
                                  columns=["dup_cnt"])
    feature_dup_df = feature_dup_df.reset_index()
    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how="outer")
    new_feature_name_df["column_name"] = new_feature_name_df[["column_name", "dup_cnt"]] \
        .apply(lambda x: x[0] + '_' + str(x[1]) if x[1] > 0 else x[0], axis=1)
    new_feature_name_df = new_feature_name_df.drop(["index"], axis=1)
    return new_feature_name_df


def get_human_dataset():
    feature_name_df = pd.read_csv("../human_activity/features.txt", sep="\s+",
                                  header=None, names=["column_index", "column_name"])

    # calls get_new_feature_name_df
    new_feature_name_df = get_new_feature_name_df(feature_name_df)

    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()

    X_train = pd.read_csv("../human_activity/train/X_train.txt", sep="\s+", names=feature_name)
    X_test = pd.read_csv("../human_activity/test/X_test.txt", sep="\s+", names=feature_name)

    y_train = pd.read_csv("../human_activity/train/y_train.txt", sep="\s+", header=None, names=["action"])
    y_test = pd.read_csv("../human_activity/test/y_test.txt", sep="\s+", header=None, names=["action"])

    return X_train, X_test, y_train, y_test

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
import time
import warnings
warnings.filterwarnings("ignore")

X_train, X_test, y_train, y_test = get_human_dataset()

start_time = time.time()

gb_clf = GradientBoostingClassifier(random_state=0)
gb_clf.fit(X_train, y_train)
gb_prd = gb_clf.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_prd)

print("GBM accuracy: {0:.4f}".format(gb_accuracy))
print("GBM running time: {0:.4f}".format(time.time() - start_time))

GBM accuracy: 0.9386
GBM running time: 480.5931


## GBM hyper parameter
- loss: choose which loss function to use
- learning_rate: (default: 0.1) Weak learner learning rate (if too high, performance degrades but fast learning. if too low, performance can be upgrade, but time takes too long)
- n_estimators: choose the amount of weak learner
- subsample: data sampling rate