<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Spark-MLlib-Tuning" data-toc-modified-id="Spark-MLlib-Tuning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><a href="https://spark.apache.org/docs/latest/ml-tuning.html" target="_blank">Spark MLlib Tuning</a></a></span></li><li><span><a href="#Hyperopt" data-toc-modified-id="Hyperopt-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><a href="https://github.com/hyperopt/hyperopt" target="_blank">Hyperopt</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><a href="https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/" target="_blank">XGBoost Tuning</a></a></span><ul class="toc-item"><li><span><a href="#Objective-function" data-toc-modified-id="Objective-function-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Objective function</a></span></li><li><span><a href="#Tune-number-of-trees" data-toc-modified-id="Tune-number-of-trees-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Tune number of trees</a></span></li><li><span><a href="#Tune-tree-specific-parameters" data-toc-modified-id="Tune-tree-specific-parameters-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Tune tree-specific parameters</a></span><ul class="toc-item"><li><span><a href="#Tune-max_depth,-min_child_weight" data-toc-modified-id="Tune-max_depth,-min_child_weight-2.1.3.1"><span class="toc-item-num">2.1.3.1&nbsp;&nbsp;</span>Tune max_depth, min_child_weight</a></span></li><li><span><a href="#Tune-gamma" data-toc-modified-id="Tune-gamma-2.1.3.2"><span class="toc-item-num">2.1.3.2&nbsp;&nbsp;</span>Tune gamma</a></span></li><li><span><a href="#Tune-subsample,-colsample_bytree" data-toc-modified-id="Tune-subsample,-colsample_bytree-2.1.3.3"><span class="toc-item-num">2.1.3.3&nbsp;&nbsp;</span>Tune subsample, colsample_bytree</a></span></li></ul></li><li><span><a href="#Tune-regularization-parameters" data-toc-modified-id="Tune-regularization-parameters-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Tune regularization parameters</a></span></li><li><span><a href="#Lower-the-learning-rate-and-decide-the-optimal-parameters" data-toc-modified-id="Lower-the-learning-rate-and-decide-the-optimal-parameters-2.1.5"><span class="toc-item-num">2.1.5&nbsp;&nbsp;</span>Lower the learning rate and decide the optimal parameters</a></span></li></ul></li><li><span><a href="#LogisticRegression-Tuning" data-toc-modified-id="LogisticRegression-Tuning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LogisticRegression Tuning</a></span></li><li><span><a href="#Optional-MongoTrials" data-toc-modified-id="Optional-MongoTrials-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Optional <a href="https://hyperopt.github.io/hyperopt/scaleout/mongodb/" target="_blank">MongoTrials</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>XGBoost Tuning</a></span></li></ul></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Results</a></span></li></ul></div>

Продолжаем работать над задачей CTR-prediction с использованием датасета от Criteo.

Описание задачи и данных можно посмотреть в notebook'e предыдущей практики (`sgd_logreg_nn/notebooks/ctr_prediction_mllib.ipynb`).

In [1]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import os
import sys
import glob
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import pyspark
import pyspark.sql.functions as F
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

COMMON_PATH = '/workspace/common'

sys.path.append(os.path.join(COMMON_PATH, 'utils'))

os.environ['PYSPARK_SUBMIT_ARGS'] = """
--jars {common}/xgboost4j-spark-0.72.jar,{common}/xgboost4j-0.72.jar
--py-files {common}/sparkxgb.zip pyspark-shell
""".format(common=COMMON_PATH).replace('\n', ' ')

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("spark_sql_examples") \
    .config("spark.executor.memory", "6g") \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

from metrics import rocauc, logloss, ne, get_ate
from processing import split_by_col

from sparkxgb.xgboost import *

In [2]:
DATA_PATH = '/workspace/data/criteo'

TRAIN_PATH = os.path.join(DATA_PATH, 'train.csv')

In [3]:
df = sqlContext.read.format("com.databricks.spark.csv") \
    .option("delimiter", ",") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('file:///' + TRAIN_PATH)

**Remark** Необязательно использовать половину датасета и всего две категориальные переменные. Можно использовать больше данных, если вам позволяет ваша конфигурация

In [4]:
df = df.sample(False, 0.5)

In [5]:
num_columns = ['_c{}'.format(i) for i in range(1, 14)]
cat_columns = ['_c{}'.format(i) for i in range(14, 40)][:2]
len(num_columns), len(cat_columns)

(13, 2)

In [6]:
df = df.fillna(0, subset=num_columns)

In [9]:
from pyspark.ml import PipelineModel

PIPELINE_MODEL_PATH = '../../sgd_logreg_nn/notebooks/transforming_pipeline'
pipeline_model = PipelineModel.load(PIPELINE_MODEL_PATH)

In [10]:
df = pipeline_model \
    .transform(df) \
    .select(F.col('_c0').alias('label'), 'features', 'id') \
    .cache()

df.count()

1833655

In [11]:
train_df, val_df, test_df = split_by_col(df, 'id', [0.8, 0.1, 0.1])

# [Spark MLlib Tuning](https://spark.apache.org/docs/latest/ml-tuning.html)

У имеющегося в Spark'e метода HPO есть два существенных недостатка, которые делают его мало пригодным в контексте нашей задачи:

1. `ParamGridBuilder` - поиск по сетке
2. `TrainValidationSplit` - делит данные случайнм образом

# [Hyperopt](https://github.com/hyperopt/hyperopt)

Установим `hyperopt`

In [12]:
!pip3.5 install hyperopt

Collecting hyperopt
  Downloading hyperopt-0.2.3-py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 1.9 MB/s eta 0:00:01
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 15.2 MB/s eta 0:00:01
[?25hCollecting networkx==2.2
  Downloading networkx-2.2.zip (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 15.4 MB/s eta 0:00:01
Collecting cloudpickle
  Downloading cloudpickle-1.3.0-py2.py3-none-any.whl (26 kB)
Installing collected packages: future, networkx, cloudpickle, hyperopt
    Running setup.py install for future ... [?25ldone
[?25h    Running setup.py install for networkx ... [?25ldone
[?25hSuccessfully installed cloudpickle-1.3.0 future-0.18.2 hyperopt-0.2.3 networkx-2.2


## [XGBoost Tuning](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

> [Notes on Parameter Tuning](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)

### Objective function

In [80]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import scipy.stats as st


def objective(space):
    estimator = XGBoostEstimator(**space)
    print('SPACE:', estimator._input_kwargs_processed())
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probabilities')
    roc_auc = rocauc(model, val_df, probabilities_col='probabilities')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [14]:
static_params = {
    'featuresCol': "features", 
    'labelCol': "label", 
    'predictionCol': "prediction",
    'eval_metric': 'logloss',
    'objective': 'binary:logistic',
    'nthread': 1,
    'silent': 0,
    'nworkers': 1
}

Fix baseline parameters and train baseline model

In [15]:
CONTROL_NAME = 'xgb baseline'

baseline_params = {
    'colsample_bytree': 0.9,
    'eta': 0.15,
    'gamma': 0.9,
    'max_depth': 6,
    'min_child_weight': 50.0,
    'subsample': 0.9,
    'num_round': 20
}

baseline_model = XGBoostEstimator(**{**static_params, **baseline_params}).fit(train_df)

In [16]:
baseline_rocauc = rocauc(baseline_model, val_df, probabilities_col='probabilities')
baseline_rocauc

0.7260253274714449

In [17]:
all_metrics = {}

In [18]:
baseline_test_metrics = {
    'logloss': logloss(baseline_model, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(baseline_model, test_df, probabilities_col='probabilities')
}

all_metrics[CONTROL_NAME] = baseline_test_metrics

### Tune number of trees

> Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate.

In [19]:
%%time

num_round_choice = [10, 20, 40, 100]
eta_choice = [0.5, 0.10, 0.15, 0.20, 0.30]

space = {
    # Optimize
    'num_round': hp.choice('num_round', num_round_choice),
    'eta': hp.choice('eta', eta_choice),
    
    # Fixed    
    'max_depth': baseline_params['max_depth'],
    'min_child_weight': baseline_params['min_child_weight'],
    'subsample': baseline_params['subsample'],
    'gamma': baseline_params['gamma'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)

SPACE:                                                
{'num_round': 40, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 50.0, 'max_depth': 6, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.5038699868920585, ROC-AUC: 0.7342957047805975
SPACE:                                                                          
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 50.0, 'max_depth': 6, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.15, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.5028733353263354, ROC-AUC: 0.7357985639428266                       
SPACE:                                                                          
{'num_round': 100, 'silent

LOG-LOSS: 0.5104978457337404, ROC-AUC: 0.7260253274714473                        
SPACE:                                                                           
{'num_round': 40, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 50.0, 'max_depth': 6, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.5, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.503105960714235, ROC-AUC: 0.7351260251276981                         
SPACE:                                                                           
{'num_round': 20, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 50.0, 'max_depth': 6, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.15, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.5104978457337404, ROC-AUC: 0.72602532747144

In [20]:
best

{'eta': 4, 'num_round': 3}

Обратите внимание на то, что в случае с `hp.choice` в переменной `best` хранится не конкретное значение гиперпараметра, а его индекс из списка, например, `num_round_choice`

In [23]:
eta = eta_choice[best['eta']]  # change me!
num_round = num_round_choice[best['num_round']]  # change me!
eta, num_round

(0.3, 100)

In [24]:
space

{'colsample_bytree': 0.9,
 'eta': <hyperopt.pyll.base.Apply at 0x7ffb5c9416a0>,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': 6,
 'min_child_weight': 50.0,
 'nthread': 1,
 'num_round': <hyperopt.pyll.base.Apply at 0x7ffb5b68f860>,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'silent': 0,
 'subsample': 0.9}

In [26]:
space['eta'] = eta
space['num_round'] = num_round
space

{'colsample_bytree': 0.9,
 'eta': 0.3,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': 6,
 'min_child_weight': 50.0,
 'nthread': 1,
 'num_round': 100,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'silent': 0,
 'subsample': 0.9}

In [28]:
model_2 = XGBoostEstimator(**space).fit(train_df)
rocauc(model_2, test_df, probabilities_col='probabilities')

0.7383147008900963

In [30]:
all_metrics['model_2'] = {
    'logloss': logloss(model_2, test_df, probabilities_col='probabilities'),
    'rocauc' :  rocauc(model_2, test_df, probabilities_col='probabilities')
}
all_metrics['model_2']

{'logloss': 0.5042670249156437, 'rocauc': 0.7383147008900977}

### Tune tree-specific parameters

> Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.

#### Tune max_depth, min_child_weight

In [31]:
max_depth_choice        = [5,   6,    8,  10,  12]
min_child_weight_choice = [20., 35., 50., 75., 100.]

space['max_depth']        = hp.choice('max_depth',        max_depth_choice)
space['min_child_weight'] = hp.choice('min_child_weight', min_child_weight_choice)

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)

SPACE:                                                
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 20.0, 'max_depth': 8, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.5001364242111265, ROC-AUC: 0.7392419353176597
SPACE:                                                                           
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 50.0, 'max_depth': 6, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.5011382280025449, ROC-AUC: 0.7378595320029839                        
SPACE:                                                                           
{'num_round': 100, 'sil

{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 75.0, 'max_depth': 8, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.499834150388143, ROC-AUC: 0.739598595730302                            
SPACE:                                                                             
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 35.0, 'max_depth': 6, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.500928483339883, ROC-AUC: 0.7382149989036276                           
SPACE:                                                                             
{'num_round': 100, 'silent': 0, 'predictionCol

In [32]:
best

{'max_depth': 3, 'min_child_weight': 4}

In [33]:
max_depth        = max_depth_choice[best['max_depth']]
min_child_weight = min_child_weight_choice[best['min_child_weight']]

In [34]:
space

{'colsample_bytree': 0.9,
 'eta': 0.3,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': <hyperopt.pyll.base.Apply at 0x7ffb5b612208>,
 'min_child_weight': <hyperopt.pyll.base.Apply at 0x7ffb5b564780>,
 'nthread': 1,
 'num_round': 100,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'silent': 0,
 'subsample': 0.9}

In [40]:
space['max_depth']        = max_depth
space['min_child_weight'] = min_child_weight
space

{'colsample_bytree': 0.9,
 'eta': 0.3,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': 10,
 'min_child_weight': 100.0,
 'nthread': 1,
 'num_round': 100,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'silent': 0,
 'subsample': 0.9}

In [41]:
model_3 = XGBoostEstimator(**space).fit(train_df)

all_metrics['model_3'] = {
    'logloss': logloss(model_3, test_df, probabilities_col='probabilities'),
    'rocauc' :  rocauc(model_3, test_df, probabilities_col='probabilities')
}
all_metrics['model_3']

{'logloss': 0.5028366702156377, 'rocauc': 0.7398397571705004}

#### Tune gamma

In [44]:
gamma_choice = [0.7, 0.8, 0.85, 0.9, 0.95]

space['gamma'] = hp.choice('gamma', gamma_choice)

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=5,
            trials=trials)

SPACE:                                               
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'max_depth': 10, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.95, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.4996699093773705, ROC-AUC: 0.7395581316768496
SPACE:                                                                          
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'max_depth': 10, 'subsample': 0.9, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.7, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.4996612091134824, ROC-AUC: 0.7396993673699801                       
SPACE:                                                                          
{'num_round': 100, 'si

In [45]:
best

{'gamma': 3}

In [46]:
gamma = gamma_choice[best['gamma']]
gamma

0.9

In [47]:
space['gamma'] = gamma
space

{'colsample_bytree': 0.9,
 'eta': 0.3,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': 10,
 'min_child_weight': 100.0,
 'nthread': 1,
 'num_round': 100,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'silent': 0,
 'subsample': 0.9}

In [48]:
model_4 = XGBoostEstimator(**space).fit(train_df)

all_metrics['model_4'] = {
    'logloss': logloss(model_4, test_df, probabilities_col='probabilities'),
    'rocauc' :  rocauc(model_4, test_df, probabilities_col='probabilities')
}
all_metrics['model_4']

{'logloss': 0.5028366702156377, 'rocauc': 0.7398397571705}

#### Tune subsample, colsample_bytree

In [49]:
subsample_choice        = [0.6, 0.7, 0.8, 0.9]
colsample_bytree_choice = [0.6, 0.7, 0.8, 0.9]

space['subsample']        = hp.choice('subsample',        subsample_choice)
space['colsample_bytree'] = hp.choice('colsample_bytree', colsample_bytree_choice)  

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=15,
            trials=trials)

SPACE:                                                
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'max_depth': 10, 'subsample': 0.8, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.7, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.49945962149383016, ROC-AUC: 0.7399387722167209
SPACE:                                                                            
{'num_round': 100, 'silent': 0, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'max_depth': 10, 'subsample': 0.6, 'labelCol': 'label', 'objective': 'binary:logistic', 'gamma': 0.9, 'featuresCol': 'features', 'nworkers': 1, 'eta': 0.3, 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.4999414094641043, ROC-AUC: 0.7390897290014236                         
SPACE:                                                                            
{'num_round': 1

In [50]:
best

{'colsample_bytree': 2, 'subsample': 2}

In [54]:
subsample        = subsample_choice[best['subsample']]
colsample_bytree = colsample_bytree_choice[best['colsample_bytree']]
subsample, colsample_bytree

(0.8, 0.8)

In [55]:
space['subsample'] = subsample
space['colsample_bytree'] = colsample_bytree
space

{'colsample_bytree': 0.8,
 'eta': 0.3,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': 10,
 'min_child_weight': 100.0,
 'nthread': 1,
 'num_round': 100,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'silent': 0,
 'subsample': 0.8}

In [56]:
model_5 = XGBoostEstimator(**space).fit(train_df)

all_metrics['model_5'] = {
    'logloss': logloss(model_5, test_df, probabilities_col='probabilities'),
    'rocauc' :  rocauc(model_5, test_df, probabilities_col='probabilities')
}
all_metrics['model_5']

{'logloss': 0.5028679651765141, 'rocauc': 0.7398801728520573}

In [57]:
get_ate(all_metrics, CONTROL_NAME)

Unnamed: 0,metric,model_2 ate %,model_3 ate %,model_4 ate %,model_5 ate %
0,logloss,-1.938934,-2.217085,-2.217085,-2.210999
1,rocauc,1.761944,1.972143,1.972143,1.977713


### Tune regularization parameters

> Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.

In [82]:
alpha_choice      = [0., .001, .01, .1, .2, 0.3, 0.5]
reg_lambda_choice = [0., .001, .01, .1, .2, 0.3, 0.5]

space['alpha']      = hp.choice('alpha',      alpha_choice)
space['reg_lambda'] = hp.choice('reg_lambda', reg_lambda_choice)


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)

SPACE:                                                
{'num_round': 100, 'silent': 0, 'alpha': 0.0, 'max_depth': 10, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'labelCol': 'label', 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic', 'featuresCol': 'features', 'lambda': 0.0, 'gamma': 0.9, 'nworkers': 1, 'eta': 0.3, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.4993349101887657, ROC-AUC: 0.7400513770208195
SPACE:                                                                           
{'num_round': 100, 'silent': 0, 'alpha': 0.0, 'max_depth': 10, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'labelCol': 'label', 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic', 'featuresCol': 'features', 'lambda': 0.3, 'gamma': 0.9, 'nworkers': 1, 'eta': 0.3, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.49945865763407926, ROC-AUC: 0.7398523978016651                       
SPACE:                                     

{'num_round': 100, 'silent': 0, 'alpha': 0.5, 'max_depth': 10, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'labelCol': 'label', 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic', 'featuresCol': 'features', 'lambda': 0.5, 'gamma': 0.9, 'nworkers': 1, 'eta': 0.3, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.499606235816009, ROC-AUC: 0.7397498899905193                          
SPACE:                                                                            
{'num_round': 100, 'silent': 0, 'alpha': 0.5, 'max_depth': 10, 'predictionCol': 'prediction', 'min_child_weight': 100.0, 'labelCol': 'label', 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic', 'featuresCol': 'features', 'lambda': 0.01, 'gamma': 0.9, 'nworkers': 1, 'eta': 0.3, 'nthread': 1, 'eval_metric': 'logloss'}
LOG-LOSS: 0.499552133431115, ROC-AUC: 0.7398778589033365                          
SPACE:                                                                

In [83]:
best

{'alpha': 2, 'reg_lambda': 5}

In [85]:
space['alpha']      = alpha_choice[best['alpha']]
space['reg_lambda'] = reg_lambda_choice[best['reg_lambda']]
space

{'alpha': 0.01,
 'colsample_bytree': 0.8,
 'eta': 0.3,
 'eval_metric': 'logloss',
 'featuresCol': 'features',
 'gamma': 0.9,
 'labelCol': 'label',
 'max_depth': 10,
 'min_child_weight': 100.0,
 'nthread': 1,
 'num_round': 100,
 'nworkers': 1,
 'objective': 'binary:logistic',
 'predictionCol': 'prediction',
 'reg_lambda': 0.3,
 'silent': 0,
 'subsample': 0.8}

In [86]:
model_6 = XGBoostEstimator(**space).fit(train_df)

all_metrics['model_6'] = {
    'logloss': logloss(model_6, test_df, probabilities_col='probabilities'),
    'rocauc' :  rocauc(model_6, test_df, probabilities_col='probabilities')
}
all_metrics['model_6']

{'logloss': 0.5029611272915919, 'rocauc': 0.7396112532601704}

In [87]:
get_ate(all_metrics, CONTROL_NAME)

Unnamed: 0,metric,model_2 ate %,model_3 ate %,model_4 ate %,model_5 ate %,model_6 ate %
0,logloss,-1.938934,-2.217085,-2.217085,-2.210999,-2.192883
1,rocauc,1.761944,1.972143,1.972143,1.977713,1.940648


### Lower the learning rate and decide the optimal parameters

In [None]:
######################################
######### YOUR CODE HERE #############
######################################

---
## LogisticRegression Tuning

Подберем гиперпараметры для логрега из предыдущих практик

In [67]:
from pyspark.ml.classification import LogisticRegression

def log_reg_objective(space):
    estimator = LogisticRegression(**space)
    print('SPACE:', estimator._input_kwargs)
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probability')
    roc_auc  =  rocauc(model, val_df, probabilities_col='probability')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [68]:
log_reg_metrics = {}

In [69]:
log_reg_space = {
    'featuresCol': 'features',
    'labelCol'   : 'label',
    
    'maxIter'    : 10,
    'regParam'       : 0.,
    'elasticNetParam': 0.
} 

log_reg_baseline = LogisticRegression(**log_reg_space).fit(train_df)

In [70]:
log_reg_metrics['log_reg_baseline'] = {
    'logloss': logloss(log_reg_baseline, test_df, probabilities_col='probability'),
    'rocauc' :  rocauc(log_reg_baseline, test_df, probabilities_col='probability')
}
log_reg_metrics['log_reg_baseline']

{'logloss': 0.531046765928655, 'rocauc': 0.7035777223524002}

In [72]:
regParam_choice = [0., .001, .01, .05, .1]
elasticNetParam_choice = [0., .001, 0.01, .05, .1]

log_reg_space['regParam'] = hp.choice('regParam', regParam_choice)
log_reg_space['elasticNetParam'] = hp.choice('elasticNetParam', elasticNetParam_choice)

trials = Trials()
best = fmin(fn=log_reg_objective,
            space=log_reg_space,
            algo=tpe.suggest,
            max_evals=25,
            trials=trials)

SPACE:                                                
{'elasticNetParam': 0.001, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.05}
LOG-LOSS: 0.530404117433663, ROC-AUC: 0.7004914094808631
SPACE:                                                                         
{'elasticNetParam': 0.001, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.05}
LOG-LOSS: 0.530404117433663, ROC-AUC: 0.7004914094808647                       
SPACE:                                                                         
{'elasticNetParam': 0.05, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.05}
LOG-LOSS: 0.5347408299329838, ROC-AUC: 0.6999968635869094                      
SPACE:                                                                         
{'elasticNetParam': 0.01, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.01}
LOG-LOSS: 0.5274751895001702, ROC-AUC: 0.7023127241967498 

Exception ignored in: 
<object repr() failed>
Traceback (most recent call last):

  File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 40, in __del__

if SparkContext._active_spark_context and self._java_obj is not None:
AttributeError
: 
'LogisticRegression' object has no attribute '_java_obj'


LOG-LOSS: 0.5347408299329838, ROC-AUC: 0.6999968635869128                       
SPACE:                                                                          
{'elasticNetParam': 0.001, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.0}
LOG-LOSS: 0.5281816810597477, ROC-AUC: 0.7026160227101298                       
SPACE:                                                                          
{'elasticNetParam': 0.0, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.1}
LOG-LOSS: 0.5337244915605497, ROC-AUC: 0.7000764796699932                       
SPACE:                                                                          
{'elasticNetParam': 0.01, 'maxIter': 10, 'labelCol': 'label', 'featuresCol': 'features', 'regParam': 0.05}
LOG-LOSS: 0.530862325061015, ROC-AUC: 0.7016420822640823                        
SPACE:                                                                           
{'elasticNetParam': 0.1, 'maxIte

In [73]:
best

{'elasticNetParam': 4, 'regParam': 1}

In [74]:
regParam        =        regParam_choice[best['regParam']]
elasticNetParam = elasticNetParam_choice[best['elasticNetParam']]
regParam, elasticNetParam

(0.001, 0.1)

In [75]:
log_reg_space['regParam'] = regParam
log_reg_space['elasticNetParam'] = elasticNetParam
log_reg_space

{'elasticNetParam': 0.1,
 'featuresCol': 'features',
 'labelCol': 'label',
 'maxIter': 10,
 'regParam': 0.001}

In [76]:
model = LogisticRegression(**log_reg_space).fit(train_df)

In [77]:
log_reg_metrics['log_reg_regularized'] = {
    'logloss': logloss(model, test_df, probabilities_col='probability'),
    'rocauc' :  rocauc(model, test_df, probabilities_col='probability')
}
log_reg_metrics['log_reg_regularized']

{'logloss': 0.5300497171179218, 'rocauc': 0.7036302896689541}

In [78]:
get_ate(log_reg_metrics, 'log_reg_baseline')

Unnamed: 0,log_reg_regularized ate %,metric
0,-0.187752,logloss
1,0.007471,rocauc


---
## Optional [MongoTrials](https://hyperopt.github.io/hyperopt/scaleout/mongodb/)

> For parallel search, hyperopt includes a MongoTrials implementation that supports asynchronous updates.

**TLDR** Преимущества использования `MongoTrials`:
* `MongoTrials` позволяет параллельно запускать несколько вычислений целевой функции
* Динамический уровень параллелизма - можно добавлять/удалять воркеров, которые вычисляют целевую функцию
* Все результаты сохраняются в БД - история запусков никуда не потеряется

*За выполнение данного задания можно получить дополнительно +0.4 к итоговому баллу*

### XGBoost Tuning

In [None]:
######################################
######### YOUR CODE HERE #############
######################################

# Results

Подведем итоги.

Обучите модели с найденными (оптимальными) гиперпараметрами и сделайте справнение на отложенной выборке

In [None]:
######################################
######### YOUR CODE HERE #############
######################################

Итоговая таблица

In [23]:
get_ate(all_metrics, CONTROL_NAME)

Unnamed: 0,metric,xgb opt ate %
0,logloss,0.0
1,rocauc,-1.554312e-13
