# Magister en Ciencia de Datos - UDD
## DBAnalytics (Ciencia de Datos aplicada)
### Sprint 2: Datos transformados II

Realizar una búsqueda de hiper-parámetros para entrenar el modelo usando el algoritmo Gradient Boosting Trees.<br>
Explorar el impacto de cada feature en el modelo, de dos maneras: <br>
Tests KS univariados comparando las poblaciones Churn vs No-Churn <br>
Performance del modelo entrenado con y sin el feature (sensibilidad) <br>
Calcular el uplift del resultado final para el segmento de los top 10K msno’s. (%churn rate top 10K vs %churn rate total)<br>

In [1]:
import pandas as pd

* Lectura de archivo de entrenamiento

In [2]:
!gsutil ls -a gs://kk_data_udd

gs://kk_data_udd/day_listen.csv#1564861559107461
gs://kk_data_udd/df_test.csv#1565240427109815
gs://kk_data_udd/df_train.csv#1565240406697715
gs://kk_data_udd/fea_rank.csv#1565661603594903
gs://kk_data_udd/members_v3.csv#1563566790239785
gs://kk_data_udd/sample_submission_v2.csv#1563580288727022
gs://kk_data_udd/sample_submission_zero.csv#1563580145138161
gs://kk_data_udd/sub_age_xgb_pred.csv#1565055727433942
gs://kk_data_udd/sub_day_listen.csv#1564861658307683
gs://kk_data_udd/sub_reg_via_xgb_pred.csv#1565055581900599
gs://kk_data_udd/sub_user_satisfaction.cvs#1564861608790636
gs://kk_data_udd/test_sorted_v1.csv#1565242819914404
gs://kk_data_udd/train.csv#1563565831541482
gs://kk_data_udd/train_sorted_v1.csv#1565243444686022
gs://kk_data_udd/train_sorted_v2.csv#1565583113260731
gs://kk_data_udd/train_v2.csv#1563580263806878
gs://kk_data_udd/transactions.csv#1563580088583483
gs://kk_data_udd/transactions_v2.csv#1563580288931202
gs://kk_data_udd/user_label_201702.csv#1563681052454642
gs

In [3]:
df_train = pd.read_csv("gs://kk_data_udd/train_sorted_v2.csv", nrows=10000)

In [4]:
df_train.head()

Unnamed: 0,msno,is_churn,city,bd,gender,registered_via,registration_init_time,payment_method_id,payment_plan_days,plan_list_price,...,discount_120,discount_149,discount_180,discount_20,discount_30,discount_50,md_-1,md_0,day_listen,user_latent_satisfaction
0,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1,10.0,38.0,1.0,9.0,2005.0,39,30,149,...,0,0,0,0,0,0,0,1,33.0,0.820694
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1,10.0,38.0,1.0,9.0,2005.0,39,30,149,...,0,0,0,0,0,0,0,1,33.0,0.820694
2,Nb1ZGEmagQeba5E+nQj8VlQoWl+8SFmLZu+Y8ytIamw=,1,18.0,22.0,2.0,9.0,2006.0,38,30,149,...,0,0,0,0,0,0,0,1,152.0,0.919394
3,Nb1ZGEmagQeba5E+nQj8VlQoWl+8SFmLZu+Y8ytIamw=,0,18.0,22.0,2.0,9.0,2006.0,38,30,149,...,0,0,0,0,0,0,0,1,152.0,0.919394
4,MkuWz0Nq6/Oq5fKqRddWL7oh2SLUSRe3/g+XmAWqW1Q=,1,11.0,31.0,2.0,9.0,2006.0,38,30,149,...,0,0,0,0,0,0,0,1,4.0,1.0


In [5]:
df_train = df_train.drop(['transaction_date', 'membership_expire_date', 'registration_init_time'], axis=1)

* Importacion de librerias para modelamiento

In [6]:
import pandas as pd
import sklearn
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

In [7]:
import gc
import warnings
from datetime import datetime

In [8]:
gc.enable()
warnings.filterwarnings('ignore')

* Definimos funciones

In [9]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [10]:
def xgb_score(preds, dtrain):
    labels = dtrain.get_label()
    return 'log_loss', sklearn.metrics.log_loss(labels, preds)

* Generamos fetures de entrenamiento

In [11]:
cols = [c for c in df_train.columns if c not in ['is_churn', 'msno']]

In [12]:
df_train = df_train.fillna(0)

In [13]:
Y = df_train['is_churn'].values
X = df_train[cols]

* Creamos dataset de entrenamiento y test desde ek tablon

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

* Se define los hyperametros para el finetuning

In [15]:
params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5, 6, 7],
    'subsample': [0.7, 0.75, 0.8]
}

* Se activa modelo con los valores por defecto

In [16]:
model = xgb.XGBClassifier(learning_rate=0.002, n_estimators=600, objective='binary:logistic', silent=True, nthread=1)

In [17]:
folds = 3
param_comb = 5

In [18]:
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1001)

In [19]:
random_search = RandomizedSearchCV(model, 
                                   param_distributions=params, 
                                   n_iter=param_comb, 
                                   scoring='neg_log_loss', 
                                   n_jobs=4,
                                   cv=skf.split(X_train, y_train), 
                                   verbose=3, 
                                   random_state=1001)

* Se comienza el proceso de busqueda de hyperparametros

In [20]:
start_time = timer(None)  
random_search.fit(X_train, y_train)
timer(start_time)  

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] max_depth=4, colsample_bytree=0.6, gamma=5, subsample=0.7, min_child_weight=1 
[CV] max_depth=4, colsample_bytree=0.6, gamma=5, subsample=0.7, min_child_weight=1 
[CV] max_depth=4, colsample_bytree=0.6, gamma=5, subsample=0.7, min_child_weight=1 
[CV] max_depth=4, colsample_bytree=0.8, gamma=1, subsample=0.75, min_child_weight=10 
[CV]  max_depth=4, colsample_bytree=0.6, gamma=5, subsample=0.7, min_child_weight=1, score=-0.5068190921492399, total=  21.9s
[CV] max_depth=4, colsample_bytree=0.8, gamma=1, subsample=0.75, min_child_weight=10 
[CV]  max_depth=4, colsample_bytree=0.6, gamma=5, subsample=0.7, min_child_weight=1, score=-0.505730340322727, total=  22.1s
[CV] max_depth=4, colsample_bytree=0.8, gamma=1, subsample=0.75, min_child_weight=10 
[CV]  max_depth=4, colsample_bytree=0.6, gamma=5, subsample=0.7, min_child_weight=1, score=-0.5081240729334219, total=  22.3s
[CV] max_depth=6, colsample_bytree=1.0, gamma=1, subs

[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:  1.9min finished



 Time taken: 0 hours 2 minutes and 13.28 seconds.


* Revisamos los resultados

1. Mejores estimadores

In [21]:
random_search.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.8, gamma=0.5,
       learning_rate=0.002, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=600, n_jobs=1,
       nthread=1, objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.8, verbosity=1)

In [22]:
print('2. Gini score para %d-fold encontrado con %d parametros de combinación' % (folds, param_comb))

2. Gini score para 3-fold encontrado con 5 parametros de combinación


In [23]:
random_search.best_score_ * 2 - 1

-2.0102053115112444

3. Best hyperparameters:

In [24]:
random_search.best_params_

{'colsample_bytree': 0.8,
 'gamma': 0.5,
 'max_depth': 3,
 'min_child_weight': 1,
 'subsample': 0.8}

4. Guardamos los resultados

In [25]:
results = pd.DataFrame(random_search.cv_results_)
results.to_csv('xgboost_random_grid_search_results_01.csv', index=False)

* Chequeamos modelo

In [26]:
predictions = model.predict(X_test)

XGBoostError: need to call fit or load_model beforehand

In [None]:
pred.clip(0.0000001, 0.999999)

In [None]:
X_test['is_churn'] = pred.clip(0.0000001, 0.999999)