# Cross-validation of XGB parameters

I'm going to use the same workflow I used in the first kernel. For a better explanation of that, see [the previous notebook here.][1]

But as a refresher:

1. We process X by removing columns with high amounts of NaN values and a basic dummy of text variables.
2. We define the RMSLE metric for scoring.
3. We create a pipeline to impute missing values and scale the frame.
4. XGBoost performed the best, so we will validate based on that model.

So step 1: process the data.


  [1]: https://www.kaggle.com/mwaddoups/sberbank-russian-housing-market/i-regression-workflow-various-models/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

train = pd.read_csv('../input/train.csv')
macro = pd.read_csv('../input/macro.csv')
test = pd.read_csv('../input/test.csv')

train = pd.merge(train, macro, how='left', on='timestamp')

# Process to get X and y variables
percent_null = train.isnull().mean(axis=0) > 0.20
df = train.loc[:, ~percent_null]
df = df.drop(['id', 'price_doc'], axis=1)
df['timestamp'] = pd.to_numeric(pd.to_datetime(df['timestamp'])) / 1e18
df = pd.get_dummies(df).astype(np.float64)
X = df
y = np.log(train['price_doc'])

# Make our train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 2: Define the metric.

In [None]:
from sklearn.metrics import make_scorer

def rmsle_exp(y_true_log, y_pred_log):
    y_true = np.exp(y_true_log)
    y_pred = np.exp(y_pred_log)
    return np.sqrt(np.mean(np.power(np.log(y_true + 1) - np.log(y_pred + 1), 2)))

def score_model(model, pipe):
    train_error = rmsle_exp(y_train, model.predict(pipe.transform(X_train)))
    test_error = rmsle_exp(y_test, model.predict(pipe.transform(X_test)))
    return train_error, test_error

rmsle_exp_scorer = make_scorer(rmsle_exp)

Step 3: Define the preprocessing pipeline.

In [None]:

from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import make_pipeline

# Make a pipeline that transforms X
pipe = make_pipeline(Imputer(), StandardScaler())
pipe.fit(X_train)
pipe.transform(X_train)

Step 4: Showcase the base performance of XGBoost on the test set.

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(pipe.transform(X_train), y_train)

print("Train error: {:.4f}, Test error: {:.4f}".format(*score_model(xgb, pipe)))

## Cross-validation
We will now use the routines available in sklearn to cross-validate this regressor. We'll use random search, since it's usually faster and converges better than grid search.

In [None]:
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'max_depth': randint(3, 6),
    'learning_rate': uniform(0.01, 0.45),
    'n_estimators': randint(50,200),
    'min_child_weight': randint(1, 50),
    'gamma': uniform(0, 1),
}

random_search = RandomizedSearchCV(xgb, param_distributions=param_dist, 
                                   n_iter=8, cv=3, scoring=rmsle_exp_scorer, 
                                   verbose=1, n_jobs=-1, refit=False)
random_search.fit(pipe.transform(X_train), y_train)

In [None]:
print(random_search.best_params_)
pd.DataFrame(random_search.cv_results_)

In [None]:
# Refit the model on everything, including our held-out test set.
pipe.fit(X)
xgb.fit(pipe.transform(X), y)

In [None]:
# Apply the same steps to process the test data
test_data = pd.merge(test, macro, how='left', on='timestamp')
test_data['timestamp'] = pd.to_numeric(pd.to_datetime(test_data['timestamp'])) / 1e18
test_data = pd.get_dummies(test_data).astype(np.float64)

# Make sure it's in the same format as the training data
df_test = pd.DataFrame(columns=df.columns)
for column in df_test.columns:
    if column in test_data.columns:
        df_test[column] = test_data[column]
    else:
        df_test[column] = np.nan

# Make the predictions
predictions = np.exp(xgb.predict(pipe.transform(df_test)))

# And put this in a dataframe
predictions_df = pd.DataFrame()
predictions_df['id'] = test['id']
predictions_df['price_doc'] = predictions
predictions_df.head()

In [None]:
# Now, output it to CSV
predictions_df.to_csv('predictions.csv', index=False)