# XGBoost
`XGBoost` is a leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.

En esta notebook vamos a entrenar un modelo de ML utilizando el **gradient boosting**

In [1]:
import pandas as pd

In [2]:
# cargamos el dataset
X = pd.read_csv('train.csv', index_col = 'Id')
X_test_full = pd.read_csv('test.csv', index_col = 'Id')
X.shape, X_test_full.shape

((1460, 80), (1459, 79))

In [3]:
# eliminamos las filas que tienen datos faltantes en la columna SalePrice
# salePrice es la variable que queremos predecir, por lo tanto no tiene ningun sentido seguir avanzando con
# observaciones (filas) que no tienen las etiquetas para el modelo que luego vamos a entrenar

X.dropna(axis = 0,
        subset=['SalePrice'],
        inplace = True)

y = X.SalePrice

X.drop(['SalePrice'],
       axis=1,
       inplace = True)

In [4]:
from sklearn.model_selection import train_test_split

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y,
                                                                 train_size = 0.8,
                                                                 test_size = 0.2,
                                                                 random_state = 0)

In [5]:
X_train_full.shape, y_train.shape

((1168, 79), (1168,))

In [6]:
X_valid_full.shape, y_valid.shape

((292, 79), (292,))

A los fines practicos de esta notebook nos vamos a quedar solo con las features que sean numericas y con las categoricas que tengan una cardinalidad menor a 10 (es decir con las columnas que tiene menos de 10 valores unicos )

In [7]:
low_cardinality_cols = [col for col in X_train_full.columns if X_train_full[col].nunique() < 10 and
                       X_train_full[col].dtype == 'object']

numeric_cols = [col for col in X_train_full.columns if X_train_full[col].dtype in ['int64','float64']]

my_cols = low_cardinality_cols + numeric_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

In [8]:
# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

# Step 1: Build model

In [11]:
from xgboost import XGBRegressor

# 1. Definimos el modelo
my_model = XGBRegressor(random_state = 0)

# 2. Entrenamos el modelo
my_model.fit(X_train, y_train)

In [17]:
# 3. Usamos el modelo para predecir los datos de SalePrice en base a los datos de validacion
predictions_1 = my_model.predict(X_valid)

# 4. validacion. En este caso vamos a usar la mean_absolute_error
from sklearn.metrics import mean_absolute_error
mae_1 = mean_absolute_error(y_valid, predictions_1)

print('-'*50, 'RESULTS','-'*50)
print(f'MAE from model_1: {mae_1.round(2)}')
print('-'*50, 'RESULTS','-'*50)

-------------------------------------------------- RESULTS --------------------------------------------------
MAE from model_1: 17662.74
-------------------------------------------------- RESULTS --------------------------------------------------


# Step 2: Improve the model


In [21]:
# 1. Definimos el model_2
my_model_2 = XGBRegressor(n_estimators = 500, learning_rate=0.05, n_jobs=4)

# 2. Entrenamos el model_2
my_model_2.fit(X_train, y_train,
              early_stopping_rounds = 5,
              eval_set = [(X_valid, y_valid)],
              verbose = False)

In [24]:
# 3. Usamos el model_2 para predecir los valores de SalePrice en base a los datos de validacion.
predictions_2 = my_model_2.predict(X_valid)

# 4. Validacin. 
mae_2 = mean_absolute_error(y_valid, predictions_2)
print('-'*50, 'RESULTS','-'*50)
print(f'MAE from model_1: {mae_2.round(2)}')
print('-'*50, 'RESULTS','-'*50)

-------------------------------------------------- RESULTS --------------------------------------------------
MAE from model_1: 16802.97
-------------------------------------------------- RESULTS --------------------------------------------------


Este resultado muestra que el ajuste manual que hicimos de los parametros del modelo no condujo a tener un mejor modelo con una error absoluto medio menor.