# Modelling

Now that we have the data in the right format, we can start building our model for making predictions

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

Lets load our data

In [2]:
train_X = pd.read_csv("train_data.csv")

test_X = train_X[train_X["date_block_num"] >= train_X["date_block_num"].max()-1]
train_X = train_X[train_X["date_block_num"] < train_X["date_block_num"].max()-1]

train_y = train_X.pop("item_cnt_month")
test_y = test_X.pop("item_cnt_month")

Now that we have our data, we can start defining our model.
For this usecase we will use the XGBRegressor from the xgboost module, and run multiple setups with GridSearchCrossValidation to find the best parameters for our model.

In [3]:
param_dict = {
    "n_estimators": [1_000, 5000],
    "max_depth": [5, 10],
    "learning_rate": [0.1, 0.3],
    "tree_method": ["gpu_hist"],
    "min_child_weight": [0.3, 0.5], 
    "colsample_bytree": [0.3, 0.6],
    "subsample": [0.8, 1.0], 
}

cross_val = GridSearchCV(
    estimator=xgb.XGBRegressor(),
    param_grid=param_dict,
    verbose=2,
    cv=5)

# Training
Now lets run the Training:

In [4]:
cross_val.fit(
    train_X, train_y, 
    early_stopping_rounds=20,
    eval_set=[(train_X, train_y), (test_X, test_y)],
    eval_metric="rmse",
    verbose=False
    )

Lets use the best model that we found through Cross Validation to make predictions for the submission and save it in a csv file.

In [5]:
model = cross_val.best_estimator_
# load prediction datta
submission_data = pd.read_csv("test.csv")
# load submission sample
submission = pd.read_csv("data/sample_submission.csv", index_col="ID")
submission["item_cnt_month"] = model.predict(submission_data)
# some values are slightly negative, indicating no sales, lets set them to 0
submission[submission["item_cnt_month"] < 0]["item_cnt_month"] = 0
# save data
submission.to_csv("submission.csv")

NameError: name 'model' is not defined