# Building a model

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor, Pool
from boruta import BorutaPy

%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
TUNING = 0

<IPython.core.display.Javascript object>

In [3]:
train_df = pd.read_csv(
    "data/train_cleaned.csv", parse_dates=["Scheduled Date", "Delivery Date"]
)
test_df = pd.read_csv(
    "data/test_cleaned.csv", parse_dates=["Scheduled Date", "Delivery Date"]
)

<IPython.core.display.Javascript object>

## Boruta for Feature importance

Features we're going to be using:
```python
[
    "Artist Reputation",
    "Height",
    "Width",
    "Weight",
    "Material",
    "Price Of Sculpture",
    "Base Shipping Price",
    "International",
    "Express Shipment",
    "Installation Included",
    "Transport",
    "Fragile",
    "Customer Information",
    "Remote Location",
    "Cost",
    "delivery_offset",
    "scheduled_year_month",
    "scheduled_month",
    "Customer State",
    "Area",
    "Price per unit weight",
]
```

In [4]:
train_X_dummified = pd.get_dummies(
    train_df[
        [
            "Artist Reputation",
            "Height",
            "Width",
            "Weight",
            "Material",
            "Price Of Sculpture",
            "Base Shipping Price",
            "International",
            "Express Shipment",
            "Installation Included",
            "Transport",
            "Fragile",
            "Customer Information",
            "Remote Location",
            "delivery_offset",
            "scheduled_year_month",
            "scheduled_month",
            "Customer State",
            "Area",
            "Price per unit weight",
        ]
    ]
)
train_y = train_df[["Cost"]]

<IPython.core.display.Javascript object>

In [5]:
if TUNING:
    # initialize Boruta
    rfr_boruta = RandomForestRegressor(n_jobs=-1, max_depth=5)
    boruta = BorutaPy(
        estimator=rfr_boruta,
        n_estimators="auto",
        max_iter=100,  # number of trials to perform
    )

    # fit Boruta (it accepts np.array, not pd.DataFrame)
    boruta.fit(np.array(train_X_dummified), np.array(train_y))

    green_area = train_X_dummified.columns[boruta.support_].to_list()
    blue_area = train_X_dummified.columns[boruta.support_weak_].to_list()

    print("features in the green area:", green_area)
    print("features in the blue area:", blue_area)

<IPython.core.display.Javascript object>

According to Boruta:

* features in the green area: ['Artist Reputation', 'Price Of Sculpture', 'Base Shipping Price', 'Customer State_ID']
* features in the blue area: ['Weight', 'scheduled_month']

In [6]:
if TUNING:
    pd.DataFrame(
        {"column": train_X_dummified.columns.to_list(), "rank": boruta.ranking_}
    ).sort_values(by="rank").to_csv("data/boruta_feature_ranking.csv", index=False)

<IPython.core.display.Javascript object>

Based on the rankings, these are the features we'll be using:

```python
[
    "Artist Reputation",
    "Customer State",
    "Price Of Sculpture",
    "Base Shipping Price",
    "scheduled_month",
    "Area",
    "Weight",
    "delivery_offset",
    "Transport",
    "Material",
    "Customer Information",
    "Installation Included",
    "Express Shipment",
    "Fragile",
    "Remote Location",
    "International",
]
```

## Random Forests