# Applying CatBoost

### Introduction

In this lesson, we'll begin to see how to work with the catboost library.  We'll do so by working with the `imdb_movies` dataset.

### Loading Data

As a first step, get started by installing catboost, if you haven't already, or you are working on Google colab.

> Uncomment and run the cell below

In [None]:
# !pip install catboost



Let's get started by loading our imdb data.

In [2]:
import pandas as pd

df = pd.read_csv('./imdb_movies.csv')

In [3]:
df[:3]

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000
2,Spectre,Action,245000000,148.0,2015,10,880674609


Now, we'll need to separate our training, validation, and testing data.  And we have a time component to our data, so let's begin by ordering our data by year and then month.

In [4]:
df_sorted = df.sort_values(['year', 'month'])
df_sorted[:2]

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
1108,Pinocchio,Animation,2600000,88.0,1940,2,84300000
862,Lolita,Drama,2000000,153.0,1962,6,9250000


We can see that our data is now ordered earliest to latest.  Next, we'll split our data into our features and targets.

In [5]:
X = df_sorted.iloc[:, 1:-1]
y = df_sorted['revenue']

Let's see if there's any additional feature engineering we need to do.  Here, we'll begin by seeing if any of our columns have na values.

In [5]:
df.isna().sum()

title       0
genre      84
budget      0
runtime     0
year        0
month       0
revenue     0
dtype: int64

Only genre does, so we'll select it and fill the na values with `-999`. 
> When our decision tree makes it's splits, it will be able to separate out the na values this way, as it can simply see what happens when the value is less than -998.

In [7]:
genre = X['genre']
genre_filled = genre.fillna(-999)
X_updated = X.assign(genre = genre_filled)

That's all of the feature engineering we need to do to use catboost.  We'll see that it will take care of our non-numeric categorical columns for us.

Next, let's split our data.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_updated, y, shuffle = False, test_size = .2)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, shuffle = False, test_size = .5)

### Fitting Catboost Model

Now it's time to fit a model.  Now remember that we still have one categorical column, of genre.  Now we don't want to translate this column to a number, because we want catboost to do this for us through mean target encoding.  For catboost to do this, we need to find the indices of all of our categorical columns.

So let's do that now.

> We'll find all of the columns that are of type object.

In [12]:
import numpy as np
cat_indices = np.where(X.dtypes == np.object)[0]
cat_indices

array([0])

Now we can import catboost, and pass through the indices with the `cat_features` argument.

> We should also set `logging_level = 'Silent'` or catboost will print the score for each iteration.

In [13]:
from catboost import CatBoostRegressor

In [14]:
cbr = CatBoostRegressor(cat_features=cat_indices, logging_level = 'Silent')
cbr.fit(X_train, y_train)

<catboost.core.CatBoostRegressor at 0x11882a310>

In [20]:
cbr.score(X_validate, y_validate)

0.28271027269811

Catboost has feature importances out of the box.

In [19]:
importances = pd.Series(index = cbr.feature_names_, data = cbr.feature_importances_)
importances.sort_values()

year        9.189228
month      14.528827
genre      18.506504
runtime    18.754304
budget     39.021137
dtype: float64

### Understand Pooling 

Now to fit our catboost model, we needed to pass through the `cat_features` values.  To avoid doing this everytime we fit a model, we can work with a data pool.  This allows us to combine information related to our data.

In [21]:
from catboost import Pool
train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=[0])

validate_pool = Pool(X_validate, 
                  y_validate, 
                  cat_features=[0])

test_pool = Pool(X_test, 
                  y_test, 
                  cat_features=[0])

Now when we train our model, we simply pass through the `train_pool` to do so. 

In [22]:
cbr = CatBoostRegressor(cat_features=cat_indices, logging_level = 'Silent')
cbr.fit(train_pool)

<catboost.core.CatBoostRegressor at 0x119b92690>

Similarly, we can score on the validate pool.

In [23]:
cbr.score(validate_pool)

0.28271027269811

### Finding the ideal trees

By default catboost trains with 1000 trees.

In [26]:
cbr.tree_count_

1000

But 1000 trees may be too many.  It perhaps, is overfitting to the model.  So instead, we can pass through the training and validation pools when fitting the model.  And ask catboost to return the model with the ideal number of trees.

In [28]:
cbr_best = CatBoostRegressor(logging_level = 'Silent', use_best_model=True).fit(train_pool, eval_set=validate_pool)

In [29]:
cbr_best.score(validate_pool)

0.5060700445435766

Now that's more like it.

In [30]:
cbr_best.tree_count_

64

So we can see that 64 trees was the ideal number of trees.  We can set this with the iterations hyperparameter.

In [50]:
cbr_smaller = CatBoostRegressor(logging_level = 'Silent', iterations=64).fit(train_pool)

In [51]:
cbr_smaller.score(validate_pool)

0.3230233318966579

In [53]:
cbr_smaller.tree_count_

64

So we can see that we did not achieve the same score that we did previously, but we still saw an increase in our score by reducing the number of trees.

### Summary

In this lesson, we saw some features of the Catboost library.  We saw that catboost will perform mean target encoding on our non-numeric features, so long as we specify their indices in our `cat_features` argument.  We saw that we can combine our feature, target and cat_features in a catboost Pool.  

Then we saw that we can learn about how many trees to train by using the `eval_set` argument when training and setting `use_best_model` as `True`.  Finally, we used this knowledge to set the number of `iterations` when training our model going forward.

### Resources

[Catboost parameter docs](https://catboost.ai/docs/concepts/python-reference_parameters-list.html)

[Catboost Description and Hyperparameters](https://towardsdatascience.com/https-medium-com-talperetz24-mastering-the-new-generation-of-gradient-boosting-db04062a7ea2)

[Catboost Tutorial](https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb)