# IMDB Mean Encoding

### Introduction

In this lesson, we can practice using our cross validation library with the imdb dataset.  Let's again start by loading the data.

### Loading the Data

In [5]:
import pandas as pd

movies_df = pd.read_csv('./coerced_movies.csv', index_col = 0)

In [6]:
movies_df.shape

(799, 6)

In [132]:
# movies_df[:-1]

### Assigning and Coercing

Now, let's try mean encoding.

In [7]:
X = movies_df.drop('revenue', axis = 1)

In [8]:
y = movies_df['revenue']

### Feature Engineering 

In [9]:
from sklearn.model_selection import train_test_split

movies_train, movies_test = train_test_split(movies_df, test_size = .4, shuffle = False)
movies_validate, movies_test = train_test_split(movies_test, test_size = .5, shuffle = False)

In [10]:
movies_train[:3]

Unnamed: 0,genre,budget,runtime,year,month,revenue
235,Fantasy,97250400,116.0,2008,1,132900000
849,Crime,55000000,141.0,2008,1,113020255
1413,Drama,35000000,101.0,2008,1,32483410


In [12]:
movies_train['genre'].value_counts()

Action             109
Comedy             102
Drama               83
Adventure           57
Animation           33
Fantasy             18
na                  17
Thriller            15
Horror              13
Crime               11
Romance             11
Science Fiction     10
Name: genre, dtype: int64

Now, to perform mean encoding we simply group our data by genre, and find the average revenue value.

In [13]:
genre_means = movies_train.groupby('genre').mean()['revenue']

In [15]:
genre_means.sort_values()

genre
Horror             7.330516e+07
Crime              7.757479e+07
Drama              9.985017e+07
Romance            1.059148e+08
Comedy             1.305874e+08
Thriller           1.880501e+08
Fantasy            1.977663e+08
na                 2.003466e+08
Action             2.187402e+08
Adventure          2.961663e+08
Animation          3.645671e+08
Science Fiction    4.507836e+08
Name: revenue, dtype: float64

So we can see that Science Fiction has the highest revenue, followed by Animation and Adventure/Action.  This seems about right.

We can turn these genres and the corresponding mean target value to a dictionary.

In [16]:
genre_mean_dict = genre_means.to_dict()
genre_mean_dict

{'Action': 218740154.56880733,
 'Adventure': 296166336.3684211,
 'Animation': 364567144.6969697,
 'Comedy': 130587423.88235295,
 'Crime': 77574786.9090909,
 'Drama': 99850173.27710843,
 'Fantasy': 197766303.7222222,
 'Horror': 73305161.0,
 'Romance': 105914778.45454545,
 'Science Fiction': 450783550.9,
 'Thriller': 188050065.13333333,
 'na': 200346551.70588234}

And then we can replace the genre with the mean target by mapping through the data.

In [17]:
genres_coerced = movies_train['genre'].map(genre_mean_dict)

And then update our column in the dataframe.

In [19]:
movies_coerced = movies_train.assign(genre_means = genres_coerced)
movies_coerced = movies_coerced.drop('genre', axis = 1)

In [20]:
movies_coerced[:3]

Unnamed: 0,budget,runtime,year,month,revenue,genre_means
235,97250400,116.0,2008,1,132900000,197766300.0
849,55000000,141.0,2008,1,113020255,77574790.0
1413,35000000,101.0,2008,1,32483410,99850170.0


Ok, now it's time to train our data.

In [21]:
X_train = movies_coerced.drop('revenue', axis = 1)
y_train = movies_coerced['revenue']

In [22]:
from sklearn.ensemble import RandomForestRegressor
est = RandomForestRegressor(n_estimators = 100, max_depth = 30, max_features = 'log2', random_state = 1)
est.fit(X_train, y_train).score(X_train, y_train)

0.9292899404740033

We can see that this gives us an extremely high score on the training set.

However on the validation set, we do not perform as well.

In [23]:
genre_means_val = movies_validate['genre'].map(genre_mean_dict)

movies_val_coerced = movies_validate.assign(genre_means = genre_means_val)
X_validate = movies_val_coerced.drop(columns = ['genre', 'revenue'], axis = 1)
y_validate = movies_val_coerced['revenue']

In [24]:
est.score(X_validate, y_validate)

0.5007806403778132

This is a sign of overfitting to our training data.

### Regularization via Cross Validation

The overfitting can come in because we are using the knowledge about an observation's target an embedding it as a feature.

To avoid this, we can use cross validation to divide our data into K folds, and continually use the training folds to set on the mean on the one holdout fold.  Let's try it and see how we do.

Now let's try this by looping through our folds.

In [152]:
train_new = X_train.copy()
col = 'genre'
target = 'revenue'
ts_folds = list(KFold(5, shuffle = True).split(X_train))

train_new = train_new.assign(genre_mean = train_new.revenue.mean())

for tr_ind, val_ind in ts_folds:
    X_tr, X_val = X_train.iloc[tr_ind], X_train.iloc[val_ind]
    avg_mapper = X_tr.groupby(col).mean()[target].to_dict()
    grouped_means = X_val[col].map(avg_mapper)
    X_val = X_val.assign(genre_mean = grouped_means)
    train_new.iloc[val_ind, :] = X_val

In [153]:
# train_new.genre_mean.value_counts()

Then because our genres are encoded differently, we need to find one final value to apply to our validation set.

In [31]:
target = 'revenue'
col = 'genre'
avg_cross_mapper = train_new.groupby(col).mean()[target].to_dict()
avg_cross_mapper

{'Action': 218740154.56880733,
 'Adventure': 296166336.3684211,
 'Animation': 364567144.6969697,
 'Comedy': 130587423.88235295,
 'Crime': 77574786.9090909,
 'Drama': 99850173.27710843,
 'Fantasy': 197766303.7222222,
 'Horror': 73305161.0,
 'Romance': 105914778.45454545,
 'Science Fiction': 450783550.9,
 'Thriller': 188050065.13333333,
 'na': 200346551.70588234}

So these are the values we'll update our validation and test sets with.

### Summary

In this lesson, we saw the procedure for mean target encoding.  We perform mean target encoding by grouping our category values by their mean value.  One issue with this is that by using information about our target to predict our target, we suffer from data leakage.  We have information available in our training data that will not be available when we deploy our model.  To limit this data leakage, we used cross validation, where we used the mean values of different observations to encode each observation.  

The catboost model takes advantage of mean encoding out of the box for us.

### Resources 

[Mean Encoding](https://github.com/scikit-learn-contrib/category_encoders)

[Kaggle Mean Encoding](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)