# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [1]:
import pandas as pd
import numpy as np
from itertools import product
import os
import zipfile

In [2]:
print(np.__version__)
print(pd.__version__)

1.18.1
1.0.1


# Read data

In [3]:
zf = zipfile.ZipFile('../final_project/competitive-data-science-predict-future-sales.zip')

In [4]:
sales = pd.read_csv(zf.open('sales_train.csv'))

In [5]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [6]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

In [7]:
grid = []

In [8]:
# For every month we create a grid from all shops/items combinations from that month
# the product function does a cartesian product iteration which will be transformed in a dataframe
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)    

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':'sum'})

# rename item_cnt_day as target
gb.rename(columns = {'item_cnt_day': 'target'}, inplace= True)

#join aggregated data to the grid
all_data = pd.merge(grid, gb,how= 'left', on = index_cols).fillna(0)

all_data.sort_values(by= ['date_block_num','shop_id','item_id'],inplace=True)

In [9]:
print(all_data.shape)
all_data.head()


(10913850, 4)


Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0



# Mean encodings without regularization

After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

#### Method 1

In [10]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True)

In [11]:
all_data

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042
...,...,...,...,...,...
10768834,59,22162,33,0.0,1.556793
10769024,59,22163,33,0.0,0.581395
10769690,59,22164,33,0.0,1.235589
10771216,59,22166,33,0.0,0.295918


#### Method 2

In [12]:
'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
# this is just like method one but instead of doing it via mapping a series, then you directly calculate
# the mean and put it directly in the dataframe by using transform
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 


In [13]:
# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621791


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**.

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

**Now it's your turn to write the code!** 

You may use 'Regularization' video as a reference for all further tasks.

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

    *  See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations.

In [14]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=False)

print('-'*50)
print('Number of folds in Regularization: {}'.format(kf.get_n_splits(all_data)))
print(' ')
print(kf)
print('-'*50)

for train_index, test_index in kf.split(all_data):
    #print("TRAIN:", train_index, "TEST:", test_index)
    
    # split the index in train and test. create a copy to avoid a SeetingwithCopy Warning message
    X_train, X_test = all_data.iloc[train_index].copy(), all_data.iloc[test_index].copy()
    
    # estimate the target mean for the train set and 
    item_id_target_mean = X_train.groupby('item_id').target.mean()
    X_test['item_target_enc'] = X_test['item_id'].map(item_id_target_mean)
    
    # take your newly created test set and assign it into your all_data dataframe
    all_data.iloc[test_index] = X_test
    

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values   
    
# You will need to compute correlation like that
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

--------------------------------------------------
Number of folds in Regularization: 5
 
KFold(n_splits=5, random_state=None, shuffle=False)
--------------------------------------------------
0.41645907127988024


# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

In [15]:
# YOUR CODE GOES HERE

# this is just like method one but instead of doing it via mapping a series, then you directly calculate
# the mean and put it directly in the dataframe by using transform
# I used a lambda expresion to do the whole calculation at once
all_data['item_target_enc'] = (all_data.groupby('item_id')['target']
                                             .transform(lambda x: (x.sum() - x) / (x.count() - 1)))

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values   

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

0.480384831129305


# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [16]:
# set alpha and global mean
globalmean = 0.3343
alpha = 100

# this is just like method one but instead of doing it via mapping a series, then you directly calculate
# the mean and put it directly in the dataframe by using transform
# I used a lambda expresion to do the whole calculation at once
all_data['item_target_enc'] = (all_data.groupby('item_id')['target']
                                             .transform(lambda x: (x.mean()*x.count() + globalmean*alpha)/ (x.count() + alpha)))

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values   

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

0.4818198797097282


# 4. Expanding mean scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas.

In [17]:
# this is just like method one but instead of doing it via mapping a series, then you directly calculate
# the mean and put it directly in the dataframe by using transform
# I used a lambda expresion to do the whole calculation at once
# all_data['item_target_enc'] = (all_data.groupby('item_id')['target']
#                                              .transform(lambda x: x.cumsum()/ np.arange(len(x.index))))

# I tried to implement cumcount by hand according the documentation I got from pandas, but
# it did not work out. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumcount.html

In [18]:
# I am using the pedestrian approach on this one
cumsum = all_data.groupby('item_id')['target'].cumsum() - all_data['target']
cumcnt = all_data.groupby('item_id').cumcount()
all_data['item_target_enc2'] = cumsum / cumcnt

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values   

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

0.4818198797097282


False