Version 1.1.0

# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [1]:
import pandas as pd
import numpy as np
from itertools import product
from grader import Grader

# Read data

In [2]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [3]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


In [5]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


In [6]:
all_data.date_block_num.unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33])

# Mean encodings without regularization

After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

#### Method 1

In [8]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 


In [13]:
encoded_feature = all_data['item_target_enc'].values
np.corrcoef(all_data['target'].values, encoded_feature)
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.483038698862


In [None]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

#### Method 2

In [14]:
'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.483038698862


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**.

In [4]:
grader = Grader()

In [5]:
all_data_original=all_data.copy()

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

**Now it's your turn to write the code!** 

You may use 'Regularization' video as a reference for all further tasks.

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

    *  See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations.

In [5]:
y_tr=all_data.target.values

from sklearn.model_selection import KFold

kf=KFold(n_splits=5, random_state=123, shuffle=True)

all_data['item_target_enc']=np.nan

In [8]:
X_train.groupby('item_id').target.mean()

item_id
0        0.024390
1        0.028436
2        0.012048
3        0.026316
4        0.025000
5        0.028571
6        0.022727
7        0.031250
8        0.025974
9        0.023810
10       0.026316
11       0.000000
12       0.025641
13       0.000000
14       0.027027
15       0.023810
16       0.023256
17       0.023810
18       0.000000
19       0.027027
20       0.025641
21       0.025000
22       0.025641
23       0.026316
24       0.025641
25       0.023810
26       0.021277
27       0.057432
28       0.150515
29       0.032129
           ...   
22140    0.200000
22141    0.261745
22142    0.058824
22143    2.158501
22144    0.286567
22145    0.676166
22146    0.041739
22147    0.081531
22148    0.025641
22149    0.058824
22150    0.101145
22151    1.160000
22152    0.149533
22153    0.025547
22154    0.113636
22155    0.095679
22156    0.027778
22157    0.014286
22158    0.033333
22159    0.161290
22160    0.095823
22161    0.027027
22162    1.520115
22163    0.603960
22

In [9]:


for tr_ind,val_ind in kf.split(y_tr):
#     print(tr_ind)
#     print(val_ind)
    X_train,X_val=all_data.iloc[tr_ind,:],all_data.iloc[val_ind,:]
#     print(all_data.values[tr_ind])
    X_val['item_target_enc']=X_val['item_id'].map(X_train.groupby('item_id').target.mean())
#     all_data.iloc[val_ind,:]=X_val
    
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [10]:
X_val.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.030303
144968,0,28,0,0.0,0.147122
142661,0,29,0,0.0,0.027778
138949,0,34,0,0.0,0.153846
139247,0,35,0,1.0,0.861702


In [16]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.0
141495,0,27,0,0.0,0.060403
144968,0,28,0,0.0,0.119048
142661,0,29,0,0.0,0.072464
138947,0,32,0,6.0,1.213018


In [11]:
all_data.iloc[val_ind,:].head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138949,0,34,0,0.0
139247,0,35,0,1.0


In [37]:
val_ind

array([       0,        2,        3, ..., 10913841, 10913843, 10913849])

In [39]:
all_data.values

array([[  0.00000000e+00,   1.90000000e+01,   0.00000000e+00,
          0.00000000e+00,   2.22222222e-02],
       [  0.00000000e+00,   2.70000000e+01,   0.00000000e+00,
          0.00000000e+00,   5.68335589e-02],
       [  0.00000000e+00,   2.80000000e+01,   0.00000000e+00,
          0.00000000e+00,   1.41176471e-01],
       ..., 
       [  5.90000000e+01,   2.21640000e+04,   3.30000000e+01,
          0.00000000e+00,   1.23558897e+00],
       [  5.90000000e+01,   2.21660000e+04,   3.30000000e+01,
          0.00000000e+00,   2.95918367e-01],
       [  5.90000000e+01,   2.21670000e+04,   3.30000000e+01,
          0.00000000e+00,   1.08108108e+00]])

In [6]:
# YOUR CODE GOES HERE

y_tr=all_data.target.values

from sklearn.model_selection import KFold

kf=KFold(n_splits=5,shuffle=False)

all_data['item_target_enc']=np.nan

for tr_ind,val_ind in kf.split(y_tr):
#     print(tr_ind)
#     print(val_ind)
    X_train,X_val=all_data.iloc[tr_ind,:],all_data.iloc[val_ind,:]
#     print(all_data.values[tr_ind])
    X_val['item_target_enc']=X_val['item_id'].map(X_train.groupby('item_id').target.mean())
    all_data.iloc[val_ind,:]=X_val

all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values

# You will need to compute correlation like that
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


0.41645907128
Current answer for task KFold_scheme is: 0.41645907128


In [8]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

In [7]:
all_data.shape

(10913850, 4)

In [9]:
target_sum = all_data.groupby('item_id')['target'].transform('sum')

In [19]:
target_sum.iloc[143085]

34.0

In [20]:
target_sum.iloc[151200]

34.0

In [17]:
all_data.iloc[144968]

all_data[all_data.item_id == 18800]

Unnamed: 0,shop_id,item_id,date_block_num,target
143085,0,18800,0,2.0
151200,1,18800,0,2.0
118740,2,18800,0,0.0
126855,3,18800,0,0.0
102510,4,18800,0,0.0
110625,6,18800,0,2.0
134970,7,18800,0,0.0
175545,8,18800,0,1.0
183660,10,18800,0,0.0
208005,12,18800,0,0.0


In [None]:
#transform uvijek vraca series koji je dimenzije kako je i pocetni dataset,
#to znaci da on grupira i racuna mean po grupi i ona svakom inedzu zapise osgovarajucu vrijednost ovisno u kojoj je grupi!

#pogledaj iznad - imas jedan item id, i za taj item id je isti mean za par proizvoljnih indexa

In [21]:
target_sum = all_data.groupby('item_id')['target'].transform('sum')

n = all_data.groupby('item_id')['target'].transform('count')


all_data['item_target_enc'] = (target_sum - all_data['target']) / (n - 1)

In [29]:
all_data=all_data_original.copy()

In [7]:
target_sum = all_data.groupby('item_id')['target'].transform('sum')

n = all_data.groupby('item_id')['target'].transform('count')

all_data['item_target_enc'] = (target_sum - all_data['target']) / (n - 1)

#for every row we calculate for that row sum of tatgets for given item id and substract target from that,
#that way every row will be left out when calculating mean for item id! 

encoded_feature = all_data['item_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

0.480384831129
Current answer for task Leave-one-out_scheme is: 0.480384831129


# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [32]:
all_data_means=all_data.groupby('item_id')['target'].transform('mean')

all_data_counts=all_data.groupby('item_id')['target'].transform('count')

In [44]:
all_data_means.head()




139255    0.022222
141495    0.056834
144968    0.141176
142661    0.037383
138947    1.319042
Name: target, dtype: float64

In [46]:
all_data_counts.head()

139255      45.0
141495     739.0
144968     595.0
142661     321.0
138947    1586.0
Name: target, dtype: float64

In [47]:
45 *0.022222

0.9999899999999999

In [43]:
all_data_means * all_data_counts

139255         1.0
141495        42.0
144968        84.0
142661        12.0
138947      2092.0
138948       836.0
138949       122.0
139247       222.0
142672        94.0
142065        43.0
139208       109.0
142670         9.0
139207       103.0
138950       176.0
143764        46.0
141505        53.0
139199        61.0
138952       185.0
139176         2.0
138951       236.0
139177        63.0
139178       190.0
139179        40.0
143769        94.0
142671        39.0
144539       156.0
139180       243.0
138953        39.0
144265        12.0
141744        18.0
             ...  
10772600    1209.0
10770510     126.0
10769953     108.0
10769955    1664.0
10768833     184.0
10769961     420.0
10770625     143.0
10769956     695.0
10771598    2807.0
10767854    3447.0
10768086    5273.0
10768087    1192.0
10768088    2089.0
10767847     390.0
10769954     260.0
10767848    1708.0
10767849     310.0
10768089     104.0
10770726     131.0
10772844     187.0
10771530     121.0
10768090    

In [48]:
all_data['item_target_enc']=(all_data_means*all_data_counts +  0.3343 * 100 ) / (all_data_counts + 100)

# when alpha goes to infinity this converges to global mean!

In [8]:
all_data_means=all_data.groupby('item_id')['target'].transform('mean')

all_data_counts=all_data.groupby('item_id')['target'].transform('count')

all_data['item_target_enc']=(all_data_means*all_data_counts +  0.3343 * 100 ) / (all_data_counts + 100)

all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

0.48181987971
Current answer for task Smoothing_scheme is: 0.48181987971


# 4. Expanding mean scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas.

In [9]:
all_data_cumsum = all_data.groupby('item_id')['target'].cumsum() - all_data['target']


In [10]:
all_data_cumcount = all_data.groupby('item_id').cumcount()


In [11]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.237448
141495,0,27,0,0.0,0.089905
144968,0,28,0,0.0,0.168964
142661,0,29,0,0.0,0.10791
138947,0,32,0,6.0,1.260635


In [12]:
all_data_cumcount.head()
# sumira do prije onog reda za koji gledamo! to nam i treba!

139255    0
141495    0
144968    0
142661    0
138947    0
dtype: int64

In [13]:
all_data['item_target_enc'] = all_data_cumsum / all_data_cumcount

In [14]:

all_data['item_target_enc'].fillna(0.3343, inplace=True) 
encoded_feature = all_data['item_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

0.502524521108
Current answer for task Expanding_mean_scheme is: 0.502524521108
