Version 1.1.0

# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [1]:
import pandas as pd
import numpy as np
from itertools import product
from grader import Grader

# Read data

In [2]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [40]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


In [58]:
len(all_data.item_id.unique())

21807

In [59]:
len(all_data.shop_id.unique())

60

In [63]:
60*21807*34

44486280

In [62]:
len(all_data.date_block_num.unique())

34

In [56]:
all_data.shape

(10913850, 4)

# Mean encodings without regularization

After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

#### Method 1

In [4]:
all_data.target.mean()

0.33427305671234259

In [5]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.483038698862


#### Method 2

In [33]:
'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.483038698862


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**.

In [36]:
grader = Grader()

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

**Now it's your turn to write the code!** 

You may use 'Regularization' video as a reference for all further tasks.

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

    *  See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations.

In [122]:
from sklearn.model_selection import KFold

kfolder = KFold(n_splits=5,shuffle=False)
kfolder.get_n_splits(all_data.index)

5

In [139]:
all_set=[]
for train_index, test_index in kfolder.split(all_data.index):
   #Train set is larger set, test is smaller
   #We want to use train to assign means for test
    X_train, X_test = all_data.iloc[train_index], all_data.iloc[test_index]
    X_test=X_test.copy()
    # Calculate a mapping: {item_id: target_mean}
    item_id_target_mean = X_train.groupby('item_id').target.mean()

    # In our non-regularized case we just *map* the computed means to the `item_id`'s
    X_test['item_target_enc'] = X_test['item_id'].map(item_id_target_mean)

    # Fill NaNs
    X_test['item_target_enc'].fillna(0.3343, inplace=True) 
    all_set.append(X_test.copy())
    
X_final=pd.concat(all_set)

In [145]:
encoded_feature=X_final.item_target_enc.values

# You will need to compute correlation like that
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

0.41645907128
Current answer for task KFold_scheme is: 0.41645907128


# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

In [187]:
sum_all=all_data.groupby("item_id").target.transform('sum')
count_all=all_data.groupby("item_id").target.transform('count')-1

In [188]:
# YOUR CODE GOES HERE

all_data['item_target_enc'] = (sum_all-all_data.target)/count_all



encoded_feature=all_data.item_target_enc.values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

0.480384831129
Current answer for task Leave-one-out_scheme is: 0.480384831129


In [176]:
all_data.item_target_enc

139255      0.334273
141495      0.334273
144968      0.334273
142661      0.334273
138947      0.334273
138948      0.334273
138949      0.334273
139247      0.334273
142672      0.334273
142065      0.334273
139208      0.334273
142670      0.334273
139207      0.334273
138950      0.334273
143764      0.334273
141505      0.334273
139199      0.334273
138952      0.334273
139176      0.334273
138951      0.334273
139177      0.334273
139178      0.334273
139179      0.334273
143769      0.334273
142671      0.334273
144539      0.334273
139180      0.334273
138953      0.334273
144265      0.334273
141744      0.334273
              ...   
10772600    0.334273
10770510    0.334273
10769953    0.334273
10769955    0.334273
10768833    0.334273
10769961    0.334273
10770625    0.334273
10769956    0.334273
10771598    0.334273
10767854    0.334273
10768086    0.334273
10768087    0.334273
10768088    0.334273
10767847    0.334273
10769954    0.334273
10767848    0.334273
10767849    0

# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [157]:
# YOUR CODE GOES HERE
def smooth(group):
    rows=group.shape[0]
    return ((group.mean()*rows+0.3343*100)/(rows+100))
all_data["item_target_enc"]=all_data.groupby("item_id").target.transform(smooth)

encoded_feature=all_data.item_target_enc.values
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

0.48181987971
Current answer for task Smoothing_scheme is: 0.48181987971


# 4. Expanding mean scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas.

In [67]:
all_data2=all_data.query("item_id in [19,27]").copy()

In [68]:
all_data2

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
147370,1,19,0,0.0
149610,1,27,0,1.0
114910,2,19,0,0.0
117150,2,27,0,1.0
123025,3,19,0,0.0
125265,3,27,0,0.0
98680,4,19,0,0.0
100920,4,27,0,0.0


In [69]:
cumsum = all_data2.groupby("item_id").target.cumsum()-all_data2.target
cumcn=all_data2.groupby("item_id").cumcount()
all_data2["item_target_enc"]=cumsum/cumcn
all_data2['item_target_enc'].fillna(0.3343, inplace=True)
all_data2

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.334300
141495,0,27,0,0.0,0.334300
147370,1,19,0,0.0,0.000000
149610,1,27,0,1.0,0.000000
114910,2,19,0,0.0,0.000000
117150,2,27,0,1.0,0.500000
123025,3,19,0,0.0,0.000000
125265,3,27,0,0.0,0.666667
98680,4,19,0,0.0,0.000000
100920,4,27,0,0.0,0.500000


139255    0.0
141495    0.0
147370    0.0
149610    0.0
606011    1.0
581507    1.0
Name: target, dtype: float64

In [159]:
# YOUR CODE GOES HERE

encoded_feature=all_data.item_target_enc.values
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

0.502524521108
Current answer for task Expanding_mean_scheme is: 0.502524521108


## Authorization & Submission
To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. Note: Token expires 30 minutes after generation.

In [192]:
STUDENT_EMAIL = "artirj@gmail.com"
STUDENT_TOKEN = "yz6bVTzbthxjs0nA"
grader.status()

You want to submit these numbers:
Task KFold_scheme: 0.41645907128
Task Leave-one-out_scheme: 0.480384831129
Task Smoothing_scheme: 0.48181987971
Task Expanding_mean_scheme: 0.502524521108


In [193]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
