# Converting ``shopper.csv``

For the shopper table there's two main objectives:

* Fill the NaN values present observed on its pandas profile.
* Re-tokenize seniority in the table so similar values are adjacent.

Let's check the table again...

In [1]:
import pandas as pd

In [3]:
shopper = pd.read_csv('./data/shoppers.csv')
shopper

Unnamed: 0,shopper_id,seniority,found_rate,picking_speed,accepted_rate,rating
0,1fc20b0bdf697ac13dd6a15cbd2fe60a,41dc7c9e385c4d2b6c1f7836973951bf,0.8606,1.94,1.00,4.87
1,e1c679ac73a69c01981fdd3c5ab8beda,6c90661e6d2c7579f5ce337c3391dbb9,0.8446,1.23,0.92,4.92
2,09d369c66ca86ebeffacb133410c5ee1,6c90661e6d2c7579f5ce337c3391dbb9,0.8559,1.56,1.00,4.88
3,db39866e62b95bb04ebb1e470f2d1347,50e13ee63f086c2fe84229348bc91b5b,,2.41,,
4,8efbc238660053b19f00ca431144fdae,6c90661e6d2c7579f5ce337c3391dbb9,0.8770,1.31,0.92,4.88
...,...,...,...,...,...,...
2859,da24da1311f7913f6d2d29d8238b439c,6c90661e6d2c7579f5ce337c3391dbb9,0.8951,1.53,0.88,4.80
2860,cf95eda5ffc1d4b9586de2ca08ab40f8,50e13ee63f086c2fe84229348bc91b5b,0.8695,3.00,0.56,5.00
2861,e8482e3ad8bc820ec756566a472b84b1,6c90661e6d2c7579f5ce337c3391dbb9,0.9152,1.47,0.88,4.96
2862,a55a3765a02530a97eb9af7aee327486,6c90661e6d2c7579f5ce337c3391dbb9,0.8695,1.20,0.96,4.80


## Addressing the seniority tokenization
Seniority probably has adjacent level, let's say beginner, medium, semi-senior and senior.
We would like to keep beginner adjacent to medium numerically speaking and far away
from senior, it's enough to enumerate them from zero to three.

First, let's check if there's an easy way of recognizing them:

In [4]:
shopper['seniority'].value_counts()

6c90661e6d2c7579f5ce337c3391dbb9    1643
50e13ee63f086c2fe84229348bc91b5b     719
41dc7c9e385c4d2b6c1f7836973951bf     440
bb29b8d0d196b5db5a5350e5e3ae2b1f      62
Name: seniority, dtype: int64

It seems seniority is easy enough to distinguish by frequency! This is good,
otherwise we would need to apply some clustering or cosine similarity metrics,
but it's enough to do the following:

In [5]:
seniority_dict = {'6c90661e6d2c7579f5ce337c3391dbb9': 0,
                  '50e13ee63f086c2fe84229348bc91b5b': 1,
                  '41dc7c9e385c4d2b6c1f7836973951bf': 2,
                  'bb29b8d0d196b5db5a5350e5e3ae2b1f': 3}

First, we know by frequency that '0' value is either senior or junior shoppers,
(probably juniors are more frequent) it doesn't because we care mostly about
the adjacency to make sense. Second, we know there's no missing data on seniority,
so not only we will rewrite it to make more sense, but we'll use seniority
to group the average rating of a shop of that experience and impute data on
other fields.


In [6]:
# we'll create a helper function to impute a col using seniority col
def specific_imputation(df):
    df_copy = df.copy()
    for seniority in range(4):
        sub_df = df_copy[df_copy['seniority']==seniority]
        sub_df.fillna(sub_df.mean(), inplace=True)

    return df_copy


# change seniority to numeric plus fill NaNs
def fix_shoppers(df, seniority_dic):
    # first, create a copy df
    df_copy = df.copy()

    # second, we pass through seniority and apply the dictionary
    df_copy['seniority'] = df_copy['seniority'].apply(lambda x: seniority_dic[x])

    # third, let's impute the data using average found_rate and rating per seniority
    df_copy.fillna(df_copy.mean(), inplace=True)
    # TODO: specific_imputation is not changing values inplace
    # df_copy = specific_imputation(df_copy)
    return df_copy

In [7]:
shopper_fixed = fix_shoppers(shopper, seniority_dict)
shopper_fixed

  df_copy.fillna(df_copy.mean(), inplace=True)


Unnamed: 0,shopper_id,seniority,found_rate,picking_speed,accepted_rate,rating
0,1fc20b0bdf697ac13dd6a15cbd2fe60a,2,0.860600,1.94,1.000000,4.870000
1,e1c679ac73a69c01981fdd3c5ab8beda,0,0.844600,1.23,0.920000,4.920000
2,09d369c66ca86ebeffacb133410c5ee1,0,0.855900,1.56,1.000000,4.880000
3,db39866e62b95bb04ebb1e470f2d1347,1,0.861082,2.41,0.908276,4.848428
4,8efbc238660053b19f00ca431144fdae,0,0.877000,1.31,0.920000,4.880000
...,...,...,...,...,...,...
2859,da24da1311f7913f6d2d29d8238b439c,0,0.895100,1.53,0.880000,4.800000
2860,cf95eda5ffc1d4b9586de2ca08ab40f8,1,0.869500,3.00,0.560000,5.000000
2861,e8482e3ad8bc820ec756566a472b84b1,0,0.915200,1.47,0.880000,4.960000
2862,a55a3765a02530a97eb9af7aee327486,0,0.869500,1.20,0.960000,4.800000


Eventually, we'll use this table joined with our main order table as our complete dataset, for now we keep
processing other tables.

In [8]:
# saved the processed table
shopper_fixed.to_csv('./data/shoppers_processed.csv')