# Predict Future Sales (Kaggle playground)

See [Predict Future Sales on Kaggle](https://www.kaggle.com/c/competitive-data-science-predict-future-sales).

I accepted the terms & conditions then downloaded the data.
`kaggle competitions download -c competitive-data-science-predict-future-sales`

'Prelude' copied from lesson3-rossmann:

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
import functools
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)
from IPython.display import HTML

PATH='data/predict-future-sales/'

In [2]:
!ls {PATH}

item_categories.csv  sales_train.csv	       shops.csv
items.csv	     sample_submission.csv.gz  test.csv.gz


## Import data

In [3]:
table_names = ['item_categories', 'items', 'shops', 'sales_train', 'test']

def load_table(table, root_path):
    fname = os.path.join(root_path, f"{table}.csv")
    if os.path.exists(fname):
        compression = None
    else:
        fname = f"{fname}.gz"
        compression = 'gzip'
    
    return pd.read_csv(fname, compression=compression)

tables = [load_table(t, root_path=PATH) for t in table_names]

In [4]:
for t in tables: display(t.head())

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


In [5]:
for t in tables:
    display(DataFrameSummary(t).summary())

Unnamed: 0,item_category_name,item_category_id
count,,84
mean,,41.5
std,,24.3926
min,,0
25%,,20.75
50%,,41.5
75%,,62.25
max,,83
counts,84,84
uniques,84,84


Unnamed: 0,item_name,item_id,item_category_id
count,,22170,22170
mean,,11084.5,46.2908
std,,6400.07,15.9415
min,,0,0
25%,,5542.25,37
50%,,11084.5,40
75%,,16626.8,58
max,,22169,83
counts,22170,22170,22170
uniques,22170,22170,84


Unnamed: 0,shop_name,shop_id
count,,60
mean,,29.5
std,,17.4642
min,,0
25%,,14.75
50%,,29.5
75%,,44.25
max,,59
counts,60,60
uniques,60,60


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,,2.93585e+06,2.93585e+06,2.93585e+06,2.93585e+06,2.93585e+06
mean,,14.5699,33.0017,10197.2,890.853,1.24264
std,,9.42299,16.227,6324.3,1729.8,2.61883
min,,0,0,0,-1,-22
25%,,7,22,4476,249,1
50%,,14,31,9343,399,1
75%,,23,47,15684,999,1
max,,33,59,22169,307980,2169
counts,2935849,2935849,2935849,2935849,2935849,2935849
uniques,1034,34,60,21807,19993,198


Unnamed: 0,ID,shop_id,item_id
count,214200,214200,214200
mean,107100,31.6429,11019.4
std,61834.4,17.5619,6252.64
min,0,2,30
25%,53549.8,16,5381.5
50%,107100,34.5,11203
75%,160649,47,16071.5
max,214199,59,22167
counts,214200,214200,214200
uniques,214200,42,5100


In [6]:
item_categories, items_full, shops_full, sales_train_full, test = tables
# !!!!! REDUCE DATA SIZE TO SPEED UP EXPLORATORY PHASE !!!!!
item_thrsh = items_full['item_id'].median()
shop_thrsh = shops_full['shop_id'].median()
items = items_full.drop(items_full.query(f'item_id > {item_thrsh}').index)
shops = shops_full.drop(shops_full.query(f'shop_id > {shop_thrsh}').index)
# drop all data from sales_train_full for which we've just dropped the item/shop info:
sales_train = pd.merge(
    pd.merge(
        sales_train_full, pd.DataFrame({'item_id': items['item_id']}), how='inner', on='item_id'),
    pd.DataFrame({'shop_id': shops['shop_id']}),
    how='inner',
    on='shop_id')
# ^^^^^ REDUCE DATA SIZE TO SPEED UP EXPLORATORY PHASE ^^^^^
sales_train.memory_usage(), sales_train.shape, len(pd.unique(sales_train['item_id'])) * len(pd.unique(sales_train['shop_id'])) * sales_train['date_block_num'].max()

(Index             6245552
 date              6245552
 date_block_num    6245552
 shop_id           6245552
 item_id           6245552
 item_price        6245552
 item_cnt_day      6245552
 dtype: int64, (780694, 6), 8607060)

 The mission is to predict next month's sales data, hence add a `date_block_num` column to the test data representing the month following the last month for which data is available (ideally the formula for `test['date_block_num']` would parse from `sales_train` what `date_block_num==0` means, but I kept it simple):

In [7]:
next_month = 1 + sales_train['date_block_num'].max()
test['date_block_num'] = next_month

Since we're interested in the prediction of a month's sales, remove all time information that is finer grained than a month (aggregate and drop some data).  The transformation from `date_block_num` to the `Year` & `Month` columns comes from examining the data during previous investigations.

In [8]:
def add_Year_Month_cols(df):
    df['Year'] = 2013 + df['date_block_num'] // 12
    df['Month'] = 1 + df['date_block_num'] % 12
    return df

data_monthly_sum = sales_train.groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False).agg({
    'item_cnt_day': 'sum',
    'item_price': ['min', 'max']})
# Reduce multi-level column indices to one level:
data_monthly_sum.columns = ['_'.join(x for x in col if x != 'sum').rstrip('_')
                            for col in data_monthly_sum.columns.values]
data_monthly_sum.head(2)

   date_block_num  shop_id  item_id  item_cnt_day  item_price_min  \
0               0        0       32           6.0           221.0   
1               0        0       33           3.0           347.0   

   item_price_max  
0           221.0  
1           347.0  

Make sure data is complete even for products introduced after data collection started or discontinued in the middle of the data set by adding 1 row with 0 sales for each `(shop_id, item_id)` combination on the 1st of the month.  Then, when aggregating by month, there will always be at least one row.

In [9]:
NaN = float("nan")
def df_crossjoin(df1, df2, **kwargs):
    # Adapted from
    # https://mkonrad.net/2016/04/16/cross-join--cartesian-product-between-pandas-dataframes.html
    # See documentation of Pandas `merge': when several rows contain
    # the same value in the column used for the join, the cartesian
    # product is made.  Add a temporary column with a common value.
    df1['_tmpkey'] = 1
    df2['_tmpkey'] = 1
    res = pd.merge(df1, df2, on='_tmpkey', **kwargs)
    res.drop('_tmpkey', axis=1, inplace=True)
    df1.drop('_tmpkey', axis=1, inplace=True)
    df2.drop('_tmpkey', axis=1, inplace=True)
    return res
_its = pd.DataFrame({'item_id': items['item_id']})
_shs = pd.DataFrame({'shop_id': shops['shop_id']})
_dbs = pd.DataFrame({'date_block_num': range(sales_train['date_block_num'].min(), next_month)})
its_shs_dbs = df_crossjoin(_its, df_crossjoin(_shs, _dbs))
its_shs_dbs['item_price_min'] = NaN
its_shs_dbs['item_price_max'] = NaN
its_shs_dbs['item_cnt_day'] = 0
its_shs_dbs = add_Year_Month_cols(its_shs_dbs)
data_monthly_sum.set_index(['item_id', 'shop_id', 'date_block_num'], inplace=True)
its_shs_dbs.set_index(['item_id', 'shop_id', 'date_block_num'], inplace=True)
its_shs_dbs.update(data_monthly_sum)
# Restore item_id, shop_id, date_block_num as columns instead of indices
its_shs_dbs.reset_index(level=['item_id', 'shop_id', 'date_block_num'], inplace=True)
display(its_shs_dbs.head(2))

Unnamed: 0,item_id,shop_id,date_block_num,item_price_min,item_price_max,item_cnt_day,Year,Month
0,0,0,0,,,0.0,2013,1
1,0,0,1,,,0.0,2013,2


In [10]:
its_shs_dbs.count()

item_id           11306700
shop_id           11306700
date_block_num    11306700
item_price_min      400201
item_price_max      400201
item_cnt_day      11306700
Year              11306700
Month             11306700
dtype: int64

Copied from lesson3-rossmann:

`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join (i.e. inner join + keep rows of left table that don't match anything in the right table) of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a `"_y"` to those on the right.


In [11]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    """@param left: Dataframe
    @param right: Dataframe
    @param left_on: column name in left table
    @param right_on: (default: left_on) column name in right table
    @param suffix: (default: "_y") appended to duplicate column names from the right table"""
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

def denormalize(t):
    """Denormalize table by adding shop names, item names & item categories
    
    This function is specific to the data model of this Kaggle competition."""
    t = join_df(t, shops, 'shop_id')
    t = join_df(t, items, 'item_id')
    t = join_df(t, item_categories, 'item_category_id')
    return t

### Aggregate training data

First, enrich the sales data with all other tables we have, so that we can e.g. group by categories instead of items later.

In [12]:
sales_train = denormalize(its_shs_dbs)
test = denormalize(test)
sales_train.head(3)

   item_id  shop_id  date_block_num  item_price_min  item_price_max  \
0        0        0               0             NaN             NaN   
1        0        0               1             NaN             NaN   
2        0        0               2             NaN             NaN   

   item_cnt_day  Year  Month                      shop_name  \
0           0.0  2013      1  !Якутск Орджоникидзе, 56 фран   
1           0.0  2013      2  !Якутск Орджоникидзе, 56 фран   
2           0.0  2013      3  !Якутск Орджоникидзе, 56 фран   

                                   item_name  item_category_id  \
0  ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D                40   
1  ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D                40   
2  ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D                40   

  item_category_name  
0         Кино - DVD  
1         Кино - DVD  
2         Кино - DVD  

Introduce a new column representing the revenue brought by a product: `gross = item_cnt_day * item_price`:

In [13]:
sales_train['gross_min'] = sales_train['item_price_min'].fillna(0) * sales_train['item_cnt_day']
sales_train['gross_max'] = sales_train['item_price_max'].fillna(0) * sales_train['item_cnt_day']

Looking at the data, it seemed like at least one shop was open every day, but there are shops that were not open every day.  Record per shop how many items (all lumped together) they sold and what revenue they generated.

In [14]:
sales_and_monthly_revenue = sales_train.groupby(['date_block_num', 'shop_id'], as_index=False).agg({
    x: 'sum' for x in ('item_cnt_day', 'gross_min', 'gross_max')})
sales_and_monthly_revenue.rename(inplace=True, columns={
    'item_cnt_day': 'all_shop_items_sold',
    'gross_min': 'all_shop_gross_min',
    'gross_max': 'all_shop_gross_max'
});
sales_and_monthly_revenue.head(5)

   date_block_num  shop_id  all_shop_items_sold  all_shop_gross_min  \
0               0        0               2866.0        1.925282e+06   
1               0        1               1543.0        1.024454e+06   
2               0        2                690.0        7.183558e+05   
3               0        3                477.0        4.102358e+05   
4               0        4               1285.0        9.710970e+05   

   all_shop_gross_max  
0          1957675.00  
1          1040763.00  
2           739713.92  
3           411439.30  
4          1012548.74  

How many items of each sort were sold in all stores for each time period and what revenue did each item generate?

In [15]:
items_globally = sales_train.groupby(['date_block_num', 'item_id'], as_index=False).agg({
    x: 'sum' for x in ('item_cnt_day', 'gross_min', 'gross_max')})
items_globally.rename(inplace=True, columns={
    'item_cnt_day': 'global_sold',
    'gross_min': 'global_gross_min',
    'gross_max': 'global_gross_max'
});
items_globally.head(5)

   date_block_num  item_id  global_sold  global_gross_min  global_gross_max
0               0        0          0.0               0.0               0.0
1               0        1          0.0               0.0               0.0
2               0        2          0.0               0.0               0.0
3               0        3          0.0               0.0               0.0
4               0        4          0.0               0.0               0.0

How much (count and revenue) was sold per category for each time period (globally & per store)?

In [16]:
cats_globally = sales_train.groupby(['date_block_num', 'item_category_id'], as_index=False).agg({
    x: 'sum' for x in ('item_cnt_day', 'gross_min', 'gross_max')})
cats_globally.rename(inplace=True, columns={
    'item_cnt_day': 'global_cat_sold',
    'gross_min': 'global_cat_gross_min',
    'gross_max': 'global_cat_gross_max'
});
cats_per_shop = sales_train.groupby(['date_block_num', 'shop_id', 'item_category_id'], as_index=False).agg({
    x: 'sum' for x in ('item_cnt_day', 'gross_min', 'gross_max')})
cats_per_shop.rename(inplace=True, columns={
    'item_cnt_day': 'shop_cat_sold',
    'gross_min': 'shop_cat_gross_min',
    'gross_max': 'shop_cat_gross_max'
});
cats_globally.head(5)

   date_block_num  item_category_id  global_cat_sold  global_cat_gross_min  \
0               0                 0              0.0                  0.00   
1               0                 1              0.0                  0.00   
2               0                 2            693.0            1269502.17   
3               0                 3              1.0               2490.00   
4               0                 4             63.0              13921.50   

   global_cat_gross_max  
0                  0.00  
1                  0.00  
2            1303480.02  
3               2490.00  
4              13921.50  

Paste together all the cumulative data, the goal is a table with unique (date_block_num, shop_id, item_id) rows.  Start from `items_per_shop` then join other cumulative data onto it.

In [17]:
data = functools.reduce(
    lambda src, extra: join_df(src, *extra),
    [[sales_and_monthly_revenue, ('date_block_num', 'shop_id')],
     [items_globally, ('date_block_num', 'item_id')],
     [cats_globally, ('date_block_num', 'item_category_id')],
     [cats_per_shop, ('date_block_num', 'shop_id', 'item_category_id')]],
    sales_train)

In [18]:
data.columns

Index(['item_id', 'shop_id', 'date_block_num', 'item_price_min',
       'item_price_max', 'item_cnt_day', 'Year', 'Month', 'shop_name',
       'item_name', 'item_category_id', 'item_category_name', 'gross_min',
       'gross_max', 'all_shop_items_sold', 'all_shop_gross_min',
       'all_shop_gross_max', 'global_sold', 'global_gross_min',
       'global_gross_max', 'global_cat_sold', 'global_cat_gross_min',
       'global_cat_gross_max', 'shop_cat_sold', 'shop_cat_gross_min',
       'shop_cat_gross_max'],
      dtype='object')

Now, this data is a very denormalized representation of the training data (because it was enriched with e.g. the aggregate sales data of items in the same category globally or in the same shop *in the same month*).  Unfortunately, for our prediction purposes, this equivalent data is not available (it would need to be predicted as well, then aggregated; this is too complex at this stage)... so we will time-shift the data and join it onto the training (and test data), so that the prediction problem gets information about the past (moving average style).  There are some arbitrary decisions about what features we want to add: let's use `past_months` data (not too many to avoid losing too many rows, i.e. those for which no past data is known) and the sum of the same month over the whole data range (this expresses the assumption that there's a yearly pattern).

In [19]:
index_cols = ['item_id', 'shop_id', 'date_block_num']
aggregation_cols = [
    'item_cnt_day', 'gross_min', 'gross_max', 'all_shop_items_sold',
    'all_shop_gross_min', 'all_shop_gross_max', 'global_sold', 'global_gross_min', 'global_gross_max',
    'global_cat_sold', 'global_cat_gross_min', 'global_cat_gross_max',
    'shop_cat_sold', 'shop_cat_gross_min', 'shop_cat_gross_max']
dependent_cols = ['item_cnt_day']
# Incorporate yearly pattern (median & mean):
data_monthly_median = data.groupby(['Month', 'shop_id', 'item_id'], as_index=False).agg({
    k: 'median' for k in aggregation_cols})
data_monthly_median.rename(inplace=True, columns={k: "{}_median".format(k) for k in aggregation_cols});
data_monthly_mean = data.groupby(['Month', 'shop_id', 'item_id'], as_index=False).agg({
    k: 'mean' for k in aggregation_cols})
data_monthly_mean.rename(inplace=True, columns={k: "{}_mean".format(k) for k in aggregation_cols});

Continue aggregating started above:

1. All data except `index_cols` & `item_cnt_day` (the dependent variable) in `with_ma_data` should come from the past
2. Rows for which there is no past data should be dropped (e.g. we can't use the 3rd month's sales data as dependent variable if we want to use 3 months or more historical data to predict it)
3. Join data (shifted in time) from `agg_data` to `with_ma_data`, renaming columns to indicate time shift.  This needs to be a function so it can be applied to the test set.

In [None]:
def augment_with_past_data(all_data, past_months=4):
    global aggregation_cols, index_cols, dependent_cols
    # Start with only the data that is available in the test set as well (i.e. doesn't need time shifting)
    with_ma_data = all_data[index_cols + dependent_cols].copy()
    # Drop rows for which no historical data would be available to predict the dependent variable
    with_ma_data = with_ma_data.loc[with_ma_data['date_block_num'] >= past_months]
    # agg_data will be shifted into the past 1 month at a time & joined onto with_ma_data
    agg_data = all_data.drop(columns=[col for col in all_data.columns if not (
        col in index_cols or
        col in aggregation_cols)],
                         inplace=False)
    # IPython.core.debugger.set_trace()
    for m in range(past_months):
        # Shift agg_data one month into future to line up the past data with the present:
        # ... drop data that isn't going to be used
        agg_data.drop(agg_data[agg_data['date_block_num'] >= next_month - 1].index,
                      inplace=True)
        # ... update 'date_block_num' column to make it line up with the rows in with_ma_data we want to join with
        agg_data['date_block_num'] += 1
        with_ma_data = with_ma_data.merge(
            agg_data,
            how='inner',
            left_on=index_cols, right_on=index_cols,
            suffixes=("", "_{}".format(m + 1)))
    IPython.core.debugger.set_trace()
    # the first merge operation (when m==0) had no column name clashes except
    # for the dependent variable, hence the suffix '_1' is missing on some
    # columns, add it:
    with_ma_data.rename(inplace=True, columns={
        k: "{}_1".format(k) for k in aggregation_cols if k not in dependent_cols})
    return with_ma_data

In [15]:
cat_vars = ['shop_id', 'item_id', 'item_name', 'item_category_id', 'item_category_name', 'shop_name', 'Year', 'Month']
cont_vars = ['item_price',
       'gross', 'open_days', 'all_shop_items_sold', 'all_shop_gross',
       'global_sold', 'global_gross', 'global_cat_sold', 'global_cat_gross',
       'cat_shop_sold', 'cat_shop_gross']

In [16]:
len(data.columns) - len(cat_vars) - len(cont_vars) - 2 # 2: dependent variable & time

0

Reorder data & cast to types expected by PyTorch (float32 for continuous variables, explicitly label categorical variables as such) (copied from lesson3-rossmann).

In [25]:
dep = dependent_cols[0]

for v in cat_vars: 
    data[v] = data[v].astype('category').cat.as_ordered()

for v in cont_vars:
    data[v] = data[v].astype('float32')

data = data[cat_vars+cont_vars+[dep, 'date_block_num']].copy()

Todo:

  1. Embedding for item name & item category name
  2. Skip shop name?
  3. Drop 'gross' column: it is too correlated with 'item_cnt_day' and I don't have it in the test set
  4. Need to apply same enrichment to test data (i.e. add item categories)
  5. time series yet only data from the same month is used?  I don't even have it for the test data!  Incorporate results of previous months
  6. Define fitness function, clip output to [0, 20]

Select subset of data to speed up exploration

In [27]:
n = data.shape[0]

In [29]:
idxs = get_cv_idxs(n, val_pct=150000/n) 
joined_samp = data.iloc[idxs].set_index("date_block_num") 
samp_size = len(joined_samp); samp_size

150000

In [31]:
joined_samp.head(2)

Unnamed: 0_level_0,shop_id,item_id,item_name,item_category_id,item_category_name,shop_name,Year,Month,item_price,gross,open_days,all_shop_items_sold,all_shop_gross,global_sold,global_gross,global_cat_sold,global_cat_gross,cat_shop_sold,cat_shop_gross,item_cnt_day
date_block_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
17,38,2196,COLDPLAY Ghost Stories,55,Музыка - CD локального производства,"Омск ТЦ ""Мега""",2014,6,299.0,598.0,30.0,1467.0,1533191.0,145.0,43012.449219,10769.0,3077388.25,126.0,35179.0,2.0
19,31,9964,ВОЗДУШНЫЙ МАРШАЛ,40,Кино - DVD,"Москва ТЦ ""Семеновский""",2014,8,399.0,2793.0,31.0,8248.0,5763062.5,93.0,36641.898438,15821.0,4307263.5,1841.0,474910.09375,7.0


In [36]:
df, y, nas, mapper = proc_df(joined_samp, dep, do_scale=True, skip_flds=['gross'])
yl = np.log(y)

  
  


In [37]:
sum(y == 0.0)

256

In [38]:
df.head(2)

Unnamed: 0_level_0,shop_id,item_id,item_name,item_category_id,item_category_name,shop_name,Year,Month,item_price,open_days,all_shop_items_sold,all_shop_gross,global_sold,global_gross,global_cat_sold,global_cat_gross,cat_shop_sold,cat_shop_gross
date_block_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
17,39,2152,2152,56,56,39,2,6,-0.324435,-0.054396,-0.721803,-0.610916,0.375586,-0.044211,0.168194,-0.316104,-0.401182,-0.542709
19,32,9776,9776,41,41,32,2,8,-0.258757,0.47273,1.852293,1.225618,0.149892,-0.05819,0.749037,-0.052203,3.097769,1.584466


In [39]:
val_idx = np.flatnonzero(df.index == max(df.index))

In [40]:
val_idx.shape

(2956,)

In [41]:
n

1609124

In [17]:
test.head(5)

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


The [competition page](https://www.kaggle.com/c/competitive-data-science-predict-future-sales#evaluation) says to use RMSE as metric:

In [1]:
def inv_y(a): return np.exp(a)

def exp_rmse(y_pred, targ):
    targ = inv_y(targ)
    pct_var = ()/targ
    return math.sqrt(((targ - inv_y(y_pred))**2).mean())

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

NameError: name 'np' is not defined