## Inference Pipeline

Now we have to predict the "future sales". This is a process that we are going to run every business day, in the middle of the night to try to predict how much each reseller is going to buy on their next pruchase. 

We will be basing on the max date of the dataset + 1 day because our extraction is not updated, but in production we can use the current day of the system and trust that we have all the relevant sales history from our transactional system. 

Note that to compute the features, now we only need the previous 30 days.


In [1]:
import sagemaker
import boto3
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import os 
import time
from sagemaker.predictor import csv_serializer,RealTimePredictor
import datetime
import pickle
import awswrangler

In [2]:
%store -r df

In [3]:
%store -r df_r

In [4]:
df['date'] = pd.to_datetime(df['date'])

In [5]:
max_date = df['date'].max()

In [6]:
min_date = max_date - pd.Timedelta(days=30)

In [7]:
df = df[(df['date'] > min_date)]

We are going to fill with amount 0 all the missing sales for each reseller every day.

In [8]:
def completeItem(dfItem,max_date,min_date):
    r = pd.date_range(start=min_date, end=max_date)
    dfItemNew = dfItem.set_index('date').reindex(r).fillna(0.0).rename_axis('date').reset_index()
    dfItemNew['id_reseller'] = dfItem['id_reseller'].max()
    return dfItemNew


In [9]:
dfCompletedList = []
for nid,item in df.groupby('id_reseller'):
    dfCompletedList.append(completeItem(item,max_date,min_date))
dfCompleted = pd.concat(dfCompletedList).copy()

In [10]:
df = dfCompleted

In [11]:
del dfCompleted

In [12]:
df.head(10)

Unnamed: 0,date,id_reseller,bill
0,2019-04-21,499921276,0.0
1,2019-04-22,499921276,7940.451
2,2019-04-23,499921276,0.0
3,2019-04-24,499921276,0.0
4,2019-04-25,499921276,0.0
5,2019-04-26,499921276,9206.969
6,2019-04-27,499921276,0.0
7,2019-04-28,499921276,0.0
8,2019-04-29,499921276,7559.732
9,2019-04-30,499921276,0.0


### Features for each reseller

In [13]:
def complete_info(group):
    weekday = (max_date + pd.Timedelta(days=1)).weekday_name
    mean_last_30 = group['bill'].replace(0,np.nan).mean()
    std_last_30 = group['bill'].replace(0,np.nan).std()
    date_last_bill = group[group['bill'] != 0]['date'].max()
    days_without_purchase = (max_date + pd.Timedelta(days=1) - date_last_bill).days
    
    mean_last_7 = group[(group['date'] >= max_date - pd.Timedelta(days=6))]['bill'].replace(0,np.nan).mean()
    last_bill = group[group['bill'] > 0].sort_values('date',ascending=False).head(1)['bill'].values[0]
    return {'weekday':weekday,'mean-last-30':mean_last_30,
           'std-last-30':std_last_30,'mean-last-7':mean_last_7,'last_bill':last_bill, 
           'id_reseller':int(group['id_reseller'].max()), 'days_without_purchase':days_without_purchase}

In [14]:
features = []
for index,group in df.groupby('id_reseller'):
    features.append(complete_info(group))

  from ipykernel import kernelapp as app


In [15]:
df_features = pd.DataFrame(features)

In [16]:
df_features.shape

(1197, 7)

### Merge with reseller info and compute dummy variables

In [17]:
df_features.head()

Unnamed: 0,weekday,mean-last-30,std-last-30,mean-last-7,last_bill,id_reseller,days_without_purchase
0,Wednesday,8634.657444,2449.592207,8835.6455,6863.729,499921276,2
1,Wednesday,8953.074875,5013.080449,6932.085,7728.092,499921342,1
2,Wednesday,22855.883769,43823.031063,6963.73875,718.29,499921344,1
3,Wednesday,21024.691,5522.351798,27126.844,27126.844,499921352,5
4,Wednesday,2601.613375,701.502616,2173.063,1642.154,499921458,1


In [18]:
df_features = df_features.merge(df_r,how='inner',on='id_reseller')

In [19]:
df_features.shape

(1197, 9)

In [20]:
df_features['zone'] = df_features['zone'].apply(lambda x: x if x in [1019,1050,1031,1033,1051,1067] else 0)

In [21]:
pickle_in = open("preprocessing.pkl","rb")
pipe_list = pickle.load(pickle_in)
# [le_cluster,ohe_cluster,le_zone,ohe_zone,le_weekday,ohe_weekday]

In [22]:
df_cluster = pd.DataFrame(
    pipe_list[1].transform(pipe_list[0].transform(df_features['cluster']).reshape(-1, 1)).todense()
)
df_cluster = df_cluster.add_prefix('cluster_')

In [23]:
df_zone = pd.DataFrame(
    pipe_list[3].transform(pipe_list[2].transform(df_features['zone']).reshape(-1, 1)).todense()
)
df_zone = df_zone.add_prefix('zone_')

In [24]:
df_weekday = pd.DataFrame(
    pipe_list[5].transform(pipe_list[4].transform(df_features['weekday']).reshape(-1, 1)).todense()
)
df_weekday = df_weekday.add_prefix('weekday_')

In [25]:
df_to_predict = pd.concat([df_features,df_cluster,df_zone,df_weekday],axis=1)

### Re-order features
 Now we have to make sure that the features are in the same order we used for training and that we don't have any extra columns.


### Here you are going to need the same columns and order that it's displayed in notebook PROD1 

In [26]:
df_to_predict.columns

Index(['weekday', 'mean-last-30', 'std-last-30', 'mean-last-7', 'last_bill',
       'id_reseller', 'days_without_purchase', 'zone', 'cluster', 'cluster_0',
       'cluster_1', 'cluster_2', 'cluster_3', 'cluster_4', 'zone_0', 'zone_1',
       'zone_2', 'zone_3', 'zone_4', 'zone_5', 'zone_6', 'weekday_0',
       'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5',
       'weekday_6'],
      dtype='object')

In [27]:
%store -r pred_columns

In [28]:
df_to_predict_feats = df_to_predict[pred_columns]

In [29]:
df_to_predict_feats.to_csv('to_predict.csv',header=False,index=False)

In [30]:
df_to_predict[['id_reseller']].to_csv('id_reseller_to_predict.csv',header=False,index=False)