## Create training and testing sample DataFrames

This code creates training and testing DataFrames on which model features can be appeneded.  The final DataFrames contain for each item purchased in the target period the customers that purchased the item as well as a randomly sampled group of customers that did not purchase the item in order to create a balanced model DataFrame on which to run the classification algorithms.

In [1]:
import shutil
import pandas as pd
import numpy as np
import pickle
import boto3
import os
import sys
from sagemaker import get_execution_role
from sklearn.model_selection import train_test_split

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from Grocery_Recommender.Create_Target_Dataframe.create_train_test_dataframes import *

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

### Import the customer transaction file

In [2]:
role = get_execution_role()
region = boto3.Session().region_name
bucket = "udacity-machine-learning-capstone-data"
key = "udacity_capstone_data/all_trans.pkl"

In [3]:
s3 = boto3.resource("s3")
all_cust_trans = pickle.loads(s3.Bucket(bucket).Object(key).get()["Body"].read())
all_cust_trans.head()

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
0,200607,20060415,7,19,1,0.93,PRD0900033,CL00201,DEP00067,G00021,D00005,CUST0000410727,UM,OT,994100100398294,L,MM,Full Shop,Mixed,STORE00001,LS,E02
1,200607,20060413,5,20,1,1.03,PRD0900097,CL00001,DEP00001,G00001,D00001,CUST0000634693,LA,YF,994100100532898,L,LA,Top Up,Fresh,STORE00001,LS,E02
2,200607,20060416,1,14,1,0.98,PRD0900121,CL00063,DEP00019,G00007,D00002,,,,994100100135562,L,MM,Top Up,Grocery,STORE00001,LS,E02
3,200607,20060415,7,19,1,3.07,PRD0900135,CL00201,DEP00067,G00021,D00005,CUST0000410727,UM,OT,994100100398294,L,MM,Full Shop,Mixed,STORE00001,LS,E02
4,200607,20060415,7,19,1,4.81,PRD0900220,CL00051,DEP00013,G00005,D00002,CUST0000410727,UM,OT,994100100398294,L,MM,Full Shop,Mixed,STORE00001,LS,E02


### Set up the target DataFrame

The target DataFrame will hold all items purchased by customers that shopped during the target window.  For the model training the following time periods are being used:

**Observation period**  200716 to 200815  
**Target period** 200816

In [4]:
target_custs = get_targets(all_cust_trans, 200816)
target_custs.head()

Unnamed: 0,CUST_CODE,PROD_CODE,TARGET
2,CUST0000307323,PRD0900939,1
4,CUST0000307323,PRD0901465,1
8,CUST0000634693,PRD0903074,1
9,CUST0000634693,PRD0903399,1
11,CUST0000307323,PRD0903542,1


### Get the active customers

Only customers that shopped in the last 8 weeks of the observation period will be considered active customers.  Only active customers will be used in the modelling

In [5]:
active_custs = get_active_custs(all_cust_trans, 200808, 200815)
active_custs.head()

Unnamed: 0,CUST_CODE
1,CUST0000659646
8,CUST0000634693
19,CUST0000425522
41,CUST0000715467
42,CUST0000089820


In [6]:
print(
    "There are {} active customers that will be modelled".format(
        active_custs.count()[0]
    )
)

There are 3519 active customers that will be modelled


###  Split active customers into train and test

In [7]:
active_custs_train, active_custs_test = train_test_split(
    active_custs, test_size=0.30, random_state=42
)

In [8]:
print(
    "There are {} customers in the training set".format(active_custs_train.count()[0])
)

print("There are {} customers in the test set".format(active_custs_test.count()[0])
)

There are 2463 customers in the training set
There are 1056 customers in the test set


### Create DataFrame including non-target customers

Create the DataFrame that includes all non-target customers (active customers that did not purchase the item in the observation period)

In [9]:
active_custs_train_target = add_non_target(active_custs_train, target_custs)
active_custs_test_target = add_non_target(active_custs_test, target_custs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [10]:
active_custs_train_target["TARGET"].value_counts()

0.0    10502459
1.0       14551
Name: TARGET, dtype: int64

### Create balanced sample

The DataFrame is now very large and imbalanced with a significantly larger number of TARGET = 0.  Create a balanced DataFrame for using in the classification by down sampling the TARGET 0, keeping an equal number of TARGET = 0 and TARGET = 1 for each PROD_CODE

In [11]:
# Get the training DataFrame
train_df = active_custs_train_target.groupby("PROD_CODE").apply(sample_non_target)
train_df["TARGET"].value_counts()

1.0    14551
0.0    14551
Name: TARGET, dtype: int64

In [12]:
train_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,CUST_CODE,PROD_CODE,TARGET
PROD_CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PRD0900001,1235174,CUST0000203043,PRD0900001,0.0
PRD0900001,7187554,CUST0000240308,PRD0900001,0.0
PRD0900001,10458374,CUST0000285663,PRD0900001,0.0
PRD0900001,9045004,CUST0000620533,PRD0900001,1.0
PRD0900001,10133854,CUST0000728571,PRD0900001,1.0


In [13]:
# Get the test DataFrame
test_df = active_custs_test_target.groupby("PROD_CODE").apply(sample_non_target)
test_df["TARGET"].value_counts()

1.0    6462
0.0    6462
Name: TARGET, dtype: int64

### Upload training and test DataFrames to s3

In [14]:
train_df.to_csv("train_df.csv", index=False)
key = "train_df.csv"  # filepath in s3
s3.Bucket(bucket).Object(key).upload_file("train_df.csv")

In [15]:
test_df.to_csv("test_df.csv", index=False)
key = "test_df.csv"  # filepath in s3
s3.Bucket(bucket).Object(key).upload_file("test_df.csv")