# CAPSTONE PROJECT - STARBUCKS
# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# About this Notebook
The code in this notebook was developed using Amazon Sage-Maker notebook instances and it uses several AWS services, such as Sage-Maker Endpoints, Models and S3 buckets. This notebook will not run outside this environment.

## Load Data

In [1]:
import pandas as pd
import numpy as np
import math
import json
import seaborn as sns
import boto3
import sagemaker
import os
from sklearn.model_selection import train_test_split

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

No handlers could be found for logger "sagemaker"


## AWS Services initialization

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Take a look at the datasets

### 1. The Portfolio dataset describes the several offers made to customers

In [3]:
portfolio.head()

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5


### 2. The Profile dataset describes every costumer

In [4]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


### 3. The Transcript dataset describes every transaction and offer action made by the several customers 

In [5]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{u'offer id': u'9b98b8c7a33c4b65b9aebfe6a799e6...
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{u'offer id': u'0b1e1539f2cc45b7b9fa7c272da2e1...
2,offer received,e2127556f4f64592b11af22de27a7932,0,{u'offer id': u'2906b810c7d4411798c6938adc9daa...
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{u'offer id': u'fafdcd668e3743c1bb461111dcafc2...
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{u'offer id': u'4d5c57ea9a6940dd891ad53e9dbe8d...


# 1. Pre Processing
___
## 1.1 Data Cleaning

___
Remove invalid customers by removing all rows with NaN values in the *income* feature
___

In [6]:
profile = profile.dropna(subset=['income'])

In [7]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
5,68,20180426,M,e2127556f4f64592b11af22de27a7932,70000.0
8,65,20180209,M,389bc3fa690240e798340f5a15918d5c,53000.0
12,58,20171111,M,2eeac8d8feae4a8cad5a6af0499a211d,51000.0


## 1.2 Feature Engineering

___
One Hot Encode the *channels* feature in the portfolio dataset
___

In [8]:
portfolio['channels'] = portfolio['channels'].apply(lambda x: str([item.encode('utf-8') for item in x]).replace(' ','').replace('[','').replace(']','').replace("'",""))
portfolio.head()

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"email,mobile,social",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"web,email,mobile,social",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"web,email,mobile",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"web,email,mobile",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"web,email",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5


In [9]:
dumdum = portfolio.channels.str.split('\s*,\s*', expand=True).stack().str.get_dummies().sum(level=0)

In [10]:
portfolio[['mobile','social','web']] = dumdum.reset_index()[['mobile','social','web']]

In [11]:
portfolio = portfolio.drop(columns = ['channels'])
portfolio.head()

Unnamed: 0,difficulty,duration,id,offer_type,reward,mobile,social,web
0,10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10,1,1,0
1,10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,1,1,1
2,0,4,3f207df678b143eea3cee63160fa8bed,informational,0,1,0,1
3,5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,1,0,1
4,20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5,0,0,1


___
Transform feature *became_member_on* to the number of days of membership, counting from the year the first member joined starcbucks
___
In this case, the first member joined in 2013
___

In [12]:
profile.became_member_on.min() / 10000

2013

In [13]:
first_member = int(profile.became_member_on.min() / 10000) * 10000

In [14]:
profile.became_member_on = profile.became_member_on - first_member

In [15]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
1,55,40715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
3,75,40509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
5,68,50426,M,e2127556f4f64592b11af22de27a7932,70000.0
8,65,50209,M,389bc3fa690240e798340f5a15918d5c,53000.0
12,58,41111,M,2eeac8d8feae4a8cad5a6af0499a211d,51000.0


In [16]:
profile['days_by_year'] = (profile['became_member_on'] /(10000)).astype(int)*365 
profile.became_member_on = profile['became_member_on'].apply(lambda x: x - (int(x/10000) * 10000) if x >= 10000 else x) 
profile['days_by_month'] = (profile['became_member_on'] /(100)).astype(int)*30 
profile.became_member_on = profile['became_member_on'].apply(lambda x: x - (int(x/100) * 100) if x >= 100 else x) 
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income,days_by_year,days_by_month
1,55,15,F,0610b486422d4921ae7d2bf64640c50b,112000.0,1460,210
3,75,9,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0,1460,150
5,68,26,M,e2127556f4f64592b11af22de27a7932,70000.0,1825,120
8,65,9,M,389bc3fa690240e798340f5a15918d5c,53000.0,1825,60
12,58,11,M,2eeac8d8feae4a8cad5a6af0499a211d,51000.0,1460,330


In [17]:
profile['days'] = profile['became_member_on'] + profile['days_by_year'] + profile['days_by_month']
profile = profile.drop(columns = ['became_member_on','days_by_year','days_by_month'])

In [18]:
profile.head()

Unnamed: 0,age,gender,id,income,days
1,55,F,0610b486422d4921ae7d2bf64640c50b,112000.0,1685
3,75,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0,1619
5,68,M,e2127556f4f64592b11af22de27a7932,70000.0,1971
8,65,M,389bc3fa690240e798340f5a15918d5c,53000.0,1894
12,58,M,2eeac8d8feae4a8cad5a6af0499a211d,51000.0,1801


### Transcript

___
Get the information out of the *value* feature
___

In [19]:
transcript.event.unique()

array([u'offer received', u'offer viewed', u'transaction',
       u'offer completed'], dtype=object)

In [20]:
transcript[transcript.event == 'offer completed'].head()

Unnamed: 0,event,person,time,value
12658,offer completed,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,{u'offer_id': u'2906b810c7d4411798c6938adc9daa...
12672,offer completed,fe97aa22dd3e48c8b143116a8403dd52,0,{u'offer_id': u'fafdcd668e3743c1bb461111dcafc2...
12679,offer completed,629fc02d56414d91bca360decdfa9288,0,{u'offer_id': u'9b98b8c7a33c4b65b9aebfe6a799e6...
12692,offer completed,676506bad68e4161b9bbaffeb039626b,0,{u'offer_id': u'ae264e3637204a6fb9bb56bc8210dd...
12697,offer completed,8f7dd3b2afe14c078eb4f6e6fe4ba97d,0,{u'offer_id': u'4d5c57ea9a6940dd891ad53e9dbe8d...


In [21]:
transactions = transcript[transcript.event == 'transaction']

In [22]:
transactions.head()

Unnamed: 0,event,person,time,value
12654,transaction,02c083884c7d45b39cc68e1314fec56c,0,{u'amount': 0.83}
12657,transaction,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,{u'amount': 34.56}
12659,transaction,54890f68699049c2a04d415abc25e717,0,{u'amount': 13.23}
12670,transaction,b2f1cd155b864803ad8334cdf13c4bd2,0,{u'amount': 19.51}
12671,transaction,fe97aa22dd3e48c8b143116a8403dd52,0,{u'amount': 18.97}


In [23]:
offer_events =  transcript[~(transcript.event == 'transaction')]

In [24]:
offer_events.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{u'offer id': u'9b98b8c7a33c4b65b9aebfe6a799e6...
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{u'offer id': u'0b1e1539f2cc45b7b9fa7c272da2e1...
2,offer received,e2127556f4f64592b11af22de27a7932,0,{u'offer id': u'2906b810c7d4411798c6938adc9daa...
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{u'offer id': u'fafdcd668e3743c1bb461111dcafc2...
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{u'offer id': u'4d5c57ea9a6940dd891ad53e9dbe8d...


___
Get the offer id from transactions that correspond to interactions with offers ( **offer received, offer viewed, offer completed**) from the *value* feature
___

In [25]:
offer_events['offer'] = [d.get('offer id') if d.get('offer id') is not None else (d.get('offer_id')) for d in offer_events.value]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [26]:
offer_events = offer_events.drop(columns = ['value'])
offer_events.head()

Unnamed: 0,event,person,time,offer
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,offer received,e2127556f4f64592b11af22de27a7932,0,2906b810c7d4411798c6938adc9daaa5
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,fafdcd668e3743c1bb461111dcafc2a4
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,4d5c57ea9a6940dd891ad53e9dbe8da0


___
Get the value from simple transactions (**transaction** event)
___

In [27]:
transactions['value'] = [d.get('amount') for d in transactions.value]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [28]:
transactions.head()

Unnamed: 0,event,person,time,value
12654,transaction,02c083884c7d45b39cc68e1314fec56c,0,0.83
12657,transaction,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56
12659,transaction,54890f68699049c2a04d415abc25e717,0,13.23
12670,transaction,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51
12671,transaction,fe97aa22dd3e48c8b143116a8403dd52,0,18.97


___
Compute the customers lifetime value by calculating the monetary, frequency and recency features for each of them
___
- monetary: total ammount spent
- frequency: frequency of transactions
- recency: how recent was the last transaction
___

In [29]:
monetary_frequency = transactions.groupby('person').value.agg(['sum','count']).rename(columns = {'sum': 'monetary','count': 'frequency'})
monetary_frequency.head()

Unnamed: 0_level_0,monetary,frequency
person,Unnamed: 1_level_1,Unnamed: 2_level_1
0009655768c64bdeb2e877511632db8f,127.6,8
00116118485d4dfda04fdbaba9a87b5c,4.09,3
0011e0d4e6b944f998e987f904e8c1e5,79.46,5
0020c2b971eb4e9188eac86d93036a77,196.86,8
0020ccbbb6d84e358d3414a3ff76cffd,154.05,12


In [30]:
time_max = transactions.time.max()
recency = transactions.groupby('person').time.max().reset_index().rename(columns = {'time': 'recency'})
recency['recency'] = time_max - recency['recency']
recency.head()

Unnamed: 0,person,recency
0,0009655768c64bdeb2e877511632db8f,18
1,00116118485d4dfda04fdbaba9a87b5c,240
2,0011e0d4e6b944f998e987f904e8c1e5,60
3,0020c2b971eb4e9188eac86d93036a77,6
4,0020ccbbb6d84e358d3414a3ff76cffd,42


In [31]:
rfm = transactions.merge(monetary_frequency, right_on='person',left_on = 'person')
rfm = rfm.merge(recency,right_on='person',left_on = 'person')
rfm = rfm[['person','recency','frequency','monetary']].drop_duplicates()
rfm = rfm.merge(profile, right_on='id',left_on='person')
rfm.head()

Unnamed: 0,person,recency,frequency,monetary,age,gender,id,income,days
0,02c083884c7d45b39cc68e1314fec56c,294,10,29.89,20,F,02c083884c7d45b39cc68e1314fec56c,30000.0,1316
1,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,174,12,320.48,42,M,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,96000.0,1142
2,54890f68699049c2a04d415abc25e717,30,7,118.85,36,M,54890f68699049c2a04d415abc25e717,56000.0,1848
3,b2f1cd155b864803ad8334cdf13c4bd2,144,8,195.35,55,F,b2f1cd155b864803ad8334cdf13c4bd2,94000.0,1776
4,fe97aa22dd3e48c8b143116a8403dd52,6,11,562.77,39,F,fe97aa22dd3e48c8b143116a8403dd52,67000.0,1837


___
Create a matrix *person x offer* , which states if the offer was **received, viewed and completed**
___

In [32]:
offer_events.head()

Unnamed: 0,event,person,time,offer
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,offer received,e2127556f4f64592b11af22de27a7932,0,2906b810c7d4411798c6938adc9daaa5
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,fafdcd668e3743c1bb461111dcafc2a4
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,4d5c57ea9a6940dd891ad53e9dbe8da0


In [33]:
offer_events = offer_events[['person','event','offer']].drop_duplicates()
offer_events['val'] = int(1)

In [34]:
offer_events['person_offer'] = offer_events['person'] + ':' + offer_events['offer']

In [35]:
offer_acceptance = offer_events.pivot(index = 'person_offer',columns = 'event',values = 'val').reset_index().fillna(0)

In [36]:
offer_acceptance['person'] = offer_acceptance.apply(lambda x : x['person_offer'].split(':')[0],axis=1)
offer_acceptance['offer'] = offer_acceptance.apply(lambda x : x['person_offer'].split(':')[1],axis=1)
offer_acceptance = offer_acceptance.drop(columns = ['person_offer'])
offer_acceptance.head()

event,offer completed,offer received,offer viewed,person,offer
0,1.0,1.0,0.0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5
1,0.0,1.0,1.0,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed
2,0.0,1.0,1.0,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837
3,1.0,1.0,1.0,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d
4,1.0,1.0,1.0,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4


## 2. Data Preparation

### 2.1 Combine all the information into one dataframe


___
Add info (**rfm, income, days, age, gender**) about the costumers to the previously computed matrix
___

In [37]:
data = rfm.merge(offer_acceptance,left_on = 'person',right_on = 'person')
data.head()

Unnamed: 0,person,recency,frequency,monetary,age,gender,id,income,days,offer completed,offer received,offer viewed,offer
0,02c083884c7d45b39cc68e1314fec56c,294,10,29.89,20,F,02c083884c7d45b39cc68e1314fec56c,30000.0,1316,0.0,1.0,0.0,0b1e1539f2cc45b7b9fa7c272da2e1d7
1,02c083884c7d45b39cc68e1314fec56c,294,10,29.89,20,F,02c083884c7d45b39cc68e1314fec56c,30000.0,1316,0.0,1.0,1.0,ae264e3637204a6fb9bb56bc8210ddfd
2,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,174,12,320.48,42,M,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,96000.0,1142,1.0,1.0,1.0,0b1e1539f2cc45b7b9fa7c272da2e1d7
3,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,174,12,320.48,42,M,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,96000.0,1142,1.0,1.0,1.0,2298d6c36e964ae4a3e7e9706d1fb8c2
4,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,174,12,320.48,42,M,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,96000.0,1142,1.0,1.0,1.0,2906b810c7d4411798c6938adc9daaa5


___
One-Hot-Encode the customer *gender*
___

In [38]:
data_gender = pd.get_dummies(data.gender, prefix='gender')

In [39]:
data = pd.concat([data, data_gender], axis=1).drop(columns = ['gender','person','id'])

In [40]:
data = data.merge(portfolio, left_on = 'offer', right_on = 'id').drop(columns = ['offer','id'])

In [41]:
data.head()

Unnamed: 0,recency,frequency,monetary,age,income,days,offer completed,offer received,offer viewed,gender_F,gender_M,gender_O,difficulty,duration,offer_type,reward,mobile,social,web
0,294,10,29.89,20,30000.0,1316,0.0,1.0,0.0,1,0,0,20,10,discount,5,0,0,1
1,174,12,320.48,42,96000.0,1142,1.0,1.0,1.0,0,1,0,20,10,discount,5,0,0,1
2,30,7,118.85,36,56000.0,1848,1.0,1.0,0.0,0,1,0,20,10,discount,5,0,0,1
3,48,3,66.05,52,72000.0,2010,1.0,1.0,1.0,0,1,0,20,10,discount,5,0,0,1
4,108,8,121.86,75,69000.0,246,1.0,1.0,1.0,1,0,0,20,10,discount,5,0,0,1


In [42]:
data_gender.corr()

Unnamed: 0,gender_F,gender_M,gender_O
gender_F,1.0,-0.972369,-0.098738
gender_M,-0.972369,1.0,-0.136299
gender_O,-0.098738,-0.136299,1.0


In [43]:
aux = float(len(data_gender[data_gender.gender_O == 1])) / float(len(data_gender))
print("{:.2f}%".format(aux*100))

1.37%


___
Since there's high correlation between female and male gender, and O is a small portion of the population we shall keep only the *gender_F* column
___

In [44]:
data = data.drop(columns = ['gender_M','gender_O','offer received'])
data.head()

Unnamed: 0,recency,frequency,monetary,age,income,days,offer completed,offer viewed,gender_F,difficulty,duration,offer_type,reward,mobile,social,web
0,294,10,29.89,20,30000.0,1316,0.0,0.0,1,20,10,discount,5,0,0,1
1,174,12,320.48,42,96000.0,1142,1.0,1.0,0,20,10,discount,5,0,0,1
2,30,7,118.85,36,56000.0,1848,1.0,0.0,0,20,10,discount,5,0,0,1
3,48,3,66.05,52,72000.0,2010,1.0,1.0,0,20,10,discount,5,0,0,1
4,108,8,121.86,75,69000.0,246,1.0,1.0,1,20,10,discount,5,0,0,1


### 2.2 Data Transformation

___
Divide the data in three datasets, according to the offer_type (so we can create a differente model for each one)
___

In [45]:
bogo = data[data.offer_type == 'bogo'].drop(columns = ['offer_type'])
info = data[data.offer_type == 'informational'].drop(columns = ['offer_type'])
disc = data[data.offer_type == 'discount'].drop(columns = ['offer_type'])

In [46]:
info['offer completed'].unique()

array([0.])

___
As its name suggests, and as we can see in the cell above, offers of type *'Informational'* are never completed, and so they will not be covered my any ML model
___

In [47]:
from sklearn.preprocessing import MinMaxScaler

___
In order to normalize our data we use MinMaxScaler in both datasets **(bogo and disc)**.
___
Split into train and test datasets
___

In [48]:
bogo_y = bogo['offer completed']
bogo_X = bogo.drop(columns = ['offer completed'])

bogo_X_train, bogo_X_test, bogo_y_train, bogo_y_test = train_test_split(bogo_X, bogo_y, test_size=0.33, random_state=42)

scaler = MinMaxScaler()
scaler.fit(bogo_X_train)

bogo_X_train = pd.DataFrame(scaler.transform(bogo_X_train), columns = bogo_X_train.columns,index=bogo_X_train.index)
bogo_X_test = pd.DataFrame(scaler.transform(bogo_X_test), columns = bogo_X_test.columns,index=bogo_X_test.index)


  return self.partial_fit(X, y)


In [49]:
disc_y = disc['offer completed']
disc_X = disc.drop(columns = ['offer completed'])

disc_X_train, disc_X_test, disc_y_train, disc_y_test = train_test_split(disc_X, disc_y, test_size=0.33, random_state=42)

scaler = MinMaxScaler()
scaler.fit(disc_X_train)

disc_X_train = pd.DataFrame(scaler.transform(disc_X_train), columns = disc_X_train.columns,index=disc_X_train.index)
disc_X_test = pd.DataFrame(scaler.transform(disc_X_test), columns = disc_X_test.columns,index=disc_X_test.index)


### 2.2 Upload Data to S3 bucket

In [50]:
def make_csv(x, y, filename, data_dir):
    
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    data = np.insert(x,0,y,1)#np.hstack((x, np.transpose(y)))
    df = pd.DataFrame(data)
    df = df.dropna()
    df.to_csv(str(data_dir)+'/'+str(filename),index = False, header = False)
    
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [51]:
data_dir = 'model_data/bogo'
make_csv(bogo_X_train.to_numpy(), bogo_y_train.to_numpy(), filename='train.csv', data_dir=data_dir)
make_csv(bogo_X_test.to_numpy(), bogo_y_test.to_numpy(), filename='test.csv', data_dir=data_dir)

Path created: model_data/bogo/train.csv
Path created: model_data/bogo/test.csv


In [52]:
data_dir = 'model_data/disc'
make_csv(disc_X_train.to_numpy(), disc_y_train.to_numpy(), filename='train.csv', data_dir=data_dir)
make_csv(disc_X_test.to_numpy(), disc_y_test.to_numpy(), filename='test.csv', data_dir=data_dir)

Path created: model_data/disc/train.csv
Path created: model_data/disc/test.csv


___
Upload
___

In [53]:
# should be the name of directory you created to save your features data
bogo_data_dir = 'model_data/bogo'
# set prefix, a descriptive name for a directory  
bogo_prefix = 'bogo'


# upload to S3
bogo_input_data = sagemaker_session.upload_data(path=bogo_data_dir, bucket=bucket, key_prefix=bogo_prefix)
print(bogo_input_data)

s3://sagemaker-us-east-2-341076436662/bogo


In [54]:
# should be the name of directory you created to save your features data
disc_data_dir = 'model_data/disc'
# set prefix, a descriptive name for a directory  
disc_prefix = 'disc'


# upload to S3
disc_input_data = sagemaker_session.upload_data(path=disc_data_dir, bucket=bucket, key_prefix=disc_prefix)
print(disc_input_data)

s3://sagemaker-us-east-2-341076436662/disc


# 3. Model Training

## 3.1 Create two pytorch estimators

___
The source code for the neural network created for this project is in the **source** folder
___
For the estimator we also need to provide the entry point (*train.py*), the output path, role, the instance type for the machine we want to use for our training, the session and some hyperparameters
___

In [55]:
from sagemaker.pytorch import PyTorch

bogo_output_path = 's3://{}/{}'.format(bucket, bogo_prefix)

bogo_estimator = PyTorch(entry_point='train.py',
                    source_dir='source',
                    role=role,
                    framework_version='1.0',
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    output_path = bogo_output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'input_dim': 14,  
                        'hidden_dim': 20,
                        'output_dim': 1,
                        'epochs': 1000 
                    })

In [56]:
disc_output_path = 's3://{}/{}'.format(bucket, disc_prefix)

disc_estimator = PyTorch(entry_point='train.py',
                    source_dir='source', # this should be just "source" for your code
                    role=role,
                    framework_version='1.0',
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    output_path = disc_output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'input_dim': 14,  
                        'hidden_dim': 20,
                        'output_dim': 1,
                        'epochs': 1000 
                    })

___
Fit the estimator with the input data
___

In [57]:
%%time 
# train the estimator on S3 training data
bogo_estimator.fit({'train': bogo_input_data})

2021-01-28 21:45:43 Starting - Starting the training job...
2021-01-28 21:45:45 Starting - Launching requested ML instances......
2021-01-28 21:46:48 Starting - Preparing the instances for training............
2021-01-28 21:49:00 Downloading - Downloading input data
2021-01-28 21:49:00 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-01-28 21:49:22,635 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-01-28 21:49:22,638 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-01-28 21:49:22,651 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-01-28 21:49:25,674 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-01-28 21:49:26,132 sagemaker-containers INFO     Module train does not

[34mEpoch: 46, Loss: 0.4088536979705301[0m
[34mEpoch: 47, Loss: 0.4099125814975792[0m
[34mEpoch: 48, Loss: 0.4102934057390212[0m
[34mEpoch: 49, Loss: 0.4143013233939807[0m
[34mEpoch: 50, Loss: 0.4082246762474955[0m
[34mEpoch: 51, Loss: 0.4090437745240055[0m
[34mEpoch: 52, Loss: 0.4107492548229371[0m
[34mEpoch: 53, Loss: 0.40908012260395626[0m
[34mEpoch: 54, Loss: 0.4101018072782216[0m
[34mEpoch: 55, Loss: 0.4115309367474184[0m
[34mEpoch: 56, Loss: 0.41043733056753173[0m
[34mEpoch: 57, Loss: 0.4116279121957214[0m
[34mEpoch: 58, Loss: 0.4104375741515376[0m
[34mEpoch: 59, Loss: 0.41078414685024595[0m
[34mEpoch: 60, Loss: 0.4080515616569529[0m
[34mEpoch: 61, Loss: 0.4108186184453107[0m
[34mEpoch: 62, Loss: 0.4072970837241304[0m
[34mEpoch: 63, Loss: 0.4087137705485438[0m
[34mEpoch: 64, Loss: 0.409596054380654[0m
[34mEpoch: 65, Loss: 0.4082562198519789[0m
[34mEpoch: 66, Loss: 0.4084324728372928[0m
[34mEpoch: 67, Loss: 0.4054427554787112[0m
[34mEpo

[34mEpoch: 229, Loss: 0.4028743646322733[0m
[34mEpoch: 230, Loss: 0.4032157128715416[0m
[34mEpoch: 231, Loss: 0.40266842205453546[0m
[34mEpoch: 232, Loss: 0.40504141735173194[0m
[34mEpoch: 233, Loss: 0.40512228422742824[0m
[34mEpoch: 234, Loss: 0.4034271242759122[0m
[34mEpoch: 235, Loss: 0.40109355113936657[0m
[34mEpoch: 236, Loss: 0.4037705526235069[0m
[34mEpoch: 237, Loss: 0.4058099310706018[0m
[34mEpoch: 238, Loss: 0.40329363765413995[0m
[34mEpoch: 239, Loss: 0.40061855007524644[0m
[34mEpoch: 240, Loss: 0.4047763151777483[0m
[34mEpoch: 241, Loss: 0.40412939670927217[0m
[34mEpoch: 242, Loss: 0.4024544573629875[0m
[34mEpoch: 243, Loss: 0.4060148722331141[0m
[34mEpoch: 244, Loss: 0.4047480466444545[0m
[34mEpoch: 245, Loss: 0.40267104733822917[0m
[34mEpoch: 246, Loss: 0.40253473602410506[0m
[34mEpoch: 247, Loss: 0.4056037684147701[0m
[34mEpoch: 248, Loss: 0.4033397327062582[0m
[34mEpoch: 249, Loss: 0.4052702898718674[0m
[34mEpoch: 250, Loss: 0.

[34mEpoch: 415, Loss: 0.4030823893690505[0m
[34mEpoch: 416, Loss: 0.3975896771126433[0m
[34mEpoch: 417, Loss: 0.3970983088026815[0m
[34mEpoch: 418, Loss: 0.3976215238877434[0m
[34mEpoch: 419, Loss: 0.3960987243152135[0m
[34mEpoch: 420, Loss: 0.3996907145887572[0m
[34mEpoch: 421, Loss: 0.39861930961389597[0m
[34mEpoch: 422, Loss: 0.4011128917542566[0m
[34mEpoch: 423, Loss: 0.40151455723017165[0m
[34mEpoch: 424, Loss: 0.4000651706396544[0m
[34mEpoch: 425, Loss: 0.3992174552884958[0m
[34mEpoch: 426, Loss: 0.399417232211942[0m
[34mEpoch: 427, Loss: 0.39721009089388626[0m
[34mEpoch: 428, Loss: 0.3998758300139284[0m
[34mEpoch: 429, Loss: 0.40226780688377256[0m
[34mEpoch: 430, Loss: 0.3990600879045196[0m
[34mEpoch: 431, Loss: 0.4001826873541208[0m
[34mEpoch: 432, Loss: 0.4009950059011352[0m
[34mEpoch: 433, Loss: 0.3986109114518675[0m
[34mEpoch: 434, Loss: 0.3966707574309238[0m
[34mEpoch: 435, Loss: 0.39895158457115065[0m
[34mEpoch: 436, Loss: 0.39633

[34mEpoch: 597, Loss: 0.4003347878551747[0m
[34mEpoch: 598, Loss: 0.39591910847221834[0m
[34mEpoch: 599, Loss: 0.3986015708377665[0m
[34mEpoch: 600, Loss: 0.39884259232305724[0m
[34mEpoch: 601, Loss: 0.3986549597403568[0m
[34mEpoch: 602, Loss: 0.39885937816168426[0m
[34mEpoch: 603, Loss: 0.395359914373725[0m
[34mEpoch: 604, Loss: 0.4003330104898038[0m
[34mEpoch: 605, Loss: 0.3968507433838287[0m
[34mEpoch: 606, Loss: 0.3971005532900863[0m
[34mEpoch: 607, Loss: 0.3995828022343077[0m
[34mEpoch: 608, Loss: 0.4007079996127393[0m
[34mEpoch: 609, Loss: 0.39755263437940785[0m
[34mEpoch: 610, Loss: 0.39825209926808686[0m
[34mEpoch: 611, Loss: 0.3976160654203031[0m
[34mEpoch: 612, Loss: 0.40058569874444444[0m
[34mEpoch: 613, Loss: 0.3939918571822912[0m
[34mEpoch: 614, Loss: 0.39816406136314236[0m
[34mEpoch: 615, Loss: 0.39807099280917546[0m
[34mEpoch: 616, Loss: 0.3999334687513929[0m
[34mEpoch: 617, Loss: 0.39862645719391016[0m
[34mEpoch: 618, Loss: 0.3

[34mEpoch: 784, Loss: 0.3943408693870939[0m
[34mEpoch: 785, Loss: 0.39467452425267524[0m
[34mEpoch: 786, Loss: 0.3982773126614374[0m
[34mEpoch: 787, Loss: 0.3960227478053395[0m
[34mEpoch: 788, Loss: 0.3992129467005222[0m
[34mEpoch: 789, Loss: 0.39849649403208187[0m
[34mEpoch: 790, Loss: 0.3989563784993577[0m
[34mEpoch: 791, Loss: 0.39883164943995814[0m
[34mEpoch: 792, Loss: 0.39969640806894413[0m
[34mEpoch: 793, Loss: 0.39562932116959765[0m
[34mEpoch: 794, Loss: 0.39898762595872[0m
[34mEpoch: 795, Loss: 0.3981062886028204[0m
[34mEpoch: 796, Loss: 0.39973918129052366[0m
[34mEpoch: 797, Loss: 0.39631289923834107[0m
[34mEpoch: 798, Loss: 0.4007091284942083[0m
[34mEpoch: 799, Loss: 0.3953075273990384[0m
[34mEpoch: 800, Loss: 0.39735168881230004[0m
[34mEpoch: 801, Loss: 0.39883872576026996[0m
[34mEpoch: 802, Loss: 0.3997798339150753[0m
[34mEpoch: 803, Loss: 0.39867743774011916[0m
[34mEpoch: 804, Loss: 0.3973121815779727[0m
[34mEpoch: 805, Loss: 0.3

[34mEpoch: 966, Loss: 0.3952525891621248[0m
[34mEpoch: 967, Loss: 0.3983021286827606[0m
[34mEpoch: 968, Loss: 0.39555082481514203[0m
[34mEpoch: 969, Loss: 0.3976542955138212[0m
[34mEpoch: 970, Loss: 0.39419703121666755[0m
[34mEpoch: 971, Loss: 0.4007449895864865[0m
[34mEpoch: 972, Loss: 0.3982395235691245[0m
[34mEpoch: 973, Loss: 0.3970126702165538[0m
[34mEpoch: 974, Loss: 0.3949876933630406[0m
[34mEpoch: 975, Loss: 0.39876465154695806[0m
[34mEpoch: 976, Loss: 0.39892315861986716[0m
[34mEpoch: 977, Loss: 0.3959302465620748[0m
[34mEpoch: 978, Loss: 0.3970866443210618[0m
[34mEpoch: 979, Loss: 0.3973766089176159[0m
[34mEpoch: 980, Loss: 0.39898916709019105[0m
[34mEpoch: 981, Loss: 0.39818750225254856[0m
[34mEpoch: 982, Loss: 0.39494601955552316[0m
[34mEpoch: 983, Loss: 0.3989934204957792[0m
[34mEpoch: 984, Loss: 0.3974983581698617[0m
[34mEpoch: 985, Loss: 0.39880604122251395[0m
[34mEpoch: 986, Loss: 0.39758045372667816[0m
[34mEpoch: 987, Loss: 0.

In [58]:
%%time 
# train the estimator on S3 training data
disc_estimator.fit({'train': disc_input_data})

2021-01-28 22:05:39 Starting - Starting the training job...
2021-01-28 22:05:41 Starting - Launching requested ML instances......
2021-01-28 22:07:05 Starting - Preparing the instances for training.........
2021-01-28 22:08:34 Downloading - Downloading input data
2021-01-28 22:08:34 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-01-28 22:08:49,059 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-01-28 22:08:49,062 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-01-28 22:08:49,075 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-01-28 22:08:50,499 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-01-28 22:08:50,995 sagemaker-containers INFO     Module train does not pr

[34mEpoch: 44, Loss: 0.3515972903883342[0m
[34mEpoch: 45, Loss: 0.352343225959709[0m
[34mEpoch: 46, Loss: 0.35212490708150423[0m
[34mEpoch: 47, Loss: 0.35312955217044906[0m
[34mEpoch: 48, Loss: 0.35199958582804924[0m
[34mEpoch: 49, Loss: 0.3524666999065374[0m
[34mEpoch: 50, Loss: 0.34977186672152816[0m
[34mEpoch: 51, Loss: 0.34928733102966986[0m
[34mEpoch: 52, Loss: 0.35113326069842304[0m
[34mEpoch: 53, Loss: 0.35093072036960427[0m
[34mEpoch: 54, Loss: 0.3502667433394438[0m
[34mEpoch: 55, Loss: 0.3513608867230501[0m
[34mEpoch: 56, Loss: 0.34833422138288783[0m
[34mEpoch: 57, Loss: 0.348883515635495[0m
[34mEpoch: 58, Loss: 0.3505654149220546[0m
[34mEpoch: 59, Loss: 0.3506187096891606[0m
[34mEpoch: 60, Loss: 0.350279724251129[0m
[34mEpoch: 61, Loss: 0.3501800368668625[0m
[34mEpoch: 62, Loss: 0.34999389146089305[0m
[34mEpoch: 63, Loss: 0.3487775720763972[0m
[34mEpoch: 64, Loss: 0.3496148744083488[0m
[34mEpoch: 65, Loss: 0.3516656969627623[0m
[34

[34mEpoch: 229, Loss: 0.345225236010712[0m
[34mEpoch: 230, Loss: 0.343808784502049[0m
[34mEpoch: 231, Loss: 0.3446645486009055[0m
[34mEpoch: 232, Loss: 0.3467826467002253[0m
[34mEpoch: 233, Loss: 0.34520238557140964[0m
[34mEpoch: 234, Loss: 0.3445999061235983[0m
[34mEpoch: 235, Loss: 0.34432625201484446[0m
[34mEpoch: 236, Loss: 0.345637937069958[0m
[34mEpoch: 237, Loss: 0.34508239387313305[0m
[34mEpoch: 238, Loss: 0.344614299984124[0m
[34mEpoch: 239, Loss: 0.3450427041530547[0m
[34mEpoch: 240, Loss: 0.3447168143877749[0m
[34mEpoch: 241, Loss: 0.34416701391246685[0m
[34mEpoch: 242, Loss: 0.3458782660862359[0m
[34mEpoch: 243, Loss: 0.34622260908601366[0m
[34mEpoch: 244, Loss: 0.3457021365245773[0m
[34mEpoch: 245, Loss: 0.3453196400817669[0m
[34mEpoch: 246, Loss: 0.3431116306963917[0m
[34mEpoch: 247, Loss: 0.3455908954521713[0m
[34mEpoch: 248, Loss: 0.3443629318920288[0m
[34mEpoch: 249, Loss: 0.34458352915057655[0m
[34mEpoch: 250, Loss: 0.3476565

[34mEpoch: 409, Loss: 0.347058793900339[0m
[34mEpoch: 410, Loss: 0.34408236395320674[0m
[34mEpoch: 411, Loss: 0.34177284256313306[0m
[34mEpoch: 412, Loss: 0.3434905904880339[0m
[34mEpoch: 413, Loss: 0.3440410076249494[0m
[34mEpoch: 414, Loss: 0.34313730280254634[0m
[34mEpoch: 415, Loss: 0.3433337443822079[0m
[34mEpoch: 416, Loss: 0.34550332349791957[0m
[34mEpoch: 417, Loss: 0.34300364196063815[0m
[34mEpoch: 418, Loss: 0.34374924474884627[0m
[34mEpoch: 419, Loss: 0.3430364396119521[0m
[34mEpoch: 420, Loss: 0.34366190644593403[0m
[34mEpoch: 421, Loss: 0.3440899238227069[0m
[34mEpoch: 422, Loss: 0.34408168374041087[0m
[34mEpoch: 423, Loss: 0.34404632352416026[0m
[34mEpoch: 424, Loss: 0.34379366185974064[0m
[34mEpoch: 425, Loss: 0.34370041559115877[0m
[34mEpoch: 426, Loss: 0.3415367040855203[0m
[34mEpoch: 427, Loss: 0.3421846505869386[0m
[34mEpoch: 428, Loss: 0.3424501450300134[0m
[34mEpoch: 429, Loss: 0.34558938351300295[0m
[34mEpoch: 430, Loss: 

[34mEpoch: 594, Loss: 0.341941709362629[0m
[34mEpoch: 595, Loss: 0.34156826713755317[0m
[34mEpoch: 596, Loss: 0.34200912253874566[0m
[34mEpoch: 597, Loss: 0.3419764480741376[0m
[34mEpoch: 598, Loss: 0.3437499926664779[0m
[34mEpoch: 599, Loss: 0.3420869376193938[0m
[34mEpoch: 600, Loss: 0.34267892174227216[0m
[34mEpoch: 601, Loss: 0.34201659886706726[0m
[34mEpoch: 602, Loss: 0.34315983600688077[0m
[34mEpoch: 603, Loss: 0.34235815778992124[0m
[34mEpoch: 604, Loss: 0.34377335159632877[0m
[34mEpoch: 605, Loss: 0.3420511996978122[0m
[34mEpoch: 606, Loss: 0.3437715177552888[0m
[34mEpoch: 607, Loss: 0.3427801000324814[0m
[34mEpoch: 608, Loss: 0.340877040806458[0m
[34mEpoch: 609, Loss: 0.3446840853456058[0m
[34mEpoch: 610, Loss: 0.3445030234224106[0m
[34mEpoch: 611, Loss: 0.34225909038091806[0m
[34mEpoch: 612, Loss: 0.3452943783222559[0m
[34mEpoch: 613, Loss: 0.34387459956078653[0m
[34mEpoch: 614, Loss: 0.34305576919912467[0m
[34mEpoch: 615, Loss: 0.3

[34mEpoch: 780, Loss: 0.34355261233681555[0m
[34mEpoch: 781, Loss: 0.3426146129923178[0m
[34mEpoch: 782, Loss: 0.34309935522167967[0m
[34mEpoch: 783, Loss: 0.34240639288086466[0m
[34mEpoch: 784, Loss: 0.3412883101527718[0m
[34mEpoch: 785, Loss: 0.3417357302178212[0m
[34mEpoch: 786, Loss: 0.3393572543577924[0m
[34mEpoch: 787, Loss: 0.3421100302964696[0m
[34mEpoch: 788, Loss: 0.34377294938074865[0m
[34mEpoch: 789, Loss: 0.34241938777782477[0m
[34mEpoch: 790, Loss: 0.34475460876993713[0m
[34mEpoch: 791, Loss: 0.34236767804455936[0m
[34mEpoch: 792, Loss: 0.34360737645695016[0m
[34mEpoch: 793, Loss: 0.34309782788095716[0m
[34mEpoch: 794, Loss: 0.3439990449773961[0m
[34mEpoch: 795, Loss: 0.3414094856765265[0m
[34mEpoch: 796, Loss: 0.34247371065301707[0m
[34mEpoch: 797, Loss: 0.34426948522829875[0m
[34mEpoch: 798, Loss: 0.3440837310987283[0m
[34mEpoch: 799, Loss: 0.3412713045511621[0m
[34mEpoch: 800, Loss: 0.3423852091603681[0m
[34mEpoch: 801, Loss: 

[34mEpoch: 960, Loss: 0.3416191020759224[0m
[34mEpoch: 961, Loss: 0.33970751129450966[0m
[34mEpoch: 962, Loss: 0.34335979217025375[0m
[34mEpoch: 963, Loss: 0.34406780752157845[0m
[34mEpoch: 964, Loss: 0.3414331116708438[0m
[34mEpoch: 965, Loss: 0.3391862560490671[0m
[34mEpoch: 966, Loss: 0.3439151958513552[0m
[34mEpoch: 967, Loss: 0.3437718276644527[0m
[34mEpoch: 968, Loss: 0.341210953155481[0m
[34mEpoch: 969, Loss: 0.3441395012569197[0m
[34mEpoch: 970, Loss: 0.3443696746211602[0m
[34mEpoch: 971, Loss: 0.3451196898304914[0m
[34mEpoch: 972, Loss: 0.34415326331798707[0m
[34mEpoch: 973, Loss: 0.3422365121412199[0m
[34mEpoch: 974, Loss: 0.3418886175054995[0m
[34mEpoch: 975, Loss: 0.3444486130517779[0m
[34mEpoch: 976, Loss: 0.34364623250697035[0m
[34mEpoch: 977, Loss: 0.34545875876081994[0m
[34mEpoch: 978, Loss: 0.34109449283173565[0m
[34mEpoch: 979, Loss: 0.341725237744742[0m
[34mEpoch: 980, Loss: 0.34488417735858434[0m
[34mEpoch: 981, Loss: 0.342

# 4. Model Deployment

In [59]:
%%time

from sagemaker.pytorch import PyTorchModel



bogo_model = PyTorchModel(model_data = bogo_estimator.model_data,
                        role = role,
                        framework_version = '1.0',
                        entry_point = 'predict.py',
                        source_dir= 'source')

# deploy and create a predictor
bogo_predictor = bogo_model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

---------------!CPU times: user 1 s, sys: 24.5 ms, total: 1.03 s
Wall time: 7min 32s


In [60]:
%%time

# deploy your model to create a predictor
disc_model = PyTorchModel(model_data = disc_estimator.model_data,
                        role = role,
                        framework_version = '1.0',
                        entry_point = 'predict.py',
                        source_dir= 'source')

# deploy and create a predictor
disc_predictor = disc_model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

-------------!CPU times: user 823 ms, sys: 54.1 ms, total: 877 ms
Wall time: 6min 34s


In [61]:
bogo_predictor

<sagemaker.pytorch.model.PyTorchPredictor at 0x7fa7406f5710>

In [62]:
disc_predictor

<sagemaker.pytorch.model.PyTorchPredictor at 0x7fa73ee20d10>

# 5. Model Testing and Results

In [63]:
import os
# read in test data, assuming it is stored locally
bogo_test_data = pd.read_csv(os.path.join(bogo_data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
bogo_test_y = bogo_test_data.iloc[:,0]
bogo_test_x = bogo_test_data.iloc[:,1:]

bogo_test_y_preds = np.squeeze(np.round(bogo_predictor.predict(bogo_test_x)))


assert len(bogo_test_y_preds)==len(bogo_test_y), 'Unexpected number of predictions.'
print('Test passed!')


Test passed!


In [64]:
# read in test data, assuming it is stored locally
disc_test_data = pd.read_csv(os.path.join(disc_data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
disc_test_y = disc_test_data.iloc[:,0]
disc_test_x = disc_test_data.iloc[:,1:]

disc_test_y_preds = np.squeeze(np.round(disc_predictor.predict(disc_test_x)))


assert len(disc_test_y_preds)==len(disc_test_y), 'Unexpected number of predictions.'
print('Test passed!')


Test passed!


## 5.1 Evaluate the Results
- Accuracy
- Precision
- Recall
- F1Score

In [65]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # rounding and squeezing array
    test_preds = np.squeeze(np.round(predictor.predict(test_features)))
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = float(np.logical_and(test_labels, test_preds).sum())
    fp = float(np.logical_and(1-test_labels, test_preds).sum())
    tn = float(np.logical_and(1-test_labels, 1-test_preds).sum())
    fn = float(np.logical_and(test_labels, 1-test_preds).sum())
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    f1score = tp/(tp+(fp+fn)/2)
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy, 'F1Score': f1score}



In [66]:
bogo_eval = evaluate(bogo_predictor,bogo_test_x, bogo_test_y)

In [67]:
bogo_eval

{'Accuracy': 0.838691562543872,
 'F1Score': 0.8745496233213232,
 'FN': 356.0,
 'FP': 793.0,
 'Precision': 0.834722801167153,
 'Recall': 0.9183673469387755,
 'TN': 1969.0,
 'TP': 4005.0}

In [68]:
disc_eval = evaluate(disc_predictor,disc_test_x, disc_test_y)

In [69]:
disc_eval

{'Accuracy': 0.8531144781144782,
 'F1Score': 0.898300145701797,
 'FN': 252.0,
 'FP': 795.0,
 'Precision': 0.8532939656763241,
 'Recall': 0.948318293683347,
 'TN': 1457.0,
 'TP': 4624.0}