# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

**Note:** If you are using the workspace, you will need to go to the terminal and run the command `conda update pandas` before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.  

You can see how to access the terminal and how the install works using the two images below.  First you need to access the terminal:

<img src="pic1.png"/>

Then you will want to run the above command:

<img src="pic2.png"/>

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

### Business Understanding

 **The questions of interest for the Starbucks dataset are as follows:**

* Which demographic groups respond best to which offer type?
* Which demographic groups will make purchases even if they don't receive an offer?
* Which demographic groups spends the most amount?
* Build a machine learning model that predicts how much someone will spend based on demographics and offer type.
* Build a model that predicts whether or not someone will respond to an offer.



### Exploratory Data Analysis

In [1]:
import pandas as pd
import numpy as np
import math
import json
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
% matplotlib inline

# read in the json files
portfolio = pd.read_json('../data/dataset/portfolio.json', orient='records', lines=True)
profile = pd.read_json('../data/dataset/profile.json', orient='records', lines=True)
transcript = pd.read_json('../data/dataset/transcript.json', orient='records', lines=True)

In [2]:
portfolio

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


In [3]:
# Check shape
portfolio.shape

(10, 6)

In [4]:
# Check for null values
portfolio.isna().sum()

channels      0
difficulty    0
duration      0
id            0
offer_type    0
reward        0
dtype: int64

In [5]:
# Unique offers
list(portfolio['offer_type'].unique())

[u'bogo', u'informational', u'discount']

In [6]:
# Min, Max and Median
portfolio.describe()

Unnamed: 0,difficulty,duration,reward
count,10.0,10.0,10.0
mean,7.7,6.5,4.2
std,5.831905,2.321398,3.583915
min,0.0,3.0,0.0
25%,5.0,5.0,2.0
50%,8.5,7.0,4.0
75%,10.0,7.0,5.0
max,20.0,10.0,10.0


In [7]:
# Top offer type
portfolio['offer_type'].describe()

count       10
unique       3
top       bogo
freq         4
Name: offer_type, dtype: object

In [8]:
# Cleaning Channels

def channel_1(x):
    
    try:
        value = x[0]
        
        return value
    
    except:
       
        return float("NAN") 
    
    
def channel_2(x):
    
    try:
        value = x[1]
        
        return value
    
    except:
       
        return float("NAN")
        
def channel_3(x):
    
    try:
        value = x[2]
        
        return value
    
    except:
       
        return float("NAN") 
    
    
def channel_4(x):
    
    try:
        value = x[3]
        
        return value
    
    except:
       
        return float("NAN")     
    

portfolio['channel_1'] = portfolio['channels'].apply(channel_1)
portfolio['channel_2'] = portfolio['channels'].apply(channel_2)
portfolio['channel_3'] = portfolio['channels'].apply(channel_3)
portfolio['channel_4'] = portfolio['channels'].apply(channel_4)

In [9]:
portfolio

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward,channel_1,channel_2,channel_3,channel_4
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10,email,mobile,social,
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,web,email,mobile,social
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0,web,email,mobile,
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,web,email,mobile,
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5,web,email,,
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3,web,email,mobile,social
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2,web,email,mobile,social
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0,email,mobile,social,
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5,web,email,mobile,social
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2,web,email,mobile,


In [10]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [11]:
# Check shape
profile.shape

(17000, 5)

In [12]:
# Check for null values
profile.isna().sum()

age                    0
became_member_on       0
gender              2175
id                     0
income              2175
dtype: int64

In [13]:
# Min, Max and Median
profile.describe()

Unnamed: 0,age,became_member_on,income
count,17000.0,17000.0,14825.0
mean,62.531412,20167030.0,65404.991568
std,26.73858,11677.5,21598.29941
min,18.0,20130730.0,30000.0
25%,45.0,20160530.0,49000.0
50%,58.0,20170800.0,64000.0
75%,73.0,20171230.0,80000.0
max,118.0,20180730.0,120000.0


In [14]:
# Top gender
profile['gender'].describe()

count     14825
unique        3
top           M
freq       8484
Name: gender, dtype: object

In [15]:
# Unique id
len(list(profile['id'].unique()))

17000

In [16]:
# Unique gender
list(profile['gender'].unique())

[None, u'F', u'M', u'O']

In [17]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{u'offer id': u'9b98b8c7a33c4b65b9aebfe6a799e6...
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{u'offer id': u'0b1e1539f2cc45b7b9fa7c272da2e1...
2,offer received,e2127556f4f64592b11af22de27a7932,0,{u'offer id': u'2906b810c7d4411798c6938adc9daa...
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{u'offer id': u'fafdcd668e3743c1bb461111dcafc2...
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{u'offer id': u'4d5c57ea9a6940dd891ad53e9dbe8d...


In [18]:
# Check shape
transcript.shape

(306534, 4)

In [19]:
# Check for null values
transcript.isna().sum()

event     0
person    0
time      0
value     0
dtype: int64

In [20]:
# Min, Max and Median
transcript.describe()

Unnamed: 0,time
count,306534.0
mean,366.38294
std,200.326314
min,0.0
25%,186.0
50%,408.0
75%,528.0
max,714.0


In [21]:
# Top event
transcript['event'].describe()

count          306534
unique              4
top       transaction
freq           138953
Name: event, dtype: object

In [22]:
# Unique event
list(transcript['event'].unique())

[u'offer received', u'offer viewed', u'transaction', u'offer completed']

In [23]:
# Unique person
len(list(transcript['person'].unique()))

17000

In [24]:
# Cleaning value
def offer(x):
    try:
       
        value =  x['offer id']
        
        return value
        
    except:
        
        try: 
            
            value =  x['offer_id']
        
            return value
        
        except:
            
            return float("NAN")
  
    
def amount(x):
    try:
       
        value =  x['amount']
        
        return value
        
    except:
        
        return float("NAN")     


In [25]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{u'offer id': u'9b98b8c7a33c4b65b9aebfe6a799e6...
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{u'offer id': u'0b1e1539f2cc45b7b9fa7c272da2e1...
2,offer received,e2127556f4f64592b11af22de27a7932,0,{u'offer id': u'2906b810c7d4411798c6938adc9daa...
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{u'offer id': u'fafdcd668e3743c1bb461111dcafc2...
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{u'offer id': u'4d5c57ea9a6940dd891ad53e9dbe8d...


### Data Understanding

Now we have the question, we need to move the question into the data. Find the columns from the datasets that would answer these questions.

**The columns identified to answer the necessary questions are as below:**


* Which demographic groups respond best to which offer type?
 -  age, gender, became_member_on, income, offer type
 
* Which demographic groups will make purchases even if they don't receive an offer?
 -  age, offer id, gender, became_member_on, income, amount, person  

* Which demographic groups spends the most amount?
 -  age, offer id, gender, became_member_on, income, amount, person  
 
* Build a machine learning model that predicts how much someone will spend based on demographics and offer type.
 - difficulty, duration, reward, age, income, amount, year, month, offer_type, channel_1, channel_2, channel_3,         channel_4, gender 
 
* Build a model that predicts whether or not someone will respond to an offer.
 - 


### Data Preparation

In [26]:
portfolio.head()

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward,channel_1,channel_2,channel_3,channel_4
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10,email,mobile,social,
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,web,email,mobile,social
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0,web,email,mobile,
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,web,email,mobile,
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5,web,email,,


In [27]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [28]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{u'offer id': u'9b98b8c7a33c4b65b9aebfe6a799e6...
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{u'offer id': u'0b1e1539f2cc45b7b9fa7c272da2e1...
2,offer received,e2127556f4f64592b11af22de27a7932,0,{u'offer id': u'2906b810c7d4411798c6938adc9daa...
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{u'offer id': u'fafdcd668e3743c1bb461111dcafc2...
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{u'offer id': u'4d5c57ea9a6940dd891ad53e9dbe8d...


In [29]:
def clean_data(profile, portfolio, transcript, offer, amount):
    """
    Function to clean the data

    INPUT:
    profile - (pandas dataframe) profile as defined at the top of the notebook
    portfolio - (pandas dataframe) portfolio as defined at the top of the notebook
    transcript - (pandas dataframe) transcript as defined at the top of the notebook

    OUTPUT:
    offer_type_df - merged dataframe containing columns offer id, offer type, age,
                    became_member_on, gender, person, income
    amount_df - merged dataframe containing columns event, amount, age,
                became_member_on, gender, person, income

    """
    
    # Converting None to NAN
    transcript['offer id'] = transcript['value'].apply(offer)
    transcript['amount'] = transcript['value'].apply(amount)
    
    # Rename column 'id' to 'person'
    profile = profile.rename(columns={'id': 'person'})
    
    # Rename column 'id' to 'offer id'
    portfolio = portfolio.rename(columns={'id': 'offer id'})
    
    # Merge dataframes proflie and transcript
    merged_df = profile.merge(transcript, how='right', on='person')
    
    # Drop Nan values in column 'Gender', 'Income'
    merged_df = merged_df.dropna(subset=['income'])
    
    # Drop column 'value'
    merged_df.drop(columns=['value'], inplace=True)
    
    # Create offer dataframe - offer_df
    offer_df = merged_df.dropna(subset=['offer id'])
    
    # Drop column 'amount' from offer_df dataframe
    offer_df.drop(columns=['amount'], inplace=True)
    
    # Merge dataframes portfolio and offer_df, map columns 'offer id' to 'offer type'
    offer_type_df = portfolio.merge(offer_df, how='right', on='offer id')
    
    # Create amount dataframe - amount_df
    amount_df = merged_df.dropna(subset=['amount'])
    
    # Drop column 'offer id'
    amount_df.drop(columns=['offer id'], inplace=True)
    
    return offer_type_df, amount_df
    
    
offer_type_df, amount_df = clean_data(profile, portfolio, transcript, offer, amount)    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [30]:
# Which demographic groups respond best to which offer type
# Which demographic groups will make purchases even if they don't receive an offer
# Which demographic groups spends the most amount?
# Build a machine learning model that predicts how much someone will spend based on demographics and offer type.
# Build a model that predicts whether or not someone will respond to an offer

### Statistical Analysis

* Which demographic groups respond best to which offer type?

In [31]:
# Demographic groups and offer
#-----------------------------

In [32]:
# Filter by gender as 'M' and age as '35'

# List of persons who has viewed at least an offer
#persons_viewed_offer = list(offer_type_df['person'][offer_type_df['event'] == 'offer viewed'].unique())

# List of persons who has completed at least an offer
#persons_completed_offer = list(offer_type_df['person'][offer_type_df['event'] == 'offer completed'].unique())

# Intersection of lists
#intersection_list = list(set(persons_viewed_offer) & set(persons_completed_offer))

# Filter rows with the intersection list
#offer_type_df = offer_type_df[offer_type_df['person'].isin(intersection_list)]

# Filter by offer viewed
offer_type_viewed_df = offer_type_df[offer_type_df['event'] == 'offer viewed'] 

# Filter by offer completed
offer_type_completed_df = offer_type_df[offer_type_df['event'] == 'offer completed'] 

# Select obly columns 'person' and 'time' from the dataframe offer_type_completed_df
offer_type_completed_df = offer_type_completed_df[['person','time']]

In [33]:
#offer_type_df = offer_type_df[offer_type_df['person'].isin(intersection_list)]

In [34]:
#offer_type_viewed_df = offer_type_df[offer_type_df['event'] =='offer viewed'] 

In [35]:
#offer_type_completed_df = offer_type_df[offer_type_df['event'] =='offer completed'] 

In [36]:
#offer_type_completed_df = offer_type_completed_df[['person','time']]

In [37]:
offer_type_viewed_df.shape

(49860, 17)

In [38]:
merged_offer = offer_type_viewed_df.merge(offer_type_completed_df, how='inner', on=['person','time'])

In [39]:
merged_offer = merged_offer.sort_values(by=['time'])

In [40]:
merged_offer.head()

Unnamed: 0,channels,difficulty,duration,offer id,offer_type,reward,channel_1,channel_2,channel_3,channel_4,age,became_member_on,gender,person,income,event,time
921,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,web,email,mobile,,55,20151208,M,83d2641895054948946aa6e898b85632,85000.0,offer viewed,0
2034,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2,web,email,mobile,social,59,20150829,F,37f7b59a483e4201bf5fc0d99978622f,55000.0,offer viewed,0
681,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,web,email,mobile,social,29,20161021,F,bf5783772fee4f2ab126f07bf3be80f1,60000.0,offer viewed,0
680,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,web,email,mobile,social,94,20160618,M,906e0eeff3bc43d79e3686adf7232594,117000.0,offer viewed,0
2712,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5,web,email,mobile,social,37,20160401,M,9096694eb31b4b9fb9c12fa9c626d028,45000.0,offer viewed,0


In [41]:
df_gender_age = merged_offer[(merged_offer['gender'] == 'M') | (offer_type_df['age'] == 35)]  

  """Entry point for launching an IPython kernel.


In [42]:
df_gender_age

Unnamed: 0,channels,difficulty,duration,offer id,offer_type,reward,channel_1,channel_2,channel_3,channel_4,age,became_member_on,gender,person,income,event,time
921,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,web,email,mobile,,55,20151208,M,83d2641895054948946aa6e898b85632,85000.0,offer viewed,0
680,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,web,email,mobile,social,94,20160618,M,906e0eeff3bc43d79e3686adf7232594,117000.0,offer viewed,0
2712,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5,web,email,mobile,social,37,20160401,M,9096694eb31b4b9fb9c12fa9c626d028,45000.0,offer viewed,0
2037,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2,web,email,mobile,social,74,20161220,M,53879247fce049dd9e8d55da657fe9a1,70000.0,offer viewed,0
1003,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,web,email,mobile,,31,20160606,M,329a9b32e7e0475cb2643125919c4d90,46000.0,offer viewed,0
674,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,web,email,mobile,social,49,20161121,M,1f99f39237164b17a8b6848f4ce881c3,47000.0,offer viewed,0
1059,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,web,email,mobile,,57,20160109,M,b34eb5a525e5497897275dcd8b5e7ff2,64000.0,offer viewed,0
3001,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2,web,email,mobile,,44,20150821,M,41486bbaab7a49e2afc05d2b48d3b00f,75000.0,offer viewed,0
142,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10,email,mobile,social,,66,20180421,M,292117548f3e4adebb5b5f896e479f13,46000.0,offer viewed,0
1590,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3,web,email,mobile,social,58,20160402,M,f8e1a46daab849268feecf21ba67ae50,79000.0,offer viewed,0


In [43]:
df_gender_age['offer_type'].value_counts(normalize=True) * 100

bogo             54.606142
discount         42.456609
informational     2.937250
Name: offer_type, dtype: float64

In [44]:
# 42.6 %

In [45]:
# Age

group_offer_age_df = offer_type_df.groupby(['age'])['offer_type'].describe() 

In [46]:
group_offer_age_df.sort_values(by=['top'])

Unnamed: 0_level_0,count,unique,top,freq
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18,668,3,bogo,290
54,3593,3,bogo,1543
55,3628,3,bogo,1556
57,3708,3,bogo,1600
100,126,3,bogo,60
61,3127,3,bogo,1358
62,3260,3,bogo,1414
63,3493,3,bogo,1574
66,2832,3,bogo,1226
69,2490,3,bogo,1057


In [47]:
# Gender

group_offer_gender_df = offer_type_df.groupby(['gender'])['offer_type'].describe()

In [48]:
group_offer_gender_df.sort_values(by=['top'])

Unnamed: 0_level_0,count,unique,top,freq
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,63719,3,bogo,27619
M,82896,3,bogo,35301
O,2190,3,discount,920


In [49]:
# Became_member_on

group_offer_became_member_on_df = offer_type_df.groupby(['became_member_on'])['offer_type'].describe()

In [50]:
group_offer_became_member_on_df.sort_values(by=['top'])

Unnamed: 0_level_0,count,unique,top,freq
became_member_on,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20160325,130,3,bogo,59
20160826,87,3,bogo,37
20160825,109,3,bogo,58
20160823,46,3,bogo,23
20160822,20,3,bogo,9
20160820,80,3,bogo,38
20160818,57,3,bogo,36
20160830,65,3,bogo,34
20160814,93,3,bogo,46
20160810,129,3,bogo,54


In [51]:
# Income

group_offer_income_df = offer_type_df.groupby(['income'])['offer_type'].describe()

In [52]:
group_offer_income_df.sort_values(by=['top'])

Unnamed: 0_level_0,count,unique,top,freq
income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30000.0,778,3,bogo,354
67000.0,2431,3,bogo,1070
70000.0,2722,3,bogo,1189
71000.0,2998,3,bogo,1304
73000.0,3234,3,bogo,1496
78000.0,1676,3,bogo,736
81000.0,1570,3,bogo,672
83000.0,1366,3,bogo,608
84000.0,1557,3,bogo,718
85000.0,1611,3,bogo,712


* Which demographic groups will make purchases even if they don't receive an offer?

In [None]:
def purchase_without_offer(offer_type_df, amount_df):
    """
    Function to find the demographic groups that will make purchases even if they don't receive an offer

    INPUT:
    offer_type_df - (pandas dataframe) offer_type_df returned by function clean_data
    amount_df - (pandas dataframe) amount_df returned by function clean_data

    OUTPUT:
    match_df - (pandas dataframe) dataframe which contains demographic groups that will
                make purchases even if they don't receive an offer

    """

    persons_viewed_offer = list(offer_type_df['person'][offer_type_df['event'] == 'offer viewed'].unique())
    
    persons_completed_offer = list(offer_type_df['person'][offer_type_df['event'] == 'offer completed'].unique())
    
    intersection_list = list(set(persons_viewed_offer) & set(persons_completed_offer))

    match_df = amount_df[amount_df['age'] == 144]

    for person in intersection_list:

        match_df = pd.concat([match_df, amount_df[amount_df['person'].isin([person])]])
    
    
    return match_df


match_df = purchase_without_offer(offer_type_df, amount_df)

In [None]:
# Build a machine learning model that predicts how much someone will spend based on demographics and offer type.

match_df.head()

In [None]:
offer_type_df.head()

In [None]:
# Find difference of dataframes

diff_df = amount_df[~amount_df.apply(tuple,1).isin(match_df.apply(tuple,1))]

In [None]:
# The demographic groups that will make purchases even if they don't receive an offer

diff_df.head()

In [None]:
diff_df.shape

In [None]:
# make barchart

* Which demographic groups spends the most amount?

In [None]:
# Make demographic groups with amount

In [None]:
amount_age_df = amount_df.groupby(['age'])['amount'].sum().to_frame()

In [None]:
# Top 10 age which spends the most amount

amount_age_df = amount_age_df.sort_values(by=['amount'], ascending=False)

amount_age_df.head(10)

In [None]:
amount_gender_df = amount_df.groupby(['gender'])['amount'].sum().to_frame()

In [None]:
# Top gender which spends the most amount

amount_gender_df = amount_gender_df.sort_values(by=['amount'], ascending=False)

amount_gender_df.head()

In [None]:
amount_became_member_on_df = amount_df.groupby(['became_member_on'])['amount'].sum().to_frame()

In [None]:
# Top 10 became_member_on which spends the most amount

amount_became_member_on_df = amount_became_member_on_df.sort_values(by=['amount'], ascending=False)

amount_became_member_on_df.head(10)

In [None]:
amount_income_df = amount_df.groupby(['income'])['amount'].sum().to_frame()

In [None]:
# Top 10 income which spends the most amount

amount_income_df = amount_income_df.sort_values(by=['amount'], ascending=False)

amount_income_df.head(10)

### Modelling and Evaluation

* Build a machine learning model that predicts how much someone will spend based on demographics and offer type.

In [None]:
# Prepare data for training the model

In [None]:
def generate_features(portfolio, transcript, amount_df):
    """
    Function to generate features for training the model

    INPUT:
    portfolio - (pandas dataframe) portfolio as defined at the top of the notebook
    transcript - (pandas dataframe) transcript as defined at the top of the notebook
    amount_df - (pandas dataframe) amount_df returned by function clean_data

    OUTPUT:
    df_offer_type_amount - (pandas dataframe) dataframe which contains the features for training the model

    """
    
    # Find duplicated rows based on duplicted time
    duplicated_df = transcript[transcript.duplicated(subset=['time'])]

    # Get rows with event as 'offer completed'
    df_offer_completed = duplicated_df[duplicated_df['event'] == 'offer completed']

    # Drop column amount
    df_offer_completed.drop(columns=['amount', 'event'], inplace=True)
    
    # Merge dataframes amount_df and df_offer_completed, map columns 'amount' to 'offer id'
    df_offer_amount = amount_df.merge(df_offer_completed, how='right', on=['person','time'])

    # Rename column 'id' to 'offer id'
    portfolio = portfolio.rename(columns={'id': 'offer id'})

    # Merge dataframes portfolio and df_offer_amount, map columns 'offer id' to 'offer type'
    df_offer_type_amount = portfolio.merge(df_offer_amount, how='right', on='offer id')
    
    # Drop unnecessary columns
    df_offer_type_amount.drop(columns=['offer id', 'channels', 'person', 'event', 'time', 'value'], inplace=True)
    
    # Generate year and month from column became_member_on 
    df_became_member_on = pd.to_datetime(df_offer_type_amount['became_member_on'], format='%Y%m%d', errors='ignore').to_frame()
    df_offer_type_amount['year'] = pd.DatetimeIndex(df_became_member_on['became_member_on']).year
    df_offer_type_amount['month'] = pd.DatetimeIndex(df_became_member_on['became_member_on']).month
    
    # Drop column became_member_on
    df_offer_type_amount.drop(columns=['became_member_on'], inplace=True)
    
    return df_offer_type_amount


df_offer_type_amount = generate_features(portfolio, transcript, amount_df)

In [None]:
"""Predict how much someone will spend based on demographics and offer type"""

In [None]:
# Drop rows for which there were no amount match found
df_offer_type_amount = df_offer_type_amount.dropna(subset=['age'])

In [None]:
# check for Nan values

df_offer_type_amount.isna().sum()

In [None]:
# numeric cols- difficulty, duration, reward, age, income, amount, year, month

df_offer_type_amount_numeric =  df_offer_type_amount[['reward', 'age', 'income', 'amount', 'year']]

#categoric cols- , offer_type, channel_1, channel_2, channel_3, channel_4, gender

df_offer_type_amount_categoric =  df_offer_type_amount[['offer_type', 'gender']]


In [None]:
df_offer_type_amount_categoric.isna().sum()

In [None]:
# corelation matrix(Seattle) for numeric type

corrMatrix =  df_offer_type_amount_numeric.corr()
sn.heatmap(corrMatrix, annot=True)
plt.show()


In [None]:
def create_dummy_df(num_df, cat_df, dummy_na):
    """
    Function to create dummy variables for categorical data

    INPUT:
    num_df - pandas dataframe with numerical variables
    cat_df - pandas dataframe with categorical variables
    dummy_na - Bool holding whether you want to dummy NA vals of categorical columns or not


    OUTPUT:
    num_df - a new dataframe that has the following characteristics:
            1. dummy columns for each of the categorical columns in cat_df
            2. if dummy_na is True - it also contains dummy columns for the NaN values
            3. Use a prefix of the column name with an underscore (_) for separating
    """
            
    cat_df = pd.get_dummies(cat_df, dummy_na=dummy_na)    
    
    num_df = pd.concat([num_df, cat_df], axis=1)

    return num_df

In [None]:
concat_df = create_dummy_df(df_offer_type_amount_numeric, df_offer_type_amount_categoric, dummy_na=False)

In [None]:
concat_df

In [None]:
def fit_linear_mod(concat_df, test_size=.3, rand_state=42):
    '''
    INPUT:
    concat_df - a dataframe holding all the variables of interest
    test_size - a float between [0,1] about what proportion of data should
                be in the test dataset
    rand_state - an int that is provided as the random state for splitting 
                 the data into training and test 
    
    OUTPUT:
    test_score - float - r2 score on the test data
    train_score - float - r2 score on the test data
    lm_model - model object from sklearn
    X_train, X_test, y_train, y_test - output from sklearn train test split used for optimal model
    '''
    
    X = concat_df.drop('amount', axis=1)
    y = concat_df['amount']

    #Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size,
                                                        random_state=rand_state) 

    lm_model = LinearRegression(normalize=True) # Instantiate

    lm_model.fit(X_train, y_train) #Fit

    #Predict and score the model
    y_train_preds = lm_model.predict(X_train) 
    y_test_preds = lm_model.predict(X_test) 

    train_score = r2_score(y_train, y_train_preds)
    test_score = r2_score(y_test, y_test_preds) 
    
    return test_score, train_score, lm_model, X_train, X_test, y_train, y_test

#Test your function with the above dataset
test_score, train_score, lm_model, X_train, X_test, y_train, y_test = fit_linear_mod(concat_df)



In [None]:
#Print training and testing score
print("The rsquared on the training data was {}.  The rsquared on the test data was {}.".\
      format(train_score, test_score))

In [None]:
# Build a model that predicts whether or not someone will respond to an offer.