### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [2]:
data = pd.read_csv('data/coupons.csv')

In [3]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [4]:
# display general info about the dataframe, including total number of entries.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

In [5]:
# fix spelling of 'passenger' column
data = data.rename(columns={'passanger':'passenger'})

In [6]:
# list all columns of the dataframe
data.columns

Index(['destination', 'passenger', 'weather', 'temperature', 'time', 'coupon',
       'expiration', 'gender', 'age', 'maritalStatus', 'has_children',
       'education', 'occupation', 'income', 'car', 'Bar', 'CoffeeHouse',
       'CarryAway', 'RestaurantLessThan20', 'Restaurant20To50',
       'toCoupon_GEQ5min', 'toCoupon_GEQ15min', 'toCoupon_GEQ25min',
       'direction_same', 'direction_opp', 'Y'],
      dtype='object')

In [7]:
# sum up all null values in the dataframe per column
data.isnull().sum()

destination                 0
passenger                   0
weather                     0
temperature                 0
time                        0
coupon                      0
expiration                  0
gender                      0
age                         0
maritalStatus               0
has_children                0
education                   0
occupation                  0
income                      0
car                     12576
Bar                       107
CoffeeHouse               217
CarryAway                 151
RestaurantLessThan20      130
Restaurant20To50          189
toCoupon_GEQ5min            0
toCoupon_GEQ15min           0
toCoupon_GEQ25min           0
direction_same              0
direction_opp               0
Y                           0
dtype: int64

The above series shows the total number of null values per column. 'car' has the most, where in fact 12576 of total entries are null.

In [8]:
# look at the unique entries for the car column.
data.car.unique()

array([nan, 'Scooter and motorcycle', 'crossover', 'Mazda5',
       'do not drive', 'Car that is too old to install Onstar :D'],
      dtype=object)

3. Decide what to do about your missing data -- drop, replace, other...

Given the 'car' column is mostly empty, with very little valuable content for non-empty values, I decided to completely drop the car column.

However, the other columns with missing data ('Bar','CoffeeHouse','CarryAway','RestaurantLessThan20',and 'Restaurant20To50') each only have less than ~1% missing values. I'm going to start by dropping all missing values in these remaining rows, and see how many total rows I'm left with.

In [9]:
# drop car column completely
data_dropped = data.drop(columns='car')

# drop all rows with any N/A values
data_dropped = data_dropped.dropna()

# see how many rows are left
data_dropped.shape

(12079, 25)

This dropping operation dropped 605 total rows (12684-12079), which is just under 5% of the data. Given this is not a significant amount of data dropped, I'm happy with the results of this approach.

4. What proportion of the total observations chose to accept the coupon? 



In [10]:
# calculate the total number of rows with Y=1 divided by the total number of rows
accepted_ratio = data_dropped.Y.sum()/data_dropped.shape[0]
print(f'{round(accepted_ratio*100,1)} percent of total observations chose to accept the coupon.')

56.9 percent of total observations chose to accept the coupon.


5. Use a bar plot to visualize the `coupon` column.

In [11]:
# A bar plot of a categorical column is really just a histogram
fig = px.histogram(data_dropped,y='coupon')
fig.update_layout(
    title='Number of Coupons Offered per Type of Coupon',
    xaxis_title='Number of Coupons Offered',
    yaxis_title='Type of Coupon Offered'
)

6. Use a histogram to visualize the temperature column.

In [12]:
# Create a histogram of the temperature column
fig = px.histogram(data_dropped,x='temperature')
fig.update_layout(
    title='Number of Coupons Offered per Temperature',
    xaxis_title='Temperature (°F)',
    yaxis_title='Number of Coupons Offered'
)

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [13]:
# filter the dropped dataframe down to just those entries where a Bar coupon was offered
df_bar = data_dropped[data_dropped['coupon']=='Bar']

df_bar.head()

Unnamed: 0,destination,passenger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
24,No Urgent Place,Friend(s),Sunny,80,10AM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,0,1,1
35,Home,Alone,Sunny,55,6PM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,1,0,1
39,Work,Alone,Sunny,55,7AM,Bar,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,1,1,0,1,1
46,No Urgent Place,Friend(s),Sunny,80,10AM,Bar,1d,Male,46,Single,...,4~8,1~3,1~3,never,1,0,0,0,1,0
57,Home,Alone,Sunny,55,6PM,Bar,1d,Male,46,Single,...,4~8,1~3,1~3,never,1,0,0,1,0,0


2. What proportion of bar coupons were accepted?


In [14]:
# calculate the total ratio of bar coupons that were accepted relative to the total number of bar coupons
bar_coupon_accepted_ratio = df_bar.Y.sum()/df_bar.shape[0]
print(f'{round(bar_coupon_accepted_ratio*100,1)}% of Bar coupons were accepted.')

41.2% of Bar coupons were accepted.


3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [15]:
# look at the unique values of the Bar column, which is how many times a person went to a bar per month
df_bar.Bar.unique()

array(['never', 'less1', '1~3', 'gt8', '4~8'], dtype=object)

In [16]:
# filter to observations where 'Bar' was 3, vs more
df_bar_lte_3 = df_bar.query("Bar=='never' or Bar=='less1' or Bar=='1~3'")
df_bar_gt_3 = df_bar.query("Bar=='4~8' or Bar=='gt8'")

# calculate the acceptance ratio for each
print(f'{round(df_bar_lte_3.Y.sum()/df_bar_lte_3.shape[0]*100,1)}% of those who went to a bar < 3 times per month accepted the coupon.')
print(f'{round(df_bar_gt_3.Y.sum()/df_bar_gt_3.shape[0]*100,1)}% of those who went to a bar >= 3 times per month accepted the coupon.')


37.3% of those who went to a bar < 3 times per month accepted the coupon.
76.2% of those who went to a bar >= 3 times per month accepted the coupon.


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [17]:
# look at unique values of age column
df_bar.age.unique()

array(['21', '46', '26', '31', '41', '50plus', '36', 'below21'],
      dtype=object)

In [18]:
pd.options.mode.chained_assignment = None

# convert age to numeric
df_bar.loc[:,'age'] = df_bar['age'].str.replace('50plus','50')
df_bar.loc[:,'age'] = df_bar['age'].str.replace('below21','20')
df_bar.loc[:,'age'] = pd.to_numeric(df_bar['age'])
df_bar.age.unique()

array([21, 46, 26, 31, 41, 50, 36, 20])

In [19]:
# get a mask of those that are > 25 and go to bar more than once per month, and its inverse
gt_25_gt_1_mask = (df_bar['age'] > 25) & (df_bar['Bar'].isin(['1~3','4~8','gt8']))
all_others_mask = ~gt_25_gt_1_mask

In [20]:
# calculate the acceptance ratios
gt_25_gt_1_acceptance_ratio = df_bar[gt_25_gt_1_mask].Y.sum()/gt_25_gt_1_mask.sum()
all_others_acceptance_ratio = df_bar[all_others_mask].Y.sum()/all_others_mask.sum()

print(f'{round(gt_25_gt_1_acceptance_ratio*100,1)}% accepted the coupon where age >25 and went to a bar >= 1 time per month.')
print(f'{round(all_others_acceptance_ratio*100,1)}% accepted the coupon for all other groups.')


69.0% accepted the coupon where age >25 and went to a bar >= 1 time per month.
33.8% accepted the coupon for all other groups.


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


In [21]:
# look at passenger column unique values
df_bar['passenger'].unique()

array(['Friend(s)', 'Alone', 'Kid(s)', 'Partner'], dtype=object)

In [22]:
# look at occupation unique values
df_bar['occupation'].unique()

array(['Architecture & Engineering', 'Student',
       'Education&Training&Library', 'Unemployed', 'Healthcare Support',
       'Healthcare Practitioners & Technical', 'Sales & Related',
       'Management', 'Arts Design Entertainment Sports & Media',
       'Computer & Mathematical', 'Life Physical Social Science',
       'Personal Care & Service', 'Office & Administrative Support',
       'Construction & Extraction', 'Legal', 'Retired',
       'Community & Social Services', 'Installation Maintenance & Repair',
       'Transportation & Material Moving', 'Business & Financial',
       'Protective Service', 'Food Preparation & Serving Related',
       'Production Occupations',
       'Building & Grounds Cleaning & Maintenance',
       'Farming Fishing & Forestry'], dtype=object)

In [23]:
# mask the study group, and invert the 'others'
study_group_mask = (df_bar['Bar'].isin(['1~3','gt8','4~8'])) & \
    (df_bar['passenger'].isin(['Friend(s)','Alone','Partner'])) & \
    (df_bar['occupation'] != 'Farming Fishing & Forestry')

other_group_mask = ~study_group_mask

In [24]:
# calculate the acceptance ratios
study_group_acceptance_ratio = df_bar[study_group_mask].Y.sum()/study_group_mask.sum()
other_group_acceptance_ratio = df_bar[other_group_mask].Y.sum()/other_group_mask.sum()

print(f'{round(study_group_acceptance_ratio*100,1)}% accepted the coupon where went to bar >=1, passenger was not a kid, and occupation was not Farming Fishing & Forestry.')
print(f'{round(other_group_acceptance_ratio*100,1)}% accepted the coupon for all other groups.')

70.9% accepted the coupon where went to bar >=1, passenger was not a kid, and occupation was not Farming Fishing & Forestry.
29.8% accepted the coupon for all other groups.


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 



In [25]:
# look at the various column unique values
df_bar['maritalStatus'].unique()

array(['Single', 'Married partner', 'Unmarried partner', 'Divorced',
       'Widowed'], dtype=object)

In [26]:
df_bar['RestaurantLessThan20'].unique()

array(['4~8', '1~3', 'less1', 'gt8', 'never'], dtype=object)

In [27]:
df_bar['income'].unique()

array(['$62500 - $74999', '$12500 - $24999', '$75000 - $87499',
       '$50000 - $62499', '$37500 - $49999', '$25000 - $37499',
       '$100000 or More', '$87500 - $99999', 'Less than $12500'],
      dtype=object)

In [28]:
# mask the study group, and invert the 'others'
bars_no_kids_not_widowed_mask = (
        (df_bar['Bar'].isin(['1~3','gt8','4~8'])) & \
        (df_bar['passenger'].isin(['Friend(s)','Alone','Partner'])) & \
        (df_bar['occupation'] != 'Widowed')
)

bars_under_30_mask = (
        (df_bar['Bar'].isin(['1~3','gt8','4~8'])) & \
        (df_bar['age'] < 30)
)

cheap_restaurants_low_income_mask = (
        (df_bar['RestaurantLessThan20'].isin(['4~8','gt8'])) & \
        (df_bar['income'].isin(['Less than $12500','$12500 - $24999','$25000 - $37499','$37500 - $49999']))
)


In [29]:
# calculate the acceptance ratios
bars_no_kids_not_widowed_acceptance_ratio = df_bar[bars_no_kids_not_widowed_mask].Y.sum()/bars_no_kids_not_widowed_mask.sum()
bars_under_30_acceptance_ratio = df_bar[bars_under_30_mask].Y.sum()/bars_under_30_mask.sum()
cheap_restaurants_low_income_acceptance_ratio = df_bar[cheap_restaurants_low_income_mask].Y.sum()/cheap_restaurants_low_income_mask.sum()

print(f'{round(bars_no_kids_not_widowed_acceptance_ratio*100,1)}% accepted the coupon where the observation goes to bars >=1 time per month, has no kids in the car, and was not widowed.')
print(f'{round(bars_under_30_acceptance_ratio*100,1)}% accepted the coupon where the observation goes to bars >=1 time per month and is less than 30.')
print(f'{round(cheap_restaurants_low_income_acceptance_ratio*100,1)}% accepted the coupon where the observation has an income below $50k and eats at cheap restaurants >=4 times per month.')

70.9% accepted the coupon where the observation goes to bars >=1 time per month, has no kids in the car, and was not widowed.
72.0% accepted the coupon where the observation goes to bars >=1 time per month and is less than 30.
45.6% accepted the coupon where the observation has an income below $50k and eats at cheap restaurants >=4 times per month.


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

Based on these results, I think the following factors contribute to higher ratios of accepting bar coupons:
* Going to a bar >=1 time per month
* Not having kids in the car
* Having a high income and not going to cheap restaurants often

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

I will look at Coffee House coupons, as this was the highest number of coupons offered.

**Question**: What's the overall acceptance rate of Coffee Coupons

In [30]:
# Filter down to coffee coupons
df_coffee = data_dropped[data_dropped['coupon']=='Coffee House']
df_coffee.head()

Unnamed: 0,destination,passenger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
23,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,0,1,0
26,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,0,1,0
27,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Male,21,Single,...,less1,4~8,4~8,less1,1,1,0,0,1,0
28,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Male,21,Single,...,less1,4~8,4~8,less1,1,1,0,0,1,0
30,No Urgent Place,Friend(s),Sunny,80,6PM,Coffee House,2h,Male,21,Single,...,less1,4~8,4~8,less1,1,0,0,0,1,0


In [31]:
coffee_coupon_accepted_ratio = df_coffee.Y.sum()/df_coffee.shape[0]
print(f'{round(coffee_coupon_accepted_ratio*100,1)}% of Coffee House coupons were accepted.')

49.6% of Coffee House coupons were accepted.


In [32]:
# Look at acceptance ratio based on how often one goes to a coffee house per month
accepted_ratio = df_coffee.groupby('CoffeeHouse').Y.sum()/df_coffee.groupby('CoffeeHouse').size()

fig = px.bar(accepted_ratio)
fig.update_layout(
    title='Coffee House Coupon Accepted Ratio as a function of Number of Times visiting a Coffee House per Month',
    xaxis_title='Number of Times Frequenting a Coffee House per Month',
    yaxis_title='Coffee House Coupon Acceptance Ratio',
    showlegend=False,
)

Clearly, going to a coffee house more frequently increases the likelihood of accepting a coffee coupon.

In [33]:
# look at the ratio of acceptance as a function of time from coupon redemption location
accepted_ratio = df_coffee.groupby('toCoupon_GEQ15min').Y.sum()/df_coffee.groupby('toCoupon_GEQ15min').size()

fig = px.bar(accepted_ratio)
fig.update_layout(
    title='Coffee House Coupon Accepted Ratio as a function of whether the redemption location was >=15 minutes away',
    xaxis_title='Whether the Coffee House was >= 15 Minutes Away (0 = No, 1 = Yes)',
    yaxis_title='Coffee House Coupon Acceptance Ratio',
    showlegend=False,
)

Being further away from the redemption location reduced the likelihood of acceptance.

In [34]:
# Look at acceptance ratio as a function of income
accepted_ratio = df_coffee.groupby('income').Y.sum()/df_coffee.groupby('income').size()

fig = px.bar(accepted_ratio)
fig.update_layout(
    title='Coffee House Coupon Accepted Ratio as a function of income',
    xaxis_title='Income Range',
    yaxis_title='Coffee House Coupon Acceptance Ratio',
    showlegend=False,
)

Interestingly, there's no clear correlation between acceptance ratio and income, for Coffee House Coupons.

In [35]:
# Look at acceptance ratio as a function of destination
accepted_ratio = df_coffee.groupby('destination').Y.sum()/df_coffee.groupby('destination').size()

fig = px.bar(accepted_ratio)
fig.update_layout(
    title='Coffee House Coupon Accepted Ratio as a function of destination',
    xaxis_title='Destination',
    yaxis_title='Coffee House Coupon Acceptance Ratio',
    showlegend=False,
)

The coupon is more likely to be accepted if there is no urgent destination for the driver.