### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'browser'

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [2]:
data = pd.read_csv('data/coupons.csv')

In [3]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,car,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [4]:
print(data.info())
print(data.isna().sum())
rows_with_missing = data[data.isna().any(axis=1)]
empty_cols = data.columns[data.isna().all()]
empty_cols

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

Index([], dtype='object')

3. Decide what to do about your missing data -- drop, replace, other...

In [5]:
data.drop(columns = ["car"], inplace = True)
data['Y'] = data['Y'].fillna(0)

4. What proportion of the total observations chose to accept the coupon?



5. Use a bar plot to visualize the `coupon` column.

In [6]:
data['coupon'].dropna()
fig = px.histogram(data, x = 'coupon')
fig.show()

6. Use a histogram to visualize the temperature column.

In [7]:
data['temperature'].dropna()
fig = px.histogram(data, x = 'temperature')
fig.show()

In [8]:
prop = (data['Y'] ==1).mean()*100
print(prop)

56.84326710816777


**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [9]:
bar_df = data[data['coupon'] == "Bar"]
bar_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2017 entries, 9 to 12682
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           2017 non-null   object
 1   passanger             2017 non-null   object
 2   weather               2017 non-null   object
 3   temperature           2017 non-null   int64 
 4   time                  2017 non-null   object
 5   coupon                2017 non-null   object
 6   expiration            2017 non-null   object
 7   gender                2017 non-null   object
 8   age                   2017 non-null   object
 9   maritalStatus         2017 non-null   object
 10  has_children          2017 non-null   int64 
 11  education             2017 non-null   object
 12  occupation            2017 non-null   object
 13  income                2017 non-null   object
 14  Bar                   1996 non-null   object
 15  CoffeeHouse           1978 non-null   obje

2. What proportion of bar coupons were accepted?


In [10]:
bar_prop = (bar_df['Y'] ==1).mean()*100
bar_prop

41.00148735746158

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [11]:
print(bar_df['Bar'].unique())
bar_df['Bar'].fillna('never')
bar_less_equal_three = bar_df['Bar'].astype(str).str.strip().str.lower().isin(['never', 'less1', '1~3'])
bar_greater_three = bar_df['Bar'].astype(str).str.strip().str.lower().isin(['gt8', '4~8'])
prop_less_three = bar_df.loc[bar_less_equal_three, 'Y'].mean()
prop_more_three = bar_df.loc[bar_greater_three, 'Y'].mean()

print('The percentage of coupons accepted for people who go the bar 3 or less is', prop_less_three *100)
print('The percentage of coupons accepted for people who go the bar more than three is', prop_more_three *100)

['never' 'less1' '1~3' 'gt8' nan '4~8']
The percentage of coupons accepted for people who go the bar 3 or less is 37.061769616026716
The percentage of coupons accepted for people who go the bar more than three is 76.88442211055276


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [16]:
bar_greater_equal_one = bar_df['Bar'].astype(str).str.strip().str.lower().isin(['4~8', 'gt8' , '1~3'])
print(data['age'].unique())
ages_greater_25 = ['26','31','36','41','46','50plus']
age_mask = bar_df['age'].astype(str).str.strip().str.lower().isin(ages_greater_25)
age_bar_mask = bar_greater_equal_one & age_mask
prop_25_greater_one = bar_df.loc[age_bar_mask, 'Y'].mean()*100
print('The percentage of drivers over 25 and go to the bar more than once a month is: ', prop_25_greater_one)

['21' '46' '26' '31' '41' '50plus' '36' 'below21']
The percentage of drivers over 25 and go to the bar more than once a month is:  69.52380952380952


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [26]:
bar_greater_equal_one = bar_df['Bar'].astype(str).str.strip().str.lower().isin(['4~8', 'gt8' , '1~3'])
print(data['passanger'].unique())
passangers_not_kids = ['Alone','Friend(s)','Partner']
print(data['occupation'].unique())
occupations_not_farming_fishing_forestry = ['Unemployed', 'Architecture & Engineering', 'Student',
 'Education&Training&Library' ,'Healthcare Support',
 'Healthcare Practitioners & Technical' ,'Sales & Related', 'Management',
 'Arts Design Entertainment Sports & Media', 'Computer & Mathematical',
 'Life Physical Social Science' ,'Personal Care & Service',
 'Community & Social Services' ,'Office & Administrative Support',
 'Construction & Extraction', 'Legal', 'Retired',
 'Installation Maintenance & Repair' ,'Transportation & Material Moving',
 'Business & Financial' ,'Protective Service',
 'Food Preparation & Serving Related', 'Production Occupations',
 'Building & Grounds Cleaning & Maintenance' ]

passanger_mask = bar_df['passanger'].dropna().astype(str).str.strip().isin(passangers_not_kids)
occupation_mask = bar_df['occupation'].dropna().astype(str).str.strip().isin(occupations_not_farming_fishing_forestry)
passanger_occupation_bar_mask = bar_greater_equal_one & occupation_mask & passanger_mask

prop_nokid_greater_one_noFarming = bar_df.loc[passanger_occupation_bar_mask, 'Y'].mean()*100
print('The percentage of drivers not farmers with no kids as passangers and go to the bar more than once a month is: ', prop_nokid_greater_one_noFarming)

['Alone' 'Friend(s)' 'Kid(s)' 'Partner']
['Unemployed' 'Architecture & Engineering' 'Student'
 'Education&Training&Library' 'Healthcare Support'
 'Healthcare Practitioners & Technical' 'Sales & Related' 'Management'
 'Arts Design Entertainment Sports & Media' 'Computer & Mathematical'
 'Life Physical Social Science' 'Personal Care & Service'
 'Community & Social Services' 'Office & Administrative Support'
 'Construction & Extraction' 'Legal' 'Retired'
 'Installation Maintenance & Repair' 'Transportation & Material Moving'
 'Business & Financial' 'Protective Service'
 'Food Preparation & Serving Related' 'Production Occupations'
 'Building & Grounds Cleaning & Maintenance' 'Farming Fishing & Forestry']
The percentage of drivers not farmers with no kids as passangers and go to the bar more than once a month is:  71.32486388384754


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [33]:
print(data['maritalStatus'].unique())
not_widowed = ['Unmarried partner', 'Single', 'Married partner', 'Divorced']
not_widowed_mask = bar_df['maritalStatus'].dropna().astype(str).str.strip().isin(not_widowed)

print(data['age'].unique())
age_less_30 = ['21', '26', 'below21']
age_less30_mask = bar_df['age'].astype(str).str.strip().isin(age_less_30)

print(data['RestaurantLessThan20'].unique())
cheap_restaurant_greater_four = ['4~8', 'gt8']
cheap_restaurant_mask = data['RestaurantLessThan20'].dropna().astype(str).str.strip().isin(cheap_restaurant_greater_four)

print(data['income'].unique())
income_less_50k = ['$37500 - $49999', '$12500 - $24999','$25000 - $37499',
 'Less than $12500']
income_less_50k_mask = data['income'].dropna().astype(str).str.strip().isin(income_less_50k)

bars_not_widowed_nokids_mask = bar_greater_equal_one & not_widowed_mask & passanger_mask
bars_under30_mask = bar_greater_equal_one & age_less30_mask
cheapRestaurants_less50k_mask = cheap_restaurant_mask & income_less_50k_mask

prop_multiple_columns = data.loc[(bars_not_widowed_nokids_mask) | (bars_under30_mask) | (cheapRestaurants_less50k_mask), 'Y'].mean()*100

print('The percentage of the conditions is : ', prop_multiple_columns)

['Unmarried partner' 'Single' 'Married partner' 'Divorced' 'Widowed']
['21' '46' '26' '31' '41' '50plus' '36' 'below21']
['4~8' '1~3' 'less1' 'gt8' nan 'never']
['$37500 - $49999' '$62500 - $74999' '$12500 - $24999' '$75000 - $87499'
 '$50000 - $62499' '$25000 - $37499' '$100000 or More' '$87500 - $99999'
 'Less than $12500']
The percentage of the conditions is :  58.891752577319586


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [34]:
ans = 'They should target drivers who already go the bar ,  who are older, with a higer income'
print(ans)

They should target drivers who already go the bar ,  who are older, with a higer income


### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [38]:
income_counts = data.groupby(data['Y'])[['income']].value_counts(normalize = True)*100
print(income_counts)

Y  income          
0  $25000 - $37499     14.961637
   $37500 - $49999     14.377055
   $12500 - $24999     14.230910
   $100000 or More     13.372305
   $50000 - $62499     12.257947
   $75000 - $87499      8.092802
   Less than $12500     7.745707
   $87500 - $99999      7.654366
   $62500 - $74999      7.307271
1  $25000 - $37499     16.560333
   $12500 - $24999     14.590846
   $37500 - $49999     14.119279
   $100000 or More     13.925104
   $50000 - $62499     13.703190
   Less than $12500     8.571429
   $87500 - $99999      6.601942
   $62500 - $74999      6.185853
   $75000 - $87499      5.742025
Name: proportion, dtype: float64


In [39]:
weather_counts = data.groupby(data['Y'])[['weather']].value_counts(normalize = True)*100
print(weather_counts)

Y  weather
0  Sunny      74.534161
   Snowy      13.591524
   Rainy      11.874315
1  Sunny      83.065187
   Snowy       9.167822
   Rainy       7.766990
Name: proportion, dtype: float64


In [40]:
where_are_they_going_counts = data.groupby(data['Y'])[['destination']].value_counts(normalize = True)*100
print(where_are_they_going_counts)

Y  destination    
0  No Urgent Place    42.035075
   Home               29.192547
   Work               28.772379
1  No Urgent Place    55.228849
   Home               22.732316
   Work               22.038835
Name: proportion, dtype: float64


In [43]:
acceptance = (data.groupby(data['coupon'])['Y'].mean().sort_values(ascending = False)
       .rename('accept_rate').mul(100).round(1))
print(acceptance)

coupon
Carry out & Take away    73.5
Restaurant(<20)          70.7
Coffee House             49.9
Restaurant(20-50)        44.1
Bar                      41.0
Name: accept_rate, dtype: float64


In [46]:
fig = px.bar(acceptance,  x = 'accept_rate',
             labels = {'coupon':'Coupon', 'accept_rate': 'Acceptance Rate'}, 
             title = "Acceptance Rate by Coupon")
fig.show()
