### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [3]:
data = pd.read_csv('data/coupons.csv')

In [4]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

3. Decide what to do about your missing data -- drop, replace, other...

In [6]:
def handle_missing_data(df):
    df_cleaned = df.drop(columns=['car'])
    df_cleaned[['Bar', 'CoffeeHouse', 'CarryAway', 'RestaurantLessThan20', 'Restaurant20To50']] = df_cleaned[['Bar', 'CoffeeHouse', 'CarryAway', 'RestaurantLessThan20', 'Restaurant20To50']].fillna('never')
    df_cleaned = df_cleaned.rename(columns={'passanger': 'passenger'})
    return df_cleaned

data_cleaned = handle_missing_data(data)
data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passenger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  Bar                   12684 non-null  object
 15  CoffeeHouse           12684 non-null

4. What proportion of the total observations chose to accept the coupon?



In [7]:
proportion_accepted_coupons = data['Y'].mean()
print(proportion_accepted_coupons)

0.5684326710816777


5. Use a bar plot to visualize the `coupon` column.

In [8]:
px.bar(data, x='coupon', y='Y', color='coupon')

6. Use a histogram to visualize the temperature column.

In [9]:
px.histogram(data, x='temperature', color='temperature')

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [10]:
bar_data = data_cleaned[data_cleaned['coupon'] == 'Bar']
bar_data[['coupon','Bar', 'Y']].sample(10)

Unnamed: 0,coupon,Bar,Y
959,Bar,less1,1
11016,Bar,1~3,1
5247,Bar,less1,0
11166,Bar,less1,0
8753,Bar,less1,0
10154,Bar,less1,1
11434,Bar,1~3,1
79,Bar,less1,1
8834,Bar,less1,0
12303,Bar,never,0


2. What proportion of bar coupons were accepted?


In [11]:
proportion_accepted_bar_coupons = bar_data['Y'].mean()
print(proportion_accepted_bar_coupons)

0.41001487357461575


3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [12]:
lte_3 = ['never', 'less1', '1~3']

bar_lte_3 = bar_data[bar_data['Bar'].isin(lte_3)]
bar_gt_3 = bar_data[~bar_data['Bar'].isin(lte_3)]

lte_3_acceptance = bar_lte_3['Y'].mean()
gt_3_acceptance = bar_gt_3['Y'].mean()
ratio_gt_lte = gt_3_acceptance / lte_3_acceptance

print(f"Acceptance rate for bars visited 3 or fewer times a month: {lte_3_acceptance:.2%}")
print(f"Acceptance rate for bars visited more than 3 times a month: {gt_3_acceptance:.2%}")
print(f"The acceptance rate for bars visited more than 3 times a month is {ratio_gt_lte:.2f} times higher than those visited 3 or fewer times a month.")

Acceptance rate for bars visited 3 or fewer times a month: 37.07%
Acceptance rate for bars visited more than 3 times a month: 76.88%
The acceptance rate for bars visited more than 3 times a month is 2.07 times higher than those visited 3 or fewer times a month.


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [13]:
bar_data['age'].value_counts()

age
21         417
26         395
31         339
50plus     283
36         209
41         178
46         109
below21     87
Name: count, dtype: int64

In [14]:
freqs_prob4 = ['never', 'less1']
ages_prob4 = ['below21', '21']

bar_lte1_lte25 = bar_data.query("Bar in @freqs_prob4 and age in @ages_prob4")
bar_gt1_gt25 = bar_data.query("not (Bar in @freqs_prob4 and age in @ages_prob4)")

lte1_lte25_acceptance = bar_lte1_lte25['Y'].mean()
gt1_gt25_acceptance = bar_gt1_gt25['Y'].mean()

print(f"Acceptance rate for drivers who go to bars more than once a month and are over the age of 25: {gt1_gt25_acceptance:.2%}")
print(f"Acceptance rate for drivers who do not meet the criteria: {lte1_lte25_acceptance:.2%}")
# bar_gt1_gt25[['Bar', 'Y','age','coupon','passenger']].sample(10)

Acceptance rate for drivers who go to bars more than once a month and are over the age of 25: 41.33%
Acceptance rate for drivers who do not meet the criteria: 39.33%


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [15]:
print(bar_data['passenger'].value_counts())
print()
print(bar_data['occupation'].value_counts())

passenger
Alone        1200
Friend(s)     337
Partner       274
Kid(s)        206
Name: count, dtype: int64

occupation
Unemployed                                   301
Student                                      251
Computer & Mathematical                      232
Sales & Related                              178
Education&Training&Library                   140
Management                                   119
Office & Administrative Support              105
Arts Design Entertainment Sports & Media     100
Business & Financial                          89
Retired                                       75
Food Preparation & Serving Related            48
Community & Social Services                   44
Healthcare Support                            44
Healthcare Practitioners & Technical          41
Transportation & Material Moving              35
Legal                                         34
Architecture & Engineering                    27
Personal Care & Service                       2

In [16]:
freqs_prob5 = ['never', 'less1']
passenger_prob5 = 'Kid(s)'
occupation_prob5 = 'Farming Fishing & Forestry'

df_prob5 = bar_data.query("Bar not in @freqs_prob5 and passenger != @passenger_prob5 and occupation != @occupation_prob5")
df_prob5_inv = bar_data.query("Bar in @freqs_prob5 and passenger == @passenger_prob5 and occupation == @occupation_prob5")

# print(df_prob5[['Bar', 'Y','age','coupon','passenger','occupation']].sample(10))
# print(f'======')
# print(df_prob5_inv[['Bar', 'Y','age','coupon','passenger','occupation']].sample())

prob5_acceptance = df_prob5['Y'].mean()
prob5_inv_acceptance = df_prob5_inv['Y'].mean()

print('Criteria: Drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry')
print(f"Acceptance rate for criteria: {prob5_acceptance:.2%}")
print(f"Acceptance rate for inverse criteria: {prob5_inv_acceptance:.2%}")

df_prob5_inv

Criteria: Drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry
Acceptance rate for criteria: 71.32%
Acceptance rate for inverse criteria: 33.33%


Unnamed: 0,destination,passenger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
8431,No Urgent Place,Kid(s),Rainy,55,10PM,Bar,1d,Male,41,Married partner,...,never,never,1~3,never,1,1,0,0,1,0
8432,No Urgent Place,Kid(s),Snowy,30,6PM,Bar,1d,Male,41,Married partner,...,never,never,1~3,never,1,1,0,0,1,0
9565,Home,Kid(s),Sunny,30,6PM,Bar,2h,Male,31,Married partner,...,less1,less1,1~3,less1,1,1,0,0,1,1


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [17]:
bar_data['age'].value_counts()

age
21         417
26         395
31         339
50plus     283
36         209
41         178
46         109
below21     87
Name: count, dtype: int64

In [18]:
freqs_6a = ['never', 'less1']
passenger_6a = 'Kid(s)'
marital_6a = 'Widowed'

df_prob6a = bar_data.query("Bar not in @freqs_6a and passenger != @passenger_6a and maritalStatus != @marital_6a")
df_prob6a_inv = bar_data.query("Bar in @freqs_6a and passenger == @passenger_6a and maritalStatus == @marital_6a")

prob6a_acceptance = df_prob6a['Y'].mean()

print(f'Criteria: Drivers who go to bars more than once a month, had passengers that were not a kid, and were not widowed')
print(f"Acceptance rate for criteria: {prob6a_acceptance:.2%}")
print(f"The inverse criteria has {df_prob6a_inv.shape[0]} observations")

Criteria: Drivers who go to bars more than once a month, had passengers that were not a kid, and were not widowed
Acceptance rate for criteria: 71.32%
The inverse criteria has 0 observations


In [19]:
freqs_6b = ['never', 'less1']
ages_6b = ['below21', '21', '26']

df_prob6b = bar_data.query("Bar not in @freqs_6b and age in @ages_6b")
df_prob6b_inv = bar_data.query("Bar in @freqs_6b and age not in @ages_6b")

prob6b_acceptance = df_prob6b['Y'].mean()
prob6b_inv_acceptance = df_prob6b_inv['Y'].mean()

print(f'Criteria: Drivers who go to bars more than once a month and are under the age of 30')
print(f"Acceptance rate for criteria: {prob6b_acceptance:.2%}")
print(f"Acceptance rate for inverse criteria: {prob6b_inv_acceptance:.2%}")

Criteria: Drivers who go to bars more than once a month and are under the age of 30
Acceptance rate for criteria: 72.17%
Acceptance rate for inverse criteria: 26.07%


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [20]:
ans_7 = 'Drivers between the ages of 21 and 30, unwidowed, and who go to bars more than once a month are more likely to accept bar coupons.'

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [21]:
coffee_data = data_cleaned[data_cleaned['coupon'] == 'Coffee House']
coffee_data.sample(10)

Unnamed: 0,destination,passenger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
4743,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,50plus,Single,...,never,less1,4~8,less1,1,1,0,0,1,0
905,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,31,Unmarried partner,...,never,4~8,never,never,1,1,0,0,1,1
6504,No Urgent Place,Kid(s),Sunny,55,2PM,Coffee House,2h,Male,36,Unmarried partner,...,4~8,4~8,1~3,1~3,1,1,0,0,1,1
3733,Work,Alone,Sunny,55,7AM,Coffee House,2h,Female,50plus,Single,...,less1,1~3,1~3,1~3,1,1,0,0,1,0
2429,Work,Alone,Sunny,55,7AM,Coffee House,2h,Female,21,Single,...,4~8,4~8,4~8,less1,1,1,0,0,1,0
3441,Home,Alone,Sunny,80,6PM,Coffee House,2h,Male,26,Single,...,less1,1~3,4~8,1~3,1,0,0,0,1,0
9922,No Urgent Place,Alone,Sunny,80,10AM,Coffee House,2h,Female,21,Single,...,gt8,gt8,gt8,4~8,1,1,0,0,1,1
7341,Home,Alone,Snowy,30,6PM,Coffee House,1d,Male,36,Unmarried partner,...,less1,1~3,1~3,less1,1,1,0,0,1,1
11003,Home,Alone,Rainy,55,10PM,Coffee House,2h,Female,36,Single,...,less1,1~3,less1,less1,1,0,0,1,0,0
4351,No Urgent Place,Partner,Sunny,80,10AM,Coffee House,1d,Female,below21,Single,...,never,4~8,4~8,never,1,0,0,0,1,0


In [22]:
coffee_acceptance = coffee_data['Y'].mean()
print(f"Acceptance rate for coffee coupons: {coffee_acceptance:.2%}")

Acceptance rate for coffee coupons: 49.92%


In [23]:
coffee_freqs = coffee_data['CoffeeHouse'].value_counts()
coffee_ages = coffee_data['age'].value_counts()
coffee_marital = coffee_data['maritalStatus'].value_counts()
coffee_time = coffee_data['time'].value_counts()
coffee_weather = coffee_data['weather'].value_counts()
coffee_carryaway = coffee_data['CarryAway'].value_counts()
coffee_occupation = coffee_data['occupation'].value_counts()

print(f'{coffee_freqs}\n')
print(f'{coffee_ages}\n')
print(f'{coffee_marital}\n')
print(f'{coffee_time}\n')
print(f'{coffee_weather}\n')
print(f'{coffee_carryaway}\n')
print(f'{coffee_occupation}')

CoffeeHouse
less1    1075
1~3      1042
never     999
4~8       538
gt8       342
Name: count, dtype: int64

age
21         883
26         843
31         623
50plus     545
36         402
41         325
46         220
below21    155
Name: count, dtype: int64

maritalStatus
Single               1550
Married partner      1541
Unmarried partner     717
Divorced              151
Widowed                37
Name: count, dtype: int64

time
6PM     1093
7AM      913
10AM     899
2PM      794
10PM     297
Name: count, dtype: int64

weather
Sunny    3467
Snowy     303
Rainy     226
Name: count, dtype: int64

CarryAway
1~3      1496
4~8      1341
less1     574
gt8       485
never     100
Name: count, dtype: int64

occupation
Unemployed                                   570
Student                                      499
Computer & Mathematical                      449
Sales & Related                              355
Management                                   298
Education&Training&Library      

In [24]:
times = ['7AM', '10AM']
weather_a = 'Sunny'

# Before 2PM, Sunny vs. Non-Sunny
df_a11 = coffee_data.query("time in @times and weather == @weather_a")
df_a12 = coffee_data.query("time in @times and weather != @weather_a")

# After 2PM, Sunny vs. Non-Sunny
df_a21 = coffee_data.query("time not in @times and weather == @weather_a")
df_a22 = coffee_data.query("time not in @times and weather != @weather_a")

a11_acceptance = df_a11['Y'].mean()
a12_acceptance = df_a12['Y'].mean()
a21_acceptance = df_a21['Y'].mean()
a22_acceptance = df_a22['Y'].mean()

print(f"Acceptance rate for sunny weather before 2 PM: {a11_acceptance:.2%}")
print(f"Acceptance rate for non-sunny weather before 2 PM: {a12_acceptance:.2%}")
print(f"Acceptance rate for sunny weather after 2 PM: {a21_acceptance:.2%}")
print(f"Acceptance rate for non-sunny weather after 2 PM: {a22_acceptance:.2%}")

Acceptance rate for sunny weather before 2 PM: 52.57%
Acceptance rate for non-sunny weather before 2 PM: 69.66%
Acceptance rate for sunny weather after 2 PM: 48.39%
Acceptance rate for non-sunny weather after 2 PM: 35.61%


In [25]:
ages_b = ['below21', '21', '26']
marital_b = ['Married partner', 'Unmarried partner']

# Older than 30, with vs. without partner
df_b11 = coffee_data.query("age not in @ages_b and maritalStatus in @marital_b")
df_b12 = coffee_data.query("age not in @ages_b and maritalStatus not in @marital_b")

# Younger than 30, with vs. without partner
df_b21 = coffee_data.query("age in @ages_b and maritalStatus in @marital_b")
df_b22 = coffee_data.query("age in @ages_b and maritalStatus not in @marital_b")

b11_acceptance = df_b11['Y'].mean()
b12_acceptance = df_b12['Y'].mean()
b21_acceptance = df_b21['Y'].mean()
b22_acceptance = df_b22['Y'].mean()

print(f"Acceptance rate for people older than 30 with a partner: {b11_acceptance:.2%}")
print(f"Acceptance rate for people older than 30 without a partner: {b12_acceptance:.2%}")
print(f"Acceptance rate for people younger than 30 with a partner: {b21_acceptance:.2%}")
print(f"Acceptance rate for people younger than 30 without a partner: {b22_acceptance:.2%}")

Acceptance rate for people older than 30 with a partner: 46.29%
Acceptance rate for people older than 30 without a partner: 47.80%
Acceptance rate for people younger than 30 with a partner: 52.93%
Acceptance rate for people younger than 30 without a partner: 53.85%
