### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [158]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [159]:
data = pd.read_csv('data/coupons.csv')

In [160]:
# Print info on dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

In [161]:
# Print first 5 rows
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [162]:
# Determine how many values are 'NaN'
data.isnull().sum()

destination                 0
passanger                   0
weather                     0
temperature                 0
time                        0
coupon                      0
expiration                  0
gender                      0
age                         0
maritalStatus               0
has_children                0
education                   0
occupation                  0
income                      0
car                     12576
Bar                       107
CoffeeHouse               217
CarryAway                 151
RestaurantLessThan20      130
Restaurant20To50          189
toCoupon_GEQ5min            0
toCoupon_GEQ15min           0
toCoupon_GEQ25min           0
direction_same              0
direction_opp               0
Y                           0
dtype: int64

In [163]:
# check values for car where not NaN
data['car'].notnull().value_counts()

car
False    12576
True       108
Name: count, dtype: int64

In [164]:
# determine percentage of missing data
(data.isnull().sum() / len(data)) * 100

destination              0.000000
passanger                0.000000
weather                  0.000000
temperature              0.000000
time                     0.000000
coupon                   0.000000
expiration               0.000000
gender                   0.000000
age                      0.000000
maritalStatus            0.000000
has_children             0.000000
education                0.000000
occupation               0.000000
income                   0.000000
car                     99.148534
Bar                      0.843582
CoffeeHouse              1.710817
CarryAway                1.190476
RestaurantLessThan20     1.024913
Restaurant20To50         1.490066
toCoupon_GEQ5min         0.000000
toCoupon_GEQ15min        0.000000
toCoupon_GEQ25min        0.000000
direction_same           0.000000
direction_opp            0.000000
Y                        0.000000
dtype: float64

3. Decide what to do about your missing data -- drop, replace, other...

In [165]:
# 99+% of car data is missing - drop it
# <= 2% of Bar, CoffeeHouse, CarryAway, RestaurantLessThan20 and Restaurant20To50 is missing - replace NaN with Unknown
data_ = data.drop(columns=['car'])
data_['Bar'] = data_['Bar'].fillna('Uknown')
data_['CoffeeHouse'] = data_['CoffeeHouse'].fillna('Uknown')
data_['CarryAway'] = data_['CarryAway'].fillna('Uknown')
data_['RestaurantLessThan20'] = data_['RestaurantLessThan20'].fillna('Uknown')
data_['Restaurant20To50'] = data_['Restaurant20To50'].fillna('Uknown')

In [166]:
# recheck for percentage of missing data - expected value is 0 for all columns
(data_.isnull().sum() / len(data)) * 100

destination             0.0
passanger               0.0
weather                 0.0
temperature             0.0
time                    0.0
coupon                  0.0
expiration              0.0
gender                  0.0
age                     0.0
maritalStatus           0.0
has_children            0.0
education               0.0
occupation              0.0
income                  0.0
Bar                     0.0
CoffeeHouse             0.0
CarryAway               0.0
RestaurantLessThan20    0.0
Restaurant20To50        0.0
toCoupon_GEQ5min        0.0
toCoupon_GEQ15min       0.0
toCoupon_GEQ25min       0.0
direction_same          0.0
direction_opp           0.0
Y                       0.0
dtype: float64

4. What proportion of the total observations chose to accept the coupon? 



In [167]:
# Check values for column Y
data_['Y'].value_counts()

Y
1    7210
0    5474
Name: count, dtype: int64

In [168]:
# Given values for Y are either 0 or 1, proportion of total obervations that chose to accept the coupon can be
# calculated as mean of the column.
print("Proportion of coupons accepted:", data_['Y'].mean())
# 56.84 % of observations have accepted the coupon

Proportion of coupons accepted: 0.5684326710816777


5. Use a bar plot to visualize the `coupon` column.

In [169]:
# Bar plot to visualize the coupon column
px.bar(data_, x="coupon", color="Y", title="Coupon usage by category count", color_continuous_scale="viridis")

6. Use a histogram to visualize the temperature column.

In [170]:
# Histogram to visualize the temperature column
px.histogram(data_, x="temperature", nbins=10, title="Distribution of Temperature")

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [171]:
# Filter the DataFrame to include only rows where the coupon column is 'Bar'
data_bar = data_[data['coupon'] == 'Bar']

# Display the first few rows of the new DataFrame to verify
print(data_bar.head())

        destination  passanger weather  temperature  time coupon expiration  \
9   No Urgent Place     Kid(s)   Sunny           80  10AM    Bar         1d   
13             Home      Alone   Sunny           55   6PM    Bar         1d   
17             Work      Alone   Sunny           55   7AM    Bar         1d   
24  No Urgent Place  Friend(s)   Sunny           80  10AM    Bar         1d   
35             Home      Alone   Sunny           55   6PM    Bar         1d   

    gender age      maritalStatus  ...  CoffeeHouse CarryAway  \
9   Female  21  Unmarried partner  ...        never    Uknown   
13  Female  21  Unmarried partner  ...        never    Uknown   
17  Female  21  Unmarried partner  ...        never    Uknown   
24    Male  21             Single  ...        less1       4~8   
35    Male  21             Single  ...        less1       4~8   

   RestaurantLessThan20 Restaurant20To50 toCoupon_GEQ5min toCoupon_GEQ15min  \
9                   4~8              1~3               

2. What proportion of bar coupons were accepted?


In [172]:
print("Proportion of bar coupons accepted:", data_bar['Y'].mean())
# 41% of bar coupons accepted

Proportion of bar coupons accepted: 0.41001487357461575


3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [173]:
data_bar['Bar'].value_counts()

Bar
never     830
less1     570
1~3       397
4~8       150
gt8        49
Uknown     21
Name: count, dtype: int64

In [174]:
# Categorize based on the 'Bar' frequency
# Assuming the 'Bar' column exists and categorizes frequency of visits
data_bar_lt_three = data_bar[data_bar['Bar'].isin(['never', 'less than once a month', '1–3 times a month'])]
data_bar_gt_three = data_bar[~data_bar['Bar'].isin(['never', 'less than once a month', '1–3 times a month'])]

print("Acceptance rate for those who visit a bar 3 or fewer times a month:", data_bar_lt_three['Y'].mean())
print("Acceptance rate for those who visit a bar more than 3 times a month:", data_bar_gt_three['Y'].mean())


Acceptance rate for those who visit a bar 3 or fewer times a month: 0.18795180722891566
Acceptance rate for those who visit a bar more than 3 times a month: 0.565290648694187


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [175]:
# convert age to int for comparision operations
data_['age_numeric'] = pd.to_numeric(data_['age'], errors='coerce')

# population who go to the Bar more than once a month and are older than 25 years
data_bar_gt_one_age_gt_25 = data_[(data_['Bar'].isin(['2~3', '4~8', 'gt8'])) & (data_['age_numeric'] > 25)]
# All other population (based on the previous filter)
data_bar_all_others = data_[~((data_['Bar'].isin(['2~3', '4~8', 'gt8'])) & (data_['age_numeric'] > 25))]

# Calculate the acceptance rate for each group
acceptance_rate_group_1 = data_bar_gt_one_age_gt_25['Y'].mean()
acceptance_rate_group_2 = data_bar_all_others['Y'].mean()

print(f"Acceptance rate for drivers who go to a bar more than once a month and are over 25: {acceptance_rate_group_1}")
print(f"Acceptance rate for all other drivers: {acceptance_rate_group_2}")


Acceptance rate for drivers who go to a bar more than once a month and are over 25: 0.6261160714285714
Acceptance rate for all other drivers: 0.5640481845945029


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


In [176]:
data_['passanger'].value_counts()

passanger
Alone        7305
Friend(s)    3298
Partner      1075
Kid(s)       1006
Name: count, dtype: int64

In [177]:
data_['occupation'].value_counts()

occupation
Unemployed                                   1870
Student                                      1584
Computer & Mathematical                      1408
Sales & Related                              1093
Education&Training&Library                    943
Management                                    838
Office & Administrative Support               639
Arts Design Entertainment Sports & Media      629
Business & Financial                          544
Retired                                       495
Food Preparation & Serving Related            298
Healthcare Practitioners & Technical          244
Healthcare Support                            242
Community & Social Services                   241
Legal                                         219
Transportation & Material Moving              218
Architecture & Engineering                    175
Personal Care & Service                       175
Protective Service                            175
Life Physical Social Science           

In [178]:
# Step 1: Filter drivers who go to bars more than once a month
data_bar_gt_one = data_['Bar'].isin(['1–3 times a month', '4–8 times a month', 'gt8'])

# Step 2: Filter based on passengers not being kids
# Note typo in column name
data_psgr_not_kid = data_['passanger'] != 'Kid(s)'

# Step 3: and occupation not being farming, fishing, or forestry
data_occupation_not_ffo = ~data_['occupation'].isin(['Farming Fishing & Forestry'])

# Combined above filters
data_group_filter = data_bar_gt_one & data_psgr_not_kid & data_occupation_not_ffo

# Calculate acceptance rate for the specific group
data_group_acceptance_rate = data_[data_group_filter]['Y'].mean()

# Calculate overall acceptance rate for comparison
data_overall_acceptance_rate = data_['Y'].mean()

print("Specific group acceptance rate:", data_group_acceptance_rate)
print("Overall acceptance rate:", data_overall_acceptance_rate)

# Let's also calculate the acceptance rate for population that doesn't fall into above categories
data_complementary_group_acceptance_rate = data_[~data_group_filter]['Y'].mean()
print("Complementary group acceptance rate:", data_complementary_group_acceptance_rate)

Specific group acceptance rate: 0.5714285714285714
Overall acceptance rate: 0.5684326710816777
Complementary group acceptance rate: 0.5683494044242768


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 



In [179]:
# Define filters based on conditions
# Apply filters to determine groups
# Identify coupon acceptance rates 

# Group 1 Conditions
group1_condition = (data_['Bar'].isin(['1–3 times a month', '4–8 times a month', 'gt8'])) & \
                    (data_['passanger'] != 'Kid(s)') & \
                    (data_['maritalStatus'] != 'Widowed')

# Group 2 Conditions
group2_condition = (data_['Bar'].isin(['1–3 times a month', '4–8 times a month', 'gt8'])) & \
                    (data_['age'].isin(['below21', '21', '22', '23', '24', '25', '26', '27', '28', '29']))

# Group 3 Conditions
# Assuming 'RestaurantLessThan20' is cheap restaurants
group3_condition = (data_['RestaurantLessThan20'].isin(['4–8 times a month', 'gt8'])) & \
                    (data_['income'].isin(['Less than $12500', '$12500 - $24999', '$25000 - $37499', '$37500 - $49999'])) 

# Filter the DataFrame for each group and calculate acceptance rates
group1_acceptance_rate = data_[group1_condition]['Y'].mean()
group2_acceptance_rate = data_[group2_condition]['Y'].mean()
group3_acceptance_rate = data_[group3_condition]['Y'].mean()

print("Acceptance rate for Group 1 (Bars >1/month, Not Kid, Not Widowed):", group1_acceptance_rate)
print("Acceptance rate for Group 2 (Bars >1/month, <30 years old):", group2_acceptance_rate)
print("Acceptance rate for Group 3 (Cheap Restaurants >4/month, <50K income):", group3_acceptance_rate)

Acceptance rate for Group 1 (Bars >1/month, Not Kid, Not Widowed): 0.5714285714285714
Acceptance rate for Group 2 (Bars >1/month, <30 years old): 0.583969465648855
Acceptance rate for Group 3 (Cheap Restaurants >4/month, <50K income): 0.6631892697466468


7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

Given the acceptance rates of groups we have looked at it can be observed that folks who are repeat customers for
Bars i.e., visit Bars > 1/month, and are young and/or without kids have a greater acceptance rate (57-58%). Given
the acceptance rates are not very high e.g. 80+%, we should look into other factors that may increase the
acceptance rate further.

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [180]:
# focus on population using coupons for Coffee House to understand potential target demographics
data_temp1 = data_[(data_['coupon'] == 'Coffee House')]
px.histogram(data_temp1, x="occupation").update_xaxes(categoryorder="total ascending")
# Noting that 'Unemployed', 'Student', 'Computer & Mathematical', 'Sales & Relateed', 'Management', 'Education&Training&Library' are the top N occupations consuming this coupon


In [181]:
# focus on the top N categories that consume Coffe House coupons to understand effect of temperature
data_temp2 = data_[((data_['coupon'] == 'Coffee House') & (data_['occupation'].isin(['Unemployed', 'Student', 'Computer & Mathematical', 'Sales & Relateed', 'Management', 'Education&Training&Library'])) & (data_['Y'] == 1))]
px.histogram(data_temp2, x="occupation", color="temperature").update_xaxes(categoryorder="total ascending")
# Coffee consumption seems higher during warm days (temperature = 80F)

In [182]:
# focus on the top N categories that consume Coffe House coupons on warm days
# Understand distribution of gender and age
data_temp3 = data_[((data_['coupon'] == 'Coffee House') & (data_['occupation'].isin(['Unemployed', 'Student', 'Computer & Mathematical', 'Sales & Relateed', 'Management', 'Education&Training&Library'])) & (data_['Y'] == 1) & (data_['temperature'] == 80))]
px.histogram(data_temp3, x="occupation", color="gender").update_xaxes(categoryorder="total ascending")
# Female with occuptaions "Unemployed" and "Education&Training&Library" consume more "Coffee House" coupons than male
# In all other occuptaions Male population consumes more "Coffee House" coupons than female

In [183]:
# focus on the top N categories that consume Coffe House coupons on warm days and are Female with occupation "Unemployed"
# Understand distribution of age
data_temp4 = data_[((data_['coupon'] == 'Coffee House') & (data_['occupation'].isin(['Unemployed'])) & (data_['Y'] == 1) & (data_['temperature'] == 80) & (data_['gender'] == 'Female'))]
px.histogram(data_temp4, x="age").update_xaxes(categoryorder="total ascending")

In [184]:
# Determine acceptance rate for "Coffee House" in population: female, under the age of 31 (not inclusive), Unemployed and on warm days, time is 10am, 2pm and 6pm, destination is "No Urgent Place"", passenger either Alone or with Friend(s), destination "No Urgent Place" and Income less than $12500
data_temp5 = data_[((data_['coupon'] == 'Coffee House') & (data_['occupation'].isin(['Unemployed'])) & (data_['temperature'] == 80) & (data_['gender'] == 'Female') & (data_['age'].isin(['21', '26'])) & data_['time'].isin(['10AM','6PM', '2PM']) & data_['passanger'].isin(['Friend(s)', 'Alone']) & (data_['destination'] == 'No Urgent Place') & (data_['income'].isin(['Less than $12500'])))]
print("Specific group acceptance rate:", data_temp5['Y'].mean())

Specific group acceptance rate: 0.9444444444444444


For the following category of population, the acceptance rate for "Coffee House" coupon is nearly 100% (94.44%).
Characteristics:
* Occupation: Unemployed
* Gender: Female
* Age: 21, 26 (Under 31)
* Temperature : 80F
* Passenger: 'Alone', 'Friend(s)'
* Destination: 'No Urgent Place'
* Income: 'Less than $12500'

This would be one good demographic to target for future 'Coffee House' coupons.

