### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [450]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [453]:
data = pd.read_csv('data/coupons.csv')

In [455]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,car,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [458]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

In [460]:
# Count null/NaN values of each column
data.isna().sum()

destination                 0
passanger                   0
weather                     0
temperature                 0
time                        0
coupon                      0
expiration                  0
gender                      0
age                         0
maritalStatus               0
has_children                0
education                   0
occupation                  0
income                      0
car                     12576
Bar                       107
CoffeeHouse               217
CarryAway                 151
RestaurantLessThan20      130
Restaurant20To50          189
toCoupon_GEQ5min            0
toCoupon_GEQ15min           0
toCoupon_GEQ25min           0
direction_same              0
direction_opp               0
Y                           0
dtype: int64

In [462]:
# Explore value set of each column

for col in data.columns:
    print(data[col].value_counts())
    print("-----------------------------------")

destination
No Urgent Place    6283
Home               3237
Work               3164
Name: count, dtype: int64
-----------------------------------
passanger
Alone        7305
Friend(s)    3298
Partner      1075
Kid(s)       1006
Name: count, dtype: int64
-----------------------------------
weather
Sunny    10069
Snowy     1405
Rainy     1210
Name: count, dtype: int64
-----------------------------------
temperature
80    6528
55    3840
30    2316
Name: count, dtype: int64
-----------------------------------
time
6PM     3230
7AM     3164
10AM    2275
2PM     2009
10PM    2006
Name: count, dtype: int64
-----------------------------------
coupon
Coffee House             3996
Restaurant(<20)          2786
Carry out & Take away    2393
Bar                      2017
Restaurant(20-50)        1492
Name: count, dtype: int64
-----------------------------------
expiration
1d    7091
2h    5593
Name: count, dtype: int64
-----------------------------------
gender
Female    6511
Male      6173
Name:

In [464]:
# Find abnormal data in toCoupon_GEQxxmin columns
# The count of abnormal data are zero, those column a valid

print(len(data[(data["toCoupon_GEQ15min"] == 1) & (data["toCoupon_GEQ25min"] == 1)]))
print(len(data[(data["toCoupon_GEQ15min"] == 0) & (data["toCoupon_GEQ25min"] == 1)]))
print(len(data[(data["toCoupon_GEQ5min"] == 0) & ((data["toCoupon_GEQ15min"] == 1) | (data["toCoupon_GEQ25min"] == 1))]))

1511
0
0


In [466]:
# Check for abnormal data in direction to coupon
# No abnormal data found

print(len(data[(data["direction_same"] ^ data["direction_opp"]) == 0]))

0


In [468]:
# Check for distribution of missing data after dropping "car" column

tmp = data.drop(columns=["car"])
tmp = tmp[tmp.isna().any(axis=1)]
print(f"{len(tmp)} rows with missing value")
print("-----------------------------------")

for col in tmp.columns:
    print(tmp[col].value_counts())
    print("-----------------------------------")

605 rows with missing value
-----------------------------------
destination
No Urgent Place    313
Home               152
Work               140
Name: count, dtype: int64
-----------------------------------
passanger
Alone        336
Friend(s)    150
Kid(s)        68
Partner       51
Name: count, dtype: int64
-----------------------------------
weather
Sunny    468
Snowy     74
Rainy     63
Name: count, dtype: int64
-----------------------------------
temperature
80    306
55    178
30    121
Name: count, dtype: int64
-----------------------------------
time
6PM     152
7AM     140
10AM    117
10PM    103
2PM      93
Name: count, dtype: int64
-----------------------------------
coupon
Coffee House             180
Restaurant(<20)          133
Carry out & Take away    113
Bar                      104
Restaurant(20-50)         75
Name: count, dtype: int64
-----------------------------------
expiration
1d    331
2h    274
Name: count, dtype: int64
-----------------------------------
gender

### Conclusion

#### Missing data

    - car column will be drop since it contains mostly na values
    - After dropping car column, the number of rows those contain missing values is 605, ~5% of the dataset; also, their values are distributed randomly. Therefore, we will drop those rows with missing value.
    
#### Convert to numerical data

    - time column will be converted to numerical value in 24-hour format. E.g 10AM -> 10.0, 10PM -> 22.0
    - On age column, we replace "below21" with 20 and "50plus" with 50, convert it to numerical
    - Add new income_low_bound column with numerical values which are lower bounds of the income column, "Less than $12500" -> 0 and "$100000 or More" -> 100000.
      
#### Redundant columns

    - toCoupon_GEQ5min, toCoupon_GEQ15min, toCoupon_GEQ25min will be simplified into time_to_coupon column with numerical values 5, 15, 25
    - We'll drop direction_opp column since it serves the same purpose as direction_same column.  
   

3. Decide what to do about your missing data -- drop, replace, other...

In [553]:
# Make a copy of data into df
df = data.copy()


In [555]:
# Drop "car" column since it mostly contains null value
df.drop(columns=["car"], inplace=True)


In [557]:
# Drop rows with missing value
df.dropna(inplace=True)


In [559]:
# Convert time column into 24-hour numerical value
def convert_time(v):
    p = v[-2:]
    t = v[:-2]
    r = float(t)
    if p.upper() == 'PM': r = (r + 12)% 24
    return r

df["time"] = df["time"].apply(convert_time)
df["time"].value_counts()

time
18.0    3078
7.0     3024
10.0    2158
14.0    1916
22.0    1903
Name: count, dtype: int64

In [561]:
# Convert age to float values
def convert_age(v):
    r = v
    if v == "50plus": r = "50"
    if v == "below21": r = "20"
    return float(r)

df["age"] = df["age"].apply(convert_age)
df["age"].value_counts()

age
21.0    2537
26.0    2399
31.0    1925
50.0    1732
36.0    1253
41.0    1065
46.0     664
20.0     504
Name: count, dtype: int64

In [563]:
# Add income_low_bound column
def convert_income(v):
    r = ""
    if v == "Less than $12500":
        r = "0"
    else:
        r = v[1:v.index(' ')]

    return float(r)

df["income_low_bound"] = df["income"].apply(convert_income)
df[["income_low_bound", "income"]].value_counts()

income_low_bound  income          
25000.0           $25000 - $37499     1919
12500.0           $12500 - $24999     1728
100000.0          $100000 or More     1692
37500.0           $37500 - $49999     1689
50000.0           $50000 - $62499     1565
0.0               Less than $12500    1014
62500.0           $62500 - $74999      840
87500.0           $87500 - $99999      818
75000.0           $75000 - $87499      814
Name: count, dtype: int64

In [567]:
# Add time_to_coupon column
def calc_time_to_coupon(v):       
    if v["toCoupon_GEQ25min"] == 1:
        return 25.0
    elif v["toCoupon_GEQ15min"] == 1:
        return 15.0
    else:
        return 5.0

df["time_to_coupon"] = df[["toCoupon_GEQ5min", "toCoupon_GEQ15min", "toCoupon_GEQ25min"]].apply(calc_time_to_coupon, axis=1)
df[["toCoupon_GEQ5min", "toCoupon_GEQ15min", "toCoupon_GEQ25min", "time_to_coupon"]].value_counts()


Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y,income_low_bound,time_to_coupon
22,No Urgent Place,Alone,Sunny,55,14.0,Restaurant(<20),1d,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,1,0,0,0,1,1,62500.0,5.0
23,No Urgent Place,Friend(s),Sunny,80,10.0,Coffee House,2h,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,1,0,0,0,1,0,62500.0,5.0
24,No Urgent Place,Friend(s),Sunny,80,10.0,Bar,1d,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,1,0,0,0,1,1,62500.0,5.0
25,No Urgent Place,Friend(s),Sunny,80,10.0,Carry out & Take away,2h,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,1,1,0,0,1,0,62500.0,15.0
26,No Urgent Place,Friend(s),Sunny,80,14.0,Coffee House,1d,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,1,0,0,0,1,0,62500.0,5.0


In [569]:
# Drop redundant columns
df.drop(columns=["toCoupon_GEQ5min", "toCoupon_GEQ15min", "toCoupon_GEQ25min", "direction_opp"], inplace=True)


In [573]:
df.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,direction_same,Y,income_low_bound,time_to_coupon
22,No Urgent Place,Alone,Sunny,55,14.0,Restaurant(<20),1d,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,0,1,62500.0,5.0
23,No Urgent Place,Friend(s),Sunny,80,10.0,Coffee House,2h,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,0,0,62500.0,5.0
24,No Urgent Place,Friend(s),Sunny,80,10.0,Bar,1d,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,0,1,62500.0,5.0
25,No Urgent Place,Friend(s),Sunny,80,10.0,Carry out & Take away,2h,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,0,0,62500.0,15.0
26,No Urgent Place,Friend(s),Sunny,80,14.0,Coffee House,1d,Male,21.0,Single,0,Bachelors degree,Architecture & Engineering,$62500 - $74999,never,less1,4~8,4~8,less1,0,0,62500.0,5.0


4. What proportion of the total observations chose to accept the coupon? 



In [638]:
df.groupby(["coupon", "Y"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,destination,passanger,weather,temperature,time,expiration,gender,age,maritalStatus,has_children,education,occupation,income,Bar,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,direction_same,income_low_bound,time_to_coupon
coupon,Y,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Bar,0,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125
Bar,1,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788,788
Carry out & Take away,0,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598,598
Carry out & Take away,1,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682,1682
Coffee House,0,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922,1922
Coffee House,1,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894,1894
Restaurant(20-50),0,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785,785
Restaurant(20-50),1,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632,632
Restaurant(<20),0,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772,772
Restaurant(<20),1,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881,1881


5. Use a bar plot to visualize the `coupon` column.

6. Use a histogram to visualize the temperature column.

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


2. What proportion of bar coupons were accepted?


3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 



7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  