### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [None]:
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

pio.renderers.default = "notebook"


### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [None]:
data = pd.read_csv('data/coupons.csv')

In [None]:
data.info()

In [None]:
data.sample(5)

In [None]:
data.describe(include='all')

2. Investigate the dataset for missing or problematic data.

In [None]:
# Show columns with missing data
is_missing_cols = data.isna().sum().sort_values(ascending=False)
is_missing_cols[is_missing_cols > 0]

In [None]:
columns_with_missing_data = is_missing_cols[is_missing_cols > 0].index

In [None]:
# Show count of columns with missing data
data[columns_with_missing_data].isna().sum()

In [None]:
# Show percentage of columns with missing data
data[columns_with_missing_data].isna().mean().round(3) * 100

In [None]:
# Show most frequent responses for columns with missing data
data[columns_with_missing_data].describe()

### Missing Columns Analysis
Concerning the dataset, I can see that only six columns contain missing data. These six columns include `Restaurant20to50`, `RestaurantLessThan20"`, `CarryAway`, `CoffeeHouse`, `Bar` and `car`. All of these columns are categorical features. All but the `car` column are missing data less than or equal to `1.7%` of the time. The `car` column is the most sparse by being unavailable `99.1%` of the time. Considering these findings, I will drop the `car` column even though it can be useful to know. I simply can't impute that much data or assume the user is in a car with confidence. The five other columns concern the user's behavior and how often they visit that type of business. Even without visualizing the dataset I can infer that these columns are important and given that so few of these data are missing I ought to keep it. The question now is how to impute that data. One naive approach could be to impute the most frequent response or to impute a response of `never`, assuming that what the user meant. Either way these would be guesses. Another way of imputing these columns are to leverage a k-clustering algorithm. You would find the top 5 nearest neighbors of a row with missing values and impute the most frequent response of the nearest neighbors. This is an advance technique that I am not confident I can pull off just yet. I do know how to leverage SMOTE (Synthetic Minority Oversampling Technique) however that is for minority datasets and is a library I have used before. I mention SMOTE because it uses k-clustering to create synthetic data and my earlier suggestion was inspired by SMOTE. For the purpose of this project I will impute the most frequent response. Above is a table showing their most frequent responses. These have a frequency from `25%--50%`.

3. Decide what to do about your missing data -- drop, replace, other...

In [None]:
# drop car column
data = data.drop(columns=['car'])

In [None]:
# Impute rest of columns
data = data.fillna(data.mode().iloc[0])

In [None]:
# Double check if data is missing
is_missing_cols = data.isna().sum().sort_values(ascending=False)
is_missing_cols[is_missing_cols > 0].index

4. What proportion of the total observations chose to accept the coupon?



In [None]:
total_observations = data.shape[0]
total_accepted_coupon = data.query('Y == 1').count().iloc[0]
total_declined_coupon = data.query('Y == 0').count().iloc[0]
percentage_accepted_coupon = total_accepted_coupon / total_observations
percentage_declined_coupon = total_declined_coupon / total_observations
print(
    f'Total Observations: {total_observations}\nAccepted Coupon: {total_accepted_coupon}\nDeclined Coupon: {total_declined_coupon}\nAccepted Coupon Percentage: {percentage_accepted_coupon:.2%}\nDeclined Coupon Percentage: {percentage_declined_coupon:.2%}')

coupon_counts = data[['Y']].value_counts(normalize=True).reset_index()
coupon_counts.columns = ['Coupon Acceptance', 'Percentage']
fig = px.bar(data[['Y']].value_counts().reset_index(), x='Y', y='count',
             title='More people accepted coupons than declined', labels={'Y': 'Acceptance', 'count': 'Count'})
fig.show()
fig.write_image('images/coupon_acceptance.png')

### Proportion of the total observations that accept/decline the coupon
Out of the 12,684 observations, 7,210 users accepted the coupon accounting for 56.84% of the observations. On the flip side 5,474 users declined the coupon or 43.16%.

5. Use a bar plot to visualize the `coupon` column.

In [None]:
fig = px.bar(data, x='coupon', title='Coffee House & Restaurant(<20) coupons were offered the most',
             labels={'coupon': 'Coupon Type', 'count': 'Count'}, color='coupon',
             color_discrete_sequence=px.colors.qualitative.Bold)
fig = fig.update_traces(marker_line_width=0)
fig.show()
fig.write_image('images/coupon_bar_plot.png')

6. Use a histogram to visualize the temperature column.

In [None]:
data['temperature'].unique()

In [None]:
fig = px.histogram(data, x='temperature', title='Of three possible temperatures 80 is most common',
                   labels={'temperature': 'Temperature', 'count': 'Count'}, nbins=3)
fig.show()
fig.write_image('images/temperature_histogram.png')

 It was called out in the explanation of the problem that the temp column contains 3 unique values. Temperature is a continuous variable so this isn't typical. If this wasn't called out in the problem set we would have discovered the issue with the above histogram. We can take this to mean: 80 -> hot, 55 -> moderate, 30 -> cold. Might be smart to create a new column to capture this.

In [None]:
data['temperature_category'] = data['temperature'].map({30: 'cold', 55: 'moderate', 80: 'hot'})

In [None]:
data.sample(2)

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [None]:
bar_coupons = data.query('coupon == "Bar"').copy()
bar_coupons.sample(2)

2. What proportion of bar coupons were accepted?


In [None]:
total_observations = bar_coupons.shape[0]

total_accepted_coupon = bar_coupons.query('Y == 1').count().iloc[0]
total_declined_coupon = bar_coupons.query('Y == 0').count().iloc[0]
percentage_accepted_coupon = total_accepted_coupon / total_observations
percentage_declined_coupon = total_declined_coupon / total_observations
print(
    f'Total Bar Observations: {total_observations}\nAccepted Bar Coupon: {total_accepted_coupon}\nDeclined Bar Coupon: {total_declined_coupon}\nAccepted Bar Coupon Percentage: {percentage_accepted_coupon:.2%}\nDeclined Bar Coupon Percentage: {percentage_declined_coupon:.2%}')

In [None]:
bar_coupon_count = bar_coupons[['Y']].value_counts()
fig = px.bar(bar_coupon_count.reset_index(), x='Y', y='count',
             title='More people declined the bar coupon than accepted',
             labels={'Y': 'Acceptance', 'count': 'Count'})
fig.write_image('images/bar_coupon_acceptance.png')
fig.show()

Interesting to note that on average those who received bar coupons were less likely to accept than those who received other coupons

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [None]:
bar_coupons['Bar'].unique()

In [None]:
fewer_than_3 = ['never', 'less1', '1~3']
greater_than_3 = ['gt8', '4~8']
total_3_or_fewer_bar = bar_coupons.query(f'Bar in {fewer_than_3}')
total_gt_than_3_bar = bar_coupons.query(f'Bar in {greater_than_3}')

total_accepted_coupon_from_3_or_fewer = total_3_or_fewer_bar.query(f'Y == 1').count().iloc[0]
total_accepted_coupon_from_gt_than_3 = total_gt_than_3_bar.query(f'Y == 1').count().iloc[0]
percentage_3_or_fewer_accepted_coupon = total_accepted_coupon_from_3_or_fewer / total_3_or_fewer_bar.shape[0]
percentage_gt_than_3_accepted_coupon = total_accepted_coupon_from_gt_than_3 / total_gt_than_3_bar.shape[0]
print(total_3_or_fewer_bar.shape[0])
print(total_gt_than_3_bar.shape[0])
print(
    f'3 or Fewer Accepted Bar Coupon: {total_accepted_coupon_from_3_or_fewer}\nGreater than 3 Accepted Bar Coupon: {total_accepted_coupon_from_gt_than_3}')
print(
    f'3 Or Fewer Accepted Bar Coupon Percentage: {percentage_3_or_fewer_accepted_coupon:.2%}\nGreater than 3 Accepted Bar Coupon Percentage: {percentage_gt_than_3_accepted_coupon:.2%}')

In [None]:
bar_avg_visit_coupon_ratio = bar_coupons.groupby('Bar')['Y'].value_counts(normalize=True).unstack()

fig = px.bar(bar_avg_visit_coupon_ratio, title='More likely to accept the bar coupon if you go more',
             labels={'Y': 'Acceptance', 'value': 'Count'})
fig.show()
fig.write_image('images/bar_coupon_acceptance_by_avg_visits.png')

As expected those that visit the bar more often are more likely to accept the coupon. 

4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [None]:
at_least_once = ['1~3', 'gt8', '4~8']
over_25 = ['46', '26', '31', '41', '50plus', '36']
goes_to_bar_at_least_once = bar_coupons['Bar'].isin(at_least_once)
over_25_age_indicator = bar_coupons['age'].isin(over_25)
bar_coupons.loc[:, 'over_25_goes_to_bar'] = goes_to_bar_at_least_once & over_25_age_indicator
bar_avg_visit_coupon_ratio = bar_coupons.groupby('over_25_goes_to_bar')['Y'].value_counts(normalize=True).unstack()
fig = px.bar(bar_avg_visit_coupon_ratio,
             title='Higher acceptance rate for those over 25 and going to bar more than once',
             labels={'over_25_goes_to_bar': 'Over 25 and goes to bar more than once a month', 'value': 'Count',
                     'Y': 'Acceptance'})
fig.show()
fig.write_image('images/bar_coupon_acceptance_by_avg_visits_and_age.png')

This was an interesting find. I would expect those 21 to 25 would be more inclined to go to the bar and therefore more likely to accept a coupon.

5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [None]:
with_friend_partner = ['Friend(s)', 'Partner']
# occupation =
with_friend_partner_indicator = bar_coupons['passanger'].isin(with_friend_partner)
not_farming_fishing_forestry = bar_coupons["occupation"] != "Farming Fishing & Forestry"
bar_coupons.loc[:,
'goes_to_bar_not_farmer_fisher_forestry_and_with_someone_not_kid'] = goes_to_bar_at_least_once & with_friend_partner_indicator & not_farming_fishing_forestry
goes_to_bar_not_farmer_fisher_forestry_and_with_someone_not_kid_coupon_ratio = \
    bar_coupons.groupby('goes_to_bar_not_farmer_fisher_forestry_and_with_someone_not_kid')['Y'].value_counts(
        normalize=True).unstack()
fig = px.bar(goes_to_bar_not_farmer_fisher_forestry_and_with_someone_not_kid_coupon_ratio, labels={
    'goes_to_bar_not_farmer_fisher_forestry_and_with_someone_not_kid': 'Not with kid, not fishing, farming & forestry & bar more than once a month',
    'value': 'Count', 'Y': 'Acceptance'})
title = 'Higher acceptance rate from those who go to bars more than once a month & had passengers that were not with a kid & had occupation other than farming, fishing, or forestry.'
wrapped_title = '<br>'.join(title[i:i + 89] for i in range(0, len(title), 89))
fig.update_layout(title=wrapped_title)

fig.show()
fig.write_image('images/bar_coupon_acceptance_by_avg_visits_and_occupation_and_passenger.png')

This is not a surprise. Those who are not alone with a kid are more likely to go to a bar and therefore more likely to accept a coupon. Probably to have a drink with a friend. 

6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.

^Henceforth this will be labeled as condition1

In [None]:
under_30_list = ['21', '26', 'below21']
four_or_more = ['gt8', '4~8']
less_than_50k = ['$37500 - $49999', '$12500 - $24999',
                 '$25000 - $37499', 'Less than $12500']

In [None]:
not_widowed = bar_coupons['maritalStatus'] != 'Widowed'
under_30 = bar_coupons['age'].isin(under_30_list)
cheap_restaurants_more_than_4_times_a_month_income_less_than_50K = bar_coupons['RestaurantLessThan20'].isin(
    four_or_more) & bar_coupons['income'].isin(less_than_50k)

first_condition = goes_to_bar_at_least_once & with_friend_partner_indicator & not_widowed
second_condition = goes_to_bar_at_least_once & under_30
third_condition = cheap_restaurants_more_than_4_times_a_month_income_less_than_50K
bar_coupons.loc[:, 'condition1'] = first_condition | second_condition | third_condition
condition1_ratio = bar_coupons.groupby('condition1')['Y'].value_counts(normalize=True).unstack()
fig = px.bar(condition1_ratio,

             labels={
                 'condition1': 'Bar goer w/o kid not widow or Bar goer under30 or cheap restaurant goer making under 50k',
                 'value': 'Count', 'Y': 'Acceptance'})
title = 'Higher Acceptance among Bar goer without kid not widow or Bar goer under30 or cheap restaurant goer making under 50k'
wrapped_title = '<br>'.join(title[i:i + 83] for i in range(0, len(title), 83))
fig.update_layout(title=wrapped_title)
fig.show()
fig.write_image('images/bar_coupon_by_condition1.png')

7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

## Bar Coupon Acceptor Analysis

### General Coupon Acceptance vs Bar Coupon Acceptance
In the 12,784 observations, coupon acceptance was at **\~57%** but bar coupon acceptors stood at **\~41%**.

Contrast general coupon acceptance:

<img src='images/coupon_acceptance.png'>

With bar coupon acceptance:

<img src='images/bar_coupon_acceptance.png'>

### Bar Coupon Acceptance x Bar visits


As I began to drill down into further categories, I began to unveil which features best represented the bar coupon acceptor minority group. I quickly found that those that accepted the bar coupon went to a bar more often than those that declined. This is intuitive. _People who go more often to a place are more willing to accept a coupon for that place than those that do not._

<img src='images/bar_coupon_acceptance_by_avg_visits.png'>

### Bar Coupon Acceptance x Age

The next feature I looked at was age. My gut feeling was that those under 25 would be people more likely to accept a bar coupon. However, I was wrong and it seems like I forgot what it is like to be 21–25. _People in this age group are either just starting their careers or are in college, meaning money is sparse and probably shouldn't be spent at a bar._ People in this group don't need to go to a bar to drink with friends, they can just drink beers or sip wine on the couch with their friends. I recollect that I didn't frequent bars at those ages simply because of cost. The following graph shows this disparity.

<img src='images/bar_coupon_acceptance_by_avg_visits_and_age.png'>

### Bar Acceptance x Goes to Bar once a month x Passenger not a kid x Not in Farming, Fishing or Forestry

These combinations of conditions represented a different kind of individual. We filtered out those who drove alone or with children. We then filter out those who worked in farming, fishing or forestry. I imagine these occupations are further away from city life. I also filtered out those that do not go to bars once a month. This shows a very different picture. _Users fitting these traits accepted the coupon at a very high rate._

<img src='images/bar_coupon_acceptance_by_avg_visits_and_occupation_and_passenger.png'>

### More complicated categories

Finally, I looked at drivers who met the following characteristics:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.

These characteristics represent 3 distinct types of people. Although the acceptance rate for these characteristics was not as high as the previous properties, it gave us a _new segment or cluster of people to target with bar coupons._

<img src='images/bar_coupon_by_condition1.png'>

### Conclusion

This exercise shows us to not generalize entire populations without considering what makes individuals within that population unique. In short, you need to know your customer. Even within your targeted customer group, you will have sub groups/clusters within that group that will respond to different stimuli. It was a red herring to assume that the average acceptance rate for bar coupons could be generalized amongst a broader population. As we began to break down the elements of the drivers, we found what traits would lead to higher acceptance rates. We now know a little bit more about who target with bar coupons. This informs us we should do a similar exercise for the other coupon types. If we repeat this process we can improve our overall coupon acceptance rate by targeting those more likely to accept the type of coupon we send them

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.

## Investigation Restaurant(20-50)

In [None]:
expensive_restaurant_df = data.query('coupon == "Restaurant(20-50)"').copy()
expensive_restaurant_df.sample(2)

In [None]:
# Start with general acceptance rate
total_observations = expensive_restaurant_df.shape[0]

total_accepted_rest = expensive_restaurant_df.query('Y == 1').count().iloc[0]
total_declined_rest = expensive_restaurant_df.query('Y == 0').count().iloc[0]
percentage_accepted_coupon = total_accepted_rest / total_observations
percentage_declined_coupon = total_declined_rest / total_observations
print(
    f'Total Expensive Restaurants Observations: {total_observations}\nAccepted Expensive Restaurants Coupon: {total_accepted_rest}\nDeclined Expensive Restaurants Coupon: {total_declined_rest}\nAccepted Expensive Restaurants Coupon Percentage: {percentage_accepted_coupon:.2%}\nDeclined Expensive Restaurants Coupon Percentage: {percentage_declined_coupon:.2%}')

fig = px.bar(expensive_restaurant_df.groupby('Y')['Y'].value_counts(), labels={'value': 'Count', 'Y': 'Acceptance'},
             title='General population more likely to decline Restaurant(20-50) coupons')
fig.show()
fig.write_image('images/restaurant_acceptance.png')

In [None]:
expensive_restaurant_df.income.unique()
# plotly doesn't like $ signs removing them
expensive_restaurant_df.income = expensive_restaurant_df.income.str.replace('$', '')

In [None]:
# Look into acceptance rate by Income
by_income = expensive_restaurant_df.groupby('income')['Y'].value_counts(normalize=True)

by_income = by_income.reset_index()
fig = px.bar(by_income, x='income', y='proportion', barmode='group', color='Y',
             labels={'income': 'Income', 'Y': 'Acceptance', 'proportion': 'Rate'},
             title='Excluding the 25,000 to 37,499 income group, most income groups <br>have a higher decline rate')
fig.update_xaxes(tickangle=45)

fig.show()
fig.write_image('images/restaurant_acceptance_by_income.png')

I was confident that there would be a higher acceptance rate for higher income individuals but I was clearly wrong. There doesn't seem to be a correlation at all.

In [None]:
# Look into acceptance rate by marital status
by_marital_status = expensive_restaurant_df.groupby('maritalStatus')['Y'].value_counts(normalize=True)

by_marital_status = by_marital_status.reset_index()
fig = px.bar(by_marital_status, x='maritalStatus', y='proportion', barmode='group', color='Y',
             labels={'maritalStatus': 'Marital Status', 'Y': 'Acceptance', 'proportion': 'Rate'},
             title='Unmarried individuals have a higher acceptance rate, <br>widows have a very high decline rate')
fig.update_xaxes(tickangle=45)

fig.show()
fig.write_image('images/restaurant_acceptance_by_marital_status.png')



This is expected to me. I would imagine someone in a relationship but yet to be married is more likely to want to impress their partner.

In [None]:
# Look into acceptance rate by monthly attendance
by_visits = expensive_restaurant_df.groupby('Restaurant20To50')['Y'].value_counts(normalize=True)

by_visits = by_visits.reset_index()
fig = px.bar(by_visits, x='Restaurant20To50', y='proportion', barmode='group', color='Y',
             labels={'Restaurant20To50': 'Visits', 'Y': 'Acceptance', 'proportion': 'Rate'},
             title='If you visit more then you are more likely to accept the coupon')
fig.update_xaxes(tickangle=45)

fig.show()
fig.write_image('images/restaurant_acceptance_by_visits.png')

To the surprise of no one if you visit a place a lot you are more likely to accept a coupon to that place.

In [None]:
# Look into acceptance rate by passenger
by_passenger = expensive_restaurant_df.groupby('passanger')['Y'].value_counts(normalize=True)

by_passenger = by_passenger.reset_index()
fig = px.bar(by_passenger, x='passanger', y='proportion', barmode='group', color='Y',
             labels={'passanger': 'Passenger type', 'Y': 'Acceptance', 'proportion': 'Rate'},
             title='If you are with your partner you are more likely to accept the coupon')
fig.update_xaxes(tickangle=45)

fig.show()
fig.write_image('images/restaurant_acceptance_by_passenger.png')

Confirming our earlier assessment concerning marital status, we see here that if you are driving with your partner you are more likely to accept the coupon.

In [None]:
# Look into acceptance rate when distance to restaurant is less than 15 min and in same direction
same_direction = expensive_restaurant_df['direction_same'] == 0
less_than_15 = expensive_restaurant_df['toCoupon_GEQ15min'] == 0

expensive_restaurant_df.loc[:, 'same_direction_and_close'] = (same_direction & less_than_15)

same_direction_and_close = expensive_restaurant_df.groupby('same_direction_and_close')['Y'].value_counts(
    normalize=True).unstack()
fig = px.bar(same_direction_and_close,
             title='2% more likely to accept the coupon if you are close and in the same <br>direction than the restaurant',
             labels={'same_direction_and_close': 'Restaurant is close and in the same direction, no material change',
                     'value': 'Ratio',
                     'Y': 'Acceptance'}
             )
fig.show()
fig.write_image('images/restaurant_acceptance_by_same_direction_and_close.png')

In [None]:
# Look into acceptance rate by gender
by_gender = expensive_restaurant_df.groupby('gender')['Y'].value_counts(normalize=True)

by_gender = by_gender.reset_index()
fig = px.bar(by_gender, x='gender', y='proportion', barmode='group', color='Y',
             labels={'gender': 'Gender', 'Y': 'Acceptance', 'proportion': 'Rate'},
             title='Males accept the coupon more than females')
fig.update_xaxes(tickangle=45)

fig.show()
fig.write_image('images/restaurant_acceptance_by_gender.png')

Males accept the coupon at 3% more. Going to try and break this down to Males with a Partner and without a kid

In [None]:
# Look into acceptance rate of Males with Partner, no children
# or Males who have their partner in the car
with_someone = expensive_restaurant_df['maritalStatus'] != 'Unmarried Partner'
no_kids = expensive_restaurant_df['has_children'] == 0
male = expensive_restaurant_df['gender'] == 'Male'
partner_in_car = expensive_restaurant_df['passanger'] == 'Partner'

expensive_restaurant_df.loc[:, 'male_with_impressing_opportunity'] = (male & with_someone & no_kids) | (
            male & partner_in_car)

male_with_impressing_opp = expensive_restaurant_df.groupby('male_with_impressing_opportunity')['Y'].value_counts(
    normalize=True).unstack()
fig = px.bar(male_with_impressing_opp,
             title='5% higher acceptance rate for males with an opportunity to impress<br> their partner',
             labels={'male_with_impressing_opportunity': 'Male with opportunity to impress partner', 'value': 'Ratio',
                     'Y': 'Acceptance'}
             )
fig.show()
fig.write_image('images/male_with_opportunity_to_impress.png')

In [None]:
# Let's do the same with females
female = expensive_restaurant_df['gender'] == 'Female'

expensive_restaurant_df.loc[:, 'female_with_impressing_opportunity'] = (female & with_someone & no_kids) | (
            female & partner_in_car)

female_with_impressing_opp = expensive_restaurant_df.groupby('female_with_impressing_opportunity')['Y'].value_counts(
    normalize=True).unstack()
fig = px.bar(female_with_impressing_opp,
             title='2% higher acceptance rate for females with an <br>opportunity to impress their partner',
             labels={'female_with_impressing_opportunity': 'Female with opportunity to impress partner',
                     'value': 'Ratio',
                     'Y': 'Acceptance'}
             )

fig.show()
fig.write_image('images/female_with_opportunity_to_impress.png')

It looks like being with a partner may increase acceptance rate but not by much

In [None]:
expensive_restaurant_df.income.unique()

In [None]:
# Try the same as the above but look at income as well
income_groups = ['37500 - 49999', '12500 - 24999',
                 '50000 - 62499', '25000 - 37499',
                 'Less than 12500']
income_lower = expensive_restaurant_df['income'].isin(income_groups)
expensive_restaurant_df.loc[:, 'male_with_impressing_opportunity_and_lower_income'] = (
                                                                                                  male & with_someone & no_kids & income_lower) | (
                                                                                                  male & partner_in_car & income_lower)

male_with_impressing_opportunity_and_lower_income = \
expensive_restaurant_df.groupby('male_with_impressing_opportunity_and_lower_income')['Y'].value_counts(
    normalize=True).unstack()
fig = px.bar(male_with_impressing_opportunity_and_lower_income,
             title='3% higher acceptance rate for males with an <br>opportunity to impress their partner with lower income',
             labels={
                 'male_with_impressing_opportunity_and_lower_income': 'Male with opportunity to impress partner with lower income',
                 'value': 'Ratio',
                 'Y': 'Acceptance'}
             )
fig.show()
fig.write_image('images/male_with_opportunity_to_impress_and_lower_income.png')

In [None]:
# Try the same as the above but look at income as well
expensive_restaurant_df.loc[:, 'female_with_impressing_opportunity_and_lower_income'] = (
                                                                                                    female & with_someone & no_kids & income_lower) | (
                                                                                                    female & partner_in_car & income_lower)

female_with_impressing_opportunity_and_lower_income = \
expensive_restaurant_df.groupby('female_with_impressing_opportunity_and_lower_income')['Y'].value_counts(
    normalize=True).unstack()
fig = px.bar(female_with_impressing_opportunity_and_lower_income,
             title='4% higher acceptance rate for females with an <br>opportunity to impress their partner with lower income',
             labels={
                 'female_with_impressing_opportunity_and_lower_income': 'Female with opportunity to impress partner with lower income',
                 'value': 'Ratio',
                 'Y': 'Acceptance'}
             )
fig.show()
fig.write_image('images/female_with_opportunity_to_impress_and_lower_income.png')

In [None]:
# Comparing Female & Male with impressing opportunity
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=("Male with Impressing Opportunity",
                    "Female with Impressing Opportunity",
                    "Male with Impressing Opportunity & Lower Income",
                    "Female with Impressing Opportunity & Lower Income")
)

fig_male_impress = px.bar(male_with_impressing_opp,
                          title='5% higher acceptance rate for males with an opportunity to impress<br> their partner',
                          labels={'male_with_impressing_opportunity': 'Male with opportunity to impress partner',
                                  'value': 'Ratio',
                                  'Y': 'Acceptance'},
                          )
fig_female_impress = px.bar(female_with_impressing_opp,
                            title='2% higher acceptance rate for females with an <br>opportunity to impress their partner',
                            labels={'female_with_impressing_opportunity': 'Female with opportunity to impress partner',
                                    'value': 'Ratio',
                                    'Y': 'Acceptance'}
                            )

fig_male_impress_income = px.bar(male_with_impressing_opportunity_and_lower_income,
                                 title='3% higher acceptance rate for males with an <br>opportunity to impress their partner with lower income',
                                 labels={
                                     'male_with_impressing_opportunity_and_lower_income': 'Male with opportunity to impress partner with lower income',
                                     'value': 'Ratio',
                                     'Y': 'Acceptance'}
                                 )
fig_female_impress_income = px.bar(female_with_impressing_opportunity_and_lower_income,
                                   title='4% higher acceptance rate for females with an <br>opportunity to impress their partner with lower income',
                                   labels={
                                       'female_with_impressing_opportunity_and_lower_income': 'Female with opportunity to impress partner with lower income',
                                       'value': 'Ratio',
                                       'Y': 'Acceptance'}
                                   )

fig.add_trace(fig_male_impress.data[1], row=1, col=1)
fig.add_trace(fig_female_impress.data[1], row=1, col=2)
fig.add_trace(fig_male_impress_income.data[1], row=2, col=1)
fig.add_trace(fig_female_impress_income.data[1], row=2, col=2)

# Update layout
fig.update_layout(
    height=800,  # Increase height
    width=1000,  # Increase width
    title_text="In all cases, the income & opportunity to impress improve acceptance",
    showlegend=False
)

fig.show()
fig.write_image('images/subplot_male_female_impress_opp.png')

## Restaurant(20-50) Coupon Acceptance Analysis

In the 12,784 observations, coupon acceptance was at **\~57%** but expensive restaurant coupon acceptors stood at **\~44%**. This was a surprising result to me. I would have expected coupon acceptance to be higher for the more expensive places. I would have expected more people to accept the coupon because it was more expensive.

Compare general coupon acceptance:

<img src="images/coupon_acceptance.png">

With expensive restaurants:

<img src='images/restaurant_acceptance.png'>

### Restaurant(20-50) Coupon Acceptance Analysis by Visits

This analysis was obvious but I felt it prudent to double check. As anyone would expect, a person is more likely to accept the coupon if they visit that type of place more often.

<img src='images/restaurant_acceptance_by_visits.png'>

### Restaurant(20-50) Coupon Acceptance Analysis by distance and direction
I decided to see how much distance and direction mattered. I expected that if you are close and in the same direction you would be more likely to accept the coupon. I found a 2% increase in acceptance rate for people who are close and in the same direction. Not as much as I expected but every single percentage point counts.

<img src='images/restaurant_acceptance_by_same_direction_and_close.png'>

### Restaurant(20-50) Coupon Acceptance Analysis by Income

I then adjusted my gaze towards income. I thought to myself that there should be a higher acceptance rate for people with higher incomes. But I was wrong. In fact, there does not seem to be a correlation at all. Interestingly, the 25,000 to 37,499 income group has a higher acceptance rate, 51% accepted. This finding got me thinking about the desire to make a good impression for a partner.

<img src='images/restaurant_acceptance_by_income.png'>

### Restaurant(20-50) Coupon Acceptance Analysis by Marital Status

I believe I was right to look at marital status, considering the above finding about income. I was not surprised to see that someone who was not married but had a partner was more likely to accept the coupon, 4% more likely than the general population. It also wasn't surprising to see that singles were 2% more likely to accept the coupon.

<img src='images/restaurant_acceptance_by_marital_status.png'>

### Restaurant(20-50) Coupon Acceptance Analysis by Passenger

Now that I have seen the acceptance rates by marital status, I thought surely if the partner was in the car they would be more likely to accept the coupon. I was very right. I found that if you are with your partner you will accept the coupon 63% of the time.

<img src='images/restaurant_acceptance_by_passenger.png'>

### Restaurant(20-50) Coupon Acceptance Analysis by Gender

I was now interested to see if there was a difference between males and females. I found that males were 3% more likely to accept the coupon than females. I wanted to drill into this comparison further.

<img src='images/restaurant_acceptance_by_gender.png'>

### Restaurant(20-50) Coupon Acceptance Analysis—Males with an opportunity to impress their partner

I wanted to see if there was a difference between males with an opportunity to impress their partner and males without. I did this by creating `male_with_impressing_opportunity` which would be true if the male had a partner and no kids or if the male was in the car with a partner. I found that males with an opportunity to impress their partner were 5% more likely to accept the coupon than males without. This was a considerable jump so I wanted to see how income affected this.

<img src='images/male_with_opportunity_to_impress.png'>

### Restaurant(20-50) Coupon Acceptance Analysis—Males with an opportunity to impress their partner with lower income

I defined lower income as anything lower than \\$62,499. I found that males with an opportunity to impress their partner with lower income were 3% more likely to accept the coupon than males without.

<img src='images/male_with_opportunity_to_impress_and_lower_income.png'>

### Restaurant(20-50) Coupon Acceptance Analysis—Females with an opportunity to impress their partner

Now I desired to compare males with females. Females when given the chance to impress their partner with an expensive date were 2% more likely to accept the coupon than females without.

<img src="images/female_with_opportunity_to_impress.png">

### Restaurant(20-50) Coupon Acceptance Analysis—Females with an opportunity to impress their partner with lower income

Given this I began to think lower income females would also be more likely to accept the coupon to impress their partner. I was not wrong. 47% of these groups were more likely to accept the coupon versus the 43% outside this group.

<img src="images/female_with_opportunity_to_impress_and_lower_income.png">

### Restaurant(20-50) Coupon Acceptance Analysis - Male/Female with Impressing Opportunity Subplots

I decided to put all of these plots together to get a better picture and tell a better story. In the following image, you will see that when it concerns coupons for expensive restaurants if given the opportunity those with a partner are more likely to accept the coupon. This is likely due to the fact that these individuals would love the chance to treat their partner to a nicer meal. I found this finding quite wholesome.

<img src="images/subplot_male_female_impress_opp.png">

### Conclusion

In conclusion, I found that we can increase the rate of acceptance for expensive restaurant coupons by targeting certain attributes in the dataset. Proximity and direction had a moderate impact on acceptance but income showed no correlation (barring 1 income bracket).  Particularly, relationship dynamics were of keen interest to me. People with partners but not married were more likely to accept a coupon. There was a gender dynamic going on here too. Both genders were out to impress, however females of lower income had a stronger boost than males of a lower. These findings suggest that targeting couples, particularly those with partners present would be more effective than targeting other dynamics. The data reveals a wholesome pattern where both females and males are more likely to accept a coupon to treat their partners to a nice meal.

## Decision Tree & Random Forest models for expensive restaurants

In [None]:
expensive_restaurant_df.columns

In [None]:
# call out desired features
category_columns = ['passanger',
                    'gender', 'age', 'maritalStatus',
                    'income', 'Restaurant20To50',]

bool_columns = [
    'same_direction_and_close', 'male_with_impressing_opportunity',
    'female_with_impressing_opportunity',
    'male_with_impressing_opportunity_and_lower_income',
    'female_with_impressing_opportunity_and_lower_income']

numerical_columns  = ['has_children',
                                        'toCoupon_GEQ5min',
                                        'toCoupon_GEQ15min', 'toCoupon_GEQ25min', 'direction_same',
                                        'direction_opp']
cleaned_expensive_restaurant_df = expensive_restaurant_df.copy()[category_columns + bool_columns + numerical_columns + ['Y']]

cleaned_expensive_restaurant_df.info()

In [None]:
# Split test and train data
X = cleaned_expensive_restaurant_df[category_columns + bool_columns + numerical_columns]
y = cleaned_expensive_restaurant_df['Y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'), category_columns),
        ('bool', 'passthrough', bool_columns),
        ('numerical', 'passthrough', numerical_columns)
    ]
)


decision_tree_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier(max_depth=4))])


decision_tree_pipeline.fit(X_train, y_train)


In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'), category_columns),
        ('bool', 'passthrough', bool_columns),
        ('numerical', 'passthrough', numerical_columns)
    ]
)


random_forest_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(max_depth=4))])


random_forest_pipeline.fit(X_train, y_train)


In [None]:

def threshold_model_stats(model, model_name, X_test, y_test):
    stats = pd.DataFrame(columns=['model_name', 'threshold', 'precision', 'recall', 'accuracy'])
    y_pred_probability = model.predict_proba(X_test)[:, 1]
    for thresh in np.arange(.3, .6, .025):
        y_pred = (y_pred_probability > thresh).astype(int)
        new_row = {
            'model_name': model_name,
            'threshold': thresh,
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
        }
        if stats.empty:
            stats = pd.DataFrame([new_row])
        else:
            stats = pd.concat([stats, pd.DataFrame([new_row])])
    return stats


In [None]:
decision_tree_stats = threshold_model_stats(decision_tree_pipeline, 'Decision Tree', X_test, y_test)
random_forest_stats = threshold_model_stats(random_forest_pipeline, 'Random Forest', X_test, y_test)
stats = pd.concat([decision_tree_stats, random_forest_stats])

In [None]:
stats.sample(10)

In [None]:
stats.describe()

In [None]:
fig = px.scatter(
    stats,
    x='recall',
    y='precision',
    color='model_name',
    size='accuracy',
    title='Decision Tree vs Random Forest - Precision vs Recall',
    labels={
        'precision': 'Precision',
        'recall': 'Recall',
        'model_name': 'Model',
    }
)

# add line
for model in stats['model_name'].unique():
    model_stats = stats[stats['model_name'] == model]
    fig.add_scatter(
        x=model_stats['recall'],
        y=model_stats['precision'],
        mode='lines',
        line=dict(color=px.colors.qualitative.D3[stats[stats['model_name'] == model].index.values[0]]),
        showlegend=False
    )

fig.show()
fig.write_image('images/decision_tree_vs_random_forest.png')

In [None]:
stats.nunique()

## Expensive Restaurant Coupon Acceptance Model Analysis

I decided to create two models for expensive restaurant coupon acceptance. A decision tree and a random forest model. For these models, I decided to only look at the columns I analyzed. I wanted to see if there was a difference between the two models, since I have seen the difference before at work. I fully expected the random forest to perform better. I found that the random forest was easier to tune by threshold. It also looks like the decision tree model had no differences between certain thresholds. Because we are only looking to predict the coupon acceptance, I believe precision metric should be our best metric. From these models, the max precision was 70% and I would recommend that for use. Precision will help minimize the number of false positives. In this case, people who we predict will accept but actually decline the coupon. From a business perspective, this means more efficient targeting. We will reduce the feeling of spam from sending coupons to less likely to accept individuals. However the poor recall metric will mean we will have a lot of missed opportunities. This should be a target area for future research. With better feature discover we should end up with a better model.

<img src='images/decision_tree_vs_random_forest.png'>