### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [None]:
import plotly.express as px
import pandas as pd

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [None]:
data = pd.read_csv('data/coupons.csv')

In [None]:
data.info()

In [None]:
data.sample(5)

In [None]:
data.describe(include='all')

2. Investigate the dataset for missing or problematic data.

In [None]:
# Show columns with missing data
is_missing_cols = data.isna().sum().sort_values(ascending=False)
is_missing_cols[is_missing_cols > 0].index

In [None]:
columns_with_missing_data = is_missing_cols[is_missing_cols > 0].index

In [None]:
# Show count of columns with missing data
data[columns_with_missing_data].isna().sum()

In [None]:
# Show percentage of columns with missing data
data[columns_with_missing_data].isna().mean().round(3) * 100

In [None]:
# Show most frequent responses for columns with missing data
data[columns_with_missing_data].describe()

### Missing Columns Analysis
Concerning the dataset, I can see that only six columns contain missing data. These six columns include `Restaurant20to50`, `RestaurantLessThan20"`, `CarryAway`, `CoffeeHouse`, `Bar` and `car`. All of these columns are categorical features. All but the `car` column are missing data less than or equal to `1.7%` of the time. The `car` column is the most sparse by being unavailable `99.1%` of the time. Considering these findings, I will drop the `car` column even though it can be useful to know. I simply can't impute that much data or assume the user is in a car with confidence. The five other columns concern the user's behavior and how often they visit that type of business. Even without visualizing the dataset I can infer that these columns are important and given that so few of these data are missing I ought to keep it. The question now is how to impute that data. One naive approach could be to impute the most frequent response or to impute a response of `never`, assuming that what the user meant. Either way these would be guesses. Another way of imputing these columns are to leverage a k-clustering algorithm. You would find the top 5 nearest neighbors of a row with missing values and impute the most frequent response of the nearest neighbors. This is an advance technique that I am not confident I can pull off just yet. I do know how to leverage SMOTE (Synthetic Minority Oversampling Technique) however that is for minority datasets and is a library I have used before. I mention SMOTE because it uses k-clustering to create synthetic data and my earlier suggestion was inspired by SMOTE. For the purpose of this project I will impute the most frequent response. Above is a table showing their most frequent responses. These have a frequency from `25%--50%`.

3. Decide what to do about your missing data -- drop, replace, other...

In [None]:
# drop car column
data = data.drop(columns=['car'])

In [None]:
# Impute rest of columns
data = data.fillna(data.mode().iloc[0])

In [None]:
# Double check if data is missing
is_missing_cols = data.isna().sum().sort_values(ascending=False)
is_missing_cols[is_missing_cols > 0].index

4. What proportion of the total observations chose to accept the coupon?



In [None]:
total_observations = data.shape[0]
total_accepted_coupon = data.query('Y == 1').count().iloc[0]
total_declined_coupon = data.query('Y == 0').count().iloc[0]
percentage_accepted_coupon = total_accepted_coupon / total_observations
percentage_declined_coupon = total_declined_coupon / total_observations
print(f'Total Observations: {total_observations}\nAccepted Coupon: {total_accepted_coupon}\nDeclined Coupon: {total_declined_coupon}\nAccepted Coupon Percentage: {percentage_accepted_coupon:.2%}\nDeclined Coupon Percentage: {percentage_declined_coupon:.2%}' )

### Proportion of the total observations that accept/decline the coupon
Out of the 12,684 observations, 7,210 users accepted the coupon accounting for 56.84% of the observations. On the flip side 5,474 users declined the coupon or 43.16%.

5. Use a bar plot to visualize the `coupon` column.

In [None]:
fig = px.bar(data, x='coupon', title='Coffee House & Restaurant(<20) coupons were offered the most', labels={'coupon': 'Coupon Type', 'count': 'Count'}, color='coupon', color_discrete_sequence=px.colors.qualitative.Bold)
fig = fig.update_traces(marker_line_width=0)
fig.show()
fig.write_image('images/coupon_bar_plot.png')

6. Use a histogram to visualize the temperature column.

In [None]:
data['temperature'].unique()

In [None]:
fig = px.histogram(data, x='temperature', title='Of three possible temperatures 80 is most common', labels={'temperature': 'Temperature', 'count': 'Count'}, nbins=3)
fig.show()
fig.write_image('images/temperature_histogram.png')

 It was called out in the explanation of the problem that the temp column contains 3 unique values. Temperature is a continuous variable so this isn't typical. If this wasn't called out in the problem set we would have discovered the issue with the above histogram. We can take this to mean: 80 -> hot, 55 -> moderate, 30 -> cold. Might be smart to create a new column to capture this.

In [None]:
data['temperature_category'] = data['temperature'].map({30: 'cold', 55: 'moderate', 80: 'hot'})

In [None]:
data.sample(2)

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [None]:
bar_coupons = data.query('coupon == "Bar"')
bar_coupons.sample(2)

2. What proportion of bar coupons were accepted?


In [None]:
total_observations = bar_coupons.shape[0]

total_accepted_coupon = bar_coupons.query('Y == 1').count().iloc[0]
total_declined_coupon = bar_coupons.query('Y == 0').count().iloc[0]
percentage_accepted_coupon = total_accepted_coupon / total_observations
percentage_declined_coupon = total_declined_coupon / total_observations
print(f'Total Bar Observations: {total_observations}\nAccepted Bar Coupon: {total_accepted_coupon}\nDeclined Bar Coupon: {total_declined_coupon}\nAccepted Bar Coupon Percentage: {percentage_accepted_coupon:.2%}\nDeclined Bar Coupon Percentage: {percentage_declined_coupon:.2%}' )

In [None]:
bar_coupon_count = bar_coupons[['Y']].value_counts()
fig = px.bar(bar_coupon_count.reset_index(), x='Y', y='count', title='More declined the bar coupon', labels={'Y': 'Acceptance', 'count': 'Count'})
fig.write_image('images/bar_coupon_acceptance.png')
fig.show()

Interesting to note that on average those who received bar coupons were less likely to accept than those who received other couponse

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [None]:
bar_coupons['Bar'].unique()

In [None]:
fewer_than_3 = ['never', 'less1', '1~3']
greater_than_3 = ['gt8', '4~8']
total_3_or_fewer_bar = bar_coupons.query(f'Bar in {fewer_than_3}')
total_gt_than_3_bar = bar_coupons.query(f'Bar in {greater_than_3}')

total_accepted_coupon_from_3_or_fewer = total_3_or_fewer_bar.query(f'Y == 1').count().iloc[0]
total_accepted_coupon_from_gt_than_3 = total_gt_than_3_bar.query(f'Y == 1').count().iloc[0]
percentage_3_or_fewer_accepted_coupon = total_accepted_coupon_from_3_or_fewer / total_3_or_fewer_bar.shape[0]
percentage_gt_than_3_accepted_coupon = total_accepted_coupon_from_gt_than_3 / total_gt_than_3_bar.shape[0]
print(total_3_or_fewer_bar.shape[0])
print(total_gt_than_3_bar.shape[0])
print(f'3 or Fewer Accepted Bar Coupon: {total_accepted_coupon_from_3_or_fewer}\nGreater than 3 Accepted Bar Coupon: {total_accepted_coupon_from_gt_than_3}')
print(f'3 Or Fewer Accepted Bar Coupon Percentage: {percentage_3_or_fewer_accepted_coupon:.2%}\nGreater than 3 Accepted Bar Coupon Percentage: {percentage_gt_than_3_accepted_coupon:.2%}' )

In [None]:
bar_avg_visit_coupon_ratio = bar_coupons.groupby('Bar')['Y'].value_counts(normalize=True).unstack()

fig = px.bar(bar_avg_visit_coupon_ratio,title='More likely to accept the bar coupon if you go more', labels={'Y': 'Acceptance', 'value': 'Count'})
fig.show()
fig.write_image('images/bar_coupon_acceptance_by_avg_visits.png')

As expected those that visit the bar more often are more likely to accept the coupon. 

4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  