### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

### Imports

In [None]:
import warnings

warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import random
import plotly.express as px
from IPython.display import Image
from typing import Union, List

### Data Load

#### Read in the `coupons.csv` file

In [None]:
df_in = pd.read_csv("./data/coupons.csv")

Function generates a compact data frame with number of rows given by the maximum number of unique elements in any column  
- Columns with fewer elements than that maximum provide more via random selection drawn from its unique set  
- This frame scrambles the data in any given row, so it is not for exploitation purposes, but to give a look at all the information in the frame

In [None]:
def random_unique_vals(series: pd.Series = None, nunique: int = 0) -> pd.Series:
    unique_vals = series.unique().tolist()
    return pd.Series(
        unique_vals + random.choices(unique_vals, k=nunique - len(unique_vals))
    )


def unique_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    random.seed(0)
    return df.apply(
        random_unique_vals, nunique=df.apply(lambda x: len(x.unique())).max()
    )

#### Investigate the dataset for missing or problematic data

In [None]:
pd.set_option("display.max_columns", None)
display(unique_dataframe(df_in).head(5))

By just viewing the first few rows of the compact frame, we see many problems:  
- There are NaN present in several categories (`car`, `Bar`, etc...)  
- The values are often too complicated, for example, in the `coupon category`, "Carry out & Take away" is quite cumbersome if we want to query for this value 
- There is ostensibly numeric data hiding in awkward range-like strings, such as 4~8 meaning 4 to 8 visits per month  
- Many of the column names, such as `toCoupon_GEQ5min` and `RestaurantLessThan20`, are incomprehensible - we do not want to have a separate description to be able to understand what a column means, we desire names where the meaning is self-evident

### Data Cleaning

#### Decide what to do about your missing data -- drop, replace, other...

#### Change Summary

Here is a summary of changes I made. A complete description is in the file in ``./data/coupons_cleanup_dict.json``

##### Dropped
- `car` because out of 12,000 rows in the data frame, only about 100 were non-null
- `direction_opp` redundant because it is the logical negation of `direction_same`
- - i.e. `all(df_in.direction_opp == (1 - df_in.direction_same))`
- - - using `1 - val` rather than `~val` because `direction_opp` is int not bool
##### Renamed - various renames to make the column names self-documenting
- `passanger` renamed `passenger`
- `toCoupon_GEQ<N>min` renamed `AtLeast<N>MinutesDriveToRedeem`
- `income` renamed `AvgAnnualIncome`
- `Bar`, `CoffeeHouse`, `CarryAway`, `Restaurant`* gained prefix `AvgMonthlyVisitsTo`
- - For example `Bar` renamed `AvgMonthlyVisitsToBar`
- - I refer to these as "frequency of visit" columns
##### Numeric Range Strings Converted to Floats
- Columns such as `income` and all the "frequency of visit" columns contained ranges which I converted to the mean value it represented
- - For example the visit frequency "4~8" was converted to 6.0
- - This made them easy to exploit as numeric values
##### Simplified
- Multi-word cell values were reduced to single-word values with equivalent meaning wherever possible
- - For example "Married partner" was simplified to "Married"
- Suffixes which add no additional meaning were removed, e.g. `Kid(s)` was shortened to `Kid`

#### Perform Cleanups

##### Cleanup Functions

In [None]:
# Convert a string representing a range to its mean, e.g. str("1 - 2") -> float(1.5)
def range_to_mean(element: str, sep: str = "-") -> float:
    if isinstance(element, str):
        return np.array(list(map(float, element.replace(" ", "").split(sep)))).mean()
    else:
        return element

In [None]:
cleanup_dict = json.load(open("./data/coupons_cleanup_dict.json", "r"))


def cleanup_series(
    series: pd.Series = None,
) -> pd.Series:
    # Get cleanup dict for this series or return
    if series.name not in cleanup_dict:
        return series

    column_cleanup_dict = cleanup_dict[series.name]
    if "key" in column_cleanup_dict:
        # Take a different cleanup if a key is provided
        column_cleanup_dict = cleanup_dict[column_cleanup_dict["key"]]

    # Substring replace (many to one) with regex
    if "str.replace" in column_cleanup_dict:
        pattern = "|".join(column_cleanup_dict["str.replace"][0])
        series = series.str.replace(
            pattern, column_cleanup_dict["str.replace"][1], regex=True
        )

    # Whole-word replace (one to one) without regex
    if "replace" in column_cleanup_dict:
        series = series.replace(column_cleanup_dict["replace"], regex=False)

    # Apply a function
    if "apply" in column_cleanup_dict:
        series = series.apply(eval(column_cleanup_dict["apply"]))

    return series

##### Run the Cleanup On all Columns

In [None]:
# Extract iterables for dropping and renaming columns
drop_list = [key for key in cleanup_dict if "drop" in cleanup_dict[key]]
rename_dict = {
    key: cleanup_dict[key]["rename"]
    for key in cleanup_dict
    if "rename" in cleanup_dict[key]
}

# Perform the cleanups per column as defined by cleanups dict
df = (
    df_in.copy(deep=True)
    .drop(drop_list, axis=1)
    .apply(cleanup_series)
    .rename(columns=rename_dict)
)

The awkward extraction for the `drop_list` and `rename_dict` results from the way I organized the data in the `cleanup_dict` as

```
{
Column-A:{drop:T/F, rename:rename-to, cleanup:yada-yada-yada},
...
Column-Z:{drop:T/F, rename:rename-to, cleanup:blah-blah-blah},
}
```

which makes it is easy to look at by column (tell me everything you did to column X)  
but harder to look at by cleanup category (what columns did you rename?)  

This column-wise organization is appealing, even if it costs a bit of cumbersome code when looking across columns.

#### Finish Cleanup Processing and Save Data

##### Drop Remaining NaN's and Display Final Compares

In [None]:
df.dropna(inplace=True)

display(unique_dataframe(df_in).head(5))
display(unique_dataframe(df).head(5))

The display shows that many multi-word values were simplified, range strings and "numeric strings" were converted to numeric values, NaN's were removed, and many columns gained readily understood names

### Initial Analyses

##### What proportion of the total observations chose to accept the coupon?

In [None]:
def acceptance_rate(
    df: pd.DataFrame = None,
    mask: pd.Series = pd.Series(),
) -> Union[float, List[float]]:
    if mask.empty:
        return len(df.query("Y == 1")) / len(df) * 100.0
    else:
        return [acceptance_rate(df[mask]), acceptance_rate(df[~mask])]

In [None]:
acceptance_rate(df)

57% of the total observations chose to accept the coupon

##### Use a bar plot to visualize the `coupon` column

This is a bar plot of the number of coupons offered per establishment type  
The plot shows that more coupons were offered for coffee houses than any other establishment

In [None]:
df.groupby("coupon")["Y"].agg(["count"]).reset_index().sort_values(
    by="count", ascending=False
).plot(
    x="coupon",
    y="count",
    kind="bar",
    legend=False,
    xlabel="Establishment Type",
    ylabel="Coupons Offered",
    title="Coupons Offered per Establishment Type",
    grid=True,
    rot=-45,
);

##### Use a histogram to visualize the temperature column

This is a histogram of the temperatures during days when coupons were offered  
The plot shows that coupons were offered most often on days when the temperature was 80 degrees

In [None]:
df.hist("temperature", grid=True)
plt.xlabel("Temperature (degrees F)")
plt.ylabel("Counts")
plt.title("Temperature Histogram");

### Investigating the Bar Coupons

Now, we will lead you through an exploration of just the bar related coupons.  

##### 1. Create a new `DataFrame` that contains just the bar coupons

In [None]:
df_bar = df.query("coupon == 'Bar'")

##### 2. What proportion of bar coupons were accepted?

In [None]:
acceptance_rate(df_bar)

41% of bar coupons were accepted

##### 3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more


In [None]:
acceptance_rate(df_bar, mask=df_bar["AvgMonthlyVisitsToBar"] <= 3)

People who went to a bar 3 or fewer times accepted at about half the rate (37%) to those who went more (76%)

##### 4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others - is there a difference?

In [None]:
mask_more_than_once = df_bar["AvgMonthlyVisitsToBar"] > 1
acceptance_rate(df_bar, mask=mask_more_than_once & (df_bar["age"] > 25))

Yes, there is a big difference  
Drivers who go to a bar more than once a month and are over the age of 25 accepted coupons more than twice as often (69%) as all others (34%)

##### 5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry (FF&F)

In [None]:
# Show that the last condition is irrelevant
assert df_bar[
    mask_more_than_once
    & (df_bar["passenger"] != "Kid")
    & (df_bar["occupation"] == "Farming Fishing & Forestry")
].empty

acceptance_rate(
    df_bar,
    mask=mask_more_than_once
    & (df_bar["passenger"] != "Kid")
    & (df_bar["occupation"] != "Farming Fishing & Forestry"),
)

Drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than FF&F accepted the coupons more than twice as often (71%) as all others (30%)

The last condition - occupations other than FF&F - has no impact on the results because once reduced to the first 2 conditions, no one had an occupation of FF&F

##### 6. Compare the acceptance rates between those drivers who:

- Go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- Go to bars more than once a month and are under the age of 30 *OR*
- Go to cheap restaurants more than 4 times a month and income is less than 50K

In [None]:
rate1 = acceptance_rate(
    df_bar[mask_more_than_once].query(
        "passenger != 'Kid' and maritalStatus != 'Widowed'"
    )
)
rate2 = acceptance_rate(df_bar[mask_more_than_once].query("age < 30"))
rate3 = acceptance_rate(
    df_bar.query("AvgMonthlyVisitsToRestaurantLessThan20 > 4 & AvgAnnualIncome < 50e3")
)
display([rate1, rate2, rate3])

Drivers whos go to bars more than once a month accept coupons about 70% of the time, but drivers with incomes less than 50k who frequently visit cheaper restaurants only accepted 45% of coupons

##### 7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

Hypothesize That

- These drivers are probably younger on average and perhaps single, with no children, essentially "young and single".  

- Interestingly, price-consciousness alone does not predict acceptance of bar coupons. For example, drivers who go to cheaper restaurants more than 4 times per month with income less than $50k might be described as "price conscious" and yet they accepted bar coupons only 45% of the time

### Independent Investigation: Investigating the Coffee House Coupons

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons

Decided to focus on the coffee house coupons because it was the establishment type with the largest count

##### Create a new `DataFrame` that contains just the coffee coupons

In [None]:
df_coffee = df.query("coupon == 'CoffeeHouse'")

#### Acceptance Rates of Coffee Coupon by Single Categories

In [None]:
def compute_grouped_acceptance_rates(
    df: pd.DataFrame = None, columns: str = ""
) -> pd.DataFrame:
    return (
        pd.DataFrame(
            {
                "Acceptance Rate": df.groupby(columns).apply(acceptance_rate),
                "Count": df.groupby(columns).apply(len),
            }
        )
        .reset_index()
        .sort_values(by="Acceptance Rate", ascending=True)
    )

In [None]:
def plot_grouped_acceptance_rates(
    df: pd.DataFrame = None,
    show_dynamic: bool = False,
) -> any:
    # Get the names of the variables to label the plot
    columns = list(df.columns[0 : -1 - 1])

    # Plot it
    fig = px.bar(
        df,
        x=columns[0] if len(columns) == 1 else None,
        y="Acceptance Rate",
        color="Count",
        title="Acceptance Rate of Coffee Coupon by " + " and ".join(columns),
        height=600,
        text_auto="d",
    )

    fig.update_layout(xaxis_type="category")

    if show_dynamic:
        fig.show()

    return fig.to_image(format="png", width=1200, scale=2)

##### By Occupation

Here is a bar plot of acceptance rate vs. occupation. The color is the count (sample size) in the category analyzed. This is helpful for recognizing that a category may have a high rate of acceptance but only a limited number of samples in the result.

In [None]:
Image(
    plot_grouped_acceptance_rates(
        compute_grouped_acceptance_rates(df_coffee, columns="occupation")
    )
)

Those in the healthcare occupation had the highest rate of acceptance (76%).  
Surprisingly, retirees were among the least likely to accept at 40%
- One might guess retirees have both more time and less income than all others and therefore would accept coupons more often

##### By Number of Visits to a Coffee House in the Last Month

In [None]:
Image(
    plot_grouped_acceptance_rates(
        compute_grouped_acceptance_rates(
            df_coffee, columns="AvgMonthlyVisitsToCoffeeHouse"
        )
    )
)

Not too surprisingly, drivers who on average visited a coffee house at least 6 times in the last month accepted often - about 67% of the time

##### By Income

In [None]:
Image(
    plot_grouped_acceptance_rates(
        compute_grouped_acceptance_rates(
            df_coffee, columns="AvgAnnualIncome"
        ).sort_values(by="AvgAnnualIncome", ascending=True)
    )
)

The acceptance rate vs income reveals no consistent trend. Acceptance rates over 50% were seen at both extrema as well as the center of the income scale

#### Acceptance Rates of Coffee Coupon by Multiple Categories

From the above results, healthcare workers and those who visit coffee houses accept coupons at a fairly high rate.  
Combine these variables to look for even higher rates

In [None]:
display(
    compute_grouped_acceptance_rates(
        df_coffee, columns=["occupation", "AvgMonthlyVisitsToCoffeeHouse"]
    ).tail(10)
)

Healthcare workers who visited a coffee house 2 times in the last months accepted 90% of coupons. However, the sample size of 30 is quite small.  
Those who visited 6 times per month accepted 100% of the coupons, but the sample size of 6 was smaller still.  
Integrate those 2 results from the health care works together to form a larger sample

In [None]:
visits = [2, 6]
display(
    compute_grouped_acceptance_rates(
        df_coffee.query("AvgMonthlyVisitsToCoffeeHouse in @visits"),
        columns=["occupation"],
    ).tail(5)
)

Healthcare workers who visited coffee houses 2 or 6 times accepted 92% of coupons

##### View results from a similar experiment as a bar graph

In this example, restrict to the results to experiments having sample sizes of at least 100

In [None]:
gf = compute_grouped_acceptance_rates(
    df_coffee, columns=["education", "AvgMonthlyVisitsToCoffeeHouse", "maritalStatus"]
).query("Count > 100")
Image(plot_grouped_acceptance_rates(gf))

Display the grouped frame to understand the indices in terms of the variables

In [None]:
display(gf)

Single people with some college who visited coffee houses twice accepted 68% of coupons