### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [None]:
%pip install seaborn



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [None]:
data = pd.read_csv('data/coupons.csv')

In [None]:
print(data)

In [None]:
data.head()

2. Investigate the dataset for missing or problematic data.

In [None]:
#Ivestigating Missing Data
missingData=data.isna().sum()

#Selecting the columns with missing data greater than 1
missing_columns = missingData[missingData > 0]

sns.barplot(x=missing_columns.index, y=missing_columns.values, palette='viridis')
plt.title('Number of Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)

3. Decide what to do about your missing data -- drop, replace, other...

In [None]:
#Based on the null values we need to drop car column or feature
data_cleanv2= data.drop(columns=['car'])

#Rest of the categories/columns which have missing data are mostly non numerical or string/category based , so we h=can use the most commonly occuring 
#data for filling the na values

missing_columns_excluding_car = missing_columns.drop('car', errors='ignore').index.tolist()

for col in missing_columns_excluding_car:
    mode_value = data[col].mode()[0]  # Calculate the mode for the column
    data_cleanv2[col].fillna(mode_value, inplace=True)  # Fill missing values with the mode


4. What proportion of the total observations chose to accept the coupon?



In [None]:

# Calculate the proportion of total observations that chose to accept the coupon
total_accepted = data['Y'].sum()  # Count of accepted coupons
total_observations = len(data)    # Total number of observations
proportion_accepted = total_accepted / total_observations

print(proportion_accepted)

5. Use a bar plot to visualize the `coupon` column.

In [None]:

sns.barplot(x=data['coupon'].value_counts().index, y=data['coupon'].value_counts().values)
plt.title('Coupon Type Distribution')
plt.xlabel('Coupon Type')
plt.ylabel('Count')
plt.xticks(rotation=45)

6. Use a histogram to visualize the temperature column.

In [None]:
sns.barplot(x=data['temperature'].value_counts().index, y=data['temperature'].value_counts().values)
plt.title('Temperature Type Distribution')
plt.xlabel('Temperature Type')
plt.ylabel('Count')
plt.xticks(rotation=45)

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [None]:
df_barCoupons=data[data['coupon']=='Bar']


2. What proportion of bar coupons were accepted?


In [None]:
df_barCoupons_yes=df_barCoupons['Y'].sum()
df_barCoupons_proportionAccepted=df_barCoupons_yes/len(df_barCoupons)

print(df_barCoupons_proportionAccepted)

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [None]:
bar_3_or_fewer = data[(data['Bar'] == 'never') | 
                      (data['Bar'] == 'less1') | 
                      (data['Bar'] == '1~3')]

bar_3_or_fewer_proportion=bar_3_or_fewer['Y'].sum()/len(bar_3_or_fewer)
 print(bar_3_or_fewer_proportion)

bar_more_than_3 = data[(data['Bar'] == '4~8') | 
                       (data['Bar'] == 'gt8')]



bar_more_than_3_proportion=bar_more_than_3['Y'].sum()/len(bar_more_than_3)
print(bar_more_than_3_proportion)

#comparing  bar_more_than_3 and bar_3_or_fewer , we cam say that bar_more_than_3 has a higher percentage of acceptance

print('Thosee who went more than 3 times have higher acceptance rate')

4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [None]:
data['age'] = pd.to_numeric(data['age'], errors='coerce')

# Filter for drivers who go to a bar more than once a month and are over 25
drivers_bar_more_than_once_and_over_25 = data[(data['Bar'] != 'never') & 
                                              (data['age'] > 25)]

print(drivers_bar_more_than_once_and_over_25)

drivers_bar_more_than_once_and_over_25_proportion=drivers_bar_more_than_once_and_over_25['Y'].sum()/len(drivers_bar_more_than_once_and_over_25)

print(drivers_bar_more_than_once_and_over_25_proportion)
# Calculate acceptance rate for all others
acceptance_rate_others = data[~((data['Bar'] != 'never') & 
                                (data['age'] > 25))]['Y'].mean()
print (acceptance_rate_others)

#drivers_bar_more_than_once_and_over_25 have more acceptance
print('Yes there is a difference Drivers who have been to Bar more than once and are over 25 have more acceptance')

5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [None]:
drivers_bar_more_than_once = data[(data['Bar'] != 'never') & 
                                  (data['Bar'] != 'less1')]

drivers_bar_with_conditions = drivers_bar_more_than_once[(drivers_bar_more_than_once['passanger'] != 'Kid(s)') & 
                                                         (~drivers_bar_more_than_once['occupation'].isin(['Farming, Fishing, & Forestry']))]

# Calculate acceptance rate for this group
acceptance_rate_special_group = drivers_bar_with_conditions['Y'].mean()

# Calculate acceptance rate for all others
acceptance_rate_all_others = data[~((data['Bar'] != 'never') & 
                                    (data['Bar'] != 'less1') & 
                                    (data['passanger'] != 'Kid(s)') & 
                                    (~data['occupation'].isin(['Farming, Fishing, & Forestry'])))]['Y'].mean()

print(f"Acceptance Rate for Drivers Who Go to Bars More Than Once a Month (Special Group): {acceptance_rate_special_group:.2f}")
print(f"Acceptance Rate for All Other Drivers: {acceptance_rate_all_others:.2f}")

# Determine if the special group has a higher acceptance rate
if acceptance_rate_special_group > acceptance_rate_all_others:
    print("\nObservation: Drivers who go to bars more than once a month, do not have 'Kid(s)' as passengers, and have occupations other than 'Farming, Fishing, & Forestry' show a higher acceptance rate compared to all other drivers.")
else:
    print("\nObservation: There is no significant difference in acceptance rate between the special group and all other drivers.")


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [None]:
filter_1 = data[(data['Bar'].isin(['1~3', '4~8', 'gt8'])) & 
                (data['passanger'] != 'Kid(s)') & 
                (data['maritalStatus'] != 'Widowed')]

# Acceptance rate for filter 1
acceptance_rate_filter_1 = filter_1['Y'].mean()

# Filter 2: Drivers who go to bars more than once a month and are under the age of 30
filter_2 = data[(data['Bar'].isin(['1~3', '4~8', 'gt8'])) & 
                (data['age'] < 30)]

# Acceptance rate for filter 2
acceptance_rate_filter_2 = filter_2['Y'].mean()

# Filter 3: Drivers who go to cheap restaurants more than 4 times a month and income is less than 50K
# Remove commas and extract numerical income values
data['income_numeric'] = data['income'].str.replace(',', '').str.extract('(\d+)').astype(float)

# Filter for cheap restaurants and income
filter_3 = data[(data['RestaurantLessThan20'].isin(['4~8', 'gt8'])) & 
                (data['income_numeric'] < 50000)]

# Acceptance rate for filter 3
acceptance_rate_filter_3 = filter_3['Y'].mean()
# Print the calculated acceptance rates and observations
print(f"Acceptance Rate for Filter 1 (Drivers who go to bars 1~3 times or more, do not have Kid(s) as passengers, and are not widowed): {acceptance_rate_filter_1:.2f}")
print(f"Acceptance Rate for Filter 2 (Drivers who go to bars more than once a month and are under the age of 30): {acceptance_rate_filter_2:.2f}")
print(f"Acceptance Rate for Filter 3 (Drivers who go to cheap restaurants more than 4 times a month and have income less than 50K): {acceptance_rate_filter_3:.2f}")

# Observation based on comparison of the three groups
print("\n### Observations ###")

if acceptance_rate_filter_1 > acceptance_rate_filter_2 and acceptance_rate_filter_1 > acceptance_rate_filter_3:
    print("Observation: The highest acceptance rate is for Filter 1. Drivers who go to bars frequently, do not have Kid(s) as passengers, and are not widowed are more likely to accept coupons.")
elif acceptance_rate_filter_2 > acceptance_rate_filter_1 and acceptance_rate_filter_2 > acceptance_rate_filter_3:
    print("Observation: The highest acceptance rate is for Filter 2. Younger drivers (under 30) who go to bars more than once a month are more likely to accept coupons.")
else:
    print("Observation: The highest acceptance rate is for Filter 3. Drivers who go to cheap restaurants more than 4 times a month and have an income less than 50K are more likely to accept coupons.")

7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [None]:
#Acceptance Rate for Each Group:

#Filter 1: Drivers who go to bars frequently (1~3 times or more), do not have Kid(s) as passengers, and are not widowed have an acceptance rate of 62%.
#Filter 2: Younger drivers (under 30) who go to bars more than once a month have a slightly higher acceptance rate of 63%.
#Filter 3: Drivers who go to cheap restaurants more than 4 times a month and have an income less than 50K have an acceptance rate of 60%.
#Conclusion:
#The group with the highest acceptance rate is Filter 2 (Younger drivers under 30 who go to bars more than once a month).
#This suggests that younger individuals who frequent bars are more likely to accept coupons, making them a prime target demographic for bar-related promotions.

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [None]:
# Calculate the acceptance rate (proportion of Y = 1) for each value in each column except 'Y' and 'coupon'
columns_to_analyze = [col for col in data_cleanv2.columns if col not in ['Y', 'coupon']]  # Exclude 'Y' and 'coupon' columns

# Dictionary to store maximum acceptance rate and corresponding column and value
max_acceptance_info = {'column': None, 'value': None, 'acceptance_rate': 0}

# Iterate through each column to find acceptance rates for each unique value
for col in columns_to_analyze:
    # Group by the column and calculate acceptance rate for each unique value
    acceptance_rate_by_value = data_cleanv2.groupby(col)['Y'].mean()
    
    # Find the maximum acceptance rate for this column
    max_value = acceptance_rate_by_value.idxmax()  # Value with highest acceptance rate
    max_rate = acceptance_rate_by_value.max()      # Maximum acceptance rate
    
    # Update if this column's value has a higher acceptance rate than the previous maximum
    if max_rate > max_acceptance_info['acceptance_rate']:
        max_acceptance_info['column'] = col
        max_acceptance_info['value'] = max_value
        max_acceptance_info['acceptance_rate'] = max_rate
# We decided to analyze each column/feature and see which one of them had maximum acceptance as that would give us an overview regarding which factor has maximum contribution towards coupn acceptance

# Calculate the acceptance rate (proportion of Y = 1) for each value in each column except 'Y' and 'coupon'
columns_to_analyze = [col for col in data_cleanv2.columns if col not in ['Y', 'coupon']]  # Exclude 'Y' and 'coupon' columns

# Dictionary to store maximum acceptance rate and corresponding column and value
max_acceptance_info = {'column': None, 'value': None, 'acceptance_rate': 0}

# Iterate through each column to find acceptance rates for each unique value
for col in columns_to_analyze:
    # Group by the column and calculate acceptance rate for each unique value
    acceptance_rate_by_value = data_cleanv2.groupby(col)['Y'].mean()
    
    # Find the maximum acceptance rate for this column
    max_value = acceptance_rate_by_value.idxmax()  # Value with highest acceptance rate
    max_rate = acceptance_rate_by_value.max()      # Maximum acceptance rate
    
    # Update if this column's value has a higher acceptance rate than the previous maximum
    if max_rate > max_acceptance_info['acceptance_rate']:
        max_acceptance_info['column'] = col
        max_acceptance_info['value'] = max_value
        max_acceptance_info['acceptance_rate'] = max_rate


#After analyzing the max_Acceptance_info we found out that education feature had the maximum coupon accepatnce value hence decided to analyze it further
education_acceptance_rate = data.groupby('education')['Y'].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=education_acceptance_rate.index, y=education_acceptance_rate.values, palette='viridis')
plt.title('Acceptance Rate by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Acceptance Rate')
plt.xticks(rotation=45)

#Based on the results the High School Gradute students seem more likely to accept the coupons and the Graduate students seem to have a lower acceptance rate

