### Self-Study Colab Activity 4.4: More Sophisticated Plotting

#### Customer Profiling

This activity is meant to give you practice exploring data, including the use of visualizations with `matplotlib`, `seaborn`, and `plotly`.  The dataset contains demographic information on the customers, information on customer purchases, engagement of customers with promotions, and information on where customer purchases happened.  A complete data dictionary can be found below.  

Your task is to explore the data and use visualizations to inform answers to specific questions using the data.  The questions and resulting visualization should be posted in the group discussion related to this activity.  Some example problems/questions to explore could be:

-----

- Does income differentiate customers who purchase wine? 
- What customers are more likely to participate in the last promotional campaign?
- Are customers with children more likely to purchase products online?
- Do married people purchase more wine?
- What kinds of purchases led to customer complaints?

-----

### Data Dictionary

Attributes


```
ID: Customer's unique identifier
Year_Birth: Customer's birth year
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if a customer complained in the last 2 years, 0 otherwise


MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years
Promotion


AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise


NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
```

In [None]:
import pandas as pd
import plotly.express as px
from datetime import datetime

from tensorflow.python.distribute.device_util import current

In [None]:
df = pd.read_csv('module 4/colab_activity4_4_starter/data/marketing_campaign.csv', sep='\t')

In [None]:
df.info()

In [None]:
df.sample(5)

In [None]:
df.info()

Post your questions with an accompanying visualization in canvas.  You should generate at least three different questions and resulting visualizations.  Include complete sentence explanations of your interpretations of the visualizations.

#### Do customers who accept an offer spend more than customers who do not? Do their incomes differ?

In [None]:
# Add total offers accepted and total amount spent over 2 years
df['totalOffersAccepted'] = df[
    ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']].sum(axis=1)
df['totalAmountSpentOver2Years'] = df[
    ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)

In [None]:
df.sample(3)

In [None]:
q1 = df['Income'].quantile(0.25)
q3 = df['Income'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

income_filtered_df = df.query(f'Income >= {lower_bound} and Income <= {upper_bound}')

px.histogram(income_filtered_df, x='Income', title="Distribution of customer incomes without outliers")

In [None]:
px.box(income_filtered_df, x='Income', title="Most customers earn between $35k - $68k")

In [None]:
fig = px.box(df, x='totalOffersAccepted', y='totalAmountSpentOver2Years', color='totalOffersAccepted',
       title="On average, customers who accept an offer spend more than customers who do not.")
fig.write_image('module 4/colab_activity4_4_starter/images/offers_accepted_x_amount_spent.png')

In [None]:
px.box(df, x='totalOffersAccepted', y='Income', color='totalOffersAccepted',
       title="Income may play a small role in the number of offers accepted.")


In [None]:
corr_matrix = df[['totalOffersAccepted', 'totalAmountSpentOver2Years', 'Income']].corr(numeric_only=True).round(2)
px.imshow(corr_matrix, title='Correlation Matrix, total spend and income have a positive correlation of .67',
          color_continuous_scale='RdBu_r', aspect='auto')

In [None]:
fig = px.scatter(df, x='Income', y='totalAmountSpentOver2Years', color='totalOffersAccepted',
           title="Those with higher incomes spend more than those with lower incomes.")
fig.write_image('module 4/colab_activity4_4_starter/images/income_spending_scatter.png')

In [None]:
fig = px.histogram(df, x='Income', y='totalAmountSpentOver2Years', color='totalOffersAccepted',
             title="Those in the higher income group accept more offers and spend more money.")
fig.write_image('module 4/colab_activity4_4_starter/images/income_spending_power.png')


#### Answers to "Do customers who accept an offer spend more than customers who do not? Do their incomes differ?"

On average, customers who accept an offer spend more than customers that do not. The largest delta is between 0 and 1 offers accepted, the difference in median being ~$550. The incremental increase in spend tailors off with each subsequent offer accepted, which suggests diminishing returns. A customer's income may play a small role in offer acceptance, with a positive correlation of 0.29. There was a strong correlation between income and total spend, 0.67. Total offers accepted and the total amount spent are also positively correlated at 0.46. These correlations are also visualized in the scatter and histogram plots. From a business perspective, we may find more success by targeting higher income customers with adverts containing offers. We may also want to consider why offers are not catching the eyes of customers with lower incomes.


#### What products have the highest spending on average among web customers aged between the 1st and 3rd quartiles?


In [None]:
df['age'] = datetime.now().year - df['Year_Birth']
online_shoppers = df.query('NumWebPurchases > 0')
px.histogram(online_shoppers, x='age',
             title="Age of customers who purchased online. Most are 48-66 years old.", marginal='rug')


In [None]:
fig= px.box(online_shoppers, x='age', title="Age of customers who purchased online. Median 55 years old.")
fig.write_image('module 4/colab_activity4_4_starter/images/online_shoppers_age.png')


In [None]:
for field in ['age', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']:
    q1 = df[field].quantile(0.25)
    q3 = df[field].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    online_shoppers = online_shoppers.query(f'{field} >= {lower_bound} and {field} <= {upper_bound}')

average_by_product = online_shoppers[
    ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].mean()
average_by_product.rename('Average', inplace=True)
fig = px.bar(average_by_product, y='Average',
       title='Wine followed by meat products are the account for the highest average spending on online purchases.')
fig.write_image('module 4/colab_activity4_4_starter/images/average_spending_by_product.png')

#### Answers to "What products have the highest spending on average among web customers aged between the 1st and 3rd age quartiles?"

Most online customers are age 48 to 66. This suggests to me that we are potentially dealing with some stale data. I would like to verify when this dataset was created to ensure freshness and accuracy. I removed outliers using the IQR method for age and all the amount fields. From the data I saw that on average, online customers between the first and third quartile spend the most on wine by a large margin. Meat follows this but by a large margin. Given that wine's price can vary greatly this is not much of a surprise. I would like to see counts for these products to help decide inventory needs.

#### What is the relationship between marital status, presence of children and customer complaints?


In [None]:
df.groupby(['Marital_Status', 'Kidhome', 'Complain']).value_counts()

In [None]:
with_only_married_divorced_single = df.query('Marital_Status in ["Married", "Divorced", "Single"]')

In [None]:
with_only_married_divorced_single.groupby(['Marital_Status', 'Kidhome', 'Complain']).size()

In [None]:
complaint_groups = with_only_married_divorced_single.groupby(['Marital_Status', 'Kidhome'])['Complain'].agg(
    ['sum', 'count']).reset_index()
complaint_groups['complaint_rate'] = (complaint_groups['sum'] / complaint_groups['count']) * 100

fig = px.density_heatmap(with_only_married_divorced_single,
                         x='Marital_Status',
                         y='Kidhome',
                         z='Complain',
                         histfunc='avg',
                         title='Customer Complaint Rate by Marital Status and Number of Children',
                         labels={'z': 'Complaint Rate (%)'})
fig.show()
fig.write_image('module 4/colab_activity4_4_starter/images/complaint_rate_by_marital_status_and_children.png')

In [None]:
complaint_groups

#### Answers to "What is the relationship between marital status, presence of children and customer complaints?"

I filtered down to only Single, Married and Divorced customers. What I found was that Single customers with 2 kids had the highest rate of complaints. However, this was one individual. I then looked at my data and determined there were not enough observations to say anything definitive. Meaning that there is no relationship between marital status, presence of children and customer complaints.