In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Unit 6 - EDA example on marketing analytics

<div>
<img src="https://github.com/nlihin/EDA-course/blob/main/images/CRISP-DM.png?raw=true" width="600"/>
</div>

In this unit we will see and EDA case end-to-end:
- Fix problems
- 1. Data Understanding
  - 1.1 Numeric Data
  - 1.2 Binary Data
  - 1.3 Categorical Data
- 2. Data Preperation
  - 2.1 Missing Values
  - 2.2 Outliers
  - 2.3 Tranformations
- 3. Correlations & Observations


<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/marketing.PNG" width="400"/>
</div>

#### Information on the data
We'll work with a small marketing analytics dataset, taken from [iFood](https://www.crunchbase.com/organization/ifood)


In [None]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/marketing_data.csv'
mrkt_df = pd.read_csv(url)
mrkt_df.head()

In [None]:
mrkt_df.shape

Data description is [here](https://www.kaggle.com/jackdaoud/marketing-data)

##  Before we begin: fix problems

You may already notice something is strange with the Income column, is it aligned to the right??  
Let's look at the types of data

In [None]:
mrkt_df.dtypes

So here is the problem: the 'Income' column contains extra whitespace, clean it:

In [None]:
mrkt_df.columns = mrkt_df.columns.str.replace(' ', '')

And: the 'Income' column should be turned to numeric (float is better)\
regex: regular expression. We want the `$` sign treated as a string so regex should be set to False

In [None]:
mrkt_df.Income

turn `Income` to a number 

In [None]:
mrkt_df['Income'] = mrkt_df['Income'].str.replace('$', '', regex = False)
mrkt_df['Income'] = mrkt_df['Income'].str.replace(',', '')
mrkt_df['Income'] = mrkt_df['Income'].astype(float)

---

Sanity check:

In [None]:
mrkt_df.Income.sum()

## 1. Data Understanding 

We have: 
> 🔢 Numeric Variables  
> 🔘 Binary Variables (0/1)  
> 🏷️ Categorical Variables  

We begin with the numeric columns.  

### 🔢 Numeric Variables
To compare their distributions side by side, we use `pd.melt()` to reshape the data and then plot them `boxplot`.
With `FacetGrid`, we can display multiple boxplots side by side.

Select only numeric columns:

In [None]:
features = mrkt_df.select_dtypes(include='number').columns
features

That's will be a lot of boxplots!!  
So lets **Melt** the data into **long** format, so that we can explore all of it at once


In [None]:
melted_mrkt_df = pd.melt(mrkt_df, id_vars='ID', value_vars=features.drop('ID'))
melted_mrkt_df

Now use `FacetGrid`  
We are interested in creating a different boxplot for each `variable`, which is why `col="variable"`

In [None]:
g = sns.FacetGrid(data = melted_mrkt_df, col="variable",  col_wrap=4, height=2.5)
g.map_dataframe(sns.boxplot, y = 'value')
plt.show()

---

#### <span style="color:blue"> Exercise:</span>
>
> It's hard to read this, because axis y is shared.  What can we do?? Hint: ask advice from Google or your favorite LLM
>


### 🔘 Binary Variables (0/1)

Boxplots are **not useful** for binary variables — they only have two values.  
Instead, we use a **histogram (or barplot)** to show how often we see 0 and 1.  


#### <span style="color:blue"> Exercise:</span>
>
> Boxplots are not ideal for binary data. Create a FacetGrid of histograms
>

### 🏷️ Categorical Variables

For categorical variables, we use `countplots`  
If the categories are not too many, we can also use pie charts 

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/categories.jpeg" width="400"/>
</div>


In [None]:
non_numeric_columns = mrkt_df.select_dtypes(include='object')
non_numeric_columns.info()

Three of these features are categorical: `Education`, `Marital_Status`, `Country`

### <span style="color:blue"> Exercise:</span>
> Visualize these features, using `sns.countplot`


In [None]:
fig, axes = plt.subplots(figsize=(20, 5), ncols=3)

#YOUR CODE HERE

plt.show()

**Note:** You can also use a FacetGrid

In [None]:
categorical_cols = non_numeric_columns.columns.drop('Dt_Customer')

In [None]:
melted_cat_df = pd.melt(mrkt_df, id_vars='ID', value_vars=categorical_cols)

In [None]:
g = sns.FacetGrid(melted_cat_df, col='variable', col_wrap=3, sharex=False, sharey=False)
g.map(sns.countplot, 'value', order=None)

# Rotate x-axis labels for clarity
for ax in g.axes.flat:
    ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

---

📌 Summary:
Use the right plot for the right variable type:
- **Numeric** → `boxplot`
- **Binary** → `histplot` or `barplot`
- **Categorical** → `countplot` or pie chart (if few categories)

 

---

<a id='section2'></a>
## 2. Data preperation


### 2.1 Missing values

In [None]:
mrkt_df.isnull().sum().sort_values(ascending=False)

The feature `Income` contains 24 null values.

Plot this feature to identify best strategy for imputation:

In [None]:
sns.histplot(mrkt_df, x = 'Income')

We see most of the incomes are between 0-10000. We can confirm this with a box plot:

In [None]:
plt.figure(figsize=(2,4))
sns.boxplot(data = mrkt_df, y= 'Income')

In [None]:
mrkt_df['Income'] = mrkt_df['Income'].fillna(mrkt_df['Income'].median())

Use a log scale to display the income, so that the display is centered.

In [None]:
sns.histplot(mrkt_df.Income, log_scale=True)

#### Be Careful When Filling Missing Values

Filling missing values with the **mean** is easy — but not always right.

Sometimes, it hides important patterns in the data.

---

#### ⚠️ Example: Wait Time and Resolved Tickets

We look at two groups:
- ✅ Customers whose issue was **resolved** — they had **short wait times**
- ❌ Customers whose issue was **not resolved** — many are missing wait time, but those we have are **very long**

---
Look at the plots below:

We compare wait times between customers whose issues were **resolved** (1) and those that were **not resolved** (0), using four versions of the data:

1. **Before Imputation** – only complete cases  
2. **After Mean Imputation** – fill missing values with the overall average  
3. **After Median Imputation** – fill with the median value  
4. **After Group Mean Imputation** – fill using the average within each group


<div>
    <center>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/waittime_imputation_comparison.png?raw=true" />
    </center>
</div>

What do you see?

- In the original data, unresolved customers have much higher wait times.
- After filling, especially with the global **mean** or **median**, the difference between groups becomes smaller.
- **Group-wise imputation** preserves the difference more clearly — but still smooths it.

📌 Different imputation choices can **change the story** the data tells.

---

📌 **Bottom line**: Filling missing values can change your results.  
Think before you fill!

#### ✅ When is filling missing values helpful?
- You want to preserve as much data as possible for modeling.
- The missingness is **random** and unrelated to the target.
- You use a statistically sound imputation method (e.g., regression or group-based means).

#### ⚠️ When can filling be misleading?
- The missingness is **not random** — e.g., only customers who were dissatisfied skipped the question.
- You fill with a constant value (e.g., the mean), which may **flatten real variation**.
- You're analyzing relationships (e.g., group comparisons) that can be **distorted** by naive imputation.


### 2.2 Outliers

Should we remove outliers?

✅ **Sometimes yes**:
- The value is clearly wrong (e.g., typing error or system bug)
- It pulls the mean too far and affects analysis
- It hurts visualizations or model accuracy

❌ **Sometimes no**:
- The value is real (e.g., a very high-earning customer)
- It’s important for business decisions (e.g., targeting premium users)
- Removing it would hide useful variation

#### Treat the outliers
> Do something with the birth years  
> What about the income?


In [None]:
plt.figure(figsize=(3,3))
sns.boxplot( y = 'Year_Birth', data = mrkt_df)

In [None]:
mrkt_df.Year_Birth.sort_values(ascending=True)

In [None]:
mrkt_df = mrkt_df[mrkt_df.Year_Birth>1922].copy()

In [None]:
plt.figure(figsize=(3,3))
sns.boxplot( y = 'Year_Birth', data = mrkt_df)

In [None]:
mrkt_df.info()

Change the date column to a date type

In [None]:
mrkt_df['Dt_Customer'] = pd.to_datetime(mrkt_df['Dt_Customer'])

In [None]:
mrkt_df.dtypes

### 2.3 Data Transformations

To find patterns in the data, we need to find correlations in the data  
Advanced note: we saw the data doesn't have a normal distribution, so we'll use Kendall instead of Pearson

In [None]:
corrs = mrkt_df.corr(method = 'kendall', numeric_only=True).round(2)
plt.figure(figsize=(15,6))  #figure size
sns.heatmap(corrs, cmap='coolwarm', center=0, annot = True)

Difficult to look for correlations in this way. So we need to first transform the data, aggregate some of the fields

The total number of dependents in the home ('Dependents') can be engineered from the sum of 'Kidhome' and 'Teenhome'

In [None]:
mrkt_df['Dependents'] = mrkt_df['Kidhome'] + mrkt_df['Teenhome']

The year of becoming a customer ('Year_Customer') can be engineered from 'Dt_Customer'

In [None]:
mrkt_df['Year_Customer'] = pd.DatetimeIndex(mrkt_df['Dt_Customer']).year

The total amount spent ('TotalMnt') can be engineered from the sum of all features containing the keyword 'Mnt'

In [None]:
mrkt_df.columns

In [None]:
mnt_cols = ['MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds']

We have used `.sum()` to sum by columns. Now we want to sum rows. So `axis=1`

In [None]:
mrkt_df['TotalMnt'] = mrkt_df[mnt_cols].sum(axis=1)

The total purchases ('TotalPurchases') can be engineered from the sum of all features containing the keyword 'Purchases'

In [None]:
purchases_cols = ['NumDealsPurchases', 'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases']
mrkt_df['TotalPurchases'] = mrkt_df[purchases_cols].sum(axis=1)

The total number of campains accepted ('TotalCampaignsAcc') can be engineered from the sum of all features containing the keywords 'Cmp' and 'Response' (the latest campaign)

In [None]:
campaigns_cols = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2'] + ['Response'] # 'Response' is for the latest campaign
mrkt_df['TotalCampaignsAcc'] = mrkt_df[campaigns_cols].sum(axis=1)

Look at our new columns

In [None]:
mrkt_df[['ID', 'Dependents', 'Year_Customer', 'TotalMnt', 'TotalPurchases', 'TotalCampaignsAcc', 'NumDealsPurchases']].head(10)

## 3. Correlations & Observations

So now we are ready to search for correlations again

In [None]:
agg_features = ['ID', 'Income', 'Dependents','TotalMnt','TotalPurchases', 'TotalCampaignsAcc', 'NumWebVisitsMonth', 'NumWebPurchases', 'NumDealsPurchases']

In [None]:
corrs = mrkt_df[agg_features].corr(method = 'kendall')
plt.figure(figsize=(8,4))  #figure size
sns.heatmap(corrs, cmap='coolwarm', center=0, annot = True)

<div>   
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/mrkt_correlations.png?raw=true" width = 500/>
</div>

We found some correlations :-)
> Income and spending (`Income` & `TotalMnt`)  
> Kids and spending (`Dependents` & `TotalMnt`)  
> Kids and deals (`Dependents` & `NumDealsPurchases`)  
> Campaigns and spending (`TotalCampaignsAcc` & `TotalMnt`)  
> Web visits and purchases (`NumWebVisitsMonth` & `NumWebPurchases` & `NumDealsPurchases`)

### Income and spendings (`Income` & `TotalMnt`)

In [None]:
plt.figure(figsize=(4,3))
sns.regplot(x='Income', y='TotalMnt', data=mrkt_df)

#### <span style="color:blue"> Exercise:</span>
> Plot the same graph, but remove the outlier


---
### <span style="color:green"> Observation 1:</span>
>
>The higher the income is, the more you spend
>
> (a bit trivial, not a very good observation)
---

### Kids and spendings (`Dependents` & `TotalMnt`)
A plot illustrating the negative effect of having dependents (kids & teens) on spending:
    

In [None]:
plt.figure(figsize=(4,3))
sns.regplot(x='Dependents', y='TotalMnt', data=mrkt_df);

A linear plot doesn't look good here since data is discrete (same for ordinal data - e.g. - none, few, many)

---

#### <span style="color:blue"> Exercise:</span>
> Try `boxplot` instead. What looks better?


In [None]:
plt.figure(figsize=(4,4))
#YOUR CODE HERE
plt.show()

---

### Kids and deals (`Dependents` & `NumDealsPurchases`)

A plot illustrating positive effect of having dependents (kids & teens) on number of deals purchased:

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(x='Dependents', y='NumDealsPurchases', data=mrkt_df)

plot side-by-side so it will be easier to see

In [None]:
fig, ax = plt.subplots(1,2, figsize = (10,5))
plt.subplots_adjust(wspace = 0.3)
sns.boxplot(x='Dependents', y='TotalMnt', data=mrkt_df, ax = ax[0])
sns.boxplot(x='Dependents', y='NumDealsPurchases', data=mrkt_df, ax = ax[1])
plt.show()

---
### <span style="color:green"> Observation 2:</span>

>
>People with more kids spend less
>
>People with more kids buy more deals


---

Plots illustrating the positive effect of campaigns

### Campaigns and spending (`TotalCampaignsAcc` & `TotalMnt`)

In [None]:
plt.figure(figsize=(5.5,4))
sns.boxplot(x='TotalCampaignsAcc', y='TotalMnt', data=mrkt_df);

---
### <span style="color:green"> Observation 3:</span>
>
>Campaigns seem to be working

---

### <span style="color:green"> Observation???</span>
>
> Campigns don't seem to corrolate with deals, even for different countries (this may be a good thing)
>
> This is not really an observation, since it doesn't focus on finding a new connection. It is only an observation if the company currently believes that there is a connection and you're proving them wrong.
---

### Web visits and purchases (`NumWebVisitsMonth` & `NumWebPurchases` & `NumDealsPurchases`)

What about web visits and web purchases?

In [None]:
fig, ax = plt.subplots(1,2,figsize=(8, 4))
sns.regplot(x='NumWebVisitsMonth', y='NumWebPurchases', data=mrkt_df, ax=ax[0])
sns.regplot(x='NumWebVisitsMonth', y='NumDealsPurchases', data=mrkt_df, ax = ax[1])
plt.subplots_adjust(wspace = 0.3)

---

### <span style="color:green"> Observation 4:</span>
>
> Number of web visits in the last month is not positively correlated with number of web purchases
>
> Instead, it is positively correlated with the number of deals purchased, suggesting that the website is effective in stimulating purchases

---

<a id='section4'></a>

### Correlations in categorical data


In [None]:
mrkt_df[["Marital_Status"]].value_counts()

##### Is there a correlation between marital status and spendings?

Create a df that holds only the status's we're interested in - the four biggest categories

In [None]:
status_mrkt = mrkt_df.loc[(mrkt_df.Marital_Status  == 'Single')|(mrkt_df.Marital_Status == 'Married')|(mrkt_df.Marital_Status == 'Together')|\
                          (mrkt_df.Marital_Status == 'Divorced')].copy()

Create one-hot encodings for the categorical variables  
Note: use smartly, so as not to add too many dimensions

In [None]:
features2 = ['Income', 'Dependents','TotalMnt','TotalPurchases', 'TotalCampaignsAcc',\
             'NumDealsPurchases', 'NumWebVisitsMonth', 'NumWebPurchases', 'Marital_Status', 'Response']

status_mrkt_with_dummies = pd.get_dummies(status_mrkt[features2])

One-hot encoding doesn't affect variables that are not categorical

In [None]:
corrs = status_mrkt_with_dummies.corr(method = 'kendall')
plt.figure(figsize=(10,7))  #figure size
sns.heatmap(corrs, cmap='coolwarm', center=0, annot = True);

---

### <span style="color:green"> Observation???</span>
>
> There isn't any correlation between martial staus and spendings
>
>##### Observations should focus on what there IS, not what there isn't
>
>Let's try to look at the response to campgains

---



#### <span style="color:blue"> Exercise:</span>
Is there a connection between the status, dependents and response?
> groupby `Marital_Status` and `Dependents` to find out  
> use `unstack()` on the result to create a table

---

Or we can use a pivot table to obtain the same results:

In [None]:
status_mrkt.pivot_table('Response', index='Marital_Status', columns='Dependents', aggfunc='mean')  #aggfunc = 'mean' is the default

Why would we want a table? Because with a table it's easier to figure out what's going on and what we should plot

Single & Divorced with no kids are more likly to respond to a campaign

In [None]:
plt.figure(figsize=(5,4))
#plt.xticks(rotation=90)
ax = sns.barplot(data = status_mrkt, x='Marital_Status', y='Response', errorbar=None, hue = 'Dependents')
ax.set(ylabel='average response')
plt.legend(title='Dependents', loc=('upper right')) #the legend position
plt.show()

---

### <span style="color:green"> Observation 5:</span>
>
> Single & Divorced with no kids are more likly to respond to a campaign - average response is higher than 30%
>
> Married & Together with no kids respond at around 20%
>

 

---

Is this it? No!! There is always more to do. We haven't touched country, education, or campaign responses and much more. 


#### Summary of commands used in this unit:
>

> 🧹 **Missing values**
>
>* `isnull().sum()` – check how many missing values exist in each column.
>
>* `fillna()` – fill missing values with a constant or calculated value (e.g., mean or median). [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)
>
>* `fillna(df.mean())` – fill missing values using the mean of the column.
>
>* `dropna()` – remove rows with missing values. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)
>

---

> 🔄 **Reshaping and selecting data**
>
>* `pd.melt()` – reshape the DataFrame from wide to long format for easy comparison in FacetGrid. [documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
>
>* `drop()` – remove a column or index label from the DataFrame. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
>
>* `reset_index(drop=True)` – reset the index to default integers; use `drop=True` to discard the old index. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html)
>
>* `select_dtypes()` – select only columns of a specific data type (e.g., numeric or categorical). [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)
>
>* `nunique()` – count the number of unique values in a column.
>
>* `list comprehension` – create a list using a compact condition-based syntax, e.g., `[col for col in df.columns if df[col].nunique() == 2]`. [documentation](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

---

> 📊 **Visualization**
>
>* `FacetGrid()` – create a grid of plots split by category. [documentation](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)
>
>* `.map()` – apply a plot type (e.g., countplot or histplot) to each facet in a FacetGrid.
---
