# marketing_analytics_ds_template

## Sources
* https://github.com/KelvinLam05/marketing_analytics/blob/main/marketing_analytics.ipynb
* https://github.com/sundar911/marketing_analytics/blob/main/marketing_analytics.ipynb

## Algo
* EDA:
    * Data cleaning
    * Stat Evals
* General Questions:
    * For any stat eval should something be normalized?
    * Is a normal distribution required for the eval to be valid?
    * Is the population homogenous except for the basis of comparison?
    * Are seasons a factor?  Special Events?

## EDA

### Outlier Checks

#### Describe DF
```
df.describe()
```

#### Review outliers graphically
#### Review outliers by drops, or imputation, or log transformation

## Data Cleaning
#### Check Nulls
```
df.isnull().sum()
```
### Clean Nulls
```
df[' Income '].fillna(df[' Income '].median(), inplace=True)
```

## Review Distributions
### Example 1: .value_counts()
```
df['education'].value_counts()
```


## Feature Engineering
### Example 1: Categorical Recode with apply(lambda
```
# Merge 'yolo', 'absurd', and 'alone' under 'single'
df['marital_status'] = df['marital_status'].apply(lambda x: 'single' if str(x) in ['alone', 'yolo', 'absurd'] else str(x))
```


### One Hot Encoding
#### Example 1:
```
one_hot_columns = list(bundled_pipeline.named_steps['preprocessor'].named_transformers_['categorical'].named_steps['one_hot_encoder'].get_feature_names_out(input_features = categorical_columns))
```


## Review Correlation
### Example 1: Using ColumnTransformer, SimpleImputer, associations
```
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from dython.nominal import associations

transformer = ColumnTransformer(transformers = [('simple_imputer', SimpleImputer(strategy = 'median'), ['income'])], remainder = 'passthrough')

complete_correlation = associations(X_tr, figsize = (32, 16))

The rule of thumb is that .8 is too collinear.
There is a process for automatically dropping over-correlated variables, but don't remember how that's done.
```
### Example 2: Seaborn Heatmap
```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))
sns.heatmap(df_income_marketing.corr('spearman'), center=0, annot=True);
```

### Chi Squared

## Feature Importance
### Example 1: Using ELI5
```
import eli5
eli5.explain_weights(bundled_pipeline.named_steps['model'], top = 50, feature_names = numeric_features_list)
```

### Feature Importance Using Shapely Values
```
import shap

# calculate shap values 
ex = shap.Explainer(model, x_train)
shap_values = ex(x_test)

# plot
plt.title('SHAP summary for NumStorePurchases', size=16)
shap.plots.beeswarm(shap_values, max_display=5);
```

## Eval Key X-Y Relationships from Correlation Study

## Evaluate Variance
### Get Equal Sized Samples
```
below_average = below_average.sample(703)
```
### Use Levene Test:
```
import scipy.stats as stats
stats.levene(above_average['numstorepurchases'], below_average['numstorepurchases'])
LeveneResult(statistic=15.638759392178962, pvalue=8.048902576767652e-05)
The resulting p-value is less than 0.05, we fail to reject the null hypothesis of the variances being equal.
```
### Evaluate Residual Distribution:
* Normality
```
Check residual distribution using Kolmogorov-Smirnov test.
diff = scale(np.array(above_average['numstorepurchases']) - np.array(below_average['numstorepurchases']))
from scipy.stats import kstest
kstest(diff, 'norm')

KstestResult(statistic=0.0711000050880169, pvalue=0.0015527924807426162)

As the p-value obtained from the Kolmogorov-Smirnov test is significant (p < 0.05), we conclude that the residuals are not normally distributed. Therefore, Mann-Whitney U test is more appropriate for analyzing two samples.
```

### Performan Mann-Whitney Test
```
Perform one-sided Mann-Whitney U test,
stats.mannwhitneyu(x = above_average['numstorepurchases'], y = below_average['numstorepurchases'], alternative = 'greater')
MannwhitneyuResult(statistic=378065.5, pvalue=2.3990673662361752e-67)
As the p-value obtained from the Mann-Whitney U test is significant (p < 0.05), we can conclude that store purchases of people who spend more on gold is greater than store purchases of people who spend less on gold. Thus, the supervisor's claim is justified.
```

### MANOVA
#### Source: 
```https://github.com/sundar911/marketing_analytics/blob/main/marketing_analytics.ipynb
```
#### Imports:
```
from statsmodels.multivariate.manova import MANOVA
```
#### Prep with Normalization
```
df_scaled = pd.DataFrame(columns=['Country', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntSweetProducts', 'MntFishProducts','MntGoldProds'])

for i in ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntSweetProducts', 'MntFishProducts','MntGoldProds']:
  final = []

  for j in ['GER', 'IND', 'SP', 'US', 'CA', 'SA', 'AUS']:
    scaled = scaler.fit_transform( np.asarray (df[df.Country==j][i]).reshape(-1,1) )

    for k in scaled:
      final.append(k[0])
    
  df_scaled[i] = final
```
#### Having Performed Levine's Test
```
for i in ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntSweetProducts', 'MntFishProducts','MntGoldProds']:
  sp = df_scaled.query('Country == "SP"')[i]
  us = df_scaled.query('Country == "CA"')[i]
  ca = df_scaled.query('Country == "US"')[i]
  aus = df_scaled.query('Country == "AUS"')[i]
  ger = df_scaled.query('Country == "GER"')[i]
  ind = df_scaled.query('Country == "IND"')[i]
  sa = df_scaled.query('Country == "SA"')[i]

  # Bartlett's test in Python with SciPy:
  stat, p = stats.levene(sp, us, ca, aus, ger, ind, sa)

  # Get the results:
  print(stat, p)
  
Levene's indicates that the variance among each product sales across countries is not significantly different which is basically a statistical nod for going ahead with the MANOVA
```


## Common Web Analytic Metric Calcs
### CPC
```
def average_calculation(df):
    
    # Cost Per Click 
    df['Cost Per Click'] = df['Facebook Cost']/df['Facebook Clicks']

    # Click Through Rate
    df['Click Through Rate'] = df['Facebook Clicks']/df['Facebook Impressions']*100

    # Conversion Ratio
    df['Converstion Ratio'] = df['Facebook Conversions']/df['Facebook Clicks']*100
    
    return df
```
### Cost Per Lead

```
def cost_per_lead_calc(amount,data):
    
    # Empty Dictionary
    cost_lead = dict()
    
    # Amount
    cost_lead['Budget Assumption'] = amount
    
    # Average CPC
    cost_lead['Average CPC'] = round(data['Cost Per Click'],2)
    
    # Average Total Clicks 
    cost_lead['Average Total Clicks'] = round(amount/data['Cost Per Click'],0)
    
    # Average Lead Conversion 
    cost_lead['Average Lead CNV'] = round(data['Converstion Ratio'],2)
    
    # Average Total Leads
    cost_lead['Average Total Leads'] = round(cost_lead['Average Total Clicks']*data['Converstion Ratio']/100,0)
    
    # Cost Per Lead 
    cost_lead['Cost Per Lead'] = round(amount/cost_lead['Average Total Leads'],2)
    
    # Dataframe Formation
    df = pd.DataFrame(cost_lead)
    
    return df.T
```
### CLV/LTV
```
# Customer Total Lifetime Value

def cust_life_value(lead_overall,margin,discount,retention_rate):
    
    # Empty Dict
    customer_liftime_table = dict()
    
    # Average Total Purchase Cost
    atpc = round(lead_overall['Average Total Purchase'],2)
    customer_liftime_table['Average Total Purchase Cost'] = atpc
    
    # Margin (Assumpt.)
    customer_liftime_table['Margin (Assumpt.)'] = margin
    
    # Average Gross profit
    agp = round(margin/100*atpc,2)
    customer_liftime_table['Average Gross Profit'] = agp
    
    # Customer Retention Rate
    crt = round(retention_rate/100,2)
    customer_liftime_table['Customer Retention Rate'] = retention_rate
    
    # Average Discount rate
    dis = round(discount/100,2)
    customer_liftime_table['Average Discount Rate'] = discount
    
    # Average CLV
    clv = round(agp*(crt/(1+dis-crt)),2)
    customer_liftime_table['Average CLV'] = clv
    
    # Average Cust Acq Cost
    acqc = round(lead_overall['Customer Acquisition Cost'],2)
    customer_liftime_table['Average Cust Acq Cost'] = acqc
    
    # Average CLV Net
    aclv_net = round(clv-acqc,2)
    customer_liftime_table['Average CLV Net'] = aclv_net
    
    return pd.DataFrame(customer_liftime_table)
```