#Institute for Machine Learning, LLC
## **Author:** Dr. Giancarlo Crocetti
## **Course:** AIM-315 Introduction to Business Analytics


**License**: This code is licensed under the Creative Common License for non-commercial use **CC BY-NC**. Refer to https://creativecommons.org/licenses/by-nc/2.0/ for more information.


#Use Case: Customer Segmentation for Marketing Strategy Optimization
##Objective:

The primary aim of this use case is to identify distinct segments within the bank's customer base using a Gaussian Mixture Model (GMM). By focusing on features like age, balance, and the number of contacts during the campaign, the bank aims to understand the underlying patterns in customer behavior and tailor its marketing strategies accordingly.
Dataset:

The dataset is sourced from the UCI Machine Learning Repository and pertains to the direct marketing campaigns (phone calls) of a Portuguese banking institution. The campaigns aimed to promote term deposits among the bank's customers.
Features:

  - **Age**: This represents the age of the clients.
  - **Balance**: This represents the average yearly balance (in euros) of the clients.
  - **Campaign**: This represents the number of contacts performed during this campaign for a client.
  - **Marital**: Marital Status
  - **Education**: Education level of customer

##Methodology:

  1. **Data Preprocessing**: The relevant features are selected, and any negative values in the 'balance' feature are handled to ensure data consistency.
  
  2. **Standardization**: The selected features are standardized to bring them to a common scale.
  
  3. **Modeling**: A GMM is applied to the standardized features to create customer segments. The model with the lowest AIC value across different component numbers is selected as the best model.
  
  4. **Segmentation**: The best GMM model is used to predict the segments and assign them back to the original dataset.
  
  5. ** Summary Statistics Generation**: Summary statistics for each segment are generated to understand the characteristics of the identified segments.

##Outcome:

The resulting segments represent distinct groups within the customer base with varying age, balance, and campaign interaction levels. These segments are then analyzed to tailor marketing strategies, including the type and frequency of contact, offer customization, and personalized communications, to enhance the effectiveness of the marketing campaigns.
Business Impact:

By leveraging the insights obtained from the customer segmentation, the bank can optimize its marketing strategies to target the right customers with the right offers. This can lead to increased customer engagement, higher conversion rates, and enhanced customer satisfaction, ultimately contributing to the bank's overall growth and profitability.

## Challenges & Considerations:

The selected features and the number of components should be validated and may need adjustment based on the actual business context and dataset characteristics.

The interpretation of the segments needs to align with the business understanding and domain knowledge to drive actionable insights.

Ethical considerations and data privacy concerns need to be addressed, especially when dealing with sensitive financial data.

In [None]:
import pandas as pd
import numpy as np
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from scipy.stats import boxcox, anderson
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.mixture import GaussianMixture


# 1. Load the data

In [None]:
features = ['age', 'balance', 'campaign', 'job', 'marital', 'education', 'housing']
df = pd.read_csv('./bank-full.csv', sep=';', encoding='latin1', usecols=features)
df.head()

In [None]:
df['balance'].max()

# 2. Standardization & Categorical Variables

## 2.1 Standardization

To avoid that variables have a higher influence on the model simply because they have higher values (i.e.; Monetary) we will change the unit of each variable into unit of standard deviation by applying a transformation called **standardization** which transform a variable as:

$z = \frac{{x - \bar{x}}}{{s}}$

The Standardization is performed by the `StandardScaler` while the The OneHotEncoder is a preprocessing technique used to convert categorical data variables so they can be provided to machine learning algorithms to improve predictions.

## 2.2 Categorical Variables

Categorical data are variables that contain label values rather than numeric values. Each label for a given attribute is mapped to a unique integer, and then each of these integers is represented in binary format—zeros and ones.


In [None]:
# FOR EXPLANATORY PURPOSES ONLY
# Sample DataFrame
df_example = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue', 'Blue','Yellow']
})
print(df_example)
print()
# Initializing OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')

# Fitting and transforming the data and converting it to DataFrame
one_hot_encoded_arr = encoder.fit_transform(df_example[['Color']])
encoded_df = pd.DataFrame(one_hot_encoded_arr, columns=encoder.get_feature_names_out(['Color']))

print(encoded_df)

Back to the GMM Analysis

In [None]:
# Selecting relevant numerical and categorical features
numerical_features = ['age', 'balance']
categorical_features = ['job', 'marital', 'education', 'housing']
selected_features = numerical_features + categorical_features
df_selected = df[selected_features]

## 2.3.2 Creating the Pipeline
### Numerical Features
#### Checking for Normality
For numerical feature we need to check for normality using the Anderson-Darling test as we have quite a lot of data available. The Shapiro-Wilk test can be overly sensitive with larger sample size (>>100).

In case of the variable is not normally distributed we will apply a Box-Cox transformation, by also making sure the values are positive.

#### Positive Values
If the variable contains negative or zero values, we will shift the values before applying the Box-Cox transformation. Shifting will not change the shape of the distribution.

#### Standardization
Finally, we will standardize the value so to have all variables in the same unit.

### Categorical Features
For categorical features we will simply apply a one-hot encoding.

In [None]:
class BoxCoxTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, alpha=0.05):
        self.alpha = alpha

    def fit(self, X, y=None):
        self.lambdas_ = {}
        self.shifts_ = {}
        for col in X.columns:
            # Shift data if necessary
            min_val = X[col].min()
            shift = 0 if min_val > 0 else -min_val + 1
            self.shifts_[col] = shift
            shifted_data = X[col] + shift

            # Apply Anderson-Darling test
            ad_test_result = anderson(shifted_data.dropna())
            if ad_test_result.statistic > ad_test_result.critical_values[0]:  # Comparing with the critical value at 15% significance level
                _, maxlog = boxcox(shifted_data.dropna())
                self.lambdas_[col] = maxlog
        return self

    def transform(self, X):
        X_transformed = pd.DataFrame(index=X.index)
        for col in X.columns:
            shifted_data = X[col] + self.shifts_.get(col, 0)
            if col in self.lambdas_:
                transformed_col = boxcox(shifted_data, lmbda=self.lambdas_[col])
                X_transformed[col] = transformed_col
            else:
                X_transformed[col] = shifted_data
        return X_transformed

# Create a pipeline for numerical features
numerical_pipeline = Pipeline([
    ('boxcox', BoxCoxTransformer()),
    ('scaler', StandardScaler())
])

# Create a pipeline for categorical features
categorical_pipeline = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a column transformer to combine the pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ],
    remainder='passthrough'
)

# Apply the transformations
transformed_data = preprocessor.fit_transform(df_selected)

# Manually construct feature names for the one-hot encoded categorical columns
ohe_categories = preprocessor.named_transformers_['cat'].named_steps['onehot'].categories_
cat_features = [f"{col}_{subcat}" for col, subcats in zip(categorical_features, ohe_categories) for subcat in subcats]

# Combining the feature names for both numerical and categorical columns
all_feature_names = numerical_features + cat_features

# Convert the sparse matrix to a dense matrix
dense_transformed_data = transformed_data.toarray()

# Create the final DataFrame
transformed_df = pd.DataFrame(data=dense_transformed_data, columns=all_feature_names, index = df.index)


In [None]:
transformed_df

**NOTE:** While you can use one-hot encoded variables in a GMM, the effectiveness and interpretability of the model might be compromised.

What we will do is to assign the segment to each observation and then look at the original data for profiling considerations.

## 2.3.2 Checking for Normality
A GMM model has a strong normality assumption. We will check if these fields are normally distributed and if not we will appy a box-cox transformation.

In [None]:
columns = df_selected.shape[1]

# Check normality assumption for numerical values (no dummies)
for c in df_selected[numerical_features].columns:
    data = df_selected[c]

    # Visual Inspection: Histogram and Q-Q plot
    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    sns.histplot(data, kde=True)
    plt.title(f'Histogram of {c}')

    plt.subplot(1, 2, 2)
    stats.probplot(data, dist="norm", plot=plt)
    plt.title(f'Q-Q plot of {c}')

    plt.tight_layout()
    plt.show()

With the exception of `campaign`, which is really a more discrete variable, `age` and `balance` seems to have been normalized quite nicely. Well, allow some 'slack' here.

# Generating the model

## Estimating k - the Number of Segments

In [None]:
def find_best_gmm(X, k_range=(2, 8)):
    best_aic = np.inf
    best_bic = np.inf
    best_k = None
    best_gmm = None
    aic_values = []
    bic_values = []

    for k in k_range:
        gmm = GaussianMixture(n_components=k, random_state=0).fit(X)
        aic = gmm.aic(X)
        bic = gmm.bic(X)

        # Store the various AICs and BICs
        aic_values.append(aic)
        bic_values.append(bic)

        if aic < best_aic:
            best_aic = aic
            best_bic = bic
            best_k = k
            best_gmm = gmm

    return best_gmm, best_k, best_aic, best_bic, aic_values, bic_values

# Example usage
# Assuming 'transformed_df' is the DataFrame from the pipeline
k_range = range(3,12)
best_gmm_model, best_k, best_aic, best_bic, aics, bics = find_best_gmm(transformed_df, k_range)
print(f"Best GMM Model: {best_k} components")
print(f"AIC: {best_aic}")
print(f"BIC: {best_bic}")

In [None]:
# Plotting AIC and BIC
plt.plot ([k for k in k_range], aics, label='AIC')
plt.plot ([k for k in k_range], bics, label='BIC')
plt.legend()
plt.xlabel('Number of Components')

There is a strong agreement between the measure, supporting a 5 components solution. While the AIC and BIC continue to decrease as we increase the components, we are looking to the simplest (less complex) model.

If you have two or more valleys in the graph, I would usually go with the lowest number of components as I prefer explanability over precision, meaning I will peek the least complex model.

# Generating Segmentation from the Best Model
Let's segment the original dataset using the best model.

In [None]:
# Let's build the model with 5 components
gmm = GaussianMixture(n_components=5, random_state=2).fit(transformed_df)

# Assigning the segments to the original DataFrame
df['Segment'] = gmm.predict(transformed_df)

**NOTE:** We apply the GMM model on the transformed data, but store the field `Segment` in the original dataset. This 'trick' will allow us to interpret the segments using the un-standardazed and un-transformed data.

In [None]:
df.head()

# 6. Display the Segmentation & Generating Summaries

In [None]:
# Set Seaborn style
sns.set(style="whitegrid")

# Prepare data
segment_counts = df['Segment'].value_counts()

# Plot
fig, ax = plt.subplots()
ax.pie(segment_counts, labels=segment_counts.index, autopct='%1.1f%%', startangle=90)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.title('Customer Segmentation')
plt.show()

Using the original dataset let's build summary statistics for both numerical and categorical variables. We will achieve this by calculating these summary statistics separaterly and then combining them into a single DataFrame.

In [None]:
summary_stats = pd.DataFrame()

for segment in df['Segment'].unique():
    segment_data = df[df['Segment'] == segment]

    # Summary for numerical features
    num_summary = segment_data[numerical_features].agg(['mean', 'std', 'min', 'max', 'count']).unstack()

    # Summary for categorical features
    cat_mode = segment_data[categorical_features].apply(lambda x: x.mode()[0])  # Mode
    cat_freq = segment_data[categorical_features].apply(lambda x: x.value_counts().values[0])  # Frequency of mode

    cat_mode.index = [f"{feat}_mode" for feat in categorical_features]
    cat_freq.index = [f"{feat}_freq" for feat in categorical_features]

    # Combining the summaries for both types of variables
    segment_summary = pd.concat([num_summary, cat_mode, cat_freq])

    # Adding the segment summary to the overall summary
    summary_stats = pd.concat([summary_stats, segment_summary], axis=1)

# Naming the columns after the segments
summary_stats.columns = [f"Segment_{seg}" for seg in df['Segment'].unique()]

print("Summary Statistics for Each Segment:")
print(summary_stats)


# 7. Describing Your Segments
One of the more critical steps is to elaborate and describe the segments and provide a high-level strategy on what to do next.

Take notice of this, as this represents the most critical step; otherwise, why did we perform a segmentation analysis in the first place?

Notice how I spent time and engaged with domain expert colleagues to write down this section of the notebook. You should do the same in a real use case.

**NOTE:** If you do not have the necessary domain understanding (yet), feel free to invite colleagues with experience in an analytical session to describe the segments and develop a strategy to move forward. **there is nothing wrong with this and it is something that is encouraged**.

## Segment_0: The Established Management Group

  - **Age Profile**: Aged around 39 years on average with a broad range (20-81 years).
  - **Financial Profile**: Highest average balance ($1755) with a large standard deviation, indicating significant variability in financial status.
  - **Demographics and Preferences**: Predominantly in management roles, married, with tertiary education, and less likely to have housing loans.
  - **Opportunities**: Offer high-value investment products, wealth management services, and retirement planning. Upsell premium credit cards and insurance products.

## Segment_1: The Retired Group

  - **Age Profile**: Significantly older (average age 53), with the oldest members up to 95 years.
  - **Financial Profile**: High average balance ($1810), suggesting accumulated wealth.
  - **Demographics and Preferences**: Mostly retired, married, with secondary education, and less inclined towards housing loans.
  - **Opportunities**: Offer estate planning, health insurance, and senior citizen savings schemes. Promote leisure and travel-related offers.

## Segment_2: The Mature Managers

  - **Age Profile**: Older (around 44 years), spanning a wide age range.
  - **Financial Profile**: Moderate average balance ($1221) with a high standard deviation.
  - **Demographics and Preferences**: Management jobs, married, secondary education, with housing loans.
  - **Opportunities**: Retirement planning services, fixed deposits, and tax-saving investment options. Also, offer refinancing options for housing loans.

## Segment_3: The Stable Middle-Aged Group

  - **Age Profile**: Middle-aged group (average age around 42).
  - **Financial Profile**: Moderate average balance ($1258) with a substantial standard deviation.
  - **Demographics and Preferences**: Predominantly blue-collar, married, with secondary education, less likely to have housing loans.
  - **Opportunities**: Target this segment with medium-term investment products, education loans for children, and life insurance.

## Segment_4: The Working-Class Savers

  - **Age Profile**: Similar to Segment_0 in age, but slightly younger on average.
  - **Financial Profile**: Lower average balance ($1088) with a moderate standard deviation.
  - **Demographics and Preferences**: Mainly blue-collar workers, married, with secondary education, and more likely to have housing loans.
  - **Opportunities**: Focus on savings accounts with better interest rates, personal loans, and home loan products. Financial literacy programs can be beneficial.

# General Insights and Opportunities:

  - **Cross-Selling and Up-Selling**: Tailored products based on life stage and financial capacity. Upsell premium services to high-balance segments.
  - **Financial Education**: Focus on segments with lower balances and less tertiary education with financial literacy programs.
  - **Personalized Marketing**: Use demographics to create targeted marketing campaigns for each segment.
  - **Digital Banking Services**: Enhance digital banking solutions for tech-savvy younger segments and simplified services for older customers.
  - **Customer Retention**: Offer loyalty programs and benefits to retain long-standing customers, especially in older and wealthier segments.
  - **Community Engagement**: Engage in community programs or sponsor events in areas with a high concentration of certain segments.

# Expanding Your Strategy
Based on the detailed profiling and analysis of the customer segments, there are several strategic opportunities for expanding the bank's business. These opportunities focus on both deepening relationships with existing customers and attracting new ones:

  - **Tailored Financial Products and Services**: Develop and offer financial products tailored to the specific needs of each segment. For example, wealth management services for the high-balance segments, affordable loan products for the working-class savers, and retirement planning for the older segments.

  - **Enhanced Digital Banking Solutions**: Invest in digital banking technologies to appeal to younger and tech-savvy customers. Simplify the digital experience for older customers to encourage adoption. Offer online financial advisory services, mobile banking apps, and easy-to-use online platforms.

  - **Financial Literacy and Education Programs**: Create programs targeting segments with lower average balances and educational levels. These programs can educate customers on saving strategies, investment options, and effective financial management, fostering a more financially literate customer base.

  - **Personalized Marketing and Customer Engagement**: Implement data-driven marketing strategies to offer personalized banking experiences. Use the insights from the segmentation to create targeted marketing campaigns, loyalty programs, and personalized offers that resonate with each segment.

  - **Community Involvement and Social Responsibility Initiatives**: Engage with the community through local events, sponsorships, and corporate social responsibility initiatives. This approach can improve brand perception and attract customers who value community involvement.

  - **Cross-Selling and Up-Selling Strategies**: Utilize customer data to identify opportunities for cross-selling and up-selling appropriate banking products and services, such as insurance, loans, and credit cards.

  - **Partnerships and Collaborations**: Form strategic partnerships with businesses, educational institutions, and other organizations. These partnerships can provide a channel for reaching new customers and offering exclusive deals or services.

  - **Expansion into New Markets or Demographics**: Explore opportunities to expand into new geographic markets or target unrepresented demographics in the current customer base.

  - **Customer Feedback and Continuous Improvement**: Regularly collect customer feedback to understand evolving needs and preferences. Use this feedback to continuously improve products, services, and customer experiences.

  - **Diversification of Investment and Savings Options**: Offer a diverse range of investment products, catering to different risk appetites and financial goals. This could include stocks, bonds, mutual funds, and retirement savings plans.

  - **Robust Customer Service and Support**: Strengthen customer service channels, providing quick and effective support. Offer financial advisory services to help customers make informed decisions.

By focusing on these areas, the bank can not only improve its offerings and customer relationships but also attract new customers, thereby expanding its business and increasing its market share.

#Congratulations
You have carried out a complete GMM analysis in Python and learned a different way of segmenting your customers.

I have spent a lot of time studying the segments and developing a strategy to improve the current situation. The strategy implementation should follow the strategy put forward by the business.