# Bank Telemarketing Campaign - Predictive Modeling Project

**Project Goal:** Build data-driven models to predict the success of telemarketing calls for long-term bank deposits

**Dataset Period:** 2008-2013 (Global Financial Crisis)

**Methodology:** CRISP-DM (Cross-Industry Standard Process for Data Mining)

---

## 1. Business Understanding

### 1.1 Business Objectives
- TODO: Define the business problem
- TODO: Identify key stakeholders
- TODO: Define success criteria for the project

### 1.2 Project Goals
- TODO: Translate business objectives into data mining goals
- TODO: Define target variable
- TODO: Identify evaluation metrics (accuracy, precision, recall, F1-score, ROC-AUC)

### 1.3 Business Context
- TODO: Describe the telemarketing campaign process
- TODO: Explain the financial crisis context (2008-2013)
- TODO: Define constraints and requirements

In [None]:
# Import necessary libraries
# TODO: Add imports as needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.combine import SMOTETomek, SMOTEENN

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split 
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.preprocessing import StandardScaler

# TODO: Add scikit-learn imports
# TODO: Add any other libraries needed

from scipy.stats import chi2_contingency
import math

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---
## 2. Data Understanding

### 2.1 Data Collection
- TODO: Load the dataset - DONE
- TODO: Document data sources

In [None]:
# Load the dataset
df = pd.read_csv('bank.csv', sep=';')
df

### 2.2 Data Description
- TODO: Examine dataset structure
- TODO: Identify features and their types
- TODO: Document feature definitions

In [None]:
# Basic dataset information

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

### 2.3 Data Exploration
- TODO: Analyze target variable distribution - DONE
- TODO: Check for class imbalance - DONE
- TODO: Explore feature distributions - DONE

In [None]:
# Target variable analysis

The target variable 'y' represents whether a customer will buy a long-term bank deposit or not.

In [None]:
# target value
goal = df['y']
counts = goal.value_counts()
percent = goal.value_counts(normalize=True)
percent100 = goal.value_counts(normalize=True).mul(100).round(1).astype(str)+'%'
pd.DataFrame({'y': counts,'percent': percent100})

In [None]:
subscription_summary = pd.DataFrame({
    'class': counts.index,
    'count': counts.values,
    'percent': percent.values
})

print(subscription_summary)

sns.countplot(data=df, x='y')
plt.title('Subscription Class Distribution')
plt.xlabel('Subscription (y)')
plt.ylabel('Count')
plt.show()

In [None]:
imbalance_ratio = counts.min() / counts.max()
print(f"Imbalance ratio: {imbalance_ratio:.2f}")

Imbalance ratio is below 0.2 which means there's a severe imbalance. Dataset is dominated by non
Without adressing this imbalance, predictive models predicting "no purchase" would achieve high accuracy, but fail to identify potential buyers.

To handle class imbalance we will use SMOTE combined with Tomek Links technique later on.

In [None]:
# Univariate analysis
# TODO: Analyze numerical features
# TODO: Analyze categorical features

In [None]:
# Separate numerical and categorical features
numericFeatures = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categFeatures = [col for col in df.select_dtypes(include=['object', 'category']).columns if col != 'y']

print("Numerical features:", numericFeatures)
print("Categorical features:", categFeatures)


In [None]:
# Bivariate analysis - shouldnt it be later???
# TODO: Analyze relationships with target variable
# TODO: Correlation analysis

### 2.4 Data Quality Assessment
- TODO: Check for missing values - DONE
- TODO: Identify outliers - DONE
- TODO: Check for duplicates - DONE
- TODO: Identify data quality issues

In [None]:
# Data quality checks

In [None]:
# % missing values by column

nulls = df.isnull().sum()
percent = round(nulls/df.shape[0]*100,3)
nullvalues = pd.concat([nulls,percent], axis=1, keys=('Cont','%'))
nullvalues
df.isnull().sum()

In [None]:
def check_unknown():
    categorical_cols_with_unknown = ['job', 'marital', 'education', 'default', 'housing', 'loan']
# Count 'unknown' in each column
    unknown_counts = {col: (df[col] == 'unknown').sum() for col in categorical_cols_with_unknown}
    unknown_df = pd.DataFrame.from_dict(unknown_counts, orient='index', columns=['Count'])
    unknown_df['Percent'] = round(unknown_df['Count'] / df.shape[0] * 100, 3)
    print(unknown_df)

In [None]:
check_unknown()

There are 

In [None]:
num_plots = len(numericFeatures)
cols = 2
rows = math.ceil(num_plots / cols)

plt.figure(figsize=(cols * 4, rows * 5)) 

for i, feature in enumerate(numericFeatures):
    plt.subplot(rows, cols, i + 1)
    sns.boxplot(data=df, x='y', y=feature, hue='y', palette={"yes": "red", "no": "green"})
    plt.title(feature)

plt.tight_layout()
plt.show()


In [None]:
Q1 = df[numericFeatures].quantile(0.25)
Q3 = df[numericFeatures].quantile(0.75)
IQR = Q3 - Q1

outliers = ((df[numericFeatures] < (Q1 - 1.5 * IQR)) | (df[numericFeatures] > (Q3 + 1.5 * IQR)))
print("Number of outliers per numeric feature:")
print(outliers.sum())

We performed outliers detection on all numerical features using the Interquartile Range (IQR) method and visualized it using boxplots.

Features such as *previous*, *duration* and *campaign* contained a high number of outlier values.
*age* and *pdays* alco contained a noticeable number out outliers.
Several features, including *emp.var.rate*, *cons.price.idx*, e8uribor3m* and *nr.employed*, showed no outliers according to the IQR method.

In [None]:
df.duplicated().sum()

In [None]:
# Remove duplicate lines, if they exist

shape_before = df.shape
print('Shape before deleting duplicate values:',shape_before)

df = df.drop_duplicates()

shape_after = df.shape
print('Shape after deleting duplicate values:',shape_after)

percent = round((1-shape_after[0]/shape_before[0])*100,3)
print(f"Percentage of duplicates rows droped: {percent}%")

There are 12 duplicated rows.

---
## 3. Data Preparation

### 3.1 Data Cleaning
- TODO: Handle missing values - DONE
- TODO: Remove/treat outliers - We will use Robust Scaler.
- TODO: Remove duplicates - DONE
- TODO: Fix data inconsistencies - DONE (there were none)

The first thing to be perfomed is to handle missing values indicated in paragraph 2.4 [...].
For the attributes with low numbers of missing values, namely, *job*, *marital* and *education*, the missing values will be replaced with the most common value in the dataset related to the specific attribute.

In [None]:
impute_cols = ['job', 'marital', 'education']

for col in impute_cols:
    most_common = df.loc[df[col] != 'unknown', col].mode()[0]
    df.loc[df[col] == 'unknown', col] = most_common
    print(f"Imputed 'unknown' in '{col}' with: {most_common}")

In [None]:
# check_unknown()

In [None]:
For the the missing values of a more siginificant number, they are changed to 'no' because [...]

In [None]:
replace_cols = ['default', 'housing', 'loan']

for col in replace_cols:
    df.loc[df[col] == 'unknown', col] = 'no'

In [None]:
# check_unknown()

In [None]:
Then the duplicates were deleted to avoid redundancy of data. [...]

In [None]:
# Remove duplicate lines, if they exist

shape_before = df.shape
print('Shape before deleting duplicate values:',shape_before)

df = df.drop_duplicates()

shape_after = df.shape
print('Shape after deleting duplicate values:',shape_after)

percent = round((1-shape_after[0]/shape_before[0])*100,3)
print(f"Percentage of duplicates rows droped: {percent}%")

In [None]:
for col in df.select_dtypes(include='object').columns:
    print(f"{col} unique values: {df[col].unique()}")

### 3.2 Bivariate analysis
- TODO: Analyze relationships with target variable
- TODO: Correlation analysis

In [None]:
fig, PlotCanvas = plt.subplots(nrows=math.ceil(len(categFeatures)/2), ncols=2, figsize=(16, 40))

# Creating Grouped bar plots for each categorical predictor against the Target Variable "class"
lin = 0
for i, Categcol in enumerate(categFeatures):
    col = i%2   
    CrossTabResult=pd.crosstab(index=df[Categcol], columns=df['y'])
    CrossTabResult.plot.bar(color=['green','red'], ax=PlotCanvas[lin,col])
    if i%2 == 1:
        lin = lin+1
    

These grouped bar charts display the frequency on the Y-axis and the category values on the X-axis. If the proportions of the target variable (e.g., "yes" vs. "no") are similar across all categories of a feature, it suggests that there is little to no relationship between that feature and the target.

For example, if we look at a hypothetical plot like *day of week* vs. y*, and observe that each day of a week has a similar "yes" to "no" ratio, it indicates that the day of week likely has no significant influence on *y*. In such cases, the feature and the target variable are likely not correlated. It can be also observed on the plot depicting *default* vs. *y*, *housing* vs. *y* and *loan* vs. *y*, which means that *default*, *housing* and *loan* are likely not correlated to *y*. 

However, there are a few variables that seems to be strongly correlated to *y*. Namely, *marital*, *poutcome*, *contact*, *job* and *education*. These variables do not maintain consistent proportions across the target classes.

In [None]:
num_features = len(numericFeatures)
cols = 3
rows = math.ceil(num_features / cols)

# Create histograms
plt.figure(figsize=(5 * cols, 4 * rows))

for i, feature in enumerate(numericFeatures):
    plt.subplot(rows, cols, i + 1)
    sns.histplot(df, x=feature, hue='y', kde=True, bins=30, palette={"yes": "red", "no": "green"})
    plt.title(feature)

plt.tight_layout()
plt.show()

### 3.3 Data Transformation
- TODO: Encode categorical variables - DONE
- TODO: Scale/normalize numerical features - DONE
- TODO: Handle skewed distributions - DONE 

In [None]:
# Encode categorical variables

In [None]:
dfML = df.copy()

for feature in categFeatures:
    print(feature)
    print(dfML[feature].unique())
    if dfML[feature].dropna().isin(['yes', 'no']).all():
        dfML[feature] = (dfML[feature].values == 'yes').astype(int)
        print(dfML[feature].unique())

dfML.info()

In [None]:
dfML = pd.get_dummies(dfML, drop_first=True, dtype=int)

dfML.rename({'y_yes': 'y'}, axis='columns', inplace = True)

dfML.info()

In [None]:
# Feature scaling

In [None]:
numeric_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 
                    'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

# Initialize the scaler
scaler = RobustScaler()

# Fit the scaler on the numeric features and transform
df[numeric_features] = scaler.fit_transform(df[numeric_features])

### 3.3 Feature Engineering ???
- TODO: Create new features from existing ones
- TODO: Create interaction features
- TODO: Create time-based features if applicable
- TODO: Create domain-specific features

In [None]:
# Feature engineering
# TODO: Create new features based on domain knowledge and EDA insights

### 3.4 Data Splitting
- TODO: Split data into training and test sets - DONE
- TODO: Handle class imbalance if necessary (SMOTE, undersampling, etc.) - DONE 
- TODO: Set up cross-validation strategy ??? why

In [None]:
# Train-test split

In [None]:
X = df.drop(columns=['y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Original dataframe
# Combine features and target for training set
train_df = X_train.copy()
train_df['y'] = y_train

# Combine features and target for testing set
test_df = X_test.copy()
test_df['y'] = y_test


# This second dataframe is processed

X = dfML.drop(columns=['y'])
y = dfML['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Handle class imbalance

In [None]:
smote_tomek = SMOTETomek(random_state=42)
X_train_resampled, y_train_resampled = smote_tomek.fit_resample(X_train, y_train)

In [None]:
train_dfML = X_train_resampled.copy()
train_dfML['y'] = y_train_resampled

test_dfML = X_test.copy()
test_dfML['y'] = y_test

### 3.5 Feature Selection
- TODO: Identify highly correlated features - DONE
- TODO: Apply feature importance analysis - DONE
- TODO: Select relevant features for modeling - DONE - two dataframes prepared

#### Filter Methods

**Statistical Feature Selection**

* Continuous vs Continuous ---- Correlation matrix
* Categorical vs Continuous---- ANOVA test
* Categorical vs Categorical--- Chi-Square test

**Categorical vs categorical using Chi-Square Test**

Chi-Square test is conducted to check the correlation between two categorical variables
 - Assumption(H0): The two columns are NOT related to each other
 - Result of Chi-Sq Test: The Probability of H0 being True

In [None]:
def FunctionChisq(inpData, TargetVariable, CategoricalVariablesList):
    # Creating an empty list of final selected predictors
    FiltPredictors=[]

    for predictor in CategoricalVariablesList:
        CrossTabResult=pd.crosstab(index=inpData[TargetVariable], columns=inpData[predictor])
        ChiSqResult = chi2_contingency(CrossTabResult)
        
        # If the ChiSq P-Value is <0.05, that means we reject H0
        if (ChiSqResult[1] < 0.05):
            print(predictor, 'is correlated with', TargetVariable, '| P-Value:', ChiSqResult[1])
        else:
            print(predictor, 'is NOT correlated with', TargetVariable, '| P-Value:', ChiSqResult[1]) 
            FiltPredictors.append(predictor)
            
    return(FiltPredictors)

In [None]:
filterCateg = FunctionChisq(inpData=train_df, TargetVariable='y', CategoricalVariablesList= categFeatures)

The Chi-square test proves that *default*, *housing and *loan* are not correclated with y, which was shown on the plots (3.2). 
For example: while the raw counts of loan categories vary significantly, the proportion of clients who purchased the deposit (y = yes) is similar between those with and without loans. This is reflected in a high p-value (~0.98), indicating no statistically significant association between loan status and the target variable.

**Continuous vs categorical using ANOVA test**
 
   - Assumption(H0): There is NO relation between the given variables (i.e. the average(mean) values of the numeric    predictor variable is same for all the groups in the categorical Target variable)

ANOVA Test result: Probability of H0 being true

In [None]:
def FunctionAnova(inpData, TargetVariable, ContinuousPredictorList):
    from scipy.stats import f_oneway

    # Creating an empty list of final selected predictors
    FiltPredictors=[]
    
    print('##### ANOVA Results ##### \n')
    for predictor in ContinuousPredictorList:
        CategoryGroupLists=inpData.groupby(TargetVariable)[predictor].apply(list)
        AnovaResults = f_oneway(*CategoryGroupLists)
        
        # If the ANOVA P-Value is <0.05, that means we reject H0
        if (AnovaResults[1] < 0.05):
            print(predictor, 'is correlated with', TargetVariable, '| P-Value:', AnovaResults[1])
        else:
            print(predictor, 'is NOT correlated with', TargetVariable, '| P-Value:', AnovaResults[1])
            FiltPredictors.append(predictor)
            
    return(FiltPredictors)

In [None]:
# Calling the function to check which numeric variables are correlated with target

filterNumeric = FunctionAnova(inpData=train_df, TargetVariable='y', ContinuousPredictorList = numericFeatures)


In [None]:
FilterColumns = filterCateg + filterNumeric

print(f"Removed features by Filter methods: {FilterColumns}")

# Prepare df1 to be used by KNN

df1 = train_dfML.copy()

#Drop columns: gender and PhoneService (excluded by Chi-Square test)
df1.drop(columns=FilterColumns, axis=1, inplace=True)

df1.info()

In [None]:
def lasso_regularization(df):

    X = df.iloc[:,:-1].copy()          
    y = df.iloc[:,-1].copy() 
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    scaler = StandardScaler()
    scaler.fit(X_train)

    # fit a Logistic Regression model and feature selection altogether 
    # select the Lasso (l1) penalty.
    # The selectFromModel class from sklearn, selects the features which coefficients are non-zero

    sel_ = SelectFromModel(LogisticRegression(C=0.5, penalty='l1', solver='liblinear', random_state=10))

    sel_.fit(scaler.transform(X_train), y_train)

    # make a list with the selected features
    selected_feat = X_train.columns[(sel_.get_support())]
    
    print("Number of features which coefficient was shrank to zero: ", np.sum(sel_.estimator_.coef_ == 0))
    # identify the removed features like this:
    removed_feats = X_train.columns[(sel_.estimator_.coef_ == 0).ravel().tolist()]
    print('Removed features by Lasso: ',removed_feats) 

    return X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()]

In [None]:
Lasso_SelectedColumns = lasso_regularization(train_dfML)

Lasso_SelectedColumns

In [None]:
df2 = train_dfML[Lasso_SelectedColumns].copy()

df2['y'] = train_dfML['y']

df2.info()

---
## 4. Modeling

### 4.1 Baseline Model
- TODO: Create a simple baseline model (e.g., majority class classifier)
- TODO: Evaluate baseline performance

In [None]:
# Baseline model
# TODO: Implement baseline model

### 4.2 Model Selection
- TODO: Train multiple algorithms from class:
  - Logistic Regression
  - Decision Trees
  - Random Forest
  - Gradient Boosting (XGBoost, LightGBM)
  - Support Vector Machines
  - Neural Networks
  - K-Nearest Neighbors
  - Naive Bayes
  - TODO: Add others as covered in class

In [None]:
# Model 1: Logistic Regression
# TODO: Train and evaluate logistic regression model

In [None]:
# Model 2: Decision Tree
# TODO: Train and evaluate decision tree model

In [None]:
# Model 3: Random Forest
# TODO: Train and evaluate random forest model

In [None]:
# Model 4: Gradient Boosting
# TODO: Train and evaluate gradient boosting model

In [None]:
# Model 5: Support Vector Machine
# TODO: Train and evaluate SVM model

In [None]:
# Model 6: [Add more models as needed]
# TODO: Train and evaluate additional models

### 4.3 Hyperparameter Tuning
- TODO: Define hyperparameter search space
- TODO: Apply Grid Search or Random Search
- TODO: Use cross-validation for tuning

In [None]:
# Hyperparameter tuning - Model 1
# TODO: Implement GridSearchCV or RandomizedSearchCV

In [None]:
# Hyperparameter tuning - Model 2
# TODO: Implement hyperparameter tuning for other promising models

### 4.4 Ensemble Methods
- TODO: Create ensemble models (voting, stacking, blending)
- TODO: Combine best performing models

In [None]:
# Ensemble models
# TODO: Implement ensemble techniques

---
## 5. Evaluation

### 5.1 Model Performance Metrics
- TODO: Calculate accuracy, precision, recall, F1-score
- TODO: Generate ROC curves and calculate AUC
- TODO: Create confusion matrices
- TODO: Calculate business-relevant metrics (cost/benefit analysis)

In [None]:
# Model evaluation metrics
# TODO: Calculate and compare all metrics across models

In [None]:
# Visualize model performance
# TODO: Create ROC curves, precision-recall curves
# TODO: Create confusion matrices
# TODO: Create comparison charts

### 5.2 Model Interpretation
- TODO: Analyze feature importance
- TODO: Interpret model predictions
- TODO: Validate model behavior

In [None]:
# Feature importance analysis
# TODO: Extract and visualize feature importance from models

In [None]:
# Model interpretation
# TODO: Use SHAP, LIME, or other interpretation methods if applicable

### 5.3 Model Validation
- TODO: Perform cross-validation
- TODO: Test on holdout set
- TODO: Check for overfitting/underfitting

In [None]:
# Cross-validation
# TODO: Perform k-fold cross-validation on best models

In [None]:
# Final model evaluation on test set
# TODO: Evaluate final model(s) on unseen test data

### 5.4 Business Impact Assessment
- TODO: Translate model performance to business value
- TODO: Calculate expected ROI or cost savings
- TODO: Provide actionable recommendations

In [None]:
# Business impact analysis
# TODO: Calculate business metrics (conversion rate improvement, cost reduction, etc.)

---
## 6. Conclusions and Recommendations

### 6.1 Summary of Findings
- TODO: Summarize key insights from data exploration
- TODO: Summarize model performance
- TODO: Identify most important predictive features

### 6.2 Best Model Selection
- TODO: Select and justify the best model
- TODO: Document model strengths and limitations

### 6.3 Recommendations
- TODO: Provide actionable business recommendations
- TODO: Suggest customer prioritization strategy
- TODO: Recommend campaign optimization strategies

### 6.4 Future Work
- TODO: Suggest model improvements
- TODO: Identify additional data needs
- TODO: Propose deployment strategy

### 6.5 Lessons Learned
- TODO: Document challenges faced
- TODO: Share insights from the project
- TODO: Note what would be done differently

---

## Project Notes and Team Collaboration

### Team Members
- TODO: List team members and responsibilities
Julia Kardasz 1250264

### Project Timeline
- TODO: Document project milestones and deadlines

### References
- TODO: Add references to papers, documentation, and resources used

---
*This notebook follows the CRISP-DM methodology for data mining projects*