# **Week-8 Python for Machine Learning - MLS**

## Term Deposit Prediction

#### Problem Statement
DirectMap Bank, located in the UAE, specializes in conducting direct marketing campaigns aimed at promoting term deposits to its customers. The bank conducts phone-call campaigns to reach out to potential subscribers. However, despite the bank's efforts, it faces challenges in effectively identifying customers who are more likely to subscribe to term deposits. These challenges include difficulties in understanding customer preferences, inefficient utilization of resources, and a lack of personalized targeting strategies. The bank has initiated various initiatives, such as increasing the number of campaign contacts and refining communication methods, to address these issues.

### Objective
As a Data Scientist hired by DirectMap Bank, the objective is to analyze the direct marketing campaign data and develop a predictive model that can accurately identify customers who are more likely to subscribe to term deposits


### Import the required libraries

In [1]:
# Importing necessary libraries
import pandas as pd
import sklearn
import joblib

# Fetching dataset from sklearn's openml module
from sklearn.datasets import fetch_openml

# Importing preprocessing modules from sklearn
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Importing make_pipeline function from pipeline module
from sklearn.pipeline import make_pipeline

# Importing train_test_split and RandomizedSearchCV from model_selection module
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# Importing LogisticRegression model and evaluation metrics from linear_model and metrics modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Configure scikit-learn to display pipeline diagrams for visualizing the structure of machine learning pipelines
sklearn.set_config(display='diagram')

This line configures scikit-learn to display pipeline diagrams for visualizing the structure of machine learning pipelines.

# Data

In [3]:
# Read the data
data_df = pd.read_csv("Bank_Telemarketing.csv")

In [4]:
# Print the top 5 rows from the data
data_df.head()

Unnamed: 0,customer_id,email_id,first_name,last_name,Age,Job,Marital Status,Education,Defaulter,Home Loan,Personal Loan,Communication Type,Last Contacted,Day of Week,Duration(Sec),CC Contact Freq,Days Since PC,PC Contact Freq,PC Outcome,subscribed
0,61e41ab36fb571a283ba252b,jared84@example.org,Aaron,Austin,56.0,housemaid,married,experience,no,no,no,telephone,may,mon,261,1,0,0,nonexistent,0
1,61e41ab36fb571a283ba252c,gsanchez@example.net,Aaron,Gray,57.0,services,married,high school,unknown,no,no,telephone,may,mon,149,1,0,0,nonexistent,0
2,61e41ab36fb571a283ba252d,donald41@example.net,Aaron,Walker,37.0,services,married,high school,no,yes,no,telephone,may,mon,226,1,0,0,nonexistent,0
3,61e41ab36fb571a283ba252e,ariel87@example.com,Aaron,Shelton,40.0,admin.,married,experience,no,no,no,telephone,may,mon,151,1,0,0,nonexistent,0
4,61e41ab36fb571a283ba252f,thomasjeff@example.com,Aaron,Johnson,56.0,services,married,high school,no,no,yes,telephone,may,mon,307,1,0,0,nonexistent,0


### Data Description

1. customer_id: unique customer ID
2. email_id: email ID of a customer
3. first_name: first name of the customer
4. last_name: last name of the customer
5. age: age of a customer
6. job: type of job (admin,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,technician,unemployed,unknown)
7. marital_status: marital status (divorced, married, single, unknown)
8. education: education (basic_4y,basic_6y,basic_9y,high_school,illiterate,professional_course,university_degree,unknown)
9. defaulter: has credit in default (yes,unknown,no)
10. home_loan: customer has home loan? (yes,no,unknown)
11. personal_loan: customer has personal loan? (yes,no,unknown)
12. communication_type: This column provides the information on the means through which the customer has been contacted either ‘cellular’ and ‘telephone’
13. last_contacted: customer last contacted month (mar,apr,may,jun,jul,aug,sep,oct,nov,dec)
14. day_of_week: last contact day of the week (mon,tue,wed,thu,fri)
15. duration:  This column represents the total call duration of each customer
16. cc_contact_freq: This column is the number of campaign in which customer is contacted.
17. days_since_pc: This column represents  the no of days passed by since the customer has been reached via bank for any of the other products (not term deposit). Here, the value ‘-1’ represents that the customer has never been reached for any product
18. pc_contact_freq: This column represents the no of times the customer has been reached in the previous campaigns or for any of the other products(not term deposit)
19. pc_outcome: This column represents the outcome of the previous reach outs for any of the products(other than term deposits) provided by banks
*   Unknown - This represents that the customer has not been reached so far
*   Success - This represents that the previous call was a successful conversion of the customer
*   Failure - This represents that the customer is not interested in the last product
20. subscribed: has the customer subscribed a term deposit? (yes, no)

In [5]:
# Get the column names in the dataset
data_df.columns

Index(['customer_id', 'email_id', 'first_name', 'last_name', 'Age', 'Job',
       'Marital Status', 'Education', 'Defaulter', 'Home Loan',
       'Personal Loan', 'Communication Type', 'Last Contacted', 'Day of Week',
       'Duration(Sec)', 'CC Contact Freq', 'Days Since PC', 'PC Contact Freq',
       'PC Outcome', 'subscribed'],
      dtype='object')

In [6]:
# Get the shape of the data
data_df.shape

(41183, 20)

In [7]:
# Check is there any missing values iin the dataset
data_df.isnull().sum()

customer_id            0
email_id               0
first_name             0
last_name              0
Age                   15
Job                   16
Marital Status         0
Education              0
Defaulter              0
Home Loan              0
Personal Loan          0
Communication Type     0
Last Contacted         0
Day of Week            0
Duration(Sec)          0
CC Contact Freq        0
Days Since PC          0
PC Contact Freq        0
PC Outcome             0
subscribed             0
dtype: int64

As we notices, there are missing values in the dataset - lets impute numerical features with mean and categorical features with the most frequanct values in the part of the sklearn pipeline

In [8]:
# Drop the unnecessary features that are not required for model training
data_df = data_df.drop(columns=['customer_id', 'email_id', 'first_name', 'last_name'])

In [9]:
# Store the numerical features in a new varable called numerical_features
numerical_features =data_df[['Age', 'Duration(Sec)', 'CC Contact Freq', 'Days Since PC','PC Contact Freq']].columns

In [10]:
numerical_features

Index(['Age', 'Duration(Sec)', 'CC Contact Freq', 'Days Since PC',
       'PC Contact Freq'],
      dtype='object')

In [11]:
# Store the categorical features in a new varable called categorical_features
categorical_features = data_df.select_dtypes(include=['object']).columns

In [12]:
categorical_features

Index(['Job', 'Marital Status', 'Education', 'Defaulter', 'Home Loan',
       'Personal Loan', 'Communication Type', 'Last Contacted', 'Day of Week',
       'PC Outcome'],
      dtype='object')

In [13]:
# Get the count of the peoples who have taken subscription or not
data_df['subscribed'].value_counts()

subscribed
0    36545
1     4638
Name: count, dtype: int64

# Model Estimation

In [14]:
X = data_df.drop('subscribed',axis=1)
y = data_df['subscribed']

In [15]:
# Split the independenet and dependent features in to x and y variables with a test size 0.2% and random at 42
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

In [16]:
# Creating a pipeline for numerical feature processing, including imputation of missing values with mean and standard scaling.
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])


This code creates a pipeline named numerical_pipeline to process numerical features. It consists of two steps:

'imputer': Imputes missing values using the mean strategy with SimpleImputer.

'scaler': Standardizes the numerical features using StandardScaler.

In [17]:
# Creating a pipeline for categorical feature processing, including imputation of missing values with the most frequent value and one-hot encoding with handling of unknown categories.
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

This code constructs a pipeline named categorical_pipeline for processing categorical features, involving imputation of missing values using the most frequent value and one-hot encoding with handling of unknown categories.

In [18]:
# Creating a column transformer named preprocessor to apply specific pipelines to numerical and categorical features separately.
preprocessor = make_column_transformer(
    (numerical_pipeline, numerical_features),
    (categorical_pipeline, categorical_features)
)

In [19]:
# Creating a logistic regression model with parallel processing enabled (-1 indicates using all available cores) for improved training efficiency.
model_logistic_regression = LogisticRegression(n_jobs=-1)

In [20]:
# Creating a pipeline combining preprocessing steps (imputation and encoding) with logistic regression modeling.
model_pipeline = make_pipeline(
    preprocessor,  # Applying preprocessing steps
    model_logistic_regression  # Training logistic regression model
)

In [21]:
# Fit the model on training data
model_pipeline.fit(Xtrain, ytrain)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


This representation indicates the structure of a pipeline named Pipeline, which includes:

A ColumnTransformer named columntransformer, containing two sub-pipelines:

pipeline-1 consisting of SimpleImputer followed by StandardScaler.

pipeline-2 consisting of SimpleImputer followed by OneHotEncoder.

Finally, LogisticRegression model is applied after preprocessing.

# Model Evaluation

In [22]:
# Make prediction on the test data
model_pipeline.predict(Xtest)

array([1, 0, 0, ..., 0, 0, 1])

In [23]:
# Evaluate the model performance using accuracy_score metric
accuracy_score(ytest, model_pipeline.predict(Xtest))

0.9084618186232851

In [24]:
# Display the classification report metric which comprises recall, precision, f1 score
print(classification_report(ytest, model_pipeline.predict(Xtest)))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95      7312
           1       0.67      0.37      0.47       925

    accuracy                           0.91      8237
   macro avg       0.80      0.67      0.71      8237
weighted avg       0.90      0.91      0.90      8237



# Hyperparameter Tuning

In [25]:
preprocessor = make_column_transformer(
    (numerical_pipeline, numerical_features),
    (categorical_pipeline, categorical_features)
)

In [26]:
model_logistic_regression = LogisticRegression(n_jobs=-1)

In [27]:
model_pipeline = make_pipeline(
    preprocessor,
    model_logistic_regression
)

In [28]:
# Get the backend architecture of the model_pipeline
model_pipeline.named_steps

{'columntransformer': ColumnTransformer(transformers=[('pipeline-1',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(strategy='median')),
                                                  ('scaler', StandardScaler())]),
                                  Index(['Age', 'Duration(Sec)', 'CC Contact Freq', 'Days Since PC',
        'PC Contact Freq'],
       dtype='object')),
                                 ('pipeline-2',
                                  Pipeline(steps=[('imputer',
                                                   SimpleImputer(strategy='most_frequent')),
                                                  ('onehot',
                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                  Index(['Job', 'Marital Status', 'Education', 'Defaulter', 'Home Loan',
        'Personal Loan', 'Communication Type', 'Last Contacted', 'Day of Week',
    

In [29]:
# Write down the list of values that we can use to tune the logistic regression model using the c parameter
param_distribution = {
    "logisticregression__C": [0.001, 0.01, 0.1, 0.5, 1, 5, 10]
}

In [30]:
# Creating a randomized search cross-validation object to search for the best hyperparameters for the model pipeline.
rand_search_cv = RandomizedSearchCV(
    model_pipeline,  # Model pipeline to be optimized
    param_distribution,  # Hyperparameter distribution to sample from
    n_iter=3,  # Number of parameter settings that are sampled
    cv=3,  # Number of folds for cross-validation
    random_state=42  # Random state for reproducibility
)

In [31]:
# Fit the model on training data
rand_search_cv.fit(Xtrain, ytrain)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
# Retrieve the best performing estimator (model) found during the randomized search cross-validation process.
rand_search_cv.best_estimator_

In [33]:
# Retrieve the mean cross-validated score of the best estimator found during the randomized search cross-validation process.
rand_search_cv.best_score_

0.9056941662113762

In [34]:
# Display the report
print(classification_report(ytest, rand_search_cv.best_estimator_.predict(Xtest)))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95      7312
           1       0.67      0.37      0.47       925

    accuracy                           0.91      8237
   macro avg       0.80      0.67      0.71      8237
weighted avg       0.90      0.91      0.90      8237



# Serialization

In [35]:
# This command displays information about the installed version of the scikit-learn library.
!pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /opt/anaconda3/lib/python3.11/site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: imbalanced-learn


In [36]:
%%writefile requirements.txt
scikit-learn==1.2.2

Writing requirements.txt


The command %%writefile requirements.txt is a magic command in notebooks that writes the following lines of text to a file named requirements.txt. In this case, it writes scikit-learn==1.2.2, specifying the version of scikit-learn required for the project.

In [37]:
%%writefile train.py

import joblib
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

from sklearn.model_selection import train_test_split, RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

data_df = pd.read_csv("Bank_Telemarketing.csv")

target = 'subscribed'
numerical_features = ['Age', 'Duration(Sec)', 'CC Contact Freq', 'Days Since PC','PC Contact Freq']
categorical_features = ['Job', 'Marital Status', 'Education', 'Defaulter', 'Home Loan',
       'Personal Loan', 'Communication Type', 'Last Contacted', 'Day of Week',
       'PC Outcome']

print("Creating data subsets")

X = data_df[numerical_features + categorical_features]
y = data_df[target]

Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = make_column_transformer(
    (numerical_pipeline, numerical_features),
    (categorical_pipeline, categorical_features)
)

model_logistic_regression = LogisticRegression(n_jobs=-1)

print("Estimating Best Model Pipeline")

model_pipeline = make_pipeline(
    preprocessor,
    model_logistic_regression
)

param_distribution = {
    "logisticregression__C": [0.001, 0.01, 0.1, 0.5, 1, 5, 10]
}

rand_search_cv = RandomizedSearchCV(
    model_pipeline,
    param_distribution,
    n_iter=3,
    cv=3,
    random_state=42
)

rand_search_cv.fit(Xtrain, ytrain)

print("Logging Metrics")
print(f"Accuracy: {rand_search_cv.best_score_}")

print("Serializing Model")

saved_model_path = "model.joblib"

joblib.dump(rand_search_cv.best_estimator_, saved_model_path)

Writing train.py


In [38]:
# Run the script
!python train.py

Creating data subsets
Estimating Best Model Pipeline
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also 

# Test Predictions

In [39]:
# Load the saved model
saved_model = joblib.load("model.joblib")

In [40]:
# Get the architecture of saved model
saved_model

In [41]:
# Make predictions on the test data
saved_model.predict(Xtest)

array([1, 0, 0, ..., 0, 0, 1])

## Power Ahead!