# Project - 2

# PRCP-1000-PortugueseBank

### Problem Defination

The goal is to predict whether a customer will subscribe to a term deposit based on various personal and socio-economic features. The dataset is derived from marketing campaigns conducted by a Portuguese bank, where the outcome variable (y) represents whether a customer subscribes to a term deposit or not. The challenge involves working with an imbalanced dataset, where the majority of customers did not subscribe to the deposit (No), while a minority did (Yes).

## **Attribute Information** 
**The various featureof the dataset explained below**
1. Age : (numeric)
2. Job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. Marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. Education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. Eefault: has credit in default? (categorical: 'no','yes','unknown')
6. Housing: has housing loan? (categorical: 'no','yes','unknown')
7. Loan: has personal loan? (categorical: 'no','yes','unknown')
8. Contact: contact communication type (categorical: 'cellular','telephone')
9. Month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. Day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no')
12. Dampaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. Previous: number of contacts performed before this campaign and for this client (numeric)
15. Poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16. Emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. Cons.price.idx: consumer price index - monthly indicator (numeric)
18. Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. Euribor3m: euribor 3 month rate - daily indicator (numeric)
20. Nr.employed: number of employees - quarterly indicator (numeric)
21. y - has the client subscribed a term deposit? (binary: 'yes','no')

## Business Goal
**The goal is to build and optimize classification models that can accurately predict whether a customer will subscribe to a term deposit (binary outcome: Yes or No) while handling the imbalance in the data to improve predictions for the minority class (Yes).**

In [None]:
## Importing Libraries
import numpy as np
import pandas as pd
from scipy import stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
## Loading Dataset
data=pd.read_csv('PortugeseBank.csv')

In [None]:
pd.set_option('display.max_columns',None)

In [None]:
## This profiling report will provide complete info about the data
from ydata_profiling import ProfileReport
Profile = ProfileReport(data, title = 'Portugese Bank Report')
Profile.to_notebook_iframe()

## Basic Checks

In [None]:
data ## Checking head and tail

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe() ## Summary Statistics

In [None]:
data.describe(include='O')

# EDA - Exploratory Data Analysis

In [None]:
## Splitting numerical and categorical features
data_cat=data[['Job', 'Martial', 'Education', 'Default', 'Housing', 'Loan', 'Contact',
       'Month', 'Day_of_week', 'Poutcome', 'y']]
data_num=data[['Age', 'Duration', 'Campaign', 'Pdays', 'Previous', 'Emp.var.rate',
       'Cons.price.idx', 'Cons.conf.idx', 'Euribor3m', 'Nr.employed']]

## Univariate Analysis

**Numerical Features**

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
for column in data_num:
    if plotnumber<=16:
        ax = plt.subplot(5,2,plotnumber)
        sns.distplot(x=data_num[column],kde=True, color='k')
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.tight_layout()

**Categorical Features**

In [None]:
## Checking Count of Categotical features
value_counts = {col: data_cat[col].value_counts() for col in data_cat.columns}
for col, counts in value_counts.items():
    print(f"Value counts for {col}:")
    print(counts)
    print()

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
for column in data_cat:
    if plotnumber<=16:
        ax = plt.subplot(6,2,plotnumber)
        sns.countplot(x=data_cat[column])
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.tight_layout()

## Insights from univariate analysis

**Numerical Features**
* Age : The age distribution is right-skewed, with most customers falling between 30-40 years old. A few older customers exist as outliers.
* Duration : The duration of the last contact call shows that longer calls tend to lead to better outcomes, though the majority of calls are short (less than 200 seconds).
* Campaign and Previous : Most customers have been contacted a limited number of times, and few customers have been contacted through multiple campaigns.

**Categorical Features**
* Job : Common jobs include "blue-collar", "management", and "technician", with "student" and "retired" customers having a higher likelihood of subscribing.
* Marital Status : Most customers are married, with a smaller proportion being single or divorced. Single customers are more likely to subscribe than married ones.
* Education : Most customers have secondary or tertiary education. Higher education correlates with a higher subscription rate.
* Housing : A large portion of customers have housing loans
* Default and Loan : People with no default and no loan more.
* Contact: People who have been contacted via cellular network are mo..e
* y(Target) : The classes no and yes are highly imbalanced with majority no values and very few positive cases.

## Bivariate Analysis

**Categorical Features VS Target**

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
for column in data_cat:
    if plotnumber<=16:
        ax = plt.subplot(6,2,plotnumber)
        sns.countplot(x=data_cat[column],hue=data_cat.y)
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.tight_layout()

**Numerical Features VS Target**

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
for column in data_num:
    if plotnumber<=16:
        ax = plt.subplot(5,2,plotnumber)
        sns.histplot(x=data_num[column],hue=data_cat.y)
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.tight_layout()

## Insights from Bivariate Analysis
* Age vs Subscription: Older customers and younger ones (20-30) tend to subscribe more frequently than middle-aged customers (30-40), who have lower subscription rates.
* Job vs Subscription: Job types like "admin", "blue-collar", and "technician" have higher subscription rates, while other workers have lower subscription rates.
* Duration vs Subscription: Longer call durations are positively correlated with higher subscription rates. Calls lasting over 300 seconds (5 minutes) are more likely to result in a subscription.
* Campaign: The length of contact calls and previous campaign outcomes are strong predictors. Longer calls and successful prior campaigns boost the likelihood of a successful subscription.
* Education vs Subscription: Customers with higher levels of education (tertiary) tend to subscribe more frequently than those with lower levels of education (primary).
* Default and Loan vs Subscriptions : People without default and loans tend to subscribe more frequently.
Previous Campaign Outcome vs Subscription: If the customer had a successful outcome in the previous campaign, they are much more likely to subscribe again.
* Demographic Variables: Variables like age, job, and education are strong indicators of whether a customer will subscribe. Younger, highly educated customers with professional jobs are more likely to subscribe.

# Data Preprocessing

In [None]:
## Handling Missing Values
data.isnull().sum()

**There is no missing values, so we will continue with next steps**

In [None]:
## Duplicates
data.duplicated().sum()

In [None]:
## Dropping Duplicates
data.drop_duplicates(inplace=True)

## Outliers

In [None]:
## Creating box plot to check the outliers
plt.figure(figsize=(20,25), facecolor='white')
plotnumber=1

for column in data_num:
    if plotnumber<=16:
        ax=plt.subplot(6,2,plotnumber)
        sns.boxplot(x=data_num[column], color='k')
        plt.xlabel(column, fontsize=15)
    plotnumber+=1
plt.tight_layout()

**Outliers Treatment**

In [None]:
## Creating def function for outlier treatment
def Outliers_IQR(data,column):
    IQR = st.iqr(data[column],interpolation='midpoint')
    print(f'IQR: {IQR}')
    Q1 = data[column].quantile(0.25)
    print(f'The 25\u1d57\u02b0 percentile of {column} : {Q1}' )
    Q3 = data[column].quantile(0.75)
    print(f'The 75\u1d57\u02b0 percentile of {column} : {Q3}' )
    Upper_limit = Q3 + 1.5*IQR
    print(f'The Upper_limit of {column} : {Upper_limit}' )
    Lower_limit = Q1 - 1.5*IQR
    print(f'The Lower_limit of {column} : {Lower_limit}' )
    Lower_Out_data = data.loc[data[column]<Lower_limit]
    print(f'The Lower Outlier data of {column} : {len(Lower_Out_data)}')
    Upper_Out_data = data.loc[data[column]>Upper_limit]
    print(f'The Upper Outlier data of {column} : {len(Upper_Out_data)}')
    Outlier_Per = (2406/len(data))*100
    print(f'The Outlier percentage of {column} : {Outlier_Per}')

    

In [None]:
# Treating the extreme values of age feature
Outliers_IQR(data,'Age')

**It doesn't make any sense for students under 18 years to open a bank term deposit or even get a housing or a personal loan. So removing these values.**

In [None]:
data[data['Age']<18]

In [None]:
# Filtering the dataset to include only individuals who are 18 years old or older.
data=data[data['Age']>=18]

In [None]:
# Treating the extreme values of age feature
Outliers_IQR(data,'Campaign')

In [None]:
data.loc[data['Campaign']>10,'Campaign'].value_counts()[-6:]

**Observing the above data, lot of data lies after 75th percentile so it would make no sense to those values, it is significant from the box plot above as well.So we will just remove the extreme value 56.**

In [None]:
#Dropping the extreme value row in Campaign for filtering the dataset
data=data[data['Campaign']<56]

# Encoding - Handling Categorical Data

In [None]:
data_cat.columns

In [None]:
## One hot encoding for Job, Martial, Contact, Month, Day_of_week, Poutcome as they does not follow any hierarchy or rank
data = pd.get_dummies(data=data,columns=['Job','Martial','Contact','Month','Day_of_week','Poutcome'], drop_first=True)

In [None]:
## Label encoding for Education, Default, Housing, Loan
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data['Education'] = labelencoder.fit_transform(data['Education'])
data['Default'] = labelencoder.fit_transform(data['Default'])
data['Housing'] = labelencoder.fit_transform(data['Housing'])
data['Loan'] = labelencoder.fit_transform(data['Loan'])

In [None]:
## Manual Encoding for dependent variable y
data.replace({'yes':1, 'no':0}, inplace=True)

In [None]:
data

# Feature Selection

In [None]:
## Checking correlation
plt.figure(figsize=(20,10))
sns.heatmap(data_num.corr(),annot=True,fmt='.1g',xticklabels=data_num.columns.values,yticklabels=data_num.columns.values,cmap="YlGnBu",cbar=True)
plt.show()


**As the duration feature highly affects the output target. Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known, so this input only used for benchmark purposes and will be dropped.**

In [None]:
## Dropping Duration column
data.drop('Duration', axis=1, inplace=True)

In [None]:
data

In [None]:
## Final data
Final_data=data.copy()

# Model Creation

In [None]:
## Assigning Independent and Dependent Variable
X=data.drop('y', axis=1)
y=data['y']

In [None]:
y

In [None]:
## Splitting training and testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=10)

In [None]:
## Scaling
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()

In [None]:
X_train[['Age','Campaign','Pdays','Emp.var.rate','Cons.price.idx','Cons.conf.idx','Euribor3m','Nr.employed']]=scaler.fit_transform(X_train[['Age','Campaign','Pdays','Emp.var.rate','Cons.price.idx','Cons.conf.idx','Euribor3m','Nr.employed']])
X_test[['Age','Campaign','Pdays','Emp.var.rate','Cons.price.idx','Cons.conf.idx','Euribor3m','Nr.employed']]=scaler.transform(X_test[['Age','Campaign','Pdays','Emp.var.rate','Cons.price.idx','Cons.conf.idx','Euribor3m','Nr.employed']])

In [None]:
## Smote - Balancing data
from imblearn.over_sampling import SMOTE
smote = SMOTE() ## object creation

In [None]:
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [None]:
from collections import Counter
print("Actual Classes",Counter(y_train))
print("SMOTE Classes",Counter(y_train_smote))

# Model Building 

In [None]:
## Importing Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix, recall_score

In [None]:
def predict(Class_Model): ## Defining a function
    print(f'IMBALANCE DATA') 
    print(f'Model Name : {Class_Model}') ## Object Creation
    model = Class_Model.fit(X_train, y_train) ## Training the data
    print(f'Training score : {model.score(X_train,y_train)}') ## Training data score
    y_predict = model.predict(X_test) ## Predicting test data
    print(f' Predictions are : {y_predict}') ## Predicted data
    print('\n')
    print(f'MODEL EVALUATION')
    conf_matrix= confusion_matrix(y_test, y_predict)
    print(f'confusion matrix')
    print(conf_matrix)
    print(f'Classification Report')
    print(classification_report(y_test,y_predict))
    print(f'f1_score: {f1_score(y_test,y_predict)}')
    print('\n')
    print(f'SMOTE DATA')
    print(f'Model Name : {Class_Model}')
    model = Class_Model.fit(X_train_smote, y_train_smote)
    print(f'Training score : {model.score(X_train_smote, y_train_smote)}')
    y_predict_Smote = model.predict(X_test)
    print(f' Predictions are : {y_predict_Smote}')
    print('\n')
    print(f'MODEL EVALUATION')
    conf_matrix= confusion_matrix(y_test, y_predict_Smote)
    print(f'Confusion matrix')
    print(conf_matrix)
    print(f'Classification Report')
    print(classification_report(y_test,y_predict_Smote))
    print(f'f1_score: {f1_score(y_test,y_predict_Smote)}')

In [None]:
## Logistic Regression
predict(LogisticRegression())

In [None]:
## Support Vector Classifier
predict(SVC())

In [None]:
## Decision Tree Classifier
predict(DecisionTreeClassifier())

In [None]:
## Random Forest Classifier
predict(RandomForestClassifier())

In [None]:
## Bagging Classifier
predict(BaggingClassifier())

In [None]:
## Gradient Boosting Classifier
predict(GradientBoostingClassifier())

In [None]:
Models=[]
Models.append(('LR',LogisticRegression()))
Models.append(('SVM',SVC()))
Models.append(('DT',DecisionTreeClassifier()))
Models.append(('RF',RandomForestClassifier()))
Models.append(('BC',BaggingClassifier()))
Models.append(('GB',GradientBoostingClassifier()))
Models

In [None]:
## Checking Cross validation scores
from sklearn.model_selection import KFold, cross_val_score
my_cv = []
my_names = []

for name, model in Models:
    cv = cross_val_score(model,X_train_smote,y_train_smote,cv=10,scoring='f1')
    my_names.append(name)
    my_cv.append(cv)
    scores = ('%s %f (%f)' % (name, cv.mean(), cv.std()))
    print(scores)

# Hyperparameter Tuning

After creating multiple classification models, both RandomForest and GradientBoosting performed well in their base models, so we will tune both the models with different Hyperparameters.

**Hyperparameter tuning for RandomForest**

In [None]:
## Importing randomizedsearchCV
from sklearn.model_selection import RandomizedSearchCV

In [None]:
## Creating dictionary for Parameters
param_distributions = {
    'n_estimators': [100,120,150,180,200,240],
    'max_depth': [5,10,15,20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf':[2,4,6,12,15,20],
    'max_features':['sqrt','log2'],
    'criterion': ['gini', 'entropy'],'bootstrap': [True, False]  
}
# Initialize the model
rf = RandomForestClassifier()

# RandomizedSearchCV
random_search = RandomizedSearchCV(rf, param_distributions, n_iter=100, scoring='f1', cv=5, random_state=42, n_jobs=-1)
random_search.fit(X_train_smote, y_train_smote)

# Best hyperparameters
print("Best Parameters:", random_search.best_params_)

In [None]:
predict(RandomForestClassifier(n_estimators=100, min_samples_split=5,min_samples_leaf=2,max_depth=20,max_features='sqrt',bootstrap=False,random_state=42,criterion='entropy'))

**Hyperparameter tuning for GradientBoosting**

In [None]:
## Creating dictionary for Parameters
param_distributions = {
    'n_estimators': [100,120,150,180,200,240],
    'max_depth': [5,10,15,20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf':[2,4,6,12,15,20],
    'max_features':['sqrt','log2'],
    'loss':['log_loss','exponential'],
    'learning_rate':[0.1,0.001,0.0001,0.02],
    'criterion':['friedman_mse','squared_error']
}
# Initialize the model
GB = GradientBoostingClassifier()

# RandomizedSearchCV
random_search = RandomizedSearchCV(GB, param_distributions, n_iter=50, scoring='f1', cv=5, random_state=42, n_jobs=-1)
random_search.fit(X_train_smote, y_train_smote)

# Best hyperparameters
print("Best Parameters:", random_search.best_params_)

In [None]:
predict(GradientBoostingClassifier(n_estimators=120, min_samples_split=10,min_samples_leaf=2,max_depth=20,learning_rate=0.1,max_features='log2',loss='log_loss',criterion='friedman_mse'))

## Applying Recursive Feature Elimination(RFE) with cross-fold evaluation

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.feature_selection import RFE

In [None]:
## Creating def function for elimination_crossval
def elimination_crossval(model):

    rfe=RFE(estimator=RandomForestClassifier(),n_features_to_select=10)
    
    #Fitting the rfe
    X_rfe=rfe.fit_transform(X_train_smote,y_train_smote)
    
    #Transforming X_test
    X_rfe_test=rfe.transform(X_test)
    
    model=model
    
    #Creating pipeling to avoid data leakage
    pipeline=Pipeline(steps=[('s',rfe),('m',model)])
    
    cv=RepeatedStratifiedKFold(n_splits=10,n_repeats=3,random_state=1)

    scores =cross_val_score(pipeline,X_rfe, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)
    print('Accuracy for model with cross val: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))
    
    
    #Fitting the pipeline
    fitted_model=pipeline.fit(X_rfe,y_train_smote)
    
    y_preds=fitted_model.predict(X_rfe_test)
    
    #Printing the classification report
    print(classification_report(y_test,y_preds))

In [None]:
elimination_crossval(LogisticRegression(max_iter=7600))

In [None]:
elimination_crossval(GradientBoostingClassifier())

## **Data Analysis Report**

#### **1. Introduction**
 
   The purpose of the analysis is to how the bank runs a marketing campaign to bring customers on board with the term deposits.


#### **2. Data Overview**

 - Number of rows: 41188
 - Number of columns: 21
 -  Featues: Age, Duration, Campaign, Pdays, Previous, Emp.var.rate, Cons.price.idx, Cons.conf.idx, Euribor3m, Nr.employed, Job, Martial, Education, Default, Housing, Loan, Contact, Month, Day_of_week, Poutcome, y(Target.))
 - Target Variable : y(Bank Term Deposit Subrcription).
#### **3. Data Preprocessing and Feature Engineering**
   
- **Handling Missing Values** : The dataset contains no missing values, ensuring data completeness and consistenc.
 - **Handling categorical data** : For the categorical features like Job, Marital, Education, Default, Housing, Loan, Contact, Month, Day_of_week, Poutcome, and y (Target) in the Portuguese bank data, a combination of one-hot encoding and label encoding was applied based on real-world hierarchies and domain understanding. One-hot encoding was used for features without inherent order, such as Job, Martial, Contact, Month, Day_of_week, Poutcome. Meanwhile, label encoding was applied to ordinal features like Education and Default, Housing and Loan, reflecting their natural ranking to improve model interpretability and performance. And for target variable(y) manual encoding was done. This balanced approach captures both nominal and ordinal relationships effectively
 - **Outliers** : Handling outliers was crucial for improving model accuracy, but in some cases, extreme values might represent valid customer behaviors, so after careful analysis. The data includes records of students under 18, which is unrealistic for opening a bank term deposit or obtaining loans like housing or personal loans. Therefore, removing these values is a logical step to ensure data relevance and accuracy. In the Campaign feature lot of data lies after 75th percentile so it would make no sense to those values, it is significant from the box plot above as well. So we removed the extreme value 56
- **Feature Transformation**:  MinMax scaling was applied to the features Age, Campaign, Pdays, Emp.var.rate, Cons.price.idx, Cons.conf.idx, Euribor3m, and Nr.employed because the data was not normally distributed, making standard scaling less appropriate. By using MinMaxScaler, the features were scaled to a range of 0 to 1, preserving the original distribution of the data while normalizing the feature values. This ensures that no feature disproportionately influences the model due to diffrent scales.

#### **4. Exploratory Data Analysis (EDA)**

- **Age vs Subscription**: Older customers and younger ones (20-30) tend to subscribe more frequently than middle-aged customers (30-40), who have lower subscription rates.
- **Job vs Subscription**: Job types like "admin", "blue-collar", and "technician" have higher subscription rates, while other workers have lower subscription rates.
- **Duration vs Subscription**: Longer call durations are positively correlated with higher subscription rates. Calls lasting over 300 seconds (5 minutes) are more likely to result in a subscription.
- **Education vs Subscription**: Customers with higher levels of education (tertiary) tend to subscribe more frequently than those with lower levels of education (primary).
- **Previous Campaign Outcome vs Subscription**: If the customer had a successful outcome in the previous campaign, they are much more likely to subscribe again.
- **Default and loan vs Subscription**: People without loans tend to subscribe more frequently.
- **Demographic Variables**: Variables like age, job, and education are strong indicators of whether a customer will subscribe. Younger, highly educated customers with professional jobs are more likely to subscribe..

## **Overview of Models Evaluated**

#### Logistic Regression

Performance Metrics:
- Accuracy Score: 0.81
- Precision: 0.30
- Recall: 0.58
- F1 Score: 0.39
  
#### Support Vector Classifier (SVC)

Performance Metrics:
- Accuracy Score: 0.83
- Precision: 0.34
- Recall: 0.59
- F1 Score: 0.43

#### Decision Tree Classifier

Performance Metrics:
- Accuracy Score: 0.84
- Precision: 0.30
- Recall: 0.37
- F1 Score: 0.33

#### Random Forest Classifier

Performance Metrics:
- Accuracy Score: 0.88
- Precision: 0.46
- Recall: 0.40
- F1 Score: 0.42

#### Bagging Classifier:

Performance Metrics:
- Accuracy Score: 0.88
- Precision: 0.44
- Recall: 0.34
- F1 Score: 0.38


#### Gradient Boosting Classifier: 

Performance Metrics:
- Accuracy Score: 0.88
- Precision: 0.44
- Recall: 0.53
- F1 Score: 0.48

## Model Performance Comparison

#### Best Model: 

After creating multiple classification models, Both RandomForest and GradientBoosting performed well in their base models, achieved overall scores as below.

1.RandomForest Classifier:
- Precision (Class 1): 0.46
- Recall (Class 1): 0.40
- Accuracy: 0.88
- F1 Score: 0.42
 
2.GradietBoosting Classifier:
- Precision (Class 1): 0.53
- Recall (Class 1): 0.44
- Accuracy: 0.88
- F1 Score: 0.48


## **Model Tuning Summary**

The primary goal of model tuning was to enhance the performance of the Random Forest (RF) and Gradient Boosting (GB) models, particularly focusing on improving recall for the minority class (customers likely to subscribe to term deposits).

**Tuning Process**

In this tuning process, RandomForest and GradientBoosting classifiers were subjected to hyperparameter tuning in an effort to improve the model's performance, particularly for the minority class (class 1). Despite tuning, the recall scores for class 1 (indicating the model's ability to correctly identify positive instances) showed little to moderate improvement in both models. For RandomForest, recall slightly increased from 40% to 43%, and for GradientBoosting, recall decreased from 53% to 37% as boosting focuses on best balance between precision, recall and overall performance

To further improve the results, Recursive Feature Elimination (RFE) was applied to Logistic Regression, Gradient boosting where the goal was to enhance the recall for class 1. This method yielded a significant improvement, achieving a recall of 70% for class 1 in Logistic Regression although at the cost of some accuracy in the overall classification.

**Conclusion**

The tuning efforts for RandomForest and GradientBoosting provided marginal improvements in recall for class 1, but they still struggled to correctly predict the minority class. By switching to Logistic Regression combined with Recursive Feature Elimination (RFE), a much better recall was achieved for class 1, improving from 58% to 70% and accuracy of 75. This suggests that RFE with Logistic Regression can be a more effective approach when the primary objective is to improve recall for the minority class, especially in imbalanced dataset.

## **Suggestions to the Bank market team to make customers buy the product.**

The most important features which the bank should focus on to attract more customers to buy term deposit are:
- Duration
- Age
- Campaign
- Euribo3
- nr.employed


1.Duration being one of the most influential factors,i.e. the higher the call duration the higher the chances of a sale. So the bank should focus on enhancing the quality of calls by building a rapport with the customers, decreasing wait time, checking in with the customers, and most importantly take feedback from the customers.

2.Age feature demonstrates that the majority term deposit purchasing capacity lies within the age group of 25-58 yrs adults. So, the bank should target this age group more and allocate more resources in getting in the customers from this particular ag

3.Campaign feature is important as it indicates the number of calls made during the current campaign. The customers do not like to get bothered with too many calls so a sweet spot lies within 1-5 calls, again depending upon the interest of the customer. So the bank should focus on training the sales team so that they can know the interested and non-interested customers based on the behavior,voice modulations, tone, and pitch of the customer.

4.Euribo3 is indicative of the trend that the higher interest rates attract more customers. So there are two things which the bank can pursue which are as follows: -Target the age group which is liable to get higher interest rates (4.5-5) particularly. -Increase the marketing campaign when the interest rates are higher, which can help in bringing more clients on board with the term deposits.

5.nr.employed trend indicates that more number of employees leads to more number of customers, which makes sense because if there are more employees, more leads can be targeted, proper followups and check-ins can be done. On the other hand, customer satisfaction could be achieved by creating a dedicated after-sales team. So, the bank should focus on hiring more people.e group.

## **Report on Challenges faced**

**1.Imbalanced Dataset**:

- Challenges : The dataset exhibited significant class imbalance, with a predominant number of 'No' responses compared to 'Yes' for term deposits. This imbalance led to models that performed well on accuracy but failed to adequately predict the minority class.
- Solution : Implementing oversampling techniques like SMOTE effectively balanced the classes, allowing for improved model training and better identification of the minority class.

**2.Model Performance Variability**:

- Challenges : Different models, including Logistic Regression, Random Forest, and Gradient Boosting, demonstrated varying performance metrics, particularly in terms of recall for the positive class. Initial iterations yielded low recall rates for potential customers.
- Solution : Utilizing advanced metrics such as F1-score and precision-recall curves provided a clearer evaluation of model performance, ensuring a focus on improving recall for the positive class.

**3.Hyperparameter Tuning**:

- Challenges : Tuning hyperparameters for complex models like Random Forest and Gradient Boosting was time-consuming and resource-intensive. Despite various tuning attempts, improvements were minimal, leading to a need for an effective strategy to optimize model performance without excessive computational load.
- Solution : By using more efficient search strategies like RandomizedSearchCV, along with systematic evaluation of a reduced set of hyperparameters, streamlined the tuning process and reduced computational load.

**4.Feature Selection Complexity**:

- Challenges : Identifying relevant features that contributed significantly to model performance was challenging. Implementing Recursive Feature Elimination (RFE) revealed that while some features were beneficial, others introduced noise, complicating the model training process.
- Solution : Combining RFE with domain knowledge helped in selecting impactful features while reducing noise, leading to enhanced model performance.

**5.Evaluation Metrics Confusion**:
- Challenges : The reliance on accuracy as a metric was misleading due to the imbalanced nature of the dataset. This necessitated a shift in focus toward metrics like precision, recall, and F1-score to better assess model performance, particularly in correctly identifying the minority class.
- Solution : Establishing a consistent evaluation framework centered on precision, recall, and F1-score ensured a comprehensive assessment of model effectiveness, particularly for the minority class.

**6.Computational Resource Limitations**:
- Challenges : Some models, particularly ensemble methods like Gradient Boosting, required substantial computational resources, resulting in longer training times and kernel crashes. This created bottlenecks in the model development process.
- Solution : To address computational resource limitations, I relied solely on local resources without utilizing cloud computing or distributed systems. This approach led to longer training times, requiring patience while the kernel executed the model training processes.