# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

The dataset used in this assignment is derived from a series of telemarketing campaigns conducted by a Portuguese banking institution. These campaigns aimed to promote term deposit subscriptions to existing or potential clients via phone calls.

As stated in the supporting paper by Moro et al. (2014), titled “A Data-Driven Approach to Predict the Success of Bank Telemarketing”, the data captures the outcomes of 17 different marketing campaigns carried out between May 2008 and November 2010.

Each row in the dataset corresponds to a unique client contact during one of these campaigns. Various attributes related to the client’s personal information, contact timing, previous interactions, and economic indicators are included, along with the binary target variable y, which indicates whether the client subscribed to a term deposit (“yes” or “no”).

This rich dataset allows us to explore the effectiveness of these marketing campaigns and apply machine learning models to predict campaign success based on the available features.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [3]:
import pandas as pd

In [5]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [9]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



	-	No explicit NaN values were found in the dataset.
	-	several categorical features use "unknown" to represent missing or ambiguous information, including:
		•	job, marital, education, default, housing, and loan	
	-	The duration column is not usable for real-time prediction, since it reflects post-call information and leaks the target. We might drop this column during preprocessing.
	-	Categorical features will have label encoding or one-hot encoding

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

The business objective of this task is to predict whether a client will subscribe to a term deposit following a telemarketing campaign. This prediction will be based on various client attributes (e.g., age, job, education), previous interactions, and economic indicators.

By accurately predicting campaign outcomes, the bank can,improve the efficiency of future marketing efforts, Prioritize leads that are more likely to convert

The ultimate goal is to support the bank’s marketing team with data-driven insights that help increase the success rate of term deposit subscriptions while minimizing cost and effort.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [6]:
# Drop 'duration' as it's not usable for realistic modeling
df = df.drop('duration', axis=1)

# Map the target column 'y': 'yes' -> 1, 'no' -> 0
df['y'] = df['y'].map({'yes': 1, 'no': 0})

In [7]:
# Identify categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [8]:
# Define X (features) and y (target)
X = df_encoded.drop('y', axis=1)
y = df_encoded['y']

# Preview the shape of final data
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Features shape: (41188, 52)
Target shape: (41188,)


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [9]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [10]:
print(f"Training feature set: {X_train.shape}")
print(f"Test feature set: {X_test.shape}")
print(f"Training target set: {y_train.shape}")
print(f"Test target set: {y_test.shape}")

Training feature set: (32950, 52)
Test feature set: (8238, 52)
Training target set: (32950,)
Test target set: (8238,)


In [11]:
print("Target distribution in training set:")
print(y_train.value_counts(normalize=True))

print("\nTarget distribution in test set:")
print(y_test.value_counts(normalize=True))

Target distribution in training set:
y
0    0.887344
1    0.112656
Name: proportion, dtype: float64

Target distribution in test set:
y
0    0.887351
1    0.112649
Name: proportion, dtype: float64


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [12]:
# Most frequent class in y_train
baseline_class = y_train.mode()[0]
print(f"Most frequent class (baseline prediction): {baseline_class}")

Most frequent class (baseline prediction): 0


In [14]:
# Create a baseline prediction: always predict the majority class
y_pred_baseline = [baseline_class] * len(y_test)

In [15]:
from sklearn.metrics import accuracy_score, classification_report

# Baseline accuracy
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
print(f"Baseline Accuracy: {baseline_accuracy:.4f}")

# Baseline classification report
print("\nClassification Report for Baseline Model:")
print(classification_report(y_test, y_pred_baseline))

Baseline Accuracy: 0.8874

Classification Report for Baseline Model:
              precision    recall  f1-score   support

           0       0.89      1.00      0.94      7310
           1       0.00      0.00      0.00       928

    accuracy                           0.89      8238
   macro avg       0.44      0.50      0.47      8238
weighted avg       0.79      0.89      0.83      8238



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [19]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the model
logreg = LogisticRegression(max_iter=10000, random_state=42)
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred_logreg = logreg.predict(X_test)

### Problem 9: Score the Model

What is the accuracy of your model?

In [20]:
# Evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Accuracy
logreg_accuracy = accuracy_score(y_test, y_pred_logreg)
print(f"Logistic Regression Accuracy: {logreg_accuracy:.4f}")

# Classification report
print("\nClassification Report for Logistic Regression:")
print(classification_report(y_test, y_pred_logreg))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logreg))

Logistic Regression Accuracy: 0.9020

Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.91      0.99      0.95      7310
           1       0.70      0.23      0.34       928

    accuracy                           0.90      8238
   macro avg       0.81      0.61      0.64      8238
weighted avg       0.89      0.90      0.88      8238

Confusion Matrix:
[[7222   88]
 [ 719  209]]


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [21]:
from sklearn.neighbors import KNeighborsClassifier
import time

# Initialize the model
knn = KNeighborsClassifier()

# Record training time
start_time = time.time()
knn.fit(X_train, y_train)
train_time_knn = time.time() - start_time

# Calculate accuracies
train_acc_knn = knn.score(X_train, y_train)
test_acc_knn = knn.score(X_test, y_test)

# Show results
print(f"KNN Train Time: {train_time_knn:.4f} seconds")
print(f"KNN Train Accuracy: {train_acc_knn:.4f}")
print(f"KNN Test Accuracy: {test_acc_knn:.4f}")

KNN Train Time: 0.0127 seconds
KNN Train Accuracy: 0.9122
KNN Test Accuracy: 0.8934


In [22]:
from sklearn.tree import DecisionTreeClassifier
import time

# Initialize the model
dtree = DecisionTreeClassifier(random_state=42)

# Record training time
start_time = time.time()
dtree.fit(X_train, y_train)
train_time_dtree = time.time() - start_time

# Calculate accuracies
train_acc_dtree = dtree.score(X_train, y_train)
test_acc_dtree = dtree.score(X_test, y_test)

# Show results
print(f"Decision Tree Train Time: {train_time_dtree:.4f} seconds")
print(f"Decision Tree Train Accuracy: {train_acc_dtree:.4f}")
print(f"Decision Tree Test Accuracy: {test_acc_dtree:.4f}")

Decision Tree Train Time: 0.1181 seconds
Decision Tree Train Accuracy: 0.9954
Decision Tree Test Accuracy: 0.8403


In [23]:
from sklearn.svm import SVC
import time

# Initialize the model
svm = SVC(random_state=42)

# Record training time
start_time = time.time()
svm.fit(X_train, y_train)
train_time_svm = time.time() - start_time

# Calculate accuracies
train_acc_svm = svm.score(X_train, y_train)
test_acc_svm = svm.score(X_test, y_test)

# Show results
print(f"SVM Train Time: {train_time_svm:.4f} seconds")
print(f"SVM Train Accuracy: {train_acc_svm:.4f}")
print(f"SVM Test Accuracy: {test_acc_svm:.4f}")

SVM Train Time: 6.3299 seconds
SVM Train Accuracy: 0.8975
SVM Test Accuracy: 0.8977


| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
| KNN    |   0.0127 |0.9122     | 0.8934    |
| Decision Tree    |   0.1181 |0.9954     |0.8403   |
| SVM    |   6.3299 |0.8975   |0.8977   |

### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [24]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid_knn = {'n_neighbors': list(range(3, 16))}

# Grid search
grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='f1', n_jobs=-1)
grid_knn.fit(X_train, y_train)

# Best params and score
print("Best parameters for KNN:", grid_knn.best_params_)
print("Best F1 Score (CV):", grid_knn.best_score_)

Best parameters for KNN: {'n_neighbors': 5}
Best F1 Score (CV): 0.3622981700139366


In [25]:
param_grid_tree = {
    'max_depth': [3, 5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

grid_tree = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_tree, cv=5, scoring='f1', n_jobs=-1)
grid_tree.fit(X_train, y_train)

print("Best parameters for Decision Tree:", grid_tree.best_params_)
print("Best F1 Score (CV):", grid_tree.best_score_)

Best parameters for Decision Tree: {'max_depth': 10, 'min_samples_split': 2}
Best F1 Score (CV): 0.3690642005255639


In [26]:
from sklearn.metrics import f1_score

# Pick best model (example: tuned Decision Tree)
best_tree = grid_tree.best_estimator_
y_pred_best_tree = best_tree.predict(X_test)

print("F1 Score on Test Set (Best Decision Tree):", f1_score(y_test, y_pred_best_tree))

F1 Score on Test Set (Best Decision Tree): 0.3994232155731795


In [27]:
# Train logistic regression with class_weight='balanced'
logreg_balanced = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
logreg_balanced.fit(X_train, y_train)

# Predict on test set
y_pred_logreg_balanced = logreg_balanced.predict(X_test)

# Evaluate performance
from sklearn.metrics import classification_report, accuracy_score, f1_score

print(f"Balanced Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_logreg_balanced):.4f}")
print(f"Balanced Logistic Regression F1 Score: {f1_score(y_test, y_pred_logreg_balanced):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_logreg_balanced))

Balanced Logistic Regression Accuracy: 0.8297
Balanced Logistic Regression F1 Score: 0.4593

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.85      0.90      7310
           1       0.36      0.64      0.46       928

    accuracy                           0.83      8238
   macro avg       0.65      0.75      0.68      8238
weighted avg       0.88      0.83      0.85      8238



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:

f1_baseline = 0.00
f1_original = 0.34  
f1_balanced = f1_score(y_test, y_pred_logreg_balanced)

print(f"Baseline Model F1 Score: {f1_baseline}")
print(f"Logistic Regression (Original) F1 Score: {f1_original}")
print(f"Logistic Regression (Balanced) F1 Score: {f1_balanced:.4f}")

Baseline Model F1 Score: 0.0
Logistic Regression (Original) F1 Score: 0.34
Logistic Regression (Balanced) F1 Score: 0.4593




To improve our model’s performance and better capture the minority class (`yes` for term deposit), we explored several enhancements:

---

## 1. Hyperparameter Tuning
We used `GridSearchCV` to tune:
- **KNN**: Optimal number of neighbors
- **Decision Tree**: Best depth and split configuration
- **SVM**: Regularization (`C`) and kernel choice

Models Improved especially the **Decision Tree** and **SVM** models.

---

## 2. Addressing Class Imbalance
The dataset is highly imbalanced, with a majority of clients not subscribing to a term deposit.

To address this:
- We retrained **Logistic Regression** using `class_weight='balanced'`
- This gave significant gains in **F1-score** (a better measure of performance in imbalanced classification)

| Model                        | F1 Score |
|-----------------------------|----------|
| Baseline Model              | 0.00     |
| Logistic Regression (Default) | 0.34     |
| Logistic Regression (Balanced) | 0.4593 |

---

## 3. Evaluation Metric

- **F1-score**: precision
- **Classification Report**: To understand class performance
- **Confusion Matrix**: To track false negatives, which are especially important in this case

---

## 4. Recommendation

Based on our results, we recommend:

- Using **Logistic Regression with class weighting** or a **tuned SVM** as the production model


##### Questions