# Practical Application III: Comparing Classifiers
**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.


### Getting Started
Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing). The data is from a Portuguese banking institution and is a collection of the results of multiple marketing campaigns. We will make use of the article accompanying the dataset for more information on the data and features.

### Problem 1: Understanding the Data
How many marketing campaigns does this data represent?
> Answer: The data represents **17 marketing campaigns** (from May 2008 to November 2010) as stated in the CRISP-DM paper.

In [2]:
### Problem 2: Read in the Data
import pandas as pd
df = pd.read_csv('module17_starter/data/bank-additional.csv', sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


### Problem 3: Understanding the Features
Examine the data description. Are any features missing or misformatted?

In [3]:
# Check for data types and missing values
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4119 non-null   object 
 2   marital         4119 non-null   object 
 3   education       4119 non-null   object 
 4   default         4119 non-null   object 
 5   housing         4119 non-null   object 
 6   loan            4119 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  duration        4119 non-null   int64  
 11  campaign        4119 non-null   int64  
 12  pdays           4119 non-null   int64  
 13  previous        4119 non-null   int64  
 14  poutcome        4119 non-null   object 
 15  emp.var.rate    4119 non-null   float64
 16  cons.price.idx  4119 non-null   float64
 17  cons.conf.idx   4119 non-null   f

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

### Problem 4: Understanding the Task
**Business Objective**: Predict whether a client will subscribe to a term deposit based on their attributes and past campaign performance.

### Problem 5: Engineering Features
We will drop 'duration' and encode categorical variables.

In [4]:
from sklearn.preprocessing import LabelEncoder

df = df.drop('duration', axis=1)
categorical_cols = df.select_dtypes(include='object').columns
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

X = df.drop('y', axis=1)
y = df['y']

### Problem 6: Train/Test Split

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Problem 7: A Baseline Model
Check class distribution to determine baseline.

In [6]:
y_test.value_counts(normalize=True)

0    0.890777
1    0.109223
Name: y, dtype: float64

### Problem 8: A Simple Model
Using Logistic Regression:

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000))
])
pipeline.fit(X_train, y_train)
log_reg_acc = accuracy_score(y_test, pipeline.predict(X_test))
log_reg_acc

0.9004854368932039

### Problem 9: Score the Model
What is the accuracy of your model?

### Problem 10: Model Comparisons
Train and compare all 4 classifiers.

In [8]:
import time
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC(probability=True)
}

results = []
for name, model in models.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('clf', model)])
    start = time.time()
    pipe.fit(X_train, y_train)
    train_time = time.time() - start
    train_acc = pipe.score(X_train, y_train)
    test_acc = pipe.score(X_test, y_test)
    results.append([name, train_time, train_acc, test_acc])

import pandas as pd
pd.DataFrame(results, columns=['Model', 'Train Time', 'Train Accuracy', 'Test Accuracy'])

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Unnamed: 0,Model,Train Time,Train Accuracy,Test Accuracy
0,Logistic Regression,0.014052,0.902276,0.900485
1,KNN,0.002164,0.911684,0.894417
2,Decision Tree,0.016385,1.0,0.817961
3,SVM,1.520947,0.91047,0.900485


### Problem 11: Improving the Model
Now that we have some basic models on the board, we want to try to improve these. Below, we list a few things to explore in this pursuit:
- More feature engineering and exploration.
- Hyperparameter tuning and grid search.
- Adjust your performance metric (e.g., F1-score, ROC AUC).

In [None]:
Try more advanced models like Random Forest, Gradient Boosting, or XGBoost.

Apply feature selection and hyperparameter tuning (e.g., GridSearchCV).

Consider resampling techniques (like SMOTE) if class imbalance is significant.

Use Lift and AUC for campaign optimization insights.