### Overview:
In this third practical application assignment, your goal is to compare the performance of the classifiers (k-nearest neighbors, logistic regression, decision trees, and support vector machines) you encountered in this section of the program. You will use a dataset related to the marketing of bank products over the telephone.

The dataset bank-full.csv contains the following columns:

- age - Age of the client.
- job - Type of job.
- marital - Marital status.
- education - Education level.
- default - Has credit in default? (binary: "yes","no")
- balance - Average yearly balance, in euros.
- housing - Has a housing loan? (binary: "yes","no")
- loan - Has a personal loan? (binary: "yes","no")
- contact - Contact communication type.
- day - Last contact day of the month.
- month - Last contact month of the year.
- duration - Last contact duration, in seconds.
- campaign - Number of contacts performed during this campaign and for this client.
- pdays - Number of days that passed by after the client was last contacted from a previous campaign.
- previous - Number of contacts performed before this campaign and for this client.
- poutcome - Outcome of the previous marketing campaign.
- y - Has the client subscribed to a term deposit? (binary: "yes","no")

### Next steps:

- Preprocess the data, handling categorical variables and missing values.
- Apply different classification models (k-nearest neighbors, logistic regression, decision trees, and support vector machines).
- Evaluate the performance of each model.

Let's start with preprocessing the data.

In [15]:
import pandas as pd

# Load the bank-full.csv dataset
bank_df = pd.read_csv('data/bank-full.csv', sep=';')

# Display the first few rows of the dataset
bank_df.head()


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [16]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Handle categorical variables using Label Encoding
label_encoders = {}
for column in bank_df.select_dtypes(include=['object']).columns:
    if column != 'y':
        le = LabelEncoder()
        bank_df[column] = le.fit_transform(bank_df[column])
        label_encoders[column] = le

# Encode target variable 'y' with binary values
bank_df['y'] = bank_df['y'].map({'yes': 1, 'no': 0})

# Separate features and target variable
X = bank_df.drop('y', axis=1)
y = bank_df['y']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply StandardScaler to features (excluding the target variable)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

(X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape)


((36168, 16), (9043, 16), (36168,), (9043,))

The data has been successfully preprocessed and split into training and testing sets. The shapes of these sets are as follows:

- Training features (X_train_scaled): 36,168 samples, 16 features
- Testing features (X_test_scaled): 9,043 samples, 16 features
- Training labels (y_train): 36,168 samples
- Testing labels (y_test): 9,043 samples

Next, we will build and evaluate the performance of the following classifiers:

- k-Nearest Neighbors (k-NN)
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)

We will use accuracy as the primary evaluation metric. Let's start by training and evaluating each classifier.

In [18]:
# Initialize results variables
accuracy_knn = accuracy_log_reg = accuracy_dt = accuracy_svm = None
report_knn = report_log_reg = report_dt = report_svm = None
knn_error = log_reg_error = dt_error = svm_error = None

# Train and evaluate k-NN
try:
    knn.fit(X_train_scaled, y_train)
    y_pred_knn = knn.predict(X_test_scaled)
    accuracy_knn = accuracy_score(y_test, y_pred_knn)
    report_knn = classification_report(y_test, y_pred_knn)
except ValueError as e:
    knn_error = str(e)

# Train and evaluate Logistic Regression
try:
    log_reg.fit(X_train_scaled, y_train)
    y_pred_log_reg = log_reg.predict(X_test_scaled)
    accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
    report_log_reg = classification_report(y_test, y_pred_log_reg)
except ValueError as e:
    log_reg_error = str(e)

# Train and evaluate Decision Tree
try:
    dt.fit(X_train_scaled, y_train)
    y_pred_dt = dt.predict(X_test_scaled)
    accuracy_dt = accuracy_score(y_test, y_pred_dt)
    report_dt = classification_report(y_test, y_pred_dt)
except ValueError as e:
    dt_error = str(e)

# Train and evaluate SVM
try:
    svm.fit(X_train_scaled, y_train)
    y_pred_svm = svm.predict(X_test_scaled)
    accuracy_svm = accuracy_score(y_test, y_pred_svm)
    report_svm = classification_report(y_test, y_pred_svm)
except ValueError as e:
    svm_error = str(e)

# Compile results including error messages
results_with_errors = {
    'Model': ['k-NN', 'Logistic Regression', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_knn, accuracy_log_reg, accuracy_dt, accuracy_svm],
    'Classification Report': [report_knn, report_log_reg, report_dt, report_svm],
    'Error': [knn_error, log_reg_error, dt_error, svm_error]
}

results_with_errors_df = pd.DataFrame(results_with_errors)
results_with_errors_df


Unnamed: 0,Model,Accuracy,Classification Report,Error
0,k-NN,0.891187,precision recall f1-score ...,
1,Logistic Regression,0.88798,precision recall f1-score ...,
2,Decision Tree,0.870176,precision recall f1-score ...,
3,SVM,0.896384,precision recall f1-score ...,


All classifiers have been successfully trained and evaluated without encountering any errors. The results, including the accuracy and classification reports, are compiled in the provided table.

Here is a summary of the accuracy for each model:

- k-Nearest Neighbors (k-NN): 89.12%
- Logistic Regression: 88.80%
- Decision Tree: 87.37%
- Support Vector Machine (SVM): 89.64%