### Load and Inspect Data

We load the raw marketing dataset provided by the bank and inspect the first few rows to understand the structure and column types.


In [21]:
import pandas as pd
import plotly.express as px

In [22]:
df = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/bank.csv")
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [23]:
df.shape

(37069, 20)

### Check Data Types and Missing Values

We explore the dataset to identify any missing or unknown values in categorical columns, which may need to be handled or encoded.

In [24]:
df.info

In [25]:
df['y'].value_counts()

Unnamed: 0_level_0,count
y,Unnamed: 1_level_1
no,32861
yes,4208


In [26]:
for col in df.columns:
  print(f"{col}: {df[col].isin(['unknown']).sum()} unknowns")

age: 0 unknowns
job: 294 unknowns
marital: 69 unknowns
education: 1535 unknowns
default: 7725 unknowns
housing: 894 unknowns
loan: 894 unknowns
contact: 0 unknowns
month: 0 unknowns
day_of_week: 0 unknowns
campaign: 0 unknowns
pdays: 0 unknowns
previous: 0 unknowns
poutcome: 0 unknowns
emp.var.rate: 0 unknowns
cons.price.idx: 0 unknowns
cons.conf.idx: 0 unknowns
euribor3m: 0 unknowns
nr.employed: 0 unknowns
y: 0 unknowns


### Encode Categorical Variables

Using `LabelEncoder`, we convert string-based categorical columns into numeric labels for model compatibility.


In [None]:
from sklearn.preprocessing import LabelEncoder

# Copy dataframe so original is preserved
df_encoded = df.copy()

# Encode all categorical columns
cat_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan',
            'contact', 'month', 'day_of_week', 'poutcome', 'y']

le_dict = {}

for col in cat_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col])
    le_dict[col] = le  # Store encoder in case we need to inverse later


In [None]:
# Check the first few rows
print(df_encoded.head())

   age  job  marital  education  default  housing  loan  contact  month  \
0   56    3        1          0        0        0     0        1      6   
1   57    7        1          3        1        0     0        1      6   
2   37    7        1          3        0        2     0        1      6   
3   40    0        1          1        0        0     0        1      6   
4   56    7        1          3        0        0     2        1      6   

   day_of_week  campaign  pdays  previous  poutcome  emp.var.rate  \
0            1         1    999         0         1           1.1   
1            1         1    999         0         1           1.1   
2            1         1    999         0         1           1.1   
3            1         1    999         0         1           1.1   
4            1         1    999         0         1           1.1   

   cons.price.idx  cons.conf.idx  euribor3m  nr.employed  y  
0          93.994          -36.4      4.857       5191.0  0  
1         

In [None]:
# Check if 'y' is now 0 and 1
print("\nEncoded 'y' value counts:")
print(df_encoded['y'].value_counts())


Encoded 'y' value counts:
y
0    32861
1     4208
Name: count, dtype: int64


In [None]:
# Check encoded values for a few other columns
print("\nSample encoded values for 'job':")
print(df_encoded['job'].unique())


Sample encoded values for 'job':
[ 3  7  0  1  9  5 10  6 11  4  2  8]


### Feature Selection and Train/Test Split

We select relevant features for modeling based on stakeholder goals and drop potential data leakage columns. We then split the dataset into training and testing sets.


In [None]:
# Select features to use for modeling (drop leakage + redundant fields)
features_to_use = [
    'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
    'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'
]

X = df_encoded[features_to_use]
y = df_encoded['y']

In [None]:
from sklearn.model_selection import train_test_split

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

Training data shape: (29655, 13)
Testing data shape: (7414, 13)


### Train Decision Tree Classifier

We train a Decision Tree Classifier and evaluate it using a confusion matrix and classification report. Since the dataset is imbalanced, we will pay close attention to precision and recall for the minority class (`y = yes`).


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Create and train the model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

In [None]:
# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Detailed performance metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[5967  561]
 [ 614  272]]

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.91      0.91      6528
           1       0.33      0.31      0.32       886

    accuracy                           0.84      7414
   macro avg       0.62      0.61      0.61      7414
weighted avg       0.84      0.84      0.84      7414



### Interpretation of Results

While overall accuracy is high (84%), precision and recall for the positive class are low (~33% and 31%, respectively). This indicates the model struggles to correctly identify customers who would subscribe to a term deposit.

We will explore methods to improve this in the next iteration, such as Random Forests or class balancing techniques.


--------------------------------------------------------------------------------

### Random Forest Classifier

We now train a Random Forest model to compare performance against the baseline Decision Tree. Random Forests often perform better by reducing overfitting and improving generalization.


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

# Predict
rf_pred = rf_clf.predict(X_test)

# Evaluate
from sklearn.metrics import classification_report, confusion_matrix
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, rf_pred))
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, rf_pred))

Random Forest Confusion Matrix:
[[6250  278]
 [ 610  276]]

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.96      0.93      6528
           1       0.50      0.31      0.38       886

    accuracy                           0.88      7414
   macro avg       0.70      0.63      0.66      7414
weighted avg       0.86      0.88      0.87      7414



### Interpretation of Random Forest Results

Switching to a Random Forest improved our model's precision for predicting customers who will subscribe from 33% to 50%, a significant gain. While recall remained similar (~31%), we now have a more reliable model for targeting interested customers. Overall accuracy also improved to 88%. This model shows clear benefits over the original decision tree and will serve as our base moving forward.

--------------------------------------------------------------------------------