# **BankTermPredict Project - Jackie CW Vescio**

- Scope: **BankTermPredict** is a supervised learning project focused on predicting whether a client will subscribe to a term deposit based on marketing campaign data collected by a Portuguese banking institution.

- Purpose: The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

# Bank Marketing Dataset – Variables Table

| Variable Name  | Role    | Type        | Demographic        | Description                                                                                                                        | Units   | Missing Values |
|----------------|---------|-------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------|---------|----------------|
| age            | Feature | Integer     | Age                | Client’s age                                                                                                                       |         | no             |
| job            | Feature | Categorical | Occupation         | Type of job (`admin.`, `blue-collar`, `entrepreneur`, `housemaid`, `management`, `retired`, `self-employed`, `services`, `student`, `technician`, `unemployed`, `unknown`) |         | no             |
| marital        | Feature | Categorical | Marital Status     | Marital status (`divorced`, `married`, `single`, `unknown`; note: `divorced` means divorced or widowed)                            |         | no             |
| education      | Feature | Categorical | Education Level    | Education level (`basic.4y`, `basic.6y`, `basic.9y`, `high.school`, `illiterate`, `professional.course`, `university.degree`, `unknown`) |         | no             |
| default        | Feature | Binary      |                    | Has credit in default?                                                                                                             |         | no             |
| housing        | Feature | Binary      |                    | Has housing loan?                                                                                                                  |         | no             |
| loan           | Feature | Binary      |                    | Has personal loan?                                                                                                                 |         | no             |
| contact        | Feature | Categorical |                    | Contact communication type (`cellular`, `telephone`)                                                                               |         | yes            |
| day_of_week    | Feature | Categorical |                    | Last contact day of the week                                                                                                       |         | no             |
| month          | Feature | Categorical |                    | Last contact month of year (`jan`, `feb`, …, `dec`)                                                                                |         | no             |
| duration       | Feature | Integer     |                    | Last contact duration in seconds. **Important:** Strongly affects target and should only be used for benchmarking—not realistic modeling. | seconds | no             |
| campaign       | Feature | Integer     |                    | Number of contacts performed during this campaign for this client (includes last contact)                                          |         | no             |
| pdays          | Feature | Integer     |                    | Number of days since client was last contacted from previous campaign (`-1` means not contacted before)                            |         | yes            |
| previous       | Feature | Integer     |                    | Number of contacts performed before this campaign for this client                                                                  |         | no             |
| poutcome       | Feature | Categorical |                    | Outcome of the previous marketing campaign (`failure`, `nonexistent`, `success`)                                                   |         | yes            |
| emp.var.rate   | Feature | Numeric     | Economic Indicator | Employment variation rate (quarterly)                                                                                              |         | no             |
| cons.price.idx | Feature | Numeric     | Economic Indicator | Consumer Price Index (monthly)                                                                                                     |         | no             |
| cons.conf.idx  | Feature | Numeric     | Economic Indicator | Consumer Confidence Index (monthly)                                                                                                |         | no             |
| euribor3m      | Feature | Numeric     | Economic Indicator | Euribor 3-month rate (daily)                                                                                                       |         | no             |
| nr.employed    | Feature | Numeric     | Economic Indicator | Number of employees (quarterly)                                                                                                    |         | no             |
| y              | Target  | Binary      |                    | Has the client subscribed to a term deposit? (`yes` = 1, `no` = 0)                                                                 |         | no             |


**Note:**  
The original UCI Bank Marketing documentation describes 41 variables because it combines details from multiple versions of the dataset (`bank.csv`, `bank-full.csv`, and `bank-additional.csv`).  
This project uses the *bank-additional-full.csv* file, which contains **21 variables** (20 features + 1 target).  
The remaining variables listed in the documentation are only present in the other dataset variants and are not available here.


## Part 1 - Data Preprocessing

### Importing the dataset

In [None]:
# Choose and define path (adjust if needed) for loading the dataset
from pathlib import Path

# If your notebook is in /notebooks, this is usually correct:
CSV = Path("../data/bank-additional/bank-additional.csv")

# If you’re running from the repo root instead, uncomment this:
# CSV = Path("data/bank-additional/bank-additional.csv")

assert CSV.exists(), f"Not found: {CSV.resolve()}"
CSV


WindowsPath('../data/bank-additional/bank-additional.csv')

In [40]:
# Load Advanced dataset and review Shape and Info of dataset

import pandas as pd
pd.set_option("display.max_columns", None)

dataset = pd.read_csv(CSV, sep=";")

print("Shape:", dataset.shape)   # should be around (41188, 21) or similar
print("\nInfo:")
print(dataset.info())


Shape: (41188, 21)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.

In [25]:
dataset.head()


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,487,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,346,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,227,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,17,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,58,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [31]:
# Show dataset shape: amount of rows and columns

print("Shape:")
print(dataset.shape)

Shape:
(4119, 21)


In [35]:
# Show dataset info

print("Info:")
dataset.info()

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4119 non-null   object 
 2   marital         4119 non-null   object 
 3   education       4119 non-null   object 
 4   default         4119 non-null   object 
 5   housing         4119 non-null   object 
 6   loan            4119 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  duration        4119 non-null   int64  
 11  campaign        4119 non-null   int64  
 12  pdays           4119 non-null   int64  
 13  previous        4119 non-null   int64  
 14  poutcome        4119 non-null   object 
 15  emp.var.rate    4119 non-null   float64
 16  cons.price.idx  4119 non-null   float64
 17  cons.conf.idx   4119 non-nu