## Abstract

The data is related to **direct marketing campaigns** (phone calls) of a Portuguese banking institution.  
The classification goal is to **predict if a client will subscribe to a term deposit** (`y`).

---

## Data Set Information

The marketing campaigns were based on phone calls. Often, more than one contact was required to determine whether the product (bank term deposit) would be subscribed (`'yes'` or `'no'`).

---

## Attribute Information

### 📂 Bank Client Data
- **age** *(numeric)*: Client's age  
- **job** *(categorical)*: Job type  
  `['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown']`
- **marital** *(categorical)*: Marital status  
  `['divorced', 'married', 'single', 'unknown']`
- **education** *(categorical)*: Education level  
  `['basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown']`
- **default** *(categorical)*: Has credit in default?  
  `['no', 'yes', 'unknown']`
- **housing** *(categorical)*: Has housing loan?  
  `['no', 'yes', 'unknown']`
- **loan** *(categorical)*: Has personal loan?  
  `['no', 'yes', 'unknown']`

---

### ☎️ Contact Information (from current campaign)
- **contact** *(categorical)*: Contact communication type  
  `['cellular', 'telephone']`
- **month** *(categorical)*: Last contact month  
  `['jan', 'feb', ..., 'dec']`
- **day_of_week** *(categorical)*: Last contact day  
  `['mon', 'tue', 'wed', 'thu', 'fri']`
- **duration** *(numeric)*: Duration of last contact in seconds  
  ⚠️ *Important:* Duration strongly affects target (`y`), but is **only known after the call**.  
  It should **not be used in predictive models** for real-time classification — include only for benchmarking.

---

### 🔁 Previous Campaign Attributes
- **campaign** *(numeric)*: Number of contacts during current campaign (including last)
- **pdays** *(numeric)*: Days since last contact in previous campaign  
  `999` means "not previously contacted"
- **previous** *(numeric)*: Number of previous contacts
- **poutcome** *(categorical)*: Outcome of the previous campaign  
  `['failure', 'nonexistent', 'success']`

---

### 📉 Economic Context Attributes
- **emp.var.rate** *(numeric)*: Employment variation rate (quarterly)
- **cons.price.idx** *(numeric)*: Consumer price index (monthly)
- **cons.conf.idx** *(numeric)*: Consumer confidence index (monthly)
- **euribor3m** *(numeric)*: Euribor 3-month rate (daily)
- **nr.employed** *(numeric)*: Number of employees (quarterly)

---

## 🎯 Target Variable
- **y** *(binary)*: Has the client subscribed to a term deposit?  
  `['yes', 'no']`


In [17]:
#first import pandas and read csv
import pandas as pd

In [18]:
 # read CSV file
df = pd.read_csv('bank-additional-full.csv', sep=';') 

In [19]:
#lets do an inital check of the data, with head(), info() and describe()
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [21]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


## 📊 Summary Notes: Descriptive Statistics & Observations

---

### 1. `age`
- **Mean**: ~40 years  
- **Range**: 17 to 98  
- **Standard Deviation**: ~10  
- ✔️ Distribution is reasonable for an adult banking population.  
- 📌 Most clients fall within 30–50 years old.

---

### 2. `duration` *(call length in seconds)*
- **Mean**: ~258 seconds (~4 min 18 sec)  
- **Median**: 180 seconds  
- **Range**: 0 to 4918 seconds (~82 min)  
- 🔺 Skewed right: A few long calls pull the average up.  
- ⚠️ *Important:* This value is only known **after** the call → don't use for real-time prediction.


In [28]:
#Identify outliers
#I'm going to use the IQR method to identify outliers in the dataframe
def identify_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Apply the function to duration, campaign, and pdays
outliers_duration, dur_low, dur_high = identify_outliers_iqr(df, 'duration')   
outliers_campaign, camp_low, camp_high = identify_outliers_iqr(df, 'campaign')
outliers_pdays, pdays_low, pdays_high = identify_outliers_iqr(df, 'pdays')

# Print summary
print(f"Duration outliers: {len(outliers_duration)} (above {dur_high:.2f})")
print(f"Campaign outliers: {len(outliers_campaign)} (above {camp_high:.2f})")
print(f"Pdays outliers: {len(outliers_pdays)} (below {pdays_low:.2f} or above {pdays_high:.2f})")

Duration outliers: 2963 (above 644.50)
Campaign outliers: 2406 (above 6.00)
Pdays outliers: 1515 (below 999.00 or above 999.00)


## 📌 Outlier Summary Table

| Column      | Outlier Count | Issue                              | Recommendation                          |
|-------------|----------------|-------------------------------------|------------------------------------------|
| `duration`  | 2,963          | Right-skewed; long tail of extreme call lengths | 🔹 Cap at 645<br>🔹 Log-transform<br>🔹 Create binary flag `is_long_call` |
| `campaign`  | 2,406          | Clients contacted excessively (e.g. 56 times)   | 🔹 Cap at 6<br>🔹 Create `high_contact` flag<br>🔹 Analyze impact on conversion |
| `pdays`     | 1,515 (≠999)   | 999 is a **placeholder**, not a real value     | 🔹 Replace 999 with `NaN` or similar<br>🔹 Create `was_previously_contacted` flag |
