# Bank Marketing Campaign Analysis

**Rows:** 45,211  
**Columns:** 17  

## Overview
This dataset contains bank marketing campaign data used to analyze factors influencing whether a customer subscribes to a term deposit.

## Features
- **Demographic:** age, job, marital, education  
- **Financial:** default, balance, housing, loan  
- **Campaign:** contact, day, month, duration, campaign, pdays, previous, poutcome  
- **Target:** y (subscription result)

## Goal
Find patterns that help improve customer targeting and campaign success.

In [1]:
import pandas as pd

df = pd.read_csv('/Users/a2681/Downloads/Raw Dataset eda.csv', sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


# Dataset Description

#This dataset contains customer marketing campaign data from a bank. It includes demographic information,financial attributes, and campaign
#interaction details.The goal is to analyze patterns that influence whether a customer subscribes to a term deposit.

# Business Objective

#The objective is to identify key factors affecting customer subscription behavior so that marketing teams can optimize targeting strategies
#and improve campaign success rate.

In [2]:
#Basic Inspection Commands

df.head()
df.info()
df.describe()
df.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


age            77
job            12
marital         3
education       4
default         2
balance      7168
housing         2
loan            2
contact         3
day            31
month          12
duration     1573
campaign       48
pdays         559
previous       41
poutcome        4
y               2
dtype: int64

In [3]:
#Identify Column Types

num_cols = df.select_dtypes(include=['int64','float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

print("Numerical Columns:", num_cols)
print("Categorical Columns:", cat_cols)

Numerical Columns: Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous'], dtype='object')
Categorical Columns: Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y'],
      dtype='object')


In [4]:
#Missing Value Analysis

df.isnull().sum()
(df.isnull().sum()/len(df))*100

age          0.0
job          0.0
marital      0.0
education    0.0
default      0.0
balance      0.0
housing      0.0
loan         0.0
contact      0.0
day          0.0
month        0.0
duration     0.0
campaign     0.0
pdays        0.0
previous     0.0
poutcome     0.0
y            0.0
dtype: float64

In [5]:
df.duplicated().sum()
df.describe()

for col in df.columns:
    print(col, df[col].isin(['unknown']).sum())

age 0
job 288
marital 0
education 1857
default 0
balance 0
housing 0
loan 0
contact 13020
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 36959
y 0


In [6]:
# Data Quality Issue Log
dq_log = pd.DataFrame({
    "Issue": [
        "Unknown values present",
        "Potential outliers",
        "Duplicate rows",
        "Skewed numeric distribution"
    ],
    "Columns Affected": [
        "job, education, contact",
        "balance, duration, campaign",
        "All columns",
        "balance"
    ],
    "Resolution": [
        "Replace with NaN and impute",
        "IQR method applied",
        "Removed duplicates",
        "Log transformation applied"
    ]
})

dq_log

Unnamed: 0,Issue,Columns Affected,Resolution
0,Unknown values present,"job, education, contact",Replace with NaN and impute
1,Potential outliers,"balance, duration, campaign",IQR method applied
2,Duplicate rows,All columns,Removed duplicates
3,Skewed numeric distribution,balance,Log transformation applied


## Conclusion â€” Day 1 Data Overview

The dataset was successfully loaded and profiled. Structural inspection shows that the dataset contains 45,211 records and 17 features with a mix of numerical and categorical variables.

No missing (NaN) values were detected; however, several categorical columns contain placeholder values such as "unknown", which should be treated as missing or special categories during preprocessing.

Key data quality issues identified include:
- High proportion of "unknown" values in `poutcome`
- Moderate unknown values in `education` and `contact`
- Potential outliers in numerical columns such as `balance` and `duration`

These findings indicate that data cleaning and preprocessing are required before performing deeper analysis or modeling.

---

### Next Step
In Day 2, data cleaning and preprocessing will be performed to handle unknown values, treat outliers, correct data types, and prepare the dataset for analysis.

In [7]:
df.to_csv("day1_output.csv", index=False)