# Bank Term Deposit Subscription Prediction Dataset

Title: Bank Term Deposit Subscription Prediction Dataset
Subtitle: Predict whether a client will subscribe to a bank term deposit

Description:
This dataset contains information about clients of a Portuguese banking institution. The goal is to predict whether a client will subscribe to a bank term deposit (variable y). The data was obtained from a direct marketing campaign, and each entry corresponds to a single client.

Dataset Content:
The dataset contains 45,211 entries with 17 attributes. The attributes represent client information and campaign details, and they include both categorical and numerical data.

- age: Age of the client (numeric)
- job: Type of job (categorical: "admin.", "blue-collar", "entrepreneur", etc.)
- marital: Marital status (categorical: "married", "single", "divorced")
- education: Level of education (categorical: "primary", "secondary", "tertiary", "unknown")
- default: Has credit in default? (categorical: "yes", "no")
- balance: Average yearly balance in euros (numeric)
- housing: Has a housing loan? (categorical: "yes", "no")
- loan: Has a personal loan? (categorical: "yes", "no")
- contact: Type of communication contact (categorical: "unknown", "telephone", "cellular")
- day: Last contact day of the month (numeric, 1-31)
- month: Last contact month of the year (categorical: "jan", "feb", "mar", …, "dec")
- duration: Last contact duration in seconds (numeric)
- campaign: Number of contacts performed during this campaign (numeric)
- pdays: Number of days since the client was last contacted from a previous campaign (numeric; -1 means the client was not previously contacted)
- previous: Number of contacts performed before this campaign (numeric)
- poutcome: Outcome of the previous marketing campaign (categorical: "unknown", "other", "failure", "success")
- y: The target variable, whether the client subscribed to a term deposit (binary: "yes", "no")

## Access & Use Information
- This dataset was made available by UCI Machine Learning Repository. The data is provided "as is" without any warranty.
- License: The dataset is available for public use under the Open Data Commons Public Domain Dedication and License (PDDL).

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('train.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

In [None]:
df.info()

In [None]:
print(df.isnull().sum())

In [None]:
df.head()

In [None]:
# Distribution of categorical variables
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
for col in categorical_cols:
    plt.figure(figsize=(10, 4))
    sns.countplot(
        data=df,
        x=col,
        order=df[col].value_counts().index,
        palette='Set2',
        edgecolor='black'
    )
    plt.title(f'{col} Distribution', fontsize=14)
    plt.xlabel(col, fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.5)
    plt.tight_layout()
    plt.show()

    # 🧮 Print Category Proportions
    print(f'\n📊 Proportion of Each Category in "{col}":\n')
    print(df[col].value_counts(normalize=True).round(3), '\n' + '-'*40)

In [None]:
# Distribution of numerical variables
numerical_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
for col in numerical_cols:
    plt.figure(figsize=(8,4))
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f"Distribution of {col}")
    plt.show()

    # Print descriptive statistics
    print(f'\n📊 Descriptive Stats for {col}:\n')
    print(df[col].describe(), '\n' + '-'*40)

In [None]:
# Correlation heatmap for numerical features
plt.figure(figsize=(10,6))
num_cols = df.select_dtypes(include=['int', 'float']).columns
sns.heatmap(df[num_cols].corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Distribution of target variable 'y'
plt.figure(figsize=(6,4))
df['y'].value_counts().plot(kind="bar", color=["skyblue","salmon"])
plt.title("Target Variable Distribution (Subscribed vs Not)")
plt.xlabel("Subscribed (y)")
plt.ylabel("Count")
plt.show()

print("Subscription Rate (%):")
print(df['y'].value_counts(normalize=True)*100)

In [None]:
# Box plots of numerical features vs target variable 'y'
num_features = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']

for col in num_features:
    plt.figure(figsize=(8,4))
    sns.boxplot(x="y", y=col, data=df)
    plt.title(f"{col} vs Subscription")
    plt.show()

In [None]:
# Stacked bar plots for categorical features vs target variable 'y'
cat_features = ['job', 'marital', 'education', 'housing', 'loan']

for col in cat_features:
    ct = pd.crosstab(df[col], df['y'], normalize='index')
    ct.plot(kind="bar", stacked=True, figsize=(10,4), colormap="viridis")
    plt.title(f"Subscription Rate by {col}")
    plt.ylabel("Proportion")
    plt.show()

In [None]:
# Special case: campaign (numerical but discrete with few values)
campaign_bins = pd.cut(df['campaign'], [0,1,3,5,10,50])
ct = pd.crosstab(campaign_bins, df['y'], normalize='index')
ct.plot(kind="bar", stacked=True, figsize=(8,4), colormap="plasma")
plt.title("Subscription Rate by Campaign Contacts")
plt.ylabel("Proportion")
plt.show()

In [None]:
plt.figure(figsize=(8,4))
sns.histplot(data=df, x="duration", hue="y", kde=True, bins=50, element="step")
plt.title("Call Duration Distribution by Subscription")
plt.show()

In [None]:
if "month" in df.columns:
    ct = pd.crosstab(df['month'], df['y'], normalize='index')
    ct.plot(kind="bar", stacked=True, figsize=(10,4), colormap="cividis")
    plt.title("Subscription Rate by Month")
    plt.ylabel("Proportion")
    plt.show()