This notebook desined for tabular classification tasks with pandas and scikit-learn.
It is a simple example of how to use pandas and scikit-learn to build a classification model using a tabular dataset. The code includes data preprocessing, model training, and evaluation steps.
The dataset used in this example is the bank marketing dataset from the UCI Machine Learning Repository. The dataset contains information about a bank's marketing campaign and whether or not a customer subscribed to a term deposit.
The goal is to predict whether a customer will subscribe to a term deposit based on their demographic and behavioral features.


In [1]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
#import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



Matplotlib is building the font cache; this may take a moment.


In [3]:
data=pd.read_csv('/home/tisinr/MEGA/Dev/models/classifier/dataset/bank.csv',header=0, sep=';')
# Display the first few rows of the dataset
print(data.head())

   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  


In [None]:
#Rename some columns
data.rename(columns={'marital':'marital_status','default':'credit_default','housing':'housing_loan','loan':'personal_loan','y':'target'}, inplace=True)
# Display the first few rows of the dataset
print(data.head())

   age           job marital_status  education credit_default  balance  \
0   58    management        married   tertiary             no     2143   
1   44    technician         single  secondary             no       29   
2   33  entrepreneur        married  secondary             no        2   
3   47   blue-collar        married    unknown             no     1506   
4   33       unknown         single    unknown             no        1   

  housing_loan personal_loan  contact  day month  duration  campaign  pdays  \
0          yes            no  unknown    5   may       261         1     -1   
1          yes            no  unknown    5   may       151         1     -1   
2          yes           yes  unknown    5   may        76         1     -1   
3          yes            no  unknown    5   may        92         1     -1   
4           no            no  unknown    5   may       198         1     -1   

   previous poutcome target  
0         0  unknown     no  
1         0  unknown

In [None]:
data.replace('unknown', pd.NA, inplace=True)

In [None]:
data.isnull().sum()

Exploratory Data Analysis with Pandas

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
#Distribution plot of target variable
plt.figure(figsize=(8, 6))
data['y'].value_counts().plot(kind='bar')
plt.title('Distribution of Target Variable')
plt.xlabel('y')
plt.ylabel('Count')
plt.show()



In [None]:
data.dtypes

In [None]:
data['poutcome'].value_counts()

In [None]:
##category distribution
data['job'].value_counts()/len(data)*100


In [None]:
X=data.drop(columns=['y'])
y=data['y']
print(X.shape)
print(y.shape)

In [None]:
data['y'].value_counts()

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', SimpleImputer(strategy='most_frequent'), ['category'])
    ])

In [None]:
# Convert labels to integers (0, 1, ...)
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

print(y_encoded)  # Output: [0 1 0]

In [None]:
# Fill missing values with a placeholder (e.g., 'missing') before encoding
data_filled = data[['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']].fillna('missing')

# Convert to one-hot encoded format
encoder = OneHotEncoder()
data_encoded = encoder.fit_transform(data_filled).toarray()


In [None]:
# Convert the encoded data back to a DataFrame
data_encoded_df = pd.DataFrame(data_encoded, columns=encoder.get_feature_names_out(data_filled.columns))
# Concatenate the encoded data with the original DataFrame
data_encoded_df

In [None]:
data_encoded_df.head

In [None]:
data_encoded.view()

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,stratify=y,random_state=78)

In [None]:
y_test.shape

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.info
