# MLOPS Project Group 1
**Team: Kushwanth Boina, Rohit Madke, Rutika Bankar**
## **Predicting Term Deposit Subscription in a Portuguese Bank**
### Direct Marketing Campaign Analysis

**Dataset Overview:**
- **Domain:** Business  
- **Task:** Classification  
- **Dataset Type:** Multivariate  
- **Number of Instances:** 45,211  
- **Number of Features:** 16  
- **Feature Types:** Categorical, Integer  

**Objective:**  
Predict whether a client will subscribe to a term deposit based on a bank's direct marketing campaign (phone calls).  

## Data Loading and Initial Inspection

In this section, we load the dataset from a CSV file using pandas and inspect the first few rows. This initial step helps to verify that the data has been imported correctly and provides a quick look at the dataset's structure.

In [20]:
# Import the pandas library for data manipulation and analysis
import pandas as pd

# Read the CSV file 'Bank.csv' using a semicolon (;) as the separator.
# The 'engine="python"' parameter ensures that the Python parsing engine is used.
data = pd.read_csv('Bank.csv', sep=";", engine="python")

# Display the first 5 rows of the dataset to verify that it has loaded correctly.
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [9]:
data.to_csv("banking_data.csv", index=False)

## Step 2: Dataset Schema and Storage

This section validates the dataset against the expected schema using Pandera and then enforces data types using PyArrow when storing the data in Parquet format. The process is split into two parts:

1. **Data Validation:**  
   The dataset is checked against a predefined schema to ensure data quality and consistency.

2. **Type Enforcement and Storage:**  
   The validated data is converted into a PyArrow Table using a specified schema and written to a Parquet file.


##### Code Block 1: Data Validation with Pandera

In [13]:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check

# Load your dataset from CSV
df = pd.read_csv('banking_data.csv')

# Define the Pandera schema with expected types and allowed values for categorical features
schema_df = DataFrameSchema({
    "age": Column(int, checks=Check(lambda s: s > 0, element_wise=True), nullable=False),
    "job": Column(str, checks=Check.isin([
        "admin.", "unknown", "unemployed", "management", "housemaid",
        "entrepreneur", "student", "blue-collar", "self-employed",
        "retired", "technician", "services"
    ])),
    "marital": Column(str, checks=Check.isin(["married", "divorced", "single"])),
    "education": Column(str, checks=Check.isin(["unknown", "secondary", "primary", "tertiary"])),
    "default": Column(str, checks=Check.isin(["yes", "no"])),
    "balance": Column(int, nullable=False),
    "housing": Column(str, checks=Check.isin(["yes", "no"])),
    "loan": Column(str, checks=Check.isin(["yes", "no"])),
    "contact": Column(str, checks=Check.isin(["unknown", "telephone", "cellular"])),
    "day": Column(int, nullable=False),
    "month": Column(str, checks=Check.isin([
        "jan", "feb", "mar", "apr", "may", "jun",
        "jul", "aug", "sep", "oct", "nov", "dec"
    ])),
    "duration": Column(int, nullable=False),
    "campaign": Column(int, nullable=False),
    "pdays": Column(int, nullable=False),
    "previous": Column(int, nullable=False),
    "poutcome": Column(str, checks=Check.isin(["unknown", "other", "failure", "success"])),
    "y": Column(str, checks=Check.isin(["yes", "no"]))
})

# Validate the DataFrame; this will raise an error if any column doesn't match the schema.
validated_df = schema_df.validate(df)

##### Code Block 2: Type Enforcement and Storage with PyArrow

In [14]:
import pyarrow as pa
import pyarrow.parquet as pq

# Define the PyArrow schema that mirrors your dataset structure
arrow_schema = pa.schema([
    ('age', pa.int64()),
    ('job', pa.string()),
    ('marital', pa.string()),
    ('education', pa.string()),
    ('default', pa.string()),
    ('balance', pa.float64()),
    ('housing', pa.string()),
    ('loan', pa.string()),
    ('contact', pa.string()),
    ('day', pa.int64()),
    ('month', pa.string()),
    ('duration', pa.int64()),
    ('campaign', pa.int64()),
    ('pdays', pa.int64()),
    ('previous', pa.int64()),
    ('poutcome', pa.string()),
    ('y', pa.string())
])

# Convert the validated DataFrame to a PyArrow Table using the defined schema
table = pa.Table.from_pandas(validated_df, schema=arrow_schema)

# Write the PyArrow Table to a Parquet file, ensuring type enforcement
pq.write_table(table, 'banking_data.parquet')


## Step 3: Profiling the Dataset

In this section, we generate a comprehensive data profile report using the ydata-profiling library. The report provides insights into data distributions, missing values, correlations, and more. We display the report within the notebook and also save it as an HTML file for future reference.


In [22]:
df = pd.read_parquet('Datasets/Processed/banking_data.parquet')
from ydata_profiling import ProfileReport

# Create a profile report of the DataFrame
profile = ProfileReport(df, title="Banking Data Profiling Report", explorative=True)

# Display the report in a Jupyter Notebook (if using Jupyter)
profile.to_notebook_iframe()


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [16]:
# Save the profile report to an HTML file
profile.to_file("banking_data_profile.html")


Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Step 4: Train-Test-Production Split

In this section, we split the validated dataset into three subsets:
- **Training Set (60%)**: Used for model training.
- **Test Set (20%)**: Used for model evaluation during development.
- **Production Set (20%)**: Reserved for monitoring and potential deployment testing.

We ensure reproducibility by setting a random seed and save each subset as a Parquet file for future use.


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the validated dataset from the Parquet file
df = pd.read_parquet('Datasets/Processed/banking_data.parquet')

# First split: Separate 60% for training and 40% for a temporary set (temp_df)
train_df, temp_df = train_test_split(df, test_size=0.4, random_state=42)

# Second split: Split the temporary set equally into test and production (each 20% of original data)
test_df, prod_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Save the resulting datasets to Parquet files
train_df.to_parquet('banking_data_train.parquet', index=False)
test_df.to_parquet('banking_data_test.parquet', index=False)
prod_df.to_parquet('banking_data_prod.parquet', index=False)

# Optional: Print shapes to verify the splits
print("Training set shape:", train_df.shape)
print("Test set shape:", test_df.shape)
print("Production set shape:", prod_df.shape)


Training set shape: (27126, 17)
Test set shape: (9042, 17)
Production set shape: (9043, 17)


## Step 5: Data Version Control
##### Navigate to your project folder
cd MLOPS Project

##### Initialize a new Git repository
git init

##### Add dataset files and reports
- git add datasets/original/banking_data.csv
- git add datasets/processed/*.parquet
- git add datasets/reports/banking_data_profile.html

##### Optionally, add your notebooks and README
git add notebooks/

##### Commit with a meaningful message
git commit -m "Initial commit: Add original dataset, processed Parquet files, and profiling report"

##### Push changes to Repo
- git remote add origin <https://github.com/kushwanthboina/MLOPS>
- git push -u origin main



## Step 6: Building an ML Pipeline with Scikit-Learn

In this section, we build an end-to-end machine learning pipeline that:
- Loads the training and test datasets from GitHub.
- Separates features from the target variable.
- Defines separate preprocessing pipelines for numeric and categorical features.
- Combines these preprocessors using a ColumnTransformer.
- Integrates the preprocessor with a classifier (Logistic Regression in this example) into a single scikit-learn Pipeline.
- Trains the model and evaluates its performance on the test set.

In [29]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import set_config
set_config(display='diagram')
def load_datasets(train_url: str, test_url: str):
    """
    Loads the training and test datasets from the provided GitHub raw file URLs.
    
    Parameters:
        train_url (str): URL to the training dataset in Parquet format.
        test_url (str): URL to the test dataset in Parquet format.
        
    Returns:
        tuple: (X_train, y_train, X_test, y_test) where X contains features and y is the target.
    """
    # Load datasets from the given URLs using pandas
    df_train = pd.read_parquet(train_url)
    df_test = pd.read_parquet(test_url)
    
    # Separate features (X) and target variable (y) for both datasets
    X_train = df_train.drop(columns=['y'])
    y_train = df_train['y']
    X_test = df_test.drop(columns=['y'])
    y_test = df_test['y']
    
    return X_train, y_train, X_test, y_test

def build_preprocessor(numeric_features, categorical_features):
    """
    Constructs a data preprocessor for numeric and categorical features.
    
    Parameters:
        numeric_features (list): List of numeric feature names.
        categorical_features (list): List of categorical feature names.
        
    Returns:
        ColumnTransformer: A preprocessor that applies the appropriate transformations to each feature set.
    """
    # Pipeline for numeric features: impute missing values using median and scale the data.
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    # Pipeline for categorical features: impute missing values with a constant and apply one-hot encoding.
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Combine both pipelines into a single ColumnTransformer
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])
    
    return preprocessor

def build_pipeline(preprocessor, classifier):
    """
    Builds an ML pipeline by combining a data preprocessor and a classifier.
    
    Parameters:
        preprocessor (ColumnTransformer): The preprocessor for feature transformation.
        classifier (sklearn classifier): The machine learning classifier to be used.
        
    Returns:
        Pipeline: A scikit-learn Pipeline that processes the data and then applies the classifier.
    """
    # Create a pipeline that first transforms the data and then fits a classifier.
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])
    return pipeline

# Define the URLs for the training and test datasets stored on GitHub.
# Replace <username> and repository details with your actual GitHub information.
train_url = 'https://raw.githubusercontent.com/kushwanthboina/MLOPS/main/datasets/processed/banking_data_train.parquet'
test_url  = 'https://raw.githubusercontent.com/kushwanthboina/MLOPS/main/datasets/processed/banking_data_test.parquet'

# Load the datasets from GitHub.
X_train, y_train, X_test, y_test = load_datasets(train_url, test_url)

# Define lists of numeric and categorical features based on the dataset schema.
numeric_features = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Build the preprocessor that transforms numeric and categorical features.
preprocessor = build_preprocessor(numeric_features, categorical_features)

# Create the ML pipeline by combining the preprocessor with a Logistic Regression classifier.
clf_pipeline = build_pipeline(preprocessor, LogisticRegression(max_iter=1000))
# Fit the pipeline on the training data.
clf_pipeline.fit(X_train, y_train)

# Make predictions on the test dataset.
y_pred = clf_pipeline.predict(X_test)

# Evaluate the model's performance using accuracy.
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)


Test Accuracy: 0.9000221190002212


In [30]:
clf_pipeline

## Step 7: ML Experimentation and Tracking with MLflow

In this section, we run multiple ML experiments using different algorithms and hyperparameters. We:
1. Set up an MLflow experiment to group all runs under a common name.
2. Use K-Fold Cross-Validation to evaluate each model on the training set.
3. Evaluate each model on the test set.
4. Log parameters, metrics, and the trained model pipeline to MLflow for versioning and future reference.

In [34]:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# Ensure reproducibility and suppress any warnings for clarity
import warnings
warnings.filterwarnings("ignore")

# Define your preprocessor from the previous step (numeric and categorical pipelines combined)
# Assuming 'preprocessor', 'X_train', 'y_train', 'X_test', 'y_test' are already defined as per previous steps

# Define a dictionary of models (experiments)
models = {
    "LogisticRegression_default": LogisticRegression(max_iter=1000),
    "LogisticRegression_C0.5": LogisticRegression(max_iter=1000, C=0.5),  # Experiment with a different regularization strength
    "DecisionTree": DecisionTreeClassifier(random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM_linear": SVC(kernel='linear', probability=True, random_state=42)
}

# Set up MLflow experiment (all runs will be grouped under this experiment)
mlflow.set_experiment("Banking_ML_Experimentation")

# Define K-Fold cross-validation parameters
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Loop through each model experiment
for model_name, model in models.items():
    with mlflow.start_run(run_name=model_name):
        # Log model name as a parameter
        mlflow.log_param("model_name", model_name)
        
        # Create a pipeline with preprocessor and the current classifier
        pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('classifier', model)
        ])
        
        # Perform cross-validation on the training set
        cv_scores = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring='accuracy')
        mean_cv = cv_scores.mean()
        std_cv = cv_scores.std()
        
        # Log cross-validation metrics
        mlflow.log_metric("cv_mean_accuracy", mean_cv)
        mlflow.log_metric("cv_std_accuracy", std_cv)
        
        # Train the pipeline on the full training data
        pipeline.fit(X_train, y_train)
        
        # Evaluate on the test set
        y_pred = pipeline.predict(X_test)
        test_accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("test_accuracy", test_accuracy)
        
        # Log hyperparameters of the model if any (example for Logistic Regression)
        if "LogisticRegression" in model_name:
            mlflow.log_param("C", model.C)
        
        # Log the trained model pipeline to MLflow
        mlflow.sklearn.log_model(pipeline, "model")
        
        print(f"{model_name} -> CV Mean: {mean_cv:.4f}, CV Std: {std_cv:.4f}, Test Accuracy: {test_accuracy:.4f}")


2025/03/01 00:03:19 INFO mlflow.tracking.fluent: Experiment with name 'Banking_ML_Experimentation' does not exist. Creating a new experiment.


LogisticRegression_default -> CV Mean: 0.9025, CV Std: 0.0043, Test Accuracy: 0.9000




LogisticRegression_C0.5 -> CV Mean: 0.9024, CV Std: 0.0045, Test Accuracy: 0.9002




DecisionTree -> CV Mean: 0.8772, CV Std: 0.0048, Test Accuracy: 0.8776




RandomForest -> CV Mean: 0.9052, CV Std: 0.0044, Test Accuracy: 0.9051




SVM_linear -> CV Mean: 0.8922, CV Std: 0.0046, Test Accuracy: 0.8950


**Inference:**
- *RandomForest not only achieves the best performance on the training folds (cross-validation) but also generalizes well on the test set. Therefore, it is the recommended model to deploy for this classification task.*