# Preface
In today’s data-driven world, predictive analytics has become a cornerstone of decision-making across industries. In the banking sector, understanding customer behavior is crucial for enhancing services, minimizing risks, and identifying potential opportunities. This project focuses on predicting whether a customer will make a transaction in the future, irrespective of the transaction amount.

The dataset provided comprises 202 columns, including anonymized features and a target variable indicating transaction behavior.Through this project, we aspire to gain deeper insights into customer behavior while developing a robust predictive model that can be applied to real-world scenarios in the banking domain. This preface sets the stage for the systematic exploration and problem-solving approach that follows in this report.

# Problem Statement
- Prepare a complete data analysis report on the given data.

- Create a predictive model which will help the bank to identify which customer will make transactions in future.


# Domain Analysis
- Customer transaction prediction is used to predict whether the customer will make an transaction or not in feature. It is used in banking industry to identify potential customers.
   - Dataset  of consist of 202 columns
   - 1st column is ID_CODE, 2 nd is target column and remaining 200 columns are anonymized features with column name from var_1 to var_200

__1. Id_code:__ Unique identifier for each row or record in the data.

__2. Target:__ 0 means the customer will not do a transaction and 1 means the customer will do a transaction.

# Importing required libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import Parallel, parallel_backend
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,classification_report,confusion_matrix,fbeta_score,roc_curve,auc
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


In [None]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

# Basic Checks

In [None]:
df=pd.read_csv('customer.csv')

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

It consists of 200000 rows and 202 columns

In [None]:
df.describe()

- The dataset has no constant columns, as no feature has zero variance.
- Features are on different scales, indicating that normalization or standardization might be required before modeling.
- In some features there high difference between 75th percentile and Max value, we need to handle outliers

In [None]:
df.info(verbose=True,show_counts=True)

- It has 202 columns
- There are 1 object column, 1 integer column and 200 float columns


In [None]:
df[df.duplicated()]

There is no duplicate column

In [None]:
df.isnull().sum()

There is no null values

# Exploratory Data Analysis


- We could not perform complete data analysis because as feature names are not provided.
- we can plot the distibution of the feature

CHECKING DISTRIBUTION OF FIRST 100 FEATURE

In [None]:
df_100=df.iloc[:,2:102]
count=1

In [None]:
plt.figure(figsize=(20, 30))
count = 1
for column in df_100.columns:
    if count > 100:
        break
    plt.subplot(20, 5, count)
    sns.axisgrid
    sns.histplot(df_100[column])
    count += 1
plt.tight_layout()
plt.show()


CHECKING DISTRIBUTION OF NEXT 100 FEATURE

In [None]:
df_200=df.iloc[:,103:202]
count=1

In [None]:
plt.figure(figsize=(20, 30))
count = 1
for column in df_200.columns:
    if count > 100:
        break
    ax=plt.subplot(20, 5, count)
    sns.histplot(df_200[column],kde=True,ax=ax)
    count += 1
    sns.set_style('darkgrid')
    ax.set_title(f'Histogram of {column}')
plt.tight_layout()
plt.show()


Most of the features follow normal distribution or close to normal distribution we don't need to perform transformation technique

# Data Preprocessing

## 1. Checking Null Values

In [None]:
df.isnull().sum()

There is no values in the data set

## 2. Handling Outliers

In [None]:
plt.figure(figsize=(25, 60))
for i, column in enumerate(df.select_dtypes(include=[np.number]).columns):
    if i >=200:
        break
    plt.subplot(40, 5, i+1)  # Adjust subplot size for many columns
    sns.boxplot(df[column])
    plt.title(column)
plt.tight_layout()
plt.show()


From boxplot we found that most of the features have outliers, so we need to impute.

In [None]:
data2=df.copy()

### winsorization

In [None]:
df['var_0'].describe()

In [None]:
numerical_columns=df.drop(['target','ID_code'],axis=1).columns

In [None]:
from scipy.stats.mstats import winsorize

In [None]:
for i in numerical_columns:
   data2[i]=winsorize(data2[i],limits=[0.01, 0.01])

In [None]:
data2.describe()

In [None]:
plt.figure(figsize=(25, 60))
for i, column in enumerate(data2.select_dtypes(include=[np.number]).columns,start=1):
  if i>200:
    break
  plt.subplot(40, 5, i)  # Adjust subplot size for many columns
  sns.boxplot(data2[column])
  plt.title(column)
plt.tight_layout()
plt.show()

Outliers has been capped using winsorization

In [None]:
for i in numerical_columns:
   df[i]=winsorize(df[i],limits=[0.01, 0.01])

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler 
stan_scaler=StandardScaler()

In [None]:
stan_scaled=stan_scaler.fit_transform(df.iloc[:,2:])

In [None]:
df.iloc[:,2:]=stan_scaled

### Handling Imbalance

In [None]:
df['target'].value_counts().plot(kind='pie', autopct='%1.1f%%')

In [None]:
count=df['target'].value_counts()
count

In [None]:
sns.countplot(x='target', data=df, palette='viridis')

Observation
- we found that there is huge imbalance in target column
- target 1 is 10% and target 0 is 90%

In [None]:
data2=df.copy()

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
df.drop(['ID_code'],axis=1,inplace=True)

In [None]:
x_df=df.iloc[:,1:]
y_df=df.iloc[:,0]

In [None]:
smote = SMOTE(random_state=42)
x_df,y_df=smote.fit_resample(x_df,y_df)

In [None]:
x_df

### Feature Selection

In [None]:
corr_matrix_1=x_df.corr()

In [None]:
plt.figure(figsize=(120,120))
sns.heatmap(corr_matrix_1,annot=True,cmap='Blues')

Observation:
- we can't interpret the heatmap we create a loop to find the feature which has high correlation within the features 

In [None]:
threshold=0.90
high_corr_pairs = []
for i in range(len(corr_matrix_1.columns)):
    for j in range(i):
        if abs(corr_matrix_1.iloc[i,j])>threshold:
            feature_1=corr_matrix_1.columns[i]
            feature_2=corr_matrix_1.columns[j]
            corr_value=corr_matrix_1.iloc[i, j]
            high_corr_pairs.append([feature_1, feature_2, corr_value])
high_corr_df = pd.DataFrame(high_corr_pairs, columns=["Feature 1", "Feature 2", "Correlation"])

high_corr_df

Observation
- Their is no highly correlated feature in dataset

In [None]:
x_df.head()

## Principal Component Analysis
- Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and statistics to reduce the number of features in a dataset while preserving as much variability (information) as possible.

In [None]:
from sklearn.decomposition import PCA
pca=PCA()

In [None]:
pca.fit_transform(x_df)

In [None]:
plt.figure(figsize=(5,5))
sns.set_style('darkgrid')
plt.plot(np.cumsum(pca.explained_variance_ratio_),c='black',marker='*')

In [None]:
value=np.cumsum(pca.explained_variance_ratio_)>=0.9
n_components=np.argmax(value)+1
n_components

Observation
- We use PCA n_compoenents =178 because it captures 90% variance of the dataset 

In [None]:
pca_178=PCA(n_components=178)

In [None]:
x_df=pca_178.fit_transform(x_df)

In [None]:
x_df=pd.DataFrame(x_df,columns=['pca{}'.format(i)  for i in range(1,179)])

In [None]:
x_df

# Model Creation

In [None]:
x_train_df,x_test_df,y_train_df,y_test_df=train_test_split(x_df,y_df,test_size=0.20,random_state=42)

## Model Evaluation function

### For Displaying the algorithm results

In [None]:
def evaluate_model_performance_display(model, X_train, y_train, X_test, y_test):
  

    # --- Training Data Evaluation ---
    y_train_pred = model.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    
    # ROC Curve for Training Data
    if hasattr(model, "predict_proba"):  # For models that have predict_proba() method
        y_train_probs = model.predict_proba(X_train)[:, 1]
    else:
        y_train_probs = model.decision_function(X_train)  # For models like SVM
    
    fpr_train, tpr_train, thresholds_train = roc_curve(y_train, y_train_probs)
    roc_auc_train = auc(fpr_train, tpr_train)
    
    print(f"Model: {model.__class__.__name__}")
    print("\nTraining Data Evaluation:")
    print(f"Training Accuracy: {train_accuracy:.4f}")
    
    # Plot ROC Curve for Training Data
    plt.figure(figsize=(6, 4))
    plt.plot(fpr_train, tpr_train, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc_train:.2f})')
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random guess')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Recall)')
    plt.title('Receiver Operating Characteristic (ROC) Curve (Training Data)')
    plt.legend(loc='lower right')
    plt.grid()
    plt.tight_layout()  # Adjust layout to fit without scrolling
    plt.show()

    # --- Test Data Evaluation ---
    y_test_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_recall = classification_report(y_test, y_test_pred, output_dict=True)['1']['recall']
    test_f2_score = fbeta_score(y_test, y_test_pred, beta=2, average='macro')
    test_report = classification_report(y_test, y_test_pred)
    test_conf_matrix = confusion_matrix(y_test, y_test_pred)
    
    print("\nTest Data Evaluation:")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Recall (Test Data): {test_recall:.4f}")
    print(f"F2 Score: {test_f2_score:.4f}")
    print("Classification Report (Test Data):\n", test_report)
    
    # Plot Test Confusion Matrix
    plt.figure(figsize=(6, 4))
    sns.heatmap(test_conf_matrix, annot=True, fmt="d", cmap="Blues")
    plt.title(f"Confusion Matrix - {model.__class__.__name__} (Test Data)")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.tight_layout()  # Adjust layout to fit without scrolling
    plt.show()

    # ROC Curve for Test Data
    if hasattr(model, "predict_proba"):  # For models that have predict_proba() method
        y_test_probs = model.predict_proba(X_test)[:, 1]
    else:
        y_test_probs = model.decision_function(X_test)  # For models like SVM
    
    fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_test_probs)
    roc_auc_test = auc(fpr_test, tpr_test)
    
    # Plot ROC Curve for Test Data
    plt.figure(figsize=(6, 4))
    plt.plot(fpr_test, tpr_test, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc_test:.2f})')
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random guess')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Recall)')
    plt.title('Receiver Operating Characteristic (ROC) Curve (Test Data)')
    plt.legend(loc='lower right')
    plt.grid()
    plt.tight_layout()  # Adjust layout to fit without scrolling
    plt.show()

### For Creating the data frame

In [None]:
def evaluate_model_performance(model, x_train_df, y_train_df, x_test_df, y_test_df, show_plots=True):
    # Predictions for training data
    y_train_pred = model.predict(x_train_df)
    train_accuracy = accuracy_score(y_train_df, y_train_pred)

    # Probabilities or decision function for training data
    if hasattr(model, "predict_proba"):
        y_train_probs = model.predict_proba(x_train_df)[:, 1]
    else:
        y_train_probs = model.decision_function(x_train_df)

    # Compute ROC AUC for training data
    fpr_train, tpr_train, _ = roc_curve(y_train_df, y_train_probs)
    roc_auc_train = auc(fpr_train, tpr_train)

    # Predictions for testing data
    y_test_pred = model.predict(x_test_df)
    test_accuracy = accuracy_score(y_test_df, y_test_pred)

    # Probabilities or decision function for testing data
    if hasattr(model, "predict_proba"):
        y_test_probs = model.predict_proba(x_test_df)[:, 1]
    else:
        y_test_probs = model.decision_function(x_test_df)

    # Compute ROC AUC for testing data
    fpr_test, tpr_test, _ = roc_curve(y_test_df, y_test_probs)
    roc_auc_test = auc(fpr_test, tpr_test)

  
    # Return results as a dictionary
    return {
        "Train Accuracy": train_accuracy,
        "Test Accuracy": test_accuracy,
        "AUC (Train)": roc_auc_train,
        "AUC (Test)": roc_auc_test
    }

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression


logistic = LogisticRegression()  # or 'saga'


param_grid = {
'penalty': ['l1', 'l2', 'elasticnet', None],  
    'C': [0.01, 0.1, 1, 10, 100],              
    'solver': ['liblinear', 'saga'],           
    'max_iter': [100, 200, 500] 
}
grid_log_df = GridSearchCV(estimator=logistic, param_grid=param_grid, scoring='accuracy', verbose=2, n_jobs=-1,cv=3)


In [None]:
with parallel_backend('multiprocessing'):
  grid_log_df.fit(x_train_df, y_train_df)

In [None]:
best_model_df = grid_log_df.best_estimator_
best_model_df

In [None]:
results=pd.DataFrame(grid_log_df.cv_results_)
results[results['rank_test_score']==1]

In [None]:
grid_log_df.best_score_

#### Model Performance

In [None]:
evaluate_model_performance_display(best_model_df,x_train_df,y_train_df,x_test_df, y_test_df)

## RANDOM FOREST

In [None]:
from joblib import parallel_backend

In [None]:
from sklearn.ensemble import RandomForestClassifier
Random=RandomForestClassifier()

In [None]:
parameters={
    'n_estimators':[10,50,100],
    'max_depth':[10, 20, 30],
    'min_samples_split':[2, 5, 10], 
}

In [None]:
random_df=GridSearchCV(estimator=Random,param_grid=parameters,cv=3,verbose=2,n_jobs=-1)

In [None]:
with parallel_backend('multiprocessing'):
  random_df.fit(x_train_df, y_train_df)

In [None]:
random_df.best_params_

In [None]:
results=pd.DataFrame(random_df.cv_results_)
results

In [None]:
random_df.best_score_

In [None]:
best_par_forest=random_df.best_estimator_
best_par_forest

#### Model Performance

In [None]:
evaluate_model_performance_display(best_par_forest,x_train_df,y_train_df,x_test_df, y_test_df)

### XGBoosting

In [None]:
import xgboost as xgb
from scipy.stats import uniform

xgb_model = xgb.XGBClassifier(eval_metric='mlogloss')


In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier


xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
param_grid = {
    'n_estimators': [100, 200],              
    'learning_rate': [0.01, 0.1],              
    'max_depth': [5, 7],                       
    'subsample': [0.7, 0.8],                 
    'colsample_bytree': [0.7, 0.8],           
    'gamma': [0, 0.1],                        
    'min_child_weight': [3, 5]                 
}

grid_search = RandomizedSearchCV(estimator=xgb_model, 
                           param_distributions=param_grid, 
                           scoring='accuracy', 
                           cv=3, 
                           verbose=2, 
                           n_jobs=-1)

In [None]:
with parallel_backend('multiprocessing'):
  grid_search.fit(x_train_df, y_train_df)

In [None]:
results=pd.DataFrame(grid_search.cv_results_)
results

In [None]:
best_param=grid_search.best_estimator_
best_param

#### Model Performane

In [None]:
evaluate_model_performance_display(best_param,x_train_df,y_train_df,x_test_df, y_test_df)

### Light BGM

In [None]:
import lightgbm as lgb
light_model=lgb.LGBMClassifier(verbose=-1)
param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [ 0.1, 0.2],
    'max_depth': [ 5, 7],
    'num_leaves': [31, 50, 100],
    'subsample': [ 0.7, 0.8],
    'colsample_bytree': [0.7, 0.8]
}


In [None]:
lg1=GridSearchCV(estimator=light_model,param_grid=param_grid,n_jobs=-1,scoring='accuracy',cv=3)

In [None]:
with parallel_backend('multiprocessing'):
  lg1.fit(x_train_df, y_train_df)

In [None]:
res=pd.DataFrame(lg1.cv_results_)
res[res['rank_test_score']==1]

In [None]:
best_param_1=lg1.best_estimator_
best_param_1

#### Model Performance

In [None]:
evaluate_model_performance_display(best_param_1,x_train_df,y_train_df,x_test_df, y_test_df)

## Model Comparision

This Function is used to print the model comparision dataframe

In [None]:
def create_model_performance(models, x_train_df, y_train_df, x_test_df, y_test_df):
    all_model_results = []


    for model, model_name in models:
       
        model_results = evaluate_model_performance(model, x_train_df, y_train_df, x_test_df, y_test_df, show_plots=False)

        train_accuracy = model_results["Train Accuracy"]
        test_accuracy = model_results["Test Accuracy"]
        train_auc = model_results["AUC (Train)"]
        test_auc = model_results["AUC (Test)"]
       
    
        all_model_results.append({
            "Model Name": model_name,
            "Train Accuracy": train_accuracy,
            "Test Accuracy": test_accuracy,
            "Train AUC": train_auc,
            "Test AUC": test_auc
        })
        

    model_performance_df = pd.DataFrame(all_model_results)
    return model_performance_df


In [None]:

models = [
    (best_model_df, "Logistic Regression"),
    (best_par_forest, "Random Forest"),
    (best_param, "XGBoost"),
    (best_param_1, "LightGBM")
]

model_performance_dataframe = create_model_performance(models, x_train_df, y_train_df, x_test_df, y_test_df)

In [None]:
model_performance_dataframe

# Project Submission: Machine Learning Model Evaluation for Customer Transaction Prediction

**Objective:** The goal of this project is to predict whether a customer will perform a transaction based on their features. 

**Models Evaluated:**
1.   Logistic Regression (LR)
2.   Random Forest Classifier (RFC)
3.   XGBoost Classifier
4.   Light Gradient Boosting Machine (LGBM)


**Analysis:**
*   Among all models, ***Light Gradient Boosting Machine (LGBM) and XGBoost Classifier achieved the highest accuracy***
*   Both models demonstrate superior performance, outperforming others in handling the complexity of the dataset.
*   Logistic Regression  had the lowest accuracies, indicating they are less suited for this problem.
*   Random Forest has higherst accuarcy but the reason we didn't  choose as best model because of  high gap between training and testing accuracy

**Conclusion:**

Based on the accuracy metric:

*   Light Gradient Boosting Machine (LGBM) and XGBoost Classifier are equally effective for the Customer Transaction Prediction
*   Both models are recommended for deployment, considering their high accuracy and efficiency.


# Project Challenges:
**Challenges Faced and Techniques Used**


**1. Handling Outliers**

**Challenge:**

- Outliers in numerical features could distort the model and lead to overfitting or inaccurate predictions.

**Solution**:

- Winsorization was used to handle outliers.

- The extreme values were capped at a specified percentile (e.g., 5th and 95th percentiles), ensuring that outliers did not disproportionately affect the model’s performance.

**Reason for Technique**:

- Winsorization preserves the structure of the data while controlling for the impact of extreme values. It is particularly useful when outliers are genuine data points but need to be contained for modeling.

**2. High Dimensionality**

**Challenge:**

- The dataset contained a large number of features, leading to potential overfitting.

**Solution:**

- Principal Component Analysis (PCA) was applied to reduce dimensionality.

- PCA transformed the feature space into a lower-dimensional space, capturing the most variance while reducing redundant information.

**Reason for Technique:**

- PCA helps improve computational efficiency and reduces the risk of overfitting by retaining only the most important components.

**3.Finding Optimal Hyperparameters**

**Challenge:**

- Determining the best hyperparameters for the machine learning model to maximize predictive accuracy.

**Solution:**

- Both Grid Search with Cross-Validation (Grid Search CV) and RandomizedSearchCV were implemented.

- Grid Search CV exhaustively searched over specified parameter values while using cross-validation to evaluate performance.

- RandomizedSearchCV sampled a fixed number of parameter settings from the specified distributions, allowing faster exploration of hyperparameters.

**Reason for Technique:**

- Grid Search CV ensures that the best combination of hyperparameters is selected by systematically exploring all possibilities and validating them on unseen folds of the data
- RandomizedSearchCV complements this by providing a quicker search alternative, especially useful when the parameter space is large.