<a href="https://colab.research.google.com/github/GouravMidya/DSW-MLtest/blob/main/model__py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [4]:
class BaseModel:
    def __init__(self):
        self.model = None
        self.scaler = StandardScaler()
        self.label_encoders = {}

    def load(self, train_filepath, test_filepath):
        self.train_data = pd.read_excel(train_filepath)
        self.test_data = pd.read_excel(test_filepath)
        print("Training and testing data loaded successfully.")

    def preprocess(self):
        def process_data(data):
            # Feature engineering for transaction_date
            data['transaction_date'] = pd.to_datetime(data['transaction_date'])
            data['transaction_year'] = data['transaction_date'].dt.year
            data['transaction_month'] = data['transaction_date'].dt.month

            # Drop unnecessary columns
            data = data.drop(['customer_id', 'transaction_date'], axis=1)

            # Encode categorical variables
            categorical_cols = ['sub_grade', 'term', 'home_ownership', 'purpose', 'application_type', 'verification_status']
            for col in categorical_cols:
                if col not in self.label_encoders:
                    le = LabelEncoder()
                    data[col] = le.fit_transform(data[col])
                    self.label_encoders[col] = le
                else:
                    data[col] = self.label_encoders[col].transform(data[col])

            # Scale numerical features
            numerical_cols = ['cibil_score', 'total_no_of_acc', 'annual_inc', 'int_rate',
                              'loan_amnt', 'installment', 'account_bal', 'emp_length', 'transaction_year', 'transaction_month']
            data[numerical_cols] = self.scaler.fit_transform(data[numerical_cols])

            return data

        self.train_data = process_data(self.train_data)
        self.test_data = process_data(self.test_data)
        print("Data preprocessing completed.")

    def split_data(self):
        X_train = self.train_data.drop('loan_status', axis=1)
        y_train = self.train_data['loan_status']
        X_test = self.test_data.drop('loan_status', axis=1)
        y_test = self.test_data['loan_status']
        return X_train, X_test, y_train, y_test

    def train(self, X_train, y_train):
        raise NotImplementedError("Train method must be implemented by subclasses.")

    def test(self, X_test, y_test):
        y_pred = self.model.predict(X_test)
        report = classification_report(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)
        print("Classification Report:\n", report)
        print("Confusion Matrix:\n", cm)

    def predict(self, X):
        return self.model.predict(X)

In [5]:
class LogisticRegressionModel(BaseModel):
    def __init__(self):
        super().__init__()
        self.model = LogisticRegression(max_iter=1000, class_weight='balanced')

    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)
        print("Logistic Regression model trained successfully.")

class XGBoostModel(BaseModel):
    def __init__(self):
        super().__init__()
        self.model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', scale_pos_weight=1)

    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)
        print("XGBoost model trained successfully.")

In [6]:
# Example pipeline usage
if __name__ == "__main__":
    train_filepath = "/content/drive/MyDrive/DSW Assessment/train_data.xlsx"  # Replace with the actual training data file
    test_filepath = "/content/drive/MyDrive/DSW Assessment/test_data.xlsx"    # Replace with the actual testing data file

    # Logistic Regression pipeline
    lr_model = LogisticRegressionModel()
    lr_model.load(train_filepath, test_filepath)
    lr_model.preprocess()
    X_train, X_test, y_train, y_test = lr_model.split_data()
    lr_model.train(X_train, y_train)
    lr_model.test(X_test, y_test)

    # XGBoost pipeline
    xgb_model = XGBoostModel()
    xgb_model.load(train_filepath, test_filepath)
    xgb_model.preprocess()
    X_train, X_test, y_train, y_test = xgb_model.split_data()
    xgb_model.train(X_train, y_train)
    xgb_model.test(X_test, y_test)

Training and testing data loaded successfully.
Data preprocessing completed.
Logistic Regression model trained successfully.
Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.53      0.53      3055
           1       0.73      0.74      0.74      5400

    accuracy                           0.66      8455
   macro avg       0.63      0.63      0.63      8455
weighted avg       0.66      0.66      0.66      8455

Confusion Matrix:
 [[1610 1445]
 [1398 4002]]
Training and testing data loaded successfully.
Data preprocessing completed.


Parameters: { "use_label_encoder" } are not used.



XGBoost model trained successfully.
Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.17      0.27      3055
           1       0.67      0.95      0.79      5400

    accuracy                           0.67      8455
   macro avg       0.67      0.56      0.53      8455
weighted avg       0.67      0.67      0.60      8455

Confusion Matrix:
 [[ 507 2548]
 [ 244 5156]]


Based on the provided evaluation metrics and the problem statement, the **XGBoost model** appears to be the better choice. Here's why:

### Context of the Use Case
The goal is to predict loan repayment behavior, particularly identifying potential defaulters. This is a highly sensitive use case where identifying defaulters (positive class `1`) accurately is critical to minimize financial risk.

### Evaluation Comparison
1. **Precision**:
   - Logistic Regression (Class `1`): **0.73**
   - XGBoost (Class `1`): **0.67**

   Logistic Regression has a slightly better precision, meaning it is less likely to falsely classify a non-defaulter as a defaulter.

2. **Recall (Sensitivity)**:
   - Logistic Regression (Class `1`): **0.74**
   - XGBoost (Class `1`): **0.95**

   XGBoost significantly outperforms Logistic Regression in recall, which is crucial for this use case. High recall ensures that most actual defaulters are identified, reducing the risk of approving loans to defaulters.

3. **F1-Score**:
   - Logistic Regression (Class `1`): **0.74**
   - XGBoost (Class `1`): **0.79**

   The F1-Score balances precision and recall. XGBoost has a higher F1-Score, indicating better overall performance in identifying defaulters.

4. **Overall Accuracy**:
   - Logistic Regression: **0.66**
   - XGBoost: **0.67**

   XGBoost has slightly better overall accuracy, though this metric is less important given the class imbalance.

5. **Confusion Matrix**:
   - XGBoost identifies a higher number of defaulters (`5156/5400`) compared to Logistic Regression (`4002/5400`). This aligns with the goal of minimizing undetected defaulters.

### Final Decision
While Logistic Regression has slightly better precision, the XGBoost model's significantly higher recall and F1-Score make it the preferred choice for this use case, where missing defaulters is a more critical issue than occasionally misclassifying a non-defaulter.
