# DS5110 - Introduction to Data Management and Processing

- Instructor: Prof. Mohammed Toutiaee
- Semester: Fall 2023
- Class Timings: Thursday 14:00 - 17:30 Pacific Time
- Class Location: Northeastern University Silicon Valley, Room 303

This notebook contains the code for the top 5 models which were built by students for HW4. The HW4 assignment is identical to a Kaggle competition and the leaderboard has been announced in the class. Please readon for the code and the results.

## Problem Statement
Financial institutions that lend to consumers rely on models to help decide on who to approve or decline for credit (for lending products such as credit cards, automobile loans, or home loans). In this project, your task is to develop models that review credit card applications to determine which ones should be approved. You are given historical data on response (binary default indicator) and 20 predictor variables from credit card accounts for a hypothetical bank XYZ, a regional bank in the Bay area. There are three datasets available: a [training](https://raw.githubusercontent.com/mh2t/DS5110/main/Homework/HW4-Train.csv) dataset with 20,000 accounts; a [validation](https://raw.githubusercontent.com/mh2t/DS5110/main/Homework/HW4-Validation.csv) dataset with 3,000 accounts, and a **hidden** test dataset with 5,000 accounts. Information about the variables is given in the [Appendix](https://github.com/mh2t/DS5110/blob/main/Homework/HW4-appx.pdf).

You are asked to do the following and also address specific questions below:

* **(10 points)** Do any necessary data pre-processing in preparation for modeling.
* **(20 points)** Develop and fit a logistic regression (LR) model, assess its performance, and interpret the results.
* **(20 points)** Develop an additional model based on a machine learning (ML) algorithm selected from one of the following: Random Forest, Gradient Boosting (XGBoost or another implementation), or Feedforward Neural Network; assess its performance, and make sure to explain why you chose this particular algorithm.
* **(10 points)** Compare the results from the ML algorithm with those from logistic regression model and discuss their advantages and disadvantages; select one of these models for credit approval; and describe the reasons for your selection.
* **(5 points)** Describe what performance metrics you chose to evalaute your proposed models and why.
* **(10 points)** Describe how you would use it to make decisions on future credit card applications.
* **(5 points)** Do customers who already have an account with the financial institution receive any favorable treatment in your model? Support your answer with appropriate analysis.
* **(20 points)** 2-page report.
* You can use any libraries for this homework.

## Deliverables

Please submit the following:

1. A report (doc file) that describes all important steps in your data analysis,
model development, comparison of the models, and answer to the specific questions in addition to justification for your final model selection. The body of the report should be no more than 2 pages in length (font size 11 and spacing 1.2).
2. The codes you used for the analysis should have brief but adequate annotations so that we can run it. Using a format of **IPYNB** is mandatory. Clearly indicate the software packages and versions (if appropriate) that you used for the analysis.
3. You are allowed to review textbooks, published papers, websites, and other open literature in preparing for this homework. Note, however, that the material you submit in your report must be based on your own analysis and writing. If you relied on published scholarly work and open-source software for your analysis and findings (beyond what is generally known), you should provide references at the end of the report.

## Top Model Bonus

If the evaluation metric of your chosen model achieve the **highest** rank among all submissions, you will be awarded an additional **10 bonus points**. This bonus will be directly applied to your homework 4 score. It's important to note that the performance of your best model will be assessed using a hidden test set, ensuring a fair and unbiased evaluation.

## Metric

After the students submitted their models, thier models were run on a hidden dataset and were score using the following metric:  

$$ F_2 = \frac{5 \times Precision \times Recall}{4 \times Precision + Recall} $$
The cells below depict the code for the top 5 models and are ordered by their rank.

In [None]:
# Global and common imports
from collections import Counter
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from scipy import stats  # import norm, zscore, chi2_contingency
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    ConfusionMatrixDisplay,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    make_scorer,
    roc_curve,
    fbeta_score,
    auc,
)
from sklearn.model_selection import (
    train_test_split,
    RepeatedStratifiedKFold,
    cross_val_score,
    cross_validate,
    RandomizedSearchCV,
)
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from xgboost import plot_importance, XGBClassifier, DMatrix, train, plot_importance


warnings.filterwarnings("ignore", category=UserWarning)

## Model 1

- F2-Score: 0.5693
- Model Name: XGBoost
- Model Type: Ensemble

In [None]:
# Packages needed for the Modeling.

train_data = pd.read_csv("location of the train data")

columns_to_drop = ["Default_ind", "States"]


def normal_data():
    X_train = train_data.drop(labels=columns_to_drop, axis=1)
    y_train = train_data.Default_ind
    scaler = StandardScaler()
    X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)

    v_data = pd.read_csv(
        ""
    )
    v_data.dropna(inplace=True)
    v_data["combined"] = v_data["avg_card_debt"] * v_data["uti_card"]

    # Dropping the States and auto_open_36_month_num column due to it insignificance we learnt from chi-square test
    X_test = v_data.drop(labels=columns_to_drop, axis=1)
    X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    y_test = v_data.Default_ind

    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = normal_data()

# Train an XGBoost classifier
xgb_model = XGBClassifier(
    objective="binary:logistic",
    random_state=111,
    scale_pos_weight=5,
    max_depth=2,
    eval_metric="logloss",
    enable_categorical="missing",
)

xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)
y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
f2_score = fbeta_score(y_test, y_pred, beta=2)

print(f'F2 Score: {f2_score}')


# Display the results
# Calculate ROC curve
print("ROC Curve:")
print(roc_auc_score(y_test, y_pred_proba))
print("Classification Report:")
print(classification_rep)
cm_display = ConfusionMatrixDisplay(
    confusion_matrix=conf_matrix, display_labels=[False, True]
)

fig, ax = plt.subplots(figsize=(6, 4))
cm_display.plot(ax=ax)
plot_importance(xgb_model)
plt.show()

## Model 2

- F2-Score: 0.5671
- Model Name: GradientBoosting
- Model Type: Ensemble

In [None]:
df = pd.read_csv(
    ""
)

# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include="number").columns
categorical_cols = df.select_dtypes(exclude="number").columns

numeric_imputer = SimpleImputer(strategy="mean")
categorical_imputer = SimpleImputer(strategy="most_frequent")

# Impute missing values in numeric columns with the mean
df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])

# Impute missing values in categorical columns with the most frequent value
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])

# Drop rows with any missing values
df_no_missing = df.dropna()

# Drop columns with any missing values
df_no_missing_cols = df.dropna(axis=1)

# Assuming df is your DataFrame with multiple categorical columns

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", MinMaxScaler(), numeric_cols),
        ("cat", OneHotEncoder(), categorical_cols),
    ]
)

# Apply the transformation
scaled_features = preprocessor.fit_transform(df)

# Create a new DataFrame with scaled features
columns = list(numeric_cols) + list(
    preprocessor.named_transformers_["cat"].get_feature_names_out(categorical_cols)
)

df_scaled = pd.DataFrame(scaled_features, columns=columns)


data = df
a = data.dtypes
a = pd.DataFrame(a)
a = a.reset_index()
a = a[a[0] == "object"]
a


def classification(data, name):
    a = list(data[name].unique())
    b = []
    c = list(data[name])
    for i in c:
        judge = False
        for j in range(len(a)):
            if a[j] == i:
                b.append(j)
                judge = True
        if judge == False:
            b.append(len(a))

    data = data.drop(columns=[name])
    data.insert(loc=0, column=name, value=b)
    return data


s = list(a["index"])
for i in s:
    data = classification(data, i)
data



data1 = pd.get_dummies(data, dtype=int)
X_train = data1.drop(["Default_ind"], axis=1)
y_train = data1["Default_ind"]

# Undersample process
undersample = RandomUnderSampler(sampling_strategy=1 / 1)
X_train, y_train = undersample.fit_resample(X_train, y_train)
print("After undersampling: ", Counter(y_train))

df = pd.read_csv()
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include="number").columns
categorical_cols = df.select_dtypes(exclude="number").columns

# Impute missing values in numeric columns with the mean
numeric_imputer = SimpleImputer(strategy="mean")
df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])

# Impute missing values in categorical columns with the most frequent value
categorical_imputer = SimpleImputer(strategy="most_frequent")
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])
# Drop rows with any missing values
df_no_missing = df.dropna()

# Drop columns with any missing values
df_no_missing_cols = df.dropna(axis=1)
# Assuming df is your DataFrame with multiple categorical columns
# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", MinMaxScaler(), numeric_cols),
        ("cat", OneHotEncoder(), categorical_cols),
    ]
)

# Apply the transformation
scaled_features = preprocessor.fit_transform(df)

# Create a new DataFrame with scaled features
columns = list(numeric_cols) + list(
    preprocessor.named_transformers_["cat"].get_feature_names_out(categorical_cols)
)
df_scaled = pd.DataFrame(scaled_features, columns=columns)

print("\nScaled DataFrame:")
print(df_scaled)


data = df
a = data.dtypes
a = pd.DataFrame(a)
a = a.reset_index()
a = a[a[0] == "object"]
a


def classification(data, name):
    a = list(data[name].unique())
    b = []
    c = list(data[name])
    for i in c:
        judge = False
        for j in range(len(a)):
            if a[j] == i:
                b.append(j)
                judge = True
        if judge == False:
            b.append(len(a))

    data = data.drop(columns=[name])
    data.insert(loc=0, column=name, value=b)
    return data


s = list(a["index"])
for i in s:
    data = classification(data, i)
data


data2 = pd.get_dummies(data, dtype=int)
data2

# Create a Gradient Boosting Regressor
gb_regressor = GradientBoostingClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
)

# Train the model
gb_regressor.fit(X_train, y_train)

# model predict
x_val = data2.drop(["Default_ind"], axis=1)
y_val = data2["Default_ind"]

resu = gb_regressor.predict(x_val)
fpr1, tpr1, threshold = roc_curve(y_val, resu)

# Calculate ROC and AUC
fpr, tpr, threshold = roc_curve(y_val, resu)
rocauc = auc(fpr1, tpr1)

## Model 3

- F2-Score: 0.5611
- Model Name: XGBoost
- Model Type: Ensemble

In [None]:
# I select XGBoost
# RandomizedSearchCV for text


training = pd.read_csv()
validation = pd.read_csv()

# Delete NA
training.dropna(inplace=True)
training

# instead States name into number
training["States"] = training["States"].replace(
    {
        "NC": 53740,
        "AL": 47466,
        "FL": 59521,
        "GA": 53194,
        "LA": 51571,
        "MS": 43654,
        "SC": 50341,
    }
)

# def remove_outliers(df):
#     Q1 = df.quantile(0.25)
#     Q3 = df.quantile(0.75)
#     IQR = Q3 - Q1
#     df_out = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
#     return df_out
# training = remove_outliers(training)
# training
validation.dropna(inplace=True)

validation["States"] = validation["States"].replace(
    {
        "NC": 53740,
        "AL": 47466,
        "FL": 59521,
        "GA": 53194,
        "LA": 51571,
        "MS": 43654,
        "SC": 50341,
    }
)


scaler = StandardScaler()

# we need to choose coloum which is Gaussian Distribution to do standalization
columns_to_standardize = [
    0,
    2,
    3,
    4,
    13,
    14,
    15,
    16,
    18,
]  # choose column by using index
training_standardized = scaler.fit_transform(training.iloc[:, columns_to_standardize])

# put standardization data into df
training.iloc[:, columns_to_standardize] = training_standardized

training

# we will do normalization
min_max_scaler = MinMaxScaler()

# normalization
columns_to_normalize = [
    1,
    5,
    6,
    7,
    8,
    9,
    10,
    11,
    12,
    17,
    19,
    20,
]  # choose column that we need normalization
training_normalized = min_max_scaler.fit_transform(
    training.iloc[:, columns_to_normalize]
)

# put back to the position
training.iloc[:, columns_to_normalize] = training_normalized

training

# we need to choose coloum which is Gaussian Distribution to do standalization
columns_to_standardize = [
    0,
    2,
    3,
    4,
    13,
    14,
    15,
    16,
    18,
]  # choose column by using index
validation_standardized = scaler.fit_transform(
    validation.iloc[:, columns_to_standardize]
)

# put standardization data into df
validation.iloc[:, columns_to_standardize] = validation_standardized

validation

# we will do normalization
min_max_scaler = MinMaxScaler()

# nomalization for test
columns_to_normalize = [
    1,
    5,
    6,
    7,
    8,
    9,
    10,
    11,
    12,
    17,
    19,
    20,
]  # choose column that we need normalization
validation_normalized = min_max_scaler.fit_transform(
    validation.iloc[:, columns_to_normalize]
)

validation.iloc[:, columns_to_normalize] = validation_normalized

validation

X_train = training.drop("Default_ind", axis=1)
y_train = training["Default_ind"]

X_test = validation.drop("Default_ind", axis=1)
y_test = validation["Default_ind"]

# set up XGBoost
xgb_model = XGBClassifier(
    objective="binary:logistic", use_label_encoder=False, eval_metric="logloss"
)

# set up the parameter
param_dist = {
    "max_depth": [3, 4, 5, 6, 7, 8, 9, 10],
    "n_estimators": [100, 200, 300, 400, 500],
    "learning_rate": [0.01, 0.02, 0.05, 0.1, 0.2, 0.3],
    "subsample": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    "colsample_bytree": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    "gamma": [0, 0.1, 0.2, 0.3, 0.4, 0.5],
    "reg_lambda": [0, 0.5, 1, 1.5, 2, 3],
    "reg_alpha": [0, 0.1, 0.5, 1],
    "min_child_weight": [1, 2, 3, 4, 5, 6],
    "scale_pos_weight": [1, 2, 3, 4],
}

# set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=50,
    scoring="roc_auc",
    n_jobs=-1,
    cv=5,  # more cv more stable cross validation result
    verbose=3,
    random_state=42,
)

# random search
random_search.fit(X_train, y_train)

# print function
print(f"Best parameters found: {random_search.best_params_}")
print(f"Best AUC found: {random_search.best_score_}")

# use the best model to predict
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# calculate AUC-ROC
auc_roc = roc_auc_score(y_test, y_pred_proba)
print(f"AUC-ROC: {auc_roc}")

# evaluate performance of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best model: {accuracy * 100.0}%")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

import sklearn.metrics as metrics

# Do model predictions and inference here
model = best_model

y_pred = model.predict(X_test)
y_true = y_test

# Calculate metrics here
f2_score = metrics.fbeta_score(y_true, y_pred, beta=2)

print(f"F2 Score: {f2_score}")

feature_importance = pd.DataFrame(
    best_model.feature_importances_, index=X_train.columns, columns=["Importance"]
)
print(feature_importance)

## Model 4

- F2-Score: 0.5610
- Model Name: XGBoost
- Model Type: Ensemble

In [None]:
def simple_preprocessing(data, y="Default_ind"):
    if "States" in data.columns:
        data = pd.get_dummies(data, columns=["States"], drop_first=True)

    data = data.dropna(axis=0)

    features = data.drop(y, axis=1)
    target = data[y]

    return features, target


X_train, y_train = simple_preprocessing(pd.read_csv())
X_valid, y_valid = simple_preprocessing(pd.read_csv())

bmodel = LogisticRegression()
bmodel.fit(X_train, y_train)

y_pred = bmodel.predict(X_valid)

print("Baseline Model : LogisticRegression")
print("Accuracy:", accuracy_score(y_valid, y_pred))
print("Classification Report:\n", classification_report(y_valid, y_pred))
print("Confusion Matrix:")
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_valid, y_pred), display_labels=[0, 1]
)
disp.plot(cmap="Blues", values_format="d")

# Test baseline model here
x = ""
if x == "":
    x = ""
test = pd.read_csv(x)
X_test, y_test = simple_preprocessing(test)
y_pred = bmodel.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

if hasattr(bmodel, "decision_function"):
    y_score = bmodel.decision_function(X_test)
else:
    y_score = bmodel.predict_proba(X_test)[:, 1]

print("AUROC:", roc_auc_score(y_test, y_pred))

# Create a contingency table
contingency_table = pd.crosstab(train["States"], train["Default_ind"])

# Perform the chi-square test to check cat-cat variables
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Display the results
print(f"Chi-square statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies table:")
print(expected)


def correlation_test(df, thresh=0.8):
    columns = df.columns
    a = []
    for col1 in columns:
        for col2 in columns:
            if col1 < col2:
                cor = df[col1].corr(df[col2])
                if cor > thresh or cor < -thresh:
                    # print(col1, col2, cor)
                    a.append(col1)
    return a


class Model:
    def __init__(
        self,
        model_class,
        trainset,
        testset,
        y="Default_ind",
        smote=False,
        thresh=0.6,
        scale=False,
    ):
        self.scaler = StandardScaler()
        self.model = model_class
        self.y = y
        self.scale = scale

        X_train, y_train = self.simple_preprocessing(trainset)
        self.droppable_columns = correlation_test(X_train, thresh)
        if smote == True:
            X_train, y_train = self.resample(X_train, y_train)

        self.train(X_train, y_train)
        self.test(testset)

    def resample(self, X, Y):
        sm = SMOTE(random_state=42)
        X_resampled, y_resampled = sm.fit_resample(X, Y)
        return X_resampled, y_resampled

    def simple_preprocessing(self, df):
        if self.scale:
            df.dropna(axis=0, inplace=True)

        if "States" in df.columns:
            df.drop("States", axis=1, inplace=True)

        X_train = df.drop(self.y, axis=1)
        y_train = df[self.y]
        return X_train, y_train

    def preprocess(self, X):
        X.drop(labels=self.droppable_columns, axis=1, inplace=True)
        return X

    def train(self, X_train, y_train):
        X_train = self.preprocess(X_train)
        if self.scale:
            X_train = self.scaler.fit_transform(X_train)
        if issubclass(self.model.__class__, xgb.XGBClassifier):
            dtrain = xgb.DMatrix(X_train, label=y_train)
            self.model = xgb.train(self.model.get_params(), dtrain, num_boost_round=132)
        else:
            self.model.fit(X_train, y_train.values.ravel())

    def test(self, testset):
        X_test, y_test = self.simple_preprocessing(testset)
        X_test = self.preprocess(X_test)
        if self.scale:
            X_test = self.scaler.transform(X_test)
        if issubclass(self.model.__class__, xgb.Booster):
            dtest = xgb.DMatrix(X_test)
            y_pred = self.model.predict(dtest)
        else:
            y_pred = self.model.predict(X_test)

        binary_predictions = [1 if p > 0.5 else 0 for p in y_pred]

        self.accuracy = accuracy_score(y_test, binary_predictions)
        self.report = classification_report(y_test, binary_predictions)
        self.c_matrix = confusion_matrix(y_test, binary_predictions)

        if issubclass(self.model.__class__, xgb.Booster):
            # For XGBoost, use the predicted probabilities directly
            self.fpr, self.tpr, _ = roc_curve(y_test, y_pred)
            self.auroc = roc_auc_score(y_test, y_pred)

            self.f2_score = fbeta_score(y_test, binary_predictions, beta=2)
        else:
            # For other models, use the decision function or predicted probabilities
            if hasattr(self.model, "decision_function"):
                y_score = self.model.decision_function(X_test)
            else:
                y_score = self.model.predict_proba(X_test)[:, 1]
            self.fpr, self.tpr, _ = roc_curve(y_test, y_score)
            self.auroc = roc_auc_score(y_test, y_score)

    def get_model(self):
        return self.model

    def all_stats(self):
        print(f"{self.model.__class__.__name__} Model")
        print("Accuracy:", self.accuracy)
        print("Classification Report:\n", self.report)
        print("AUROC:", self.auroc)
        print("Confusion Matrix:")

        # Create subplots
        fig, axs = plt.subplots(1, 2, figsize=(12, 5))

        # Plot AUROC
        axs[0].plot(
            self.fpr,
            self.tpr,
            color="darkorange",
            lw=2,
            label="ROC curve (area = {:.2f})".format(self.auroc),
        )
        axs[0].plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
        axs[0].set_xlabel("False Positive Rate")
        axs[0].set_ylabel("True Positive Rate")
        axs[0].set_title("Receiver Operating Characteristic (ROC) Curve")
        axs[0].legend(loc="lower right")

        # Plot Confusion Matrix
        if issubclass(self.model.__class__, xgb.Booster):
            classes = [0, 1]
            disp = ConfusionMatrixDisplay(
                confusion_matrix=self.c_matrix, display_labels=classes
            )
        else:
            disp = ConfusionMatrixDisplay(
                confusion_matrix=self.c_matrix, display_labels=self.model.classes_
            )
        disp.plot(ax=axs[1], cmap="Blues", values_format="d")
        axs[1].set_title("Confusion Matrix")

        plt.show()

    def res(self):
        if issubclass(self.model.__class__, xgb.Booster):
            print("F2 Score:", self.f2_score)

        print("Accuracy:", self.accuracy)
        print("AuROC", self.auroc)


train = pd.read_csv(
    ""
)
test = pd.read_csv(
    ""
)

xgb_model = Model(
    xgb.XGBClassifier(
        max_depth=3, eta=0.2, objective="binary:logistic", scale_pos_weight=5
    ),
    trainset=train,
    testset=test,
    y="Default_ind",
    thresh=0.9,
)
xgb_model.all_stats()

## Model 5

- F2-Score: 0.5579
- Model Name: XGBoost
- Model Type: Ensemble

In [None]:
training_df = pd.read_csv()
validation_df = pd.read_csv()

imputer_uti = SimpleImputer(strategy="mean")
imputer_income = SimpleImputer(strategy="mean")

# Imputation for 'uti_card_50plus_pct'
training_df["uti_card_50plus_pct"] = imputer_uti.fit_transform(
    training_df[["uti_card_50plus_pct"]]
)
validation_df["uti_card_50plus_pct"] = imputer_uti.transform(
    validation_df[["uti_card_50plus_pct"]]
)

# Imputation for 'rep_income'
training_df["rep_income"] = imputer_income.fit_transform(training_df[["rep_income"]])
validation_df["rep_income"] = imputer_income.transform(validation_df[["rep_income"]])

X_train_original = training_df.drop(["Default_ind", "States"], axis=1)
y_train_original = training_df["Default_ind"]
X_validation_original = validation_df.drop(["Default_ind", "States"], axis=1)
y_validation_original = validation_df["Default_ind"]


scaler = StandardScaler()

# standardizing the data
X_train_original_scaled = scaler.fit_transform(X_train_original)
X_validation_original_scaled = scaler.transform(X_validation_original)

xgb_model = XGBClassifier(
    objective="binary:logistic",
    random_state=123,
    scale_pos_weight=5,
    max_depth=2,
    eval_metric="logloss",
    enable_categorical="missing",
)
xgb_model.fit(X_train_original_scaled, y_train_original)

# Predicting on the validation set
y_pred_xgb = xgb_model.predict(X_validation_original_scaled)

# Assessing the model's performance
classification_rep_xgb = classification_report(y_validation_original, y_pred_xgb)
confusion_mat = confusion_matrix(y_validation_original, y_pred_xgb)
print("Classification Report for XG-Boost Model:\n")
print(classification_rep_xgb)

y_pred_proba_xgb = xgb_model.predict_proba(X_validation_original_scaled)[:, 1]


fpr_xg, tpr_xg, thresholds_xg = roc_curve(y_validation_original, y_pred_proba_xgb)
auc_xg = auc(fpr_xg, tpr_xg)

print("AUC: ", auc_xg)

import sklearn.metrics as metrics

# Do model predictions and inference here
model = xgb_model

y_pred = model.predict(X_validation_original_scaled)
y_true = y_validation_original

# Calculate metrics here
f2_score = metrics.fbeta_score(y_true, y_pred, beta=2)

print(f2_score)

cm_display = ConfusionMatrixDisplay(
    confusion_matrix=confusion_mat, display_labels=[False, True]
)
fig, ax = plt.subplots(figsize=(6, 4))
cm_display.plot(ax=ax)
plt.show()