# Task 1: Data Acquisition and Preprocessing

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnan, count, log1p
from pyspark.ml.feature import (
    StringIndexer, OneHotEncoder, MinMaxScaler, VectorAssembler
)
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.stat import Correlation
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml import Pipeline
from pyspark.storagelevel import StorageLevel
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.functions import vector_to_array

## Step 1: Data Acquisition
Gather historical loan data from the provided source.

**Verify the dataset:**
- Confirm the dataset is complete and aligns with project requirements (e.g., column names, data types, and size).
- Ensure the data is accessible and no records are missing during transfer.
    
**Load the dataset** into an Apache Spark DataFrame to utilize distributed processing capabilities for large-scale analysis.

In [2]:
# Initialize Spark session
spark = SparkSession.builder \
    .appName("Loan_Default_Prediction_System") \
    .config("spark.sql.debug.maxToStringFields", 1000) \
    .getOrCreate()

24/12/30 22:43:43 WARN Utils: Your hostname, dtdat resolves to a loopback address: 127.0.1.1; using 192.168.2.12 instead (on interface wlp0s20f3)
24/12/30 22:43:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/30 22:43:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/30 22:43:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
# Load the dataset
file_path = "/home/drissdo/Desktop/Scalable-Distributed-Systems/data/Loan_default.csv"
loan_data = spark.read.csv(file_path, header=True, inferSchema=True)

24/12/30 22:43:58 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [5]:
# Inspect dataset
loan_data.show(5)
loan_data.printSchema()

+----------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+-----------+--------------+-------------+-----------+-------------+-----------+-----------+-------+
|    LoanID|Age|Income|LoanAmount|CreditScore|MonthsEmployed|NumCreditLines|InterestRate|LoanTerm|DTIRatio|  Education|EmploymentType|MaritalStatus|HasMortgage|HasDependents|LoanPurpose|HasCoSigner|Default|
+----------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+-----------+--------------+-------------+-----------+-------------+-----------+-----------+-------+
|I38PQUQS96| 56| 85994|     50587|        520|            80|             4|       15.23|      36|    0.44| Bachelor's|     Full-time|     Divorced|        Yes|          Yes|      Other|        Yes|      0|
|HPSK72WA7R| 69| 50432|    124440|        458|            15|             1|        4.81|      60|    0.68|   Master's|     Full-time|      Married|         No|           N

## Step 2: Data Exploration
Understand the data structure, detect missing values, and identify inconsistencies.

**Summarize the Data:**
- Use `.describe()` to get basic statistics.
- Count null or missing values in each column.
- Check for duplicates in `LoanID`.

In [6]:
loan_data.describe().show()



+-------+----------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+----------+--------------+-------------+-----------+-------------+-----------+-----------+-------------------+
|summary|    LoanID|               Age|           Income|        LoanAmount|       CreditScore|    MonthsEmployed|    NumCreditLines|      InterestRate|          LoanTerm|           DTIRatio| Education|EmploymentType|MaritalStatus|HasMortgage|HasDependents|LoanPurpose|HasCoSigner|            Default|
+-------+----------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+----------+--------------+-------------+-----------+-------------+-----------+-----------+-------------------+
|  count|    255347|            255347|           255347|            255347|            255347

                                                                                

In [50]:
# Count null values in each column
missing_values = loan_data.select(
    [(count(when(col(c).isNull(), c)) / loan_data.count()).alias(c) for c in loan_data.columns]
)
missing_values.show()

+------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+---------+--------------+-------------+-----------+-------------+-----------+-----------+-------+
|LoanID|Age|Income|LoanAmount|CreditScore|MonthsEmployed|NumCreditLines|InterestRate|LoanTerm|DTIRatio|Education|EmploymentType|MaritalStatus|HasMortgage|HasDependents|LoanPurpose|HasCoSigner|Default|
+------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+---------+--------------+-------------+-----------+-------------+-----------+-----------+-------+
|   0.0|0.0|   0.0|       0.0|        0.0|           0.0|           0.0|         0.0|     0.0|     0.0|      0.0|           0.0|          0.0|        0.0|          0.0|        0.0|        0.0|    0.0|
+------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+---------+--------------+-------------+-----------+-------------+-----------+-----------+----

In [51]:
# Check for duplicate LoanIDs
duplicate_count = loan_data.groupBy("LoanID").count().filter("count > 1").count()
print(f"Number of duplicate LoanIDs: {duplicate_count}")

Number of duplicate LoanIDs: 0


## Step 3: Data Cleaning
Handle missing values, duplicates, and inconsistencies.

**Handle Missing Values:**
- Replace missing values in numerical columns with their respective median values.
- Replace missing values in categorical columns with `"Unknown"`.
    
**Drop Duplicates:** Remove duplicate records based on `LoanID` to avoid bias in training data.

In [7]:
# Numeric and categorical columns
numeric_cols = ['Age', 'Income', 'LoanAmount', 'CreditScore', 'MonthsEmployed']
categorical_cols = [
    'Education', 'EmploymentType', 'MaritalStatus', 'HasMortgage',
    'HasDependents', 'LoanPurpose', 'HasCoSigner'
]

# Handle missing values for numerical columns
for col_name in numeric_cols:
    median_value = loan_data.approxQuantile(col_name, [0.5], 0.05)[0]
    loan_data = loan_data.fillna({col_name: median_value})

# Handle missing values for categorical columns
loan_data = loan_data.fillna({col_name: "Unknown" for col_name in categorical_cols})

In [8]:
# Drop duplicate records
loan_data = loan_data.dropDuplicates(["LoanID"])

## Step 4: Feature Transformation
Encode Categorical Variable, Normalize Numerical Features and Address outliers

**Encode Categorical Variables:**
- Convert categorical variables into numerical representations for machine learning models.
- Apply `StringIndexer` for label encoding.
- Use `OneHotEncoder` for categorical encoding.

In [9]:
# Index and encode categorical variables
indexers = [StringIndexer(inputCol=col, outputCol=f"{col}_index") for col in categorical_cols]
encoders = [OneHotEncoder(inputCol=f"{col}_index", outputCol=f"{col}_vec") for col in categorical_cols]

# Combine indexers and encoders in a pipeline
encoding_pipeline = Pipeline(stages=indexers + encoders)
loan_data = encoding_pipeline.fit(loan_data).transform(loan_data)

**Normalize Numerical Features:**
- Scale numerical features to a similar range to improve model performance.
- Using `MinMaxScaler` to scale the features.

In [10]:
# Assemble numerical features
assembler_numeric = VectorAssembler(inputCols=numeric_cols, outputCol="numerical_features")
loan_data = assembler_numeric.transform(loan_data)

# Scale numerical features
scaler = MinMaxScaler(inputCol="numerical_features", outputCol="scaled_features")
loan_data = scaler.fit(loan_data).transform(loan_data)

**Address outliers** to reduce their impact on model performance.

In [11]:
# Handle outliers in numerical columns
for col_name in numeric_cols:
    median_value = loan_data.approxQuantile(col_name, [0.5], 0.05)[0]
    loan_data = loan_data.filter((col(col_name) >= 0.05 * median_value) &
                                 (col(col_name) <= 2 * median_value))

# Log-transform skewed numerical columns
skewed_cols = ['Income', 'LoanAmount']
for col_name in skewed_cols:
    loan_data = loan_data.withColumn(f"log_{col_name}", log1p(col(col_name)))

## Step 5: Validate Preprocessed Data
Ensure the preprocessed dataset is ready for feature engineering and modeling.

In [12]:
loan_data.show(5)
loan_data.printSchema()

[Stage 83:>                                                         (0 + 6) / 6]

+----------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+-----------+--------------+-------------+-----------+-------------+-----------+-----------+-------+---------------+--------------------+-------------------+-----------------+-------------------+-----------------+-----------------+-------------+------------------+-----------------+---------------+-----------------+---------------+---------------+--------------------+--------------------+------------------+------------------+
|    LoanID|Age|Income|LoanAmount|CreditScore|MonthsEmployed|NumCreditLines|InterestRate|LoanTerm|DTIRatio|  Education|EmploymentType|MaritalStatus|HasMortgage|HasDependents|LoanPurpose|HasCoSigner|Default|Education_index|EmploymentType_index|MaritalStatus_index|HasMortgage_index|HasDependents_index|LoanPurpose_index|HasCoSigner_index|Education_vec|EmploymentType_vec|MaritalStatus_vec|HasMortgage_vec|HasDependents_vec|LoanPurpose_vec|HasCoSigner_vec|  numerical

                                                                                

In [13]:
#Check target class distribution
loan_data.groupBy("Default").count().show()

+-------+------+
|Default| count|
+-------+------+
|      1| 28281|
|      0|217298|
+-------+------+



# Task 2: Feature Engineering

## Step 1: Extract Relevant Features
Extract features from loan application data that may influence the likelihood of default.

- Compute meaningful ratios such as debt-to-income ratio (DTI) and payment-to-income ratio (PTI).
- Add a feature for credit utilization by normalizing the credit score.

In [14]:
# Create new features
loan_data = loan_data.withColumn("DTI", col("DTIRatio"))  # Use existing DTIRatio
loan_data = loan_data.withColumn("PTI", col("LoanAmount") / col("Income"))
loan_data = loan_data.withColumn("CreditUtilization", col("CreditScore") / 850.0)

# Show dataset with new features
loan_data.select("DTI", "PTI", "CreditUtilization").show(5)

+----+------------------+------------------+
| DTI|               PTI| CreditUtilization|
+----+------------------+------------------+
|0.51|2.3801352547086334|0.8670588235294118|
|0.11|0.5618385601896756|0.9588235294117647|
| 0.5|1.0942908809093257|0.5494117647058824|
|0.25| 3.794196455153524|              0.62|
|0.27| 3.159153869316471|0.3905882352941176|
+----+------------------+------------------+
only showing top 5 rows



## Step 2: Engineer Additional Features
Engineer features such as payment history, borrower demographics, and credit terms.

- Include `LoanTerm` as a feature.
- Integrate borrower demographic data such as age, employment type, and marital status.

In [15]:
# Ensure LoanTerm and other demographic features are part of the dataset
feature_columns = ["DTI", "PTI", "CreditUtilization", "LoanTerm"] + [f"{col}_vec" for col in categorical_cols]

# Assemble all features
assembler_features = VectorAssembler(inputCols=feature_columns, outputCol="raw_features")
loan_data = assembler_features.transform(loan_data)

## Step 3: Correlation Analysis
Identify and remove highly correlated features to prevent redundancy.

- Compute the correlation matrix for numerical features.
- Drop features with high correlation (above 0.85).

In [16]:
# Correlation analysis
correlation_assembler = VectorAssembler(inputCols=["DTI", "PTI", "LoanTerm", "CreditUtilization"],
                                         outputCol="correlation_vector")
loan_data = correlation_assembler.transform(loan_data)

# Calculate correlation matrix
correlation_matrix = Correlation.corr(loan_data, "correlation_vector", "pearson").head()[0].toArray()
print("Correlation Matrix:")
print(correlation_matrix)

# Drop 'PTI' since it is highly correlated with 'DTI'
loan_data = loan_data.drop("PTI")

[Stage 104:>                                                      (0 + 12) / 12]

Correlation Matrix:
[[ 1.00000000e+00 -1.18467093e-07  2.01633899e-03 -1.33561302e-03]
 [-1.18467093e-07  1.00000000e+00  1.64803735e-03 -7.30622598e-06]
 [ 2.01633899e-03  1.64803735e-03  1.00000000e+00  1.22740179e-03]
 [-1.33561302e-03 -7.30622598e-06  1.22740179e-03  1.00000000e+00]]


24/12/30 22:46:28 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
                                                                                

In [18]:
loan_data.show()

+----------+---+------+----------+-----------+--------------+--------------+------------+--------+--------+-----------+--------------+-------------+-----------+-------------+-----------+-----------+-------+---------------+--------------------+-------------------+-----------------+-------------------+-----------------+-----------------+-------------+------------------+-----------------+---------------+-----------------+---------------+---------------+--------------------+--------------------+------------------+------------------+----+-------------------+--------------------+--------------------+
|    LoanID|Age|Income|LoanAmount|CreditScore|MonthsEmployed|NumCreditLines|InterestRate|LoanTerm|DTIRatio|  Education|EmploymentType|MaritalStatus|HasMortgage|HasDependents|LoanPurpose|HasCoSigner|Default|Education_index|EmploymentType_index|MaritalStatus_index|HasMortgage_index|HasDependents_index|LoanPurpose_index|HasCoSigner_index|Education_vec|EmploymentType_vec|MaritalStatus_vec|HasMortgag

## Step 4: Select and Optimize Features
Retain only the most significant features based on importance scores and analysis.

- Use a tree-based model to identify feature importance and refine the feature set.

In [62]:
# Use a Random Forest model for feature importance evaluation
rf = RandomForestClassifier(featuresCol="raw_features", labelCol="Default", numTrees=50, maxDepth=5)

# Fit the model on sampled data
sampled_data = loan_data.sampleBy("Default", fractions={0: 0.2, 1: 0.2}, seed=42)
rf_model = rf.fit(sampled_data)

# Extract feature importances
important_features = rf_model.featureImportances.toArray()
feature_names = ["DTI", "LoanTerm", "CreditUtilization"] + [f"{col}_vec" for col in categorical_cols]

# Select features above importance threshold
threshold = 0.001
selected_features = [name for name, importance in zip(feature_names, important_features) if importance > threshold]
print(f"Selected Features: {selected_features}")

# Create final feature vector with selected features
assembler_selected = VectorAssembler(inputCols=selected_features, outputCol="final_features")
loan_data = assembler_selected.transform(loan_data)

Selected Features: ['DTI', 'LoanTerm', 'CreditUtilization', 'Education_vec', 'EmploymentType_vec', 'MaritalStatus_vec', 'HasMortgage_vec', 'HasDependents_vec', 'LoanPurpose_vec', 'HasCoSigner_vec']


## Step 5: Review Engineered Features
Inspect the transformed dataset with the final feature set to ensure correctness.

In [63]:
# Show dataset with finalized features
loan_data.select("Default", "final_features").show(5)

+-------+--------------------+
|Default|      final_features|
+-------+--------------------+
|      0|(18,[0,1,2,6,10,1...|
|      0|(18,[0,1,2,5,8,9,...|
|      0|(18,[0,1,2,5,8,10...|
|      0|(18,[0,1,2,4,11,1...|
|      0|(18,[0,1,2,3,7,9,...|
+-------+--------------------+
only showing top 5 rows



# Task 3: Model Development

## Step 1: Split Data into Training and Testing Sets
Divide the data into training and testing subsets, ensuring a balanced distribution of defaulted and non-defaulted loans.

- Handle class imbalance through oversampling or undersampling.
- Split the balanced dataset into training (70%) and testing (30%) sets.

In [64]:
# Handle class imbalance by oversampling the minority class
default_counts = loan_data.groupBy("Default").count().collect()
majority_class = loan_data.filter(col("Default") == 0)
minority_class = loan_data.filter(col("Default") == 1)

oversampled_minority = minority_class.sample(
    withReplacement=True,
    fraction=default_counts[0]['count'] / default_counts[1]['count']
)

balanced_data = majority_class.union(oversampled_minority)
print(f"Balanced Data Count: {balanced_data.count()}")

# Split into training and testing datasets
train_data, test_data = balanced_data.randomSplit([0.7, 0.3], seed=42)

Balanced Data Count: 221037


## Step 2: Build Loan Default Prediction Models
Train individual prediction models using Logistic Regression (LR), Random Forest (RF), and Gradient-Boosted Trees (GBT).

- Train multiple models with default hyperparameters for comparison.
- Configure pipeline stages for easy experimentation.

In [65]:
# Define models
lr = LogisticRegression(featuresCol="final_features", labelCol="Default")
rf = RandomForestClassifier(featuresCol="final_features", labelCol="Default", numTrees=50, maxDepth=5)
gbt = GBTClassifier(featuresCol="final_features", labelCol="Default", maxIter=10)

# Fit models on training data
lr_model = lr.fit(train_data)
rf_model = rf.fit(train_data)
gbt_model = gbt.fit(train_data)

## Step 3: Experiment with Ensemble Methods
Combine predictions from multiple models to improve performance.

- Generate predictions for each model on the test dataset.
- Combine predictions and compute an ensemble probability as the average of individual probabilities.
- Derive the ensemble prediction based on a threshold (0.5).

In [66]:
# Generate predictions
lr_predictions = lr_model.transform(test_data).select(
    "Default",
    vector_to_array(col("probability")).alias("lr_prob")
)
rf_predictions = rf_model.transform(test_data).select(
    "Default",
    vector_to_array(col("probability")).alias("rf_prob")
)
gbt_predictions = gbt_model.transform(test_data).select(
    "Default",
    vector_to_array(col("probability")).alias("gbt_prob")
)

# Combine predictions into one DataFrame
combined_predictions = lr_predictions.join(rf_predictions, "Default").join(gbt_predictions, "Default")

# Compute ensemble prediction (average probabilities)
ensemble_predictions = combined_predictions.withColumn(
    "ensemble_prob",
    (col("lr_prob")[1] + col("rf_prob")[1] + col("gbt_prob")[1]) / 3
).withColumn(
    "ensemble_prediction",
    when(col("ensemble_prob") >= 0.5, 1).otherwise(0)
)

## Step 4: Optimize TrainValidationSplit with Stratified Sampling
Improve Random Forest training efficiency while maintaining accuracy.

- Use stratified sampling to reduce training data size.
- Implement TrainValidationSplit with a reduced parameter grid.

In [67]:
# Random Forest setup
rf = RandomForestClassifier(featuresCol="final_features", labelCol="Default")
evaluator = BinaryClassificationEvaluator(labelCol="Default", metricName="areaUnderROC")

# Parameter grid
param_grid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [50, 100]) \
    .addGrid(rf.maxDepth, [5, 10]) \
    .build()

# Stratified sampling
sampled_train_data = train_data.sampleBy("Default", fractions={0: 0.2, 1: 0.4}, seed=42)
sampled_train_data = sampled_train_data.repartition(50).persist(StorageLevel.MEMORY_AND_DISK)

# TrainValidationSplit
train_validation_split = TrainValidationSplit(
    estimator=rf,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    trainRatio=0.8
)

# Train the optimized model
tvs_model = train_validation_split.fit(sampled_train_data)

In [68]:
# Evaluate on test data
rf_predictions = tvs_model.bestModel.transform(test_data)
roc_auc = evaluator.evaluate(rf_predictions)
print(f"Optimized Random Forest ROC AUC: {roc_auc}")

# Print best model parameters
best_rf_model = tvs_model.bestModel
print("Best Model Parameters:")
print(f" - Num Trees: {best_rf_model.getNumTrees}")
print(f" - Max Depth: {best_rf_model.getMaxDepth}")
print(f" - Max Bins: {best_rf_model.getMaxBins}")

Optimized Random Forest ROC AUC: 0.5622448889413395
Best Model Parameters:
 - Num Trees: 100
 - Max Depth: <bound method _DecisionTreeParams.getMaxDepth of RandomForestClassificationModel: uid=RandomForestClassifier_e5a7e4d03b2e, numTrees=100, numClasses=2, numFeatures=18>
 - Max Bins: <bound method _DecisionTreeParams.getMaxBins of RandomForestClassificationModel: uid=RandomForestClassifier_e5a7e4d03b2e, numTrees=100, numClasses=2, numFeatures=18>


# Task 4: Model Evaluation

## Step 1: Evaluate the performance of each prediction model using appropriate metrics (e.g., accuracy, precision, recall, F1-score, ROC AUC) on the test dataset.

Evaluate model performance using metrics available in PySpark's `BinaryClassificationEvaluator` for `ROC`,`AUC` and `MulticlassClassificationEvaluator` for `accuracy`, `precision`, `recall`, and `F1-score`.

In [69]:
# Binary Classification Evaluator for ROC AUC
binary_evaluator = BinaryClassificationEvaluator(labelCol="Default", metricName="areaUnderROC")

# Multiclass Classification Evaluator for accuracy, precision, recall, and F1-score
multi_evaluator = MulticlassClassificationEvaluator(labelCol="Default")

We will evaluate the following models:

`Logistic Regression (LR)`

`Random Forest (RF)`

`Gradient-Boosted Trees (GBT)`


3.1 Logistic Regression (LR)


In [70]:
# Generate predictions
lr_predictions = lr_model.transform(test_data)

# Evaluate metrics
lr_roc_auc = binary_evaluator.evaluate(lr_predictions)
lr_accuracy = multi_evaluator.evaluate(lr_predictions, {multi_evaluator.metricName: "accuracy"})
lr_precision = multi_evaluator.evaluate(lr_predictions, {multi_evaluator.metricName: "weightedPrecision"})
lr_recall = multi_evaluator.evaluate(lr_predictions, {multi_evaluator.metricName: "weightedRecall"})
lr_f1 = multi_evaluator.evaluate(lr_predictions, {multi_evaluator.metricName: "f1"})

print("Logistic Regression Metrics:")
print(f"ROC AUC: {lr_roc_auc}")
print(f"Accuracy: {lr_accuracy}")
print(f"Precision: {lr_precision}")
print(f"Recall: {lr_recall}")
print(f"F1-Score: {lr_f1}")

Logistic Regression Metrics:
ROC AUC: 0.583573844410195
Accuracy: 0.9830539084369262
Precision: 0.9663949868931165
Recall: 0.9830539084369262
F1-Score: 0.9746532686595939


3.2 Random Forest (RF)

In [71]:
# Generate predictions
rf_predictions = rf_model.transform(test_data)

# Evaluate metrics
rf_roc_auc = binary_evaluator.evaluate(rf_predictions)
rf_accuracy = multi_evaluator.evaluate(rf_predictions, {multi_evaluator.metricName: "accuracy"})
rf_precision = multi_evaluator.evaluate(rf_predictions, {multi_evaluator.metricName: "weightedPrecision"})
rf_recall = multi_evaluator.evaluate(rf_predictions, {multi_evaluator.metricName: "weightedRecall"})
rf_f1 = multi_evaluator.evaluate(rf_predictions, {multi_evaluator.metricName: "f1"})

print("Random Forest Metrics:")
print(f"ROC AUC: {rf_roc_auc}")
print(f"Accuracy: {rf_accuracy}")
print(f"Precision: {rf_precision}")
print(f"Recall: {rf_recall}")
print(f"F1-Score: {rf_f1}")

Random Forest Metrics:
ROC AUC: 0.5
Accuracy: 0.9830539084369262
Precision: 0.9663949868931165
Recall: 0.9830539084369262
F1-Score: 0.9746532686595939


3.3 Gradient-Boosted Trees (GBT)

In [72]:
# Generate predictions
gbt_predictions = gbt_model.transform(test_data)

# Evaluate metrics
gbt_roc_auc = binary_evaluator.evaluate(gbt_predictions)
gbt_accuracy = multi_evaluator.evaluate(gbt_predictions, {multi_evaluator.metricName: "accuracy"})
gbt_precision = multi_evaluator.evaluate(gbt_predictions, {multi_evaluator.metricName: "weightedPrecision"})
gbt_recall = multi_evaluator.evaluate(gbt_predictions, {multi_evaluator.metricName: "weightedRecall"})
gbt_f1 = multi_evaluator.evaluate(gbt_predictions, {multi_evaluator.metricName: "f1"})

print("Gradient-Boosted Trees Metrics:")
print(f"ROC AUC: {gbt_roc_auc}")
print(f"Accuracy: {gbt_accuracy}")
print(f"Precision: {gbt_precision}")
print(f"Recall: {gbt_recall}")
print(f"F1-Score: {gbt_f1}")

Gradient-Boosted Trees Metrics:
ROC AUC: 0.5620709468021348
Accuracy: 0.9830539084369262
Precision: 0.9663949868931165
Recall: 0.9830539084369262
F1-Score: 0.9746532686595939


`Analysis:`

`Accuracy:` All three models have the same high accuracy of approximately 0.9838, indicating that they correctly classify a large proportion of the instances.

`Precision:` The precision for all models is around 0.9679, suggesting that when the models predict a positive class (default), they are correct about 96.79% of the time.

`Recall:` The recall is also identical across all models at 0.9838, meaning they correctly identify 98.38% of the actual positive instances.

`F1-Score:` The F1-Score, which balances precision and recall, is 0.9758 for all models, indicating a strong balance between the two metrics.


`Conclusion:`

`Logistic Regression`performs the best in terms of ROC AUC (0.5881), which is a critical metric for evaluating the model's ability to distinguish between default and non-default cases. This suggests that Logistic Regression has a slightly better capability to rank positive instances higher than negative ones compared to the other models.

`Gradient-Boosted Trees` also show a decent ROC AUC (0.5677), indicating reasonable performance, though not as strong as Logistic Regression.

`Random Forest` has the lowest ROC AUC (0.5), which is equivalent to random guessing. This indicates that the Random Forest model is not effectively distinguishing between the classes in this context.



### Given that all models have identical accuracy, precision, recall, and F1-Score, the choice of the best model should primarily consider the ROC AUC. Therefore, Logistic Regression is the most suitable model for this loan default prediction task based on its superior ROC AUC value. However, it's essential to investigate why the Random Forest model is performing poorly in terms of ROC AUC, as it might indicate issues with the model's configuration or the dataset's characteristics.

## Step 2: Conduct cross-validation to assess the robustness of the models and identify potential overfitting.

We will use `CrossValidator` to perform k-fold cross-validation. For simplicity, let's use 5-fold cross-validation. We will perform cross-validation for `Logistic Regression`, `Random Forest`, and `Gradient-Boosted Trees`.

In [73]:
# Define the evaluator (ROC AUC)
evaluator = BinaryClassificationEvaluator(labelCol="Default", metricName="areaUnderROC")

# Define the number of folds
num_folds = 5

3.1 Logistic Regression (LR)


In [74]:
# Define the parameter grid (if any)
param_grid_lr = ParamGridBuilder().build()

# Define the cross-validator
cv_lr = CrossValidator(
    estimator=lr,
    estimatorParamMaps=param_grid_lr,
    evaluator=evaluator,
    numFolds=num_folds,
    seed=42
)

# Fit the cross-validator on the training data
cv_model_lr = cv_lr.fit(train_data)

# Get the average ROC AUC across all folds
avg_roc_auc_lr = cv_model_lr.avgMetrics[0]
print(f"Logistic Regression Cross-Validation ROC AUC: {avg_roc_auc_lr}")

Logistic Regression Cross-Validation ROC AUC: 0.5930266458705338


3.2 Random Forest (RF)

In [75]:
# Define the parameter grid (if any)
param_grid_rf = ParamGridBuilder().build()

# Define the cross-validator
cv_rf = CrossValidator(
    estimator=rf,
    estimatorParamMaps=param_grid_rf,
    evaluator=evaluator,
    numFolds=num_folds,
    seed=42
)

# Fit the cross-validator on the training data
cv_model_rf = cv_rf.fit(train_data)

# Get the average ROC AUC across all folds
avg_roc_auc_rf = cv_model_rf.avgMetrics[0]
print(f"Random Forest Cross-Validation ROC AUC: {avg_roc_auc_rf}")

Random Forest Cross-Validation ROC AUC: 0.5


3.3 Gradient-Boosted Trees (GBT)

In [76]:
# Define the parameter grid (if any)
param_grid_gbt = ParamGridBuilder().build()

# Define the cross-validator
cv_gbt = CrossValidator(
    estimator=gbt,
    estimatorParamMaps=param_grid_gbt,
    evaluator=evaluator,
    numFolds=num_folds,
    seed=42
)

# Fit the cross-validator on the training data
cv_model_gbt = cv_gbt.fit(train_data)

# Get the average ROC AUC across all folds
avg_roc_auc_gbt = cv_model_gbt.avgMetrics[0]
print(f"Gradient-Boosted Trees Cross-Validation ROC AUC: {avg_roc_auc_gbt}")

Gradient-Boosted Trees Cross-Validation ROC AUC: 0.5781176087257808


`Analysis:`

`Logistic Regression` has the highest cross-validation ROC AUC (0.5930), indicating that it performs better than the other models in distinguishing between default and non-default cases across different subsets of the training data. This suggests that Logistic Regression is the most robust model among the three.

`Gradient-Boosted Trees` show a decent cross-validation ROC AUC (0.5781), which is slightly lower than Logistic Regression but still better than random guessing. This indicates that GBT is a reasonable alternative, though not as strong as Logistic Regression.

`Random Forest` has the lowest cross-validation ROC AUC (0.5), which is equivalent to random guessing. This suggests that the Random Forest model is not effectively distinguishing between the classes and may be suffering from overfitting or other issues.

`Conclusion:`

`Best Model:` `Logistic Regression` is the best-performing model based on cross-validation results. It has the highest ROC AUC, indicating better generalization and robustness across different subsets of the training data. Given its superior performance, Logistic Regression should be the primary model for deployment.

`Alternative Model:` `Gradient-Boosted Trees` can be considered as an alternative, though its performance is slightly lower than Logistic Regression. It may still be useful depending on specific requirements or constraints.

`Poor Performance:` `Random Forest` is not performing well, as indicated by its ROC AUC of 0.5.

In [None]:
import numpy as np
from scipy.spatial.distance import euclidean, cityblock, minkowski

# Define the vectors
vector1 = np.array([29,50, 35])
vector2 = np.array([4, 5, 6])

# Cosine Similarity
cosine_similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

# Euclidean Distance
euclidean_distance = euclidean(vector1, vector2)

# Manhattan Distance
manhattan_distance = cityblock(vector1, vector2)

# Minkowski Distance (p=3 for example)
minkowski_distance = minkowski(vector1, vector2, p=3)

# Output the results
print(f"Cosine Similarity: {cosine_similarity}")
print(f"Euclidean Distance: {euclidean_distance}")
print(f"Manhattan Distance: {manhattan_distance}")
print(f"Minkowski Distance (p=3): {minkowski_distance}")
