In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


Objective:

The objective of Return Model 1 is to predict the expected return on individual Lending Club loans using borrower and loan attributes known at the time of issuance. This supports the construction of investment portfolios focused on maximizing return, as required by the coursework.

Return Metric Definition:

The chosen metric, custom_return_1, is calculated as (total_pymnt−loan_amnt)/loan_amnt(total_pymnt−loan_amnt)/loan_amnt, capturing the net percentage gain or loss on a loan.



In [4]:
import pandas as pd


path = "/content/drive/MyDrive/lending_club_dataset.pickle"


data = pd.read_pickle(path)
df = data[0]

# Return metric
df["custom_return_1"] = (df["total_pymnt"] - df["loan_amnt"]) / df["loan_amnt"]


In [5]:
from sklearn.model_selection import train_test_split

# Define features
numerical = ["loan_amnt", "funded_amnt", "installment", "int_rate", "annual_inc", "loan_length", "term_num"]
categorical = ["home_ownership", "grade", "emp_length"]
features = numerical + categorical
target = "custom_return_1"

# Split
df_sample = df.sample(n=50000, random_state=42)
X = df_sample[features]
y = df_sample["custom_return_1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Feature Selection:

Features were selected based on their economic relevance to loan performance and their availability at the loan’s origination. Numerical variables included loan_amnt, funded_amnt, installment, int_rate, annual_inc, loan_length, and term_num, while categorical features included home_ownership, grade, and emp_length.


In [6]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numerical),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical)
])


Feature Processing:

To prepare the dataset for model training, we applied systematic preprocessing to both numerical and categorical features using a scikit-learn ColumnTransformer. All numerical variables (loan_amnt, funded_amnt, installment, int_rate, annual_inc, loan_length, and term_num) were standardized using StandardScaler to ensure zero mean and unit variance.

This step is particularly important for linear models like Lasso, which are sensitive to the scale of input features. Categorical variables (home_ownership, grade, emp_length) were transformed using OneHotEncoder, allowing models to handle non-numeric categories effectively. To ensure consistent and reproducible preprocessing, these transformations were embedded within a scikit-learn Pipeline, which combines preprocessing and model training into a single unified workflow.



In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.utils import shuffle

models = {
    "Lasso": Lasso(alpha=0.01, max_iter=10000),
    "RandomForest": RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=50, max_depth=6, learning_rate=0.1, random_state=42)
}


results = {}
for name, model in models.items():
    pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("model", model)
    ])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    results[name] = {
        "MSE": mean_squared_error(y_test, preds),
        "R2": r2_score(y_test, preds)
    }

# Random Guessing Baseline
random_preds = shuffle(y_test.copy(), random_state=42)
results["RandomGuessing"] = {
    "MSE": mean_squared_error(y_test, random_preds),
    "R2": r2_score(y_test, random_preds)
}


Train/Test Split:

The dataset was split into an 80% training set and 20% testing set using train_test_split. This ensures that models are evaluated on unseen data to assess generalization performance, in line with standard evaluation methods discussed in lectures.

Model Selection:

Three models were trained to predict loan return: Lasso Regression, Random Forest Regressor, and XGBoost Regressor. Lasso provides a simple linear benchmark with embedded feature selection. Random Forest uses bagging to reduce variance, while XGBoost applies boosting for high predictive performance. All models were implemented using scikit-learn pipelines to streamline preprocessing and training and was compared against a random guessing strategy by shuffling test labels.


In [8]:
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    print(f"  MSE: {metrics['MSE']:.4f}")
    print(f"  R²:  {metrics['R2']:.4f}\\n")




Model: Lasso
  MSE: 0.0642
  R²:  0.2221\n
Model: RandomForest
  MSE: 0.0601
  R²:  0.2723\n
Model: XGBoost
  MSE: 0.0596
  R²:  0.2783\n
Model: RandomGuessing
  MSE: 0.1676
  R²:  -1.0288\n


Sample Size:

To balance computational efficiency with predictive power, a random subset of 50,000 loans was used for model training. This sample size captures representative patterns in the data while remaining tractable within the Google Colab environment
