<div class="alert alert-success">

The dataset consists of data about 1000 customers, encompassing 84 features extracted from their financial transactions and current financial status. The main aim is to utilize this dataset for credit risk assessment and forecasting potential defaults.

Included within are two target variables, one designed for classification and the other for regression analysis:

- **DEFAULT**: Binary target variable indicating if the customer has defaulted (1) or not (0)
- **CREDIT_SCORE**: Numerical target variable representing the customer's credit score (integer)

and these features:

- **INCOME**: Total income in the last 12 months
- **SAVINGS**: Total savings in the last 12 months
- **DEBT**: Total existing debt
- **R_SAVINGS_INCOME**: Ratio of savings to income
- **R_DEBT_INCOME**: Ratio of debt to income
- **R_DEBT_SAVINGS**: Ratio of debt to savings

Transaction groups (**GROCERIES**, **CLOTHING**, **HOUSING**, **EDUCATION**, **HEALTH**, **TRAVEL**, **ENTERTAINMENT**, **GAMBLING**, **UTILITIES**, **TAX**, **FINES**) are categorized.

- **T_{GROUP}_6**: Total expenditure in that group in the last 6 months
- **T_GROUP_12**: Total expenditure in that group in the last 12 months
- **R_[GROUP]**: Ratio of T_[GROUP]6 to T[GROUP]_12
- **R_[GROUP]INCOME**: Ratio of T[GROUP]_12 to INCOME
- **R_[GROUP]SAVINGS**: Ratio of T[GROUP]_12 to SAVINGS
- **R_[GROUP]DEBT**: Ratio of T[GROUP]_12 to DEBT

Categorical Features:

- **CAT_GAMBLING**: Gambling category (none, low, high)
- **CAT_DEBT**: 1 if the customer has debt; 0 otherwise
- **CAT_CREDIT_CARD**: 1 if the customer has a credit card; 0 otherwise
- **CAT_MORTGAGE**: 1 if the customer has a mortgage; 0 otherwise
- **CAT_SAVINGS_ACCOUNT**: 1 if the customer has a savings account; 0 otherwise
- **CAT_DEPENDENTS**: 1 if the customer has any dependents; 0 otherwise
- **CAT_LOCATION**: Location (San Francisco, Philadelphia, Los Angeles, etc.)
- **CAT_MARITAL_STATUS**: Marital status (Married, Widowed, Divorced or Single)
- **CAT_EDUCATION**: Level of Education (Postgraduate, College, High School or Graduate)

</div>

In [1]:
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")

<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h2 style="color: #333;">Guidance through the Notebook</h2>
    <p>We start by reading the data and preprocessing it in a very simple way (as the aim of the project is to focus on models and gridsearch) making train-test split again.</p>
</div>

In [1]:
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import train_test_split

set_config(transform_output="pandas")

url = "https://raw.githubusercontent.com/jnin/information-systems/main/data/AI2_23_24_credit_score.csv"

# create the dataframe
df = pd.read_csv(url)
df = df.drop('CUST_ID', axis=1)
df.head()

# we reget the original df with the CREDIT_SCORE column
df = pd.read_csv(url)

In [2]:
# Data preprocessing

# Dropping the CUST_ID column
df = df.drop('CUST_ID', axis=1)

# We can print the correlation plot and observe some correlations across columns (we comment it as too many columns are printed)
# import seaborn as sns
# import matplotlib.pyplot as plt
# plt.figure(figsize=(10, 10))
# sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
# plt.show()
# print(df.corr())


# get columns correlated >0.9 with each other
correlated_features = set()
correlation_matrix = df.corr()
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.9:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)
correlated_features
# Drop the correlated columns
df.drop(columns=correlated_features, inplace=True)




  correlation_matrix = df.corr()


In [3]:
y = df['CREDIT_SCORE']
X = df.drop('CREDIT_SCORE', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=25)
# Creation of pipeline and transformers
categorical_features = X_train.select_dtypes(include = ['object']).columns.tolist()
numerical_features = X_train.select_dtypes(include = ['int64', 'float64']).columns.tolist()

<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h2 style="color: #333;">Approach 1</h2>
    <p>The first approach consisted on running a gridsearch only on the 2 models and their hyperparameters but not on the data preprocessing for numerical and categorical data, so although this approach is suboptimal, we will use it as a starting point.</p>
</div>


In [5]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer


## Approach 1: Defining the pipeline and conducting gridsearch only for the regressor models

# Preprocessing for numerical data
numerical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Preprocessing for categorical data
categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipe, numerical_features),
        ('cat', categorical_pipe, categorical_features)])

# Placeholder for the regression model
regression_model = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('regressor', None)])  # 'regressor' will be defined in the grid search

from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# import mean_squared_error from sklearn.metrics
from sklearn.metrics import mean_squared_error

# Parameter grid for grid search, including both regressors and potential preprocessing steps
param_grid = [{
    'regressor': [XGBRegressor(random_state=25)],
    'regressor__n_estimators': [100, 200],
    'regressor__max_depth': [3, 4, 5],
    # Add more parameters specific to XGBRegressor
}, {
    'regressor': [SVR()],
    'regressor__C': [0.1, 1, 10],
    'regressor__epsilon': [0.01, 0.1, 0.5],
    # Add more parameters specific to SVR
}]

# We optimise for negative mean squared error as scikit-learn optimises for maximising the score.
grid_search = GridSearchCV(regression_model, param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-2)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best model: {best_model}")
print(f"Best parameters: {best_params}")
print(f"Best score on training set: {best_score}")

Best model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['INCOME', 'SAVINGS', 'DEBT',
                                                   'R_SAVINGS_INCOME',
                                                   'R_DEBT_INCOME',
                                                   'R_DEBT_SAVINGS',
                                                   'T_CLOTHING_12',
                                                   'R_CLOTHING',
                                                   'R_CLOTHING_INCOME',
                                                   'R_CLOTHING_SAVINGS',
    

<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h2 style="color: #333;">Approach 2</h2>
    <p>Approach 2 tries to tackle the last problem better, by applying preprocessing steps inside the gridsearch so we can seek for more optimal solutions</p>
</div>

In [6]:
## Approach 2: Defining the pipeline and conducting gridsearch for both the regressor and the preprocessing steps
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error

# Preprocessing for numerical data
numerical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer()), # Strategy will be defined in grid search
    ('scaler', StandardScaler())]) # Scaler will be defined in grid search

# Preprocessing for categorical data
categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # Strategy can also be defined in grid search
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]) # Parameters will be defined in grid search

# Transformer for preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipe, numerical_features),
        ('cat', categorical_pipe, categorical_features)])

# Placeholder for the regression model
regression_model = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('regressor', None)])  # 'regressor' will be defined in the grid search

# Parameter grid for grid search, now including preprocessing steps
param_grid = [{
    'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'], # Define strategies for SimpleImputer
    'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()], # Define scalers
    'preprocessor__cat__onehot__drop': ['first', None], # Define drop parameter for OneHotEncoder
    'regressor': [XGBRegressor(random_state=25)], # Define the regressor
    'regressor__n_estimators': [100, 200, 300], # n estimators 100-300 will specify number of trees in the forest
    'regressor__max_depth': [3, 4, 5], # max depth 3-5 will specify the maximum depth of the tree
    'regressor__learning_rate': [0.01, 0.1, 0.2] # learning rate will specify the step size at each iteration while moving toward a minimum of a loss function
}, {
    'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'], # Define strategies for SimpleImputer
    'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()], # Define scalers
    'preprocessor__cat__onehot__drop': ['first', None], # Define drop parameter for OneHotEncoder
    'regressor': [SVR()], # Define the regressor
    'regressor__C': [0.1, 1, 10],   # C is a regularization parameter
                                    # Smaller C mean a more regularized model, larger C mean a less regularized model
    'regressor__epsilon': [0.01, 0.1], # Epsilon is the width of the street
    'regressor__kernel': ['linear', 'poly', 'rbf', 'sigmoid'] # Kernel specifies the kernel type to be used in the algorithm
}]

# Initialize GridSearchCV
grid_search = GridSearchCV(regression_model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-2) # n_jobs=-2 means that all CPUs but one are used, needed as it took long (30')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best model: {best_model}")
print(f"Best parameters: {best_params}")
print(f"Best score on training set: {best_score}")

Best model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('scaler',
                                                                   MinMaxScaler())]),
                                                  ['INCOME', 'SAVINGS', 'DEBT',
                                                   'R_SAVINGS_INCOME',
                                                   'R_DEBT_INCOME',
                                                   'R_DEBT_SAVINGS',
                                                   'T_CLOTHING_12',
                                                   'R_CLOTHING',
                                                   'R_CLOTHING_INCOME',
                                                   'R_C

<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h2 style="color: #333;">Approach 1 & 2: Model Optimization Summary</h2>
    <p>Our refined grid search did not shifte our top model preference to <strong>SVR</strong> from XGBoost, XGBoost was still our best performing model. This approach fine-tuned our preprocessing strategy, employing <code>mean</code> imputation and <code>StandardScaler</code> for numerical data, and <code>most_frequent</code> imputation with <code>OneHotEncoder</code> for categorical data.</p>
    <p>Optimal SVR parameters were identified as <strong>C=1</strong> and <strong>ε=1</strong>, favoring a linear kernel, which proved effective for our dataset's dimensional characteristics.</p>
    <p>The score on the training set was improved from -858 to -787.</p>
</div>


<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h2 style="color: #333;">Approach 3</h2>
    <p>The third approach now tries to improve the last gridsearch by <strong>introducing 2 more models</strong> (with its hyperparameters too) inside of it. We will now compute the gridsearch with XGBoost, SVR, RandomForest, and GradientBoosting.</p>
</div>


In [39]:
# Approach 3: More complex approach (exploratory analysis): Comparing 4 models: XGBoost, SVR, RandomForest, and GradientBoosting and their parameters
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from xgboost import XGBRegressor
from sklearn.svm import SVR


# Preprocessing for numerical data
numerical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer()),  # Strategy will be defined in grid search
    ('scaler', StandardScaler())])  # Scaler will be defined in grid search

# Preprocessing for categorical data
categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Strategy can also be defined in grid search
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])  # Parameters will be defined in grid search

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipe, numerical_features),
    ('cat', categorical_pipe, categorical_features)])

# Placeholder for the regression model
regression_model = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('regressor', None)])  # 'regressor' will be defined in the grid search

# Parameter grid for grid search, now including additional models
# Comments on models used before are not going to be explained in comments again (Check comments in Approach 2).
param_grid = [
    {
        'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
        'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
        'preprocessor__cat__onehot__drop': ['first', None],
        'regressor': [XGBRegressor(random_state=25)], #random_state=25 for reproducibility
        'regressor__n_estimators': [100, 200, 400],
        'regressor__max_depth': [3, 4, 5],
        'regressor__learning_rate': [0.01, 0.1]
    },
    {
        'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
        'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
        'preprocessor__cat__onehot__drop': ['first', None],
        'regressor': [SVR()],
        'regressor__C': [0.1, 1, 10],
        'regressor__epsilon': [0.01, 0.1],
        'regressor__kernel': ['linear', 'poly', 'rbf']
    },
    {
        'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
        'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
        'preprocessor__cat__onehot__drop': ['first', None],
        'regressor': [RandomForestRegressor(random_state=25)],
        'regressor__n_estimators': [100, 200, 400],
        'regressor__max_depth': [3, 4, 5]
    },
    {
        'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
        'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
        'preprocessor__cat__onehot__drop': ['first', None],
        'regressor': [GradientBoostingRegressor(random_state=25)],
        'regressor__n_estimators': [100, 200, 400],
        'regressor__learning_rate': [0.01, 0.1],
        'regressor__max_depth': [3, 4, 5]
    }
]

# Initialize GridSearchCV
grid_search = GridSearchCV(regression_model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-2)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best model: {best_model}")
print(f"Best parameters: {best_params}")
print(f"Best score on training set: {best_score}")

Best model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('scaler',
                                                                   MinMaxScaler())]),
                                                  ['INCOME', 'SAVINGS', 'DEBT',
                                                   'R_SAVINGS_INCOME',
                                                   'R_DEBT_INCOME',
                                                   'R_DEBT_SAVINGS',
                                                   'T_CLOTHING_12',
                                                   'R_CLOTHING',
                                                   'R_CLOTHING_INCOME',
                                                   'R_C

<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h3 style="color: #333;">Approach 3 Gridsearch Expansion Reflections</h3>
    <p>Starting with SVR and XGBRegressor, we later included RandomForestRegressor and GradientBoostingRegressor in our model exploration. The inclusion of these 2 new models in the search, resulted in GradientBoosting overperforming the other three models. We are now getting a training score of <strong>-756.</strong></p>
    <p><strong>Key Takeaway:</strong></p>
    <ul>
        <li>Gridsearch with more models help us evaluate and compare different models and hyperparameters on the training set.</li>
    </ul>
</div>


<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h3 style="color: #333;">Approach 4: Fine-tuning our best model</h3>
    <p>As we know that GradientBoosting is our best-performing model, we will try to improve it by fine-tuning the "optimal parameters", which we believe they are still suboptimal. This will be done by conducting a gridsearch in the GradientBoosting model but we will include parameters to be tested that have to be very close to the ones selected by our latest gridsearch.</p>
    <p>This approach will make our GradientBoosting at least slightly better (no worse at all) as the last model will also be included in the new gridsearch.</p>
    <p>Thisway we can get more optimal values for number of trees per forest, maximum depth and learning rate.</p>
</div>


In [43]:
# Approach 4 Gridsearch for GradientBoosting specifically 

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingRegressor


# Preprocessing for numerical data
numerical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer()),  # Strategy will be defined in grid search
    ('scaler', StandardScaler())])  # Scaler will be defined in grid search

# Preprocessing for categorical data
categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Strategy can also be defined in grid search
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])  # Parameters will be defined in grid search

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipe, numerical_features),
    ('cat', categorical_pipe, categorical_features)])

# Placeholder for the regression model
regression_model = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('regressor', None)])  # 'regressor' will be defined in the grid search

# Parameter grid for grid search, now including additional models
param_grid = [
    {
        'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
        'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
        'preprocessor__cat__onehot__drop': ['first', None],
        'regressor': [GradientBoostingRegressor(random_state=25)],
        'regressor__n_estimators': [80,90,100,110,120], #wider range than Approach 3
        'regressor__learning_rate': [0.09,0.1,0.11], #wider range than Approach 3
        'regressor__max_depth': [1,2,3,4] #wider range than Approach 3
    },
]

# Initialize GridSearchCV
grid_search = GridSearchCV(regression_model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best model: {best_model}")
print(f"Best parameters: {best_params}")
print(f"Best score on training set: {best_score}")


# We make prediction using the best model of the loop found through the grid search.
y_pred = grid_search.best_estimator_.predict(X_test)

# And then, we calculate the Mean Squared Error of the predictions on the test set.
mse_test = - mean_squared_error(y_test, y_pred)

# Finally, we print the generalization score (MSE on the test set)
print(f"Generalization score: {mse_test}")

Best model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   MinMaxScaler())]),
                                                  ['INCOME', 'SAVINGS', 'DEBT',
                                                   'R_SAVINGS_INCOME',
                                                   'R_DEBT_INCOME',
                                                   'R_DEBT_SAVINGS',
                                                   'T_CLOTHING_12',
                                                   'R_CLOTHING',
                                                   'R_CLOTHING_INCOME',
                                                   'R_CLOTHING_SAVINGS',
      

<div style="background-color: #e0f7fa; border-left: 5px solid #0097a7; padding: 10px; color: #005662;">
    <h3 style="color: #333;">Approach 4 conclusions: Streamlined Model Optimization Summary</h3>
    <p>After identifying GradientBoosting regressor as our model of choice, we refined its hyperparameters by conducting a focused grid search, exploring n values [80,90,100,110,120], max depth values [1,2,3,4] and diferent learning rates [0.09,0.1,0.11].</p>
    <p><strong>Results:</strong> This led to an optimized GradientBoosting model with <strong>n=110, maximum depth=3 and learning rate of0.09 </strong>, which showcased a better nMSE of <strong>-755</strong> on the validation set. Therefore, becoming our best-performing model out of all the ones we tried.</p>
    <p>Only now that the model has been defined and selected, we can get the generalization score by using the test set. Even though it is worse than the validation score, as expected, our generalisation score is a promising negative mean squared error -924.</p>
    <p><strong>Takeaway:</strong> Fine-tuning confirmed the effectiveness of precision in hyperparameter optimization, enhancing our model's accuracy for deployment. Only in this last step, once the model has been clearly defined, have we used the test set in our code.</p>
</div>
