# Credit Risk Analysis

## Summary

I started with an EDA to understand the training set structure, relying on the Pandas Profiling package to generate a report from the training data that allowed me to easily see the variables that have high rate of missing values, high cardinality, or high correlation. From this, I decided to drop 7 columns, along with the `ids`. I also decided to remove observations with missing `default` value, instead of attempting to handle them with an unsupervised method. Dropping observations and columns ensured that most of the missing values were dealt with.

I one-hot encoded all categorical variables in the training set, ensuring that the test set had the same resulting features. After that, I split the training data for training and validation. Then, I handled the remaining missing values in the numerical variables. I used the mode for the features that were 0 for most observations or which had only a couple possible values, and the median for the rest. The imputers were fit on the training set and then used to transform the variables in the validation and test sets.

I trained a baseline model that always picks the most frequent class for `default`, resulting in 0.84 accuracy. However, given the imbalance in the dataset, this metric isn't useful. Thus, I opted for the weighted AUROC, which can deal with the imbalance, as well as scoring probability predictions.

After performing some additional feature selection, I trained a LogisticRegression model, a DecisionTree, a RandomForest, and a XGBoost model. LogisticRegression and XGBoost showed the most promising results according to the AUROC metric.

I ran grid search with 3-fold cross-validation to tune the XGBoost model. Then, with the best parameters, I trained a soft VotingClassifier that averages the probabilities predicted by the LogisticRegression and optimized XGBoost models, which obtained a AUROC score of 0.7766 against the validation set. Finally, I used this ensemble model to make predictions on the test set.

## Analysis

### Import Required Packages

In [1]:
import numpy as np
import pandas as pd

from pandas_profiling import ProfileReport
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

### Load the Data 

In [2]:
train_df = pd.read_csv("data/train_dataset.csv")
test_df = pd.read_csv("data/test_dataset.csv")

### EDA

Check the first five observations in the training set.

In [3]:
train_df.head()

Unnamed: 0,ids,default,score_1,score_2,score_3,score_4,score_5,score_6,risk_rate,amount_borrowed,...,state,zip,channel,job_name,real_state,ok_since,n_bankruptcies,n_defaulted_loans,n_accounts,n_issues
0,810e3277-619e-3154-7ba0-ebddfc5f7ea9,False,smzX0nxh5QlePvtVf6EAeg==,tHpS8e9F8d9zg3iOQM9tsA==,710.0,104.174961,0.661509,123.015325,0.43,20024.31,...,xsd3ZdsI3356I3xMxZeiqQ==,i036nmJ7rfxo+3EvCD7Jnw==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,n+xK9CfX0bCn77lClTWviw==,14.0,1.0,0.0,9.0,9.0
1,b4118fd5-77d5-4d80-3617-bacd7aaf1a88,False,DGCQep2AE5QRkNCshIAlFQ==,RO7MTL+j4PH2gNzbhNTq/A==,330.0,97.880798,0.531115,110.913484,0.23,10046.51,...,xsd3ZdsI3356I3xMxZeiqQ==,oyrt7nHjoQSc58vCxgJF/w==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,n+xK9CfX0bCn77lClTWviw==,75.0,0.0,0.0,3.0,
2,a75638f1-4662-4f4f-044a-d649b676d85d,False,8k8UDR4Yx0qasAjkGrUZLw==,wkeCdGeu5sEv4/fjwR0aDg==,360.0,97.908925,0.611086,104.620791,0.3,21228.25,...,/L8vvVesB5WyAv190Hw/rQ==,BMIK35trMYhh9yVrcGg/oQ==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,N5/CE7lSkAfB04hVFFwllw==,,0.0,0.0,5.0,
3,285ce334-3602-42b3-51cb-eebfcba48a09,False,4DLlLW62jReXaqbPaHp1vQ==,tQUTfUyeuGkhRotd+6WjVg==,120.0,100.434557,0.139784,120.134718,0.15,23032.33,...,GW2VZ3dN3OGHSjQ6JkfqQw==,coa2oOrpjxnQl4iyM7dTpQ==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,N5/CE7lSkAfB04hVFFwllw==,,0.0,0.0,5.0,
4,e643bf65-9288-92f2-df13-eed631fe237c,False,4DLlLW62jReXaqbPaHp1vQ==,7h8PTkrlTWUPP3yuyP4rUg==,330.0,103.774638,0.002856,104.320462,0.08,24026.29,...,sjJbkqJS7cXalHLBFA+EOQ==,xTrDMEf/Cnewxc1LO+pfbg==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,N5/CE7lSkAfB04hVFFwllw==,15.0,0.0,0.0,10.0,10.0


Check variable types and number of non-null observations.

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64592 entries, 0 to 64591
Data columns (total 27 columns):
ids                   64592 non-null object
default               59966 non-null object
score_1               63807 non-null object
score_2               63807 non-null object
score_3               63807 non-null float64
score_4               64592 non-null float64
score_5               64592 non-null float64
score_6               64592 non-null float64
risk_rate             63807 non-null float64
amount_borrowed       63807 non-null float64
borrowed_in_months    63807 non-null float64
credit_limit          44839 non-null float64
reason                63801 non-null object
income                63807 non-null float64
sign                  43825 non-null object
gender                57406 non-null object
facebook_profile      58185 non-null object
state                 63807 non-null object
zip                   63807 non-null object
channel               63807 non-null object
jo

Check the ratio of `default` classes and the number of missing values.

In [5]:
# Default vs. non-default 
counts = train_df.default.value_counts()
default = counts[True]
non_default = counts[False]
perc_default = (default / (default + non_default)) * 100
perc_non_default = (non_default/(default + non_default)) * 100
print("There were {} default loans ({:.3f}%) and {} non-default loans ({:.3f}%).".format(default, perc_default, non_default, perc_non_default))

There were 9510 default loans (15.859%) and 50456 non-default loans (84.141%).


In [6]:
print("Number of missing `default` values:",train_df["default"].isna().sum())

Number of missing `default` values: 4626


Drop observations with missing values for `default`.

In [7]:
train_df = train_df.dropna(subset = ["default"])

Generate a profile report for the training set, ignoring the missing values for `default`.

In [8]:
profile = ProfileReport(train_df, title = "Pandas Profiling Report - ignoring missing values for `default`", html = {"style": {"full_width": True}})
profile.to_notebook_iframe()

The report above shows a number of variables that have several missing values (`credit_limit`, `n_issues`, `ok_since`), and others that have high cardinality (`job_name`, `reason` ,`zip`). Additionally, on top of the missing values, `n_issues` also has high correlation with `n_accounts`. All of this suggests that these variables are good candidates to be dropped from the dataset, as they would either require a large number of columns to represent as one-hot encoded variables, or require imputation for missing values, with no clear strategy. 

Additionally, the `channel` variable has the same value for all observations, which means it has no impact on modeling.

Finally, the report also shows the distribution for the variables, which helps define the imputation strategies that are used later in the analysis.

Along with the interactive report above, the profile report can also be viewed in the `eda_report.html` file.

In [9]:
# Save an HTML report
profile.to_file(output_file = "eda_report.html")

### Preprocessing/Data Wrangling

Split the training data, separating input features from the target variable `default`, converting it to `int` in the process.

In [10]:
target = "default"

# Convert `default` values from boolean to integer
y_train = train_df.loc[:, target].astype("int")

# Remove `default` values from boolean to integer
X_train = train_df.loc[:, train_df.columns != target]

As identified in the EDA, some features have many missing values and/or high cardinality. Remove those from the training and test data, along with the `channel` feature, which is constant, and the `ids`.

In [11]:
dropped_columns =  [
    "ids",
    "channel", # 1 unique value
    "credit_limit", # 31.3% missing values, low correlation with target variable
    "job_name", # 6.3% missing values, high cardinality
    "n_issues", # 26% missing values, high correlation with n_accounts                                         
    "ok_since", # 58.5% missing values                                         
    "reason", # high cardinality
    "zip", # high cardinality
]

X_train = X_train.drop(columns = dropped_columns)
X_test = test_df.drop(columns = dropped_columns)

One-hot encode the remaining non-numeric categorical variables. It's important to ensure that the features are encoded based only on the values seen on the training set.

In [12]:
def encode_categories(train_df, test_df, variables):
    """
    Encode categorical features as a one-hot numeric array.

    Keyword arguments:
    train_df -- the training dataset. Used to fit encoders.
    test_df -- the test dataset.
    variables -- the features to be encoded.
    """
    train_oh = pd.get_dummies(train_df, columns = variables, prefix = variables)
    test_oh = pd.get_dummies(test_df, columns = variables, prefix = variables)
    # left join ensures only columns existing on the training set are in both final data frames
    train_oh, test_oh = train_oh.align(test_oh, join = "left", axis = 1)
        
    train_df = train_df.merge(train_oh).drop(columns = variables)
    test_df = test_df.merge(test_oh).drop(columns = variables)
        
    return train_df, test_df

In [13]:
X_train, X_test = encode_categories(X_train, X_test, [
    "facebook_profile",
    "gender",
    "real_state",
    "score_1",
    "score_2",
    "sign",
    "state"
])  

Split the training set into a training set (80%) and a validation set (20%). This will ensure our models are trained and then validated with unseen data, which will provide an estimation of the expected performance against the test set. The

In [14]:
def train_validation_split(X, y, validation_size = 0.2, random_state = 17):
    """
    Splits a dataframe of input features and its corresponding target labels into a training and test set

    Keyword arguments:
    X -- Dataframe of feature observations.
    y -- Series of target labels.
    validation_size -- the percentage of observations to be split.
    random_state -- random seed.
    """
    X_train, X_validation, y_train, y_validation = train_test_split(X, 
                                                                    y, 
                                                                    test_size = validation_size, 
                                                                    shuffle = True,
                                                                    random_state = random_state)
    return X_train, X_validation, y_train, y_validation

In [15]:
X_train, X_validation, y_train, y_validation = train_validation_split(X_train, y_train)

Fit imputers for each numerical feature, with strategies chosen based on the EDA's findings. Each imputer is trained against the training set only, and then used to transform the variables in the validation and test sets. That is, medians and modes are calculated based on the observations on the training set.

In [16]:
def fit_imputers(X, variables_strategies):
    """
    Fits and returns imputers to use for filling missing values.

    Keyword arguments:
    X -- Dataframe of feature observations.
    variables_strategies -- A dictionary of variable names to imputing strategies. e.g { "income": "median" }
                            valid strategies are ["mean", "median", "most_frequent", "constant"]
    """
    imputers = {}
    for variable, strategy in variables_strategies.items():
        imputer = SimpleImputer(strategy = strategy, copy = False)
        if variable in X.columns:
            imputer.fit(X[[variable]])
            imputers[variable] = imputer
    
    return imputers

In [17]:
imputers = fit_imputers(X_train, 
                        variables_strategies = {
                            "amount_borrowed": "median",
                            "borrowed_in_months": "most_frequent", # 2 possible values
                            "income": "median",
                            "n_accounts": "median",
                            "n_bankruptcies": "most_frequent", # 92% zeroes
                            "n_defaulted_loans": "most_frequent", # 99.6% zeroes                                          
                            "risk_rate": "median",
                            "score_3": "median",
                            "score_4": "median",
                            "score_5": "median",
                            "score_6": "median"
                        })

In [18]:
def impute_missing_values(X, imputers):
    """
    Imputs missing values on variables of X with the provided imputers.

    Keyword arguments:
    X -- Dataframe of feature observations.
    imputers -- Dictionary of variable names and pre-fit imputers.
    """
    for variable, imputer in imputers.items():
        if variable in X.columns:
            X[variable] = imputer.transform(X[[variable]])
    return X

In [19]:
X_train = impute_missing_values(X_train, imputers)
X_validation = impute_missing_values(X_validation, imputers)
X_test = impute_missing_values(X_test, imputers)

### Baseline Model - Dummy Classifier

Train a `DummyClassifier` that always picks the most frequent class as the prediction. 

In [20]:
baseline_model = DummyClassifier(strategy = "most_frequent")
baseline_model.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

In [21]:
accuracy_score(y_validation, baseline_model.predict(X_validation))

0.8449224612306153

Due to the imbalance in the dataset, picking always the most frequent class leads to 84% accuracy. This isn't a useful metric though, as this model can never predict a `default` occurence.

### Model Scoring

Since the data is unbalanced – approximately 16% of the training observations belong to the default class – it's important to pick a scoring metric that can account for that. At the same time, the model should output class probabilities, so the scoring metric should be estimated from that. Thus, a good choice is the area under the ROC curve, with the `average` parameter set to `weighted` to account for the class imbalance.

In [22]:
def report_score(model, X_train, X_validation, y_train, y_validation):
    """
    Prints a model's train and validation scores, computed using the Area Under the ROC curve (AUROC).

    Keyword arguments:
    model -- the model to score. Must support predict_proba
    X_train -- training data (features).
    X_validation -- validation data (features).
    y_train -- training targets
    y_validation -- validation targets
    """
    y_hat_train_proba = model.predict_proba(X_train)[:,1]
    y_hat_validation_proba = model.predict_proba(X_validation)[:,1]
    
    print("Train ROC AUC score:", roc_auc_score(y_train, y_hat_train_proba, average = "weighted"))
    print("Validation ROC AUC score:", roc_auc_score(y_validation, y_hat_validation_proba, average = "weighted"))

In [23]:
report_score(baseline_model, X_train, X_validation, y_train, y_validation)

Train ROC AUC score: 0.5
Validation ROC AUC score: 0.5


With a better score, it's now clear that the `DummyClassifier` has very poor performance, with AUROC of 0.5.

### Feature Selection

Before continuing with model training and selection, it's a good idea to se `SelectKBest` to visualize feature importance.

In [24]:
k = X_train.shape[1]
best_features = SelectKBest(score_func = chi2, k = k)
fit = best_features.fit(X_train, y_train)
scores = pd.DataFrame(fit.scores_)
features = pd.DataFrame(X_train.columns)

featureScores = pd.concat([features, scores], axis = 1)
featureScores.columns = ["Feature", "Score"]
# Print features ordered by score
print(featureScores.nlargest(k, "Score"))

                                 Feature         Score
7                                 income  3.444246e+06
5                        amount_borrowed  1.049064e+06
6                     borrowed_in_months  3.688218e+03
0                                score_3  3.242821e+03
70                             sign_sagi  1.974388e+03
12                 facebook_profile_True  7.434388e+02
21      score_1_4DLlLW62jReXaqbPaHp1vQ==  6.836984e+02
72                             sign_taur  6.174519e+02
26      score_1_smzX0nxh5QlePvtVf6EAeg==  5.111632e+02
24      score_1_e4NYDor1NOw6XKGE60AWFw==  4.571898e+02
11                facebook_profile_False  4.241728e+02
66                             sign_gemi  3.217008e+02
22      score_1_8k8UDR4Yx0qasAjkGrUZLw==  2.841588e+02
67                              sign_leo  2.438944e+02
23      score_1_DGCQep2AE5QRkNCshIAlFQ==  2.158309e+02
68                             sign_libr  1.894019e+02
64                            sign_cance  1.843365e+02
73        

Inspecting the feature rank above, it's possible to notice:
- Many of the one hot encoded features associated with the `state` variable appear among the lowest ranked features.
- The numerical features `score_4`, `score_5`, and `score_6` are all towards the bottom of the rank.

This suggests these can be dropped.

In [25]:
drop_features = [
    "score_4", 
    "score_5", 
    "score_6"] + list(X_train.columns[X_train.columns.str.startswith("state_")])

X_train = X_train.drop(columns = drop_features)
X_validation = X_validation.drop(columns = drop_features)
X_test = X_test.drop(columns = drop_features)

### Model Selection

Since the problem at hand is a binary classification where the output should be class probabilities, the candidate models are those which can output probability predictions. At the same time, the models should be capable of handling the imbalance in the data. Of all the possible models, the ones considered were:

- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosted Trees (XGBoost)

#### Logistic Regression

For LogisticRegression, it's important to ensure the data is scaled. As with previous transformations, the scaling is fit on the training set, and then the transformation is applied to the validation – and test – set. 

For the subsequent models, scaling is not required.

In [26]:
# Scale the data for use with logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_validation_scaled = scaler.transform(X_validation)

In [27]:
lr_model = LogisticRegression(random_state = 17, 
                              solver = "lbfgs", 
                              C = 1, 
                              class_weight = "balanced")
lr_model.fit(X_train_scaled, y_train)

LogisticRegression(C=1, class_weight='balanced', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=17, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [28]:
report_score(lr_model, X_train_scaled, X_validation_scaled, y_train, y_validation)

Train ROC AUC score: 0.771235988203459
Validation ROC AUC score: 0.7734978174186334


#### Decision Tree

In [29]:
dt_model = DecisionTreeClassifier(max_depth = 8,
                                  class_weight = "balanced",
                                  random_state = 17)
dt_model.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced', criterion='gini',
                       max_depth=8, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=17, splitter='best')

In [30]:
report_score(dt_model, X_train, X_validation, y_train, y_validation)

Train ROC AUC score: 0.7606874675204472
Validation ROC AUC score: 0.731697776674285


#### Random Forest

In [31]:
rf_model = RandomForestClassifier(n_estimators = 200,
                                  max_depth = 15,
                                  criterion = "gini",
                                  class_weight = "balanced",
                                  min_samples_leaf = 20,
                                  n_jobs = -1,
                                  random_state = 17)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=20, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=-1, oob_score=False, random_state=17, verbose=0,
                       warm_start=False)

In [32]:
report_score(rf_model, X_train, X_validation, y_train, y_validation)

Train ROC AUC score: 0.8066871780208538
Validation ROC AUC score: 0.7641061920798929


#### XGBoost

In [33]:
xgb_model = XGBClassifier(n_estimators = 300,
                          scale_pos_weight = y_train[y_train == 0].count() / y_train[y_train == 1].count(),
                          objective = "binary:logistic",
                          random_state = 17,
                          n_jobs = -1)

xgb_model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=17,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=5.270849673202615,
              seed=None, silent=None, subsample=1, verbosity=1)

In [34]:
report_score(xgb_model, X_train, X_validation, y_train, y_validation)

Train ROC AUC score: 0.7964735383431352
Validation ROC AUC score: 0.7759208594086552


Based on the scores reported above, the models with the best performances against the validation set were the Logistic Regression and XGBoost models. This suggests an ensemble that combines predictions from both models could perform even better.

### Tune XGBoost Hyperparameters

Before implementing the ensemble mentioned above, it makes sense to explore the hyperparameters for the XGBoost model in order to fine tune it for better performance.

In [35]:
xgb_params = {
    "max_depth": [3, 4, 5, 6],
    "gamma": [0, 0.1, 0.3, 0.5],
    "min_samples_leaf": [2, 5, 7, 10]
}

grid_search = GridSearchCV(estimator = xgb_model, 
                           param_grid = xgb_params, 
                           cv = 3,
                           scoring = "roc_auc_ovr_weighted",
                           n_jobs = -1)

grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

{'gamma': 0.1, 'max_depth': 3, 'min_samples_leaf': 2}


The best parameters obtained at the time of writing were:

`{'gamma': 0.1, 'max_depth': 3, 'min_samples_leaf': 2}`

As previously mentioned, a good model candidate is a soft voting ensemble that combines predictions from the previously trained Logistic Regression model, and the newly-optimized XGBoost model.

### Final Model - Voting Ensemble Classifier

Train a `VotingClassifier` combining the `XGBClassifier` with the parameters obtained in the grid search, and a pipeline composed of the scaler and the previously trained `LogisticRegression` model. The model predictions will be the average of the probabilities predicted by each model.

In [36]:
optimized_xgb_model = XGBClassifier(n_estimators = 300,
                                    max_depth = 3,
                                    scale_pos_weight = y_train[y_train == 0].count() / y_train[y_train == 1].count(),
                                    objective = "binary:logistic",
                                    gamma = 0.1,
                                    min_samples_leaf = 2,
                                    random_state = 17,
                                    n_jobs = -1)

voting_model = VotingClassifier(estimators = [("xgb", optimized_xgb_model), ("lr", make_pipeline(StandardScaler(), lr_model))],
                                voting = "soft",
                                n_jobs = -1)

voting_model.fit(X_train, y_train)

VotingClassifier(estimators=[('xgb',
                              XGBClassifier(base_score=0.5, booster='gbtree',
                                            colsample_bylevel=1,
                                            colsample_bynode=1,
                                            colsample_bytree=1, gamma=0.1,
                                            learning_rate=0.1, max_delta_step=0,
                                            max_depth=3, min_child_weight=1,
                                            min_samples_leaf=2, missing=None,
                                            n_estimators=300, n_jobs=-1,
                                            nthread=None,
                                            objective='binary:logistic',
                                            random_state=17, reg_alpha=0,
                                            reg_la...
                                                              with_std=True)),
                                  

In [37]:
report_score(voting_model, X_train, X_validation, y_train, y_validation)

Train ROC AUC score: 0.7859395039863737
Validation ROC AUC score: 0.7766497747389286


This final ensemble model has marginally better performance than either individual model. This will be the model used for the final predictions against the test set.

### Predictions Against Test Set

In [38]:
predictions = voting_model.predict_proba(X_test)

In [39]:
y_hat_test = predictions[:,1]

In [40]:
test_predictions = pd.DataFrame(data = {
    "ids": test_df["ids"], 
    "predictions": y_hat_test
})

In [41]:
test_predictions.head()

Unnamed: 0,ids,predictions
0,e4366223-7aa2-0904-7a47-66479ae46b2a,0.314025
1,c6416108-c6d7-e6be-c4b5-923dd36c8ec4,0.766762
2,a90d3929-86ec-2414-89ba-543776b0e82b,0.618167
3,c5b96a7f-389a-28d0-242d-95db05e69da0,0.825871
4,1b461faa-926d-565d-b15d-0b452968ac81,0.641226


In [42]:
# Save model predictions to .csv file
test_predictions.to_csv("predictions.csv", index = False)

### Future Analysis

In the future, I'd like to explore the effect of balancing the dataset prior to fitting the models. I did briefly experiment with SMOTE, but ended up not using it in the final analysis, as I wasn't seeing improvements and it led to increased training times, and seemed to increase overfitting.

I'd also like to tune the XGBoost classifier with a wider range of possible hyperparameter values.

Finally, I'd also like to further explore feature engineering approaches. I ended up discarding the `score_4`, `score_5`, and `score_6` features since they ranked very low in the feature importance analysis I carried out, but I feel there might be an opportunity to combine them in a way to make them more useful. I also would like to explore if some of the categorical variables I discarded due to high cardinality could be re-added and grouped in bins to see if they could improve the model predictions. To finish, I want to investigate the `sign` feature relationship with the target. I initially discarded it thinking there was no way it was supposed to have any predicting power, but it ranks very high in feature importance, and including it had a significant increase in the models scores.
