# Loan Default Prediction

The load default prediction problem is used to show a "general" structured approach to a machine learning prediction problem. To showcase the approach a simple regression (Ridge Regression) is used to predict the loss for the defaulting loans. The solution is not at all competitive - many parts are only highlighted and not worked out or implemented - the missing pieces and possible improvements are highlighted in the below texts.

In general the following steps should be taken (not all are present in this solution):
- Loading data (No error checking present).
- Statistical analysis and visualisation. (The statistical analysis is reduce to a drescribe and no visualisations are done.)
- Data cleaning and dealing with missing values. (Observations with missing data are simply filled with a feature column mean. Better solutions should be used. E.g. sklearn's Imputer or custom methods. In addition different filling values are used in the training and submission data - which introduces an extra bias - possible improvement use the same value in both cleaning steps.)
- Feature Engineering incl. selection and dimensionality reduction. (No feature engineering nor reduction is done. Sklearn's PCA was tested, but sadly the current implementation in sklearn has a bug which is being resolved in a PR at this very moment. https://github.com/scikit-learn/scikit-learn/pull/10359. Besides a unsupervised PCA other dimensionality reducion mehtods exist, weights for models could be investigated, correlation matrices for the different features could be looked at etc.)
- Splitting data, into a training and test data set. (The given split is chosen without further reasoning. The test data given by the challenge is the data which is needed for the submission. In a larger development project further splits should be considered, e.g. data which is only made available for certain milestones and the final implementation. This allows for more independent evaluation.)
- Set benchmark error (by predicting the mean loss and evaluating the error, not done in this solution, it helps to quantify improvements with more elaborate models).
- Model selection (No model selection is done, Ridge is chosen without further reasoning.)
- Training regressor - predicting and measuring error (Challenge uses mean absolute error).
- Validation, statistical analysis of the results. (No analysis of the results is done. As a next step it would be beneficial to check whether the training and test data and hence the predicted values share common statistical metrics and follow a "similar" distribution. This helps to spot biases introduced through the chosen methods).
- Iterate over the process again and again to improve solution. (In the challenge at hand, working on the features will improve the result, as well evaluating different models and their parameters could bring insights and hence help improve the results. The challenge at hand has a special characteristic which one could make use of - separating the problem in two stages - classification on whether the loan will default and regression on the defaulting loans.)

### Comments on Software Development

The code is written (apparently) as a jupyter notebook script and hence useful for a interactive step by step execution. Plenty of code is copied for the train and test (submission) section. If a predictor should be implemented more for a production environment many more aspect have to be considered. Standalone execution with proper error handling and tests in place, monitoring capabilities, separate code into helper function to avoid code repetion etc.

### Training a Regressor Model

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import Imputer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

In [2]:
# loading training data
df_raw = pd.read_csv('./train_v2.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# copy the dataframe to avoid loading time if dataframe gets messed up during experiments in the development
df = df_raw.copy()

In [4]:
df.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f770,f771,f772,f773,f774,f775,f776,f777,f778,loss
0,1,126,10,0.686842,1100,3,13699,7201.0,4949.0,126.75,...,5,2.14,-1.54,1.18,0.1833,0.7873,1,0,5,0
1,2,121,10,0.782776,1100,3,84645,240.0,1625.0,123.52,...,6,0.54,-0.24,0.13,0.1926,-0.6787,1,0,5,0
2,3,126,10,0.50008,1100,3,83607,1800.0,1527.0,127.76,...,13,2.89,-1.73,1.04,0.2521,0.7258,1,0,5,0
3,4,134,10,0.439874,1100,3,82642,7542.0,1730.0,132.94,...,4,1.29,-0.89,0.66,0.2498,0.7119,1,0,5,0
4,5,109,9,0.502749,2900,4,79124,89.0,491.0,122.72,...,26,6.11,-3.82,2.51,0.2282,-0.5399,0,0,5,0


In [5]:
# simple statistical analysis
df.describe()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f770,f771,f772,f773,f774,f775,f776,f777,f778,loss
count,105471.0,105471.0,105471.0,105471.0,105471.0,105471.0,105471.0,105289.0,105370.0,105471.0,...,105471.0,105471.0,105471.0,105471.0,104407.0,103946.0,105471.0,105471.0,105471.0,105471.0
mean,52736.0,134.603171,8.246883,0.499066,2678.488874,7.354533,47993.704317,2974.336018,2436.363718,134.555225,...,17.422543,5.800976,-4.246788,3.273059,0.233852,0.014797,0.310246,0.322847,175.951589,0.799585
std,30446.999458,14.725467,1.691535,0.288752,1401.010943,5.151112,35677.136048,2546.551085,2262.950221,13.824682,...,18.548936,6.508555,4.828265,3.766746,0.073578,1.039439,0.462597,0.467567,298.294043,4.32112
min,1.0,103.0,1.0,6e-06,1100.0,1.0,0.0,1.0,1.0,106.82,...,2.0,0.0,-43.16,0.0,0.0,-18.4396,0.0,0.0,2.0,0.0
25%,26368.5,124.0,8.0,0.24895,1500.0,4.0,11255.0,629.0,746.0,124.29,...,5.0,1.48,-5.7,0.74,0.1984,-0.704275,0.0,0.0,19.0,0.0
50%,52736.0,129.0,9.0,0.498267,2200.0,4.0,76530.0,2292.0,1786.0,128.46,...,11.0,3.57,-2.6,1.99,0.2518,0.3754,0.0,0.0,40.0,0.0
75%,79103.5,148.0,9.0,0.749494,3700.0,10.0,80135.0,4679.0,3411.0,149.08,...,23.0,7.7,-1.01,4.44,0.2836,0.7371,1.0,1.0,104.0,0.0
max,105471.0,176.0,11.0,0.999994,7900.0,17.0,88565.0,9968.0,11541.0,172.95,...,168.0,58.12,0.0,34.04,0.4737,11.092,1.0,1.0,1212.0,100.0


In [6]:
# before "cleaning" - are there nan values
df.isnull().values.any()

True

In [7]:
# assure that no 'NA' are left
# this should be handled in the loading step, but it apparently isn't, should be checked seperately
df = df.replace('NA', np.nan)

In [8]:
# helper function, gives true if field is a number
is_numeric = np.vectorize(lambda x: np.issubdtype(x, np.number))

In [9]:
# are all columns of the dataframe of numeric type
mask_is_numeric = is_numeric(df.dtypes)
np.all(mask_is_numeric)

False

In [10]:
# explict conversion to float for all fields which are not yet numeric (fixing the loading issue)
df.loc[:, ~mask_is_numeric] = df.loc[:, ~mask_is_numeric].applymap(float)

In [11]:
# extract feature data into np arrays
X = df.loc[:,'f1':'f778'].values

In [12]:
# extract target from dataframe into np array
y = df.loc[:,'loss'].values

In [13]:
# fill nan values with mean of corresponding feature column (axis=0) using Imputer
imp = Imputer(strategy='mean', axis=0)
imp.fit(X)
X = imp.transform(X)

In [14]:
# after cleaning - are there any nan values in the data (not target) - False is good.
np.any(np.isnan(X))

False

In [15]:
# split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [16]:
# train model and predict
clf = Ridge(alpha=1.0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [17]:
# evaluate model performance with different metrics
print('mse', mean_squared_error(y_test, y_pred))
print('r2', r2_score(y_test, y_pred))
print('mae', mean_absolute_error(y_test, y_pred))

mse 19.3222841009
r2 -8.10136863603e-05
mae 1.45495885607


# Predict for Submission

In [18]:
# loading test data to predict values for submission
df_test_raw = pd.read_csv('./test_v2.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [19]:
df_test = df_test_raw.copy()

In [20]:
# "cleaning" procedure is copied from above - see comments there
df_test = df_test.replace('NA', np.nan)
mask_is_numeric = is_numeric(df_test.dtypes)
df_test.loc[:, ~mask_is_numeric] = df_test.loc[:, ~mask_is_numeric].applymap(float)
X_sub = df_test.loc[:,'f1':'f778'].values
imp = Imputer(strategy='mean', axis=0)
imp.fit(X_sub)
X_sub = imp.transform(X_sub)

In [21]:
df_test['loss'] = clf.predict(X_sub)

In [22]:
df_test[['id', 'loss']].to_csv('submission.csv', index=False)