# Intro

This notebook is created especially for Sber x Skoltech hackathon for risk modeling task at 29-31 October of 2021. It suppose to train a model for the probability of first payment default calculations. Notebook load train data, fit model, load test set, and predict probabilities of first payment default for the test set. Results are in form of 'submission.csv' with predictions, pickled fitted model 'model.pkl' and pickled features 'features.pkl'. The autors: Oleg Nikolaev, Nikita Kuznetsov, Anton Nevskii

Contact mail: oleg.nikolaev@skoltech.ru

# Import libraries

In [None]:
import pickle
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")
import os

# Load and modify train data
We load only one part because the model is not receiving any visible improvement as we load and concatenate two parts

In [None]:
SEED = 42
data_p1_link = '/kaggle/input/risk-management-uiim/train_part1.pkl'
data_p2_link = '/kaggle/input/risk-management-uiim/train_part2.pkl'
data_test_link = '/kaggle/input/risk-management-uiim/test_data.pkl'
submission_link = '/kaggle/input/risk-management-uiim/submission.csv'

df = pd.read_pickle(data_p2_link)

We sort data by time as we conna use such characteristic as 'year', and we will split our train set on train and validation **without** shafle - so our model will be fitted on 'old' dates and will predict on latest dates. By this we can predict future trend, for next years (this is done thanks to comment from one of the Sber experts about that fact that simply 'year' is not informative, as global market has economic waves) 

In [None]:
df.sort_values(by=['REPORT_DT'],axis=0, ascending=True, inplace=True)

In [None]:
# Categorial features with str format
cat_features=['x_618','x_628','x_13']
# Dates
date=['REPORT_DT','x_9']
# Numerical featues which have no big ammount of different numbers, so it can be can be transfer to numerical
num_features_pse_cat=['x_0','x_2','x_3']
# Numerical featues
num_features=['x_4','x_5','x_7','x_317','x_286','x_292','x_189','x_321','x_291','x_63','x_85','x_124','x_421','x_183','x_111','x_100','x_330']


# To add second part of train data 
#particulary its not needed - no huge diffference

# data_k2=data[[*cat_features, *date, *num_features_pse_cat, *num_features, 'TARGET']]
# data_p1 = pd.read_pickle(data_p1_link)
# data_k1=data_p1[[*cat_features, *date, *num_features_pse_cat, *num_features, 'TARGET']]
# del data_p1
# data=pd.concat([data_k1,data_k2],axis=0)

df=df[[*cat_features, *date, *num_features_pse_cat, *num_features, 'TARGET']]

d1=pd.DatetimeIndex(df['REPORT_DT'].values.astype('<M8[M]'))
d0=pd.DatetimeIndex(df['x_9'].values.astype('<M8[M]'))
delta = d1 - d0
df['delta']=delta.days
df['year']=d1.year

Fill NaN values in columns with median value for each column

In [None]:
df.fillna(df.median(), inplace=True)

Encode categorical three features with lables to make them useable during training

In [None]:
cleanup_nums = {
                'x_13':      {"1": 0,
                             "19": 7,
                             "2": 2,
                             "3": 3,
                             "4": 4,
                             "5": 5,
                             "9": 6,
                             "None":8},
                 'x_618':    {"Приобретение": 0,
                             "Инвестирование": 1,
                             "Нецелевой кредит под залог недвижимости": 2,
                             "Рефинансирование": 3,
                             "Индивидуальное строительство": 4},
                 'x_628':    {"ЗП": 0,
                             "Сотрудники": 1,
                             "Улица": 2},    
                }

df.drop(['REPORT_DT','x_9'], axis=1,inplace=True)
df = df.replace(cleanup_nums)

Split training dataset into two parts using Sklearn function. Since there is no need to split randomly (because of timeline) shuffle is False

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('TARGET', axis=1), df['TARGET'], test_size=0.20, random_state=SEED, shuffle=False)

Train Logistic Regression model from Sklearn library with fixed random state

In [None]:
clf =LogisticRegression(solver='newton-cg',random_state=SEED, penalty='l2')
clf.fit(X_train, y_train)

Uncomment code below in order to calculate roc_auc metric for train and test predictions

In [None]:
# To see accuracy score

print('TRAIN SCORE', roc_auc_score(y_train,clf.predict_proba(X_train)[:,1]))
print('TEST SCORE', roc_auc_score(y_test,clf.predict_proba(X_test)[:,1]))

Uncomment the code below in order to obtain trained weights of the model

In [None]:
# To see weights of the model

pd.options.display.max_columns=3
X_coef=np.transpose(X_train)
X_coef['coef']=clf.coef_[0]
abs(X_coef).sort_values('coef', axis=0, ascending=False)
                    

Delete previously used dataframe to clean RAM

In [None]:
del df

# Download test dataset and submission template

In [None]:
data_test = pd.read_pickle(data_test_link)
submission = pd.read_csv(submission_link)

Perform the same manipulations with test dataset as it was with train dataset: generate specific features, fill in the empty values, encode categorical features

In [None]:
df=data_test[[*cat_features, *date, *num_features_pse_cat, *num_features]]
d1=pd.DatetimeIndex(df['REPORT_DT'].values.astype('<M8[M]'))
d0=pd.DatetimeIndex(df['x_9'].values.astype('<M8[M]'))
delta = d1 - d0
df['delta']=delta.days
df['year']=d1.year
df.fillna(df.median(), inplace=True)
cleanup_nums = {
                'x_13':      {"1": 0,
                             "19": 7,
                             "2": 2,
                             "3": 3,
                             "4": 4,
                             "5": 5,
                             "9": 6,
                             "None":8},
                 'x_618':    {"Приобретение": 0,
                             "Инвестирование": 1,
                             "Нецелевой кредит под залог недвижимости": 2,
                             "Рефинансирование": 3,
                             "Индивидуальное строительство": 4},
                 'x_628':    {"ЗП": 0,
                             "Сотрудники": 1,
                             "Улица": 2},    
                }

df.drop(['REPORT_DT','x_9'], axis=1,inplace=True)
df = df.replace(cleanup_nums)

# Calculate probabilities for test dataset and form .csv file for submission

In [None]:
submission['Probability'] = clf.predict_proba(df)[:,1]
submission.to_csv('submission.csv',index=False)
FEATURES=df.columns

In [None]:
with open('model.pkl', 'wb') as files:
    pickle.dump(clf, files)
with open('features.pkl', 'wb') as files:
    pickle.dump(FEATURES, files)