#Alternative Data Credit Scoring Model

In [0]:
import numpy as np
import pandas as pd
from pandas._libs.interval import Interval
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

This credit scoring model is built on the alternative data. Over 2.5 billion peoples dont have bank accounts and credit history, so traditional approach to calculate the credit score is impossible for them. But, there is a alternative way to calculate the credit score. We can collect the data from this peoples which is not related to the credit or financial data, and feed it to this model to get a credit score. This alternative data of this peoples could be like married status, childrens, education, job, income, relatives, phone call history, SMS history, digital payment history, neighbors, home address, office address, assets, health, and many more


We are using some of this features, like

FEATURES :
1. **Age** 
2. **Marital Status** :
      
    1 - Married, 0 - Not Married
3. **Dependents** : Number of dependents (0, 1, 2, 3, 4, 5+)
4. **Total years of work experience** : Number of work experience (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10+)
5. **Working Status** :

  Unemployed - 0, Student - 1, Part-Time - 2, Full-Time - 3
6. **Highest education qualification** :

  High School - 0, Bachelor's  - 1, Master's - 2, Doctorate - 3
7. **Area of study** :

  Life Sciences - 0, Mathematical Sciences - 1, Humanities - 2, Social      Science - 3, Business - 4, Law - 5, Engineering & Technology - 6,         Sport - 7
8. **Annual Income**
  
  
  **Target** -

  1 - Good User, 0 - Bad User

In [0]:
#Generating random "alternative data" for credit scoing model

rand_age_data = np.random.randint(21, 100 + 1, 1000)
rand_maritalstatus_data = np.random.randint(0, 1 + 1, 1000)
rand_dependents_data = np.random.randint(0, 20 + 1, 1000)
rand_workexp_data = np.random.randint(0, 50 + 1, 1000)
rand_working_status_data = np.random.randint(0, 3 + 1, 1000)
rand_edu_quali_data = np.random.randint(0, 3 + 1, 1000)
rand_area_study_data = np.random.randint(0, 7 + 1, 1000)
rand_annualinc_data = np.random.randint(0, 100 + 1, 1000) #In thousands

rand_target_data = np.random.randint(0, 1 + 1, 1000)

In [0]:
df_orig = pd.DataFrame({'age':rand_age_data,
                        'marital_status': rand_maritalstatus_data,
                        'dependents' : rand_dependents_data,
                        'work_exp': rand_workexp_data,
                        'work_status': rand_working_status_data,
                        'education_quali': rand_edu_quali_data, 
                        'area_study': rand_area_study_data,
                        'annual_income': rand_annualinc_data,
                        'target' : rand_target_data})
df_orig.head()

Unnamed: 0,age,marital_status,dependents,work_exp,work_status,education_quali,area_study,annual_income,target
0,80,1,0,26,0,2,1,54,1
1,66,1,1,37,3,0,7,70,0
2,92,1,16,34,0,2,6,54,1
3,79,0,1,28,0,1,7,42,0
4,100,1,17,35,3,2,5,40,1


##Data Transformations

Transforming the data in a manner suited to our model.

In [0]:
# Continous values from original dataframe "df_orig" are replced with the ranges they lie in.
df_ranges = pd.DataFrame()

df_ranges['age_range'] = pd.cut(df_orig.age, [0, 25, 30, 35, 42, 51, 91, np.inf], right = False)
df_ranges['marital_status'] = df_orig.marital_status
df_ranges['dependents_range'] = pd.cut(df_orig.dependents, [0, 2, 6, np.inf], right = False)
df_ranges['work_exp_range'] = pd.cut(df_orig.work_exp, [0, 3, 6, 11, np.inf], right = False)
df_ranges['work_status'] = df_orig.work_status
df_ranges['education_quali'] = df_orig.education_quali
df_ranges['area_study'] = df_orig.area_study
df_ranges['annual_income_range'] = pd.cut(df_orig.annual_income, [0, 15, 20, 25, 30, 40, 50, np.inf], right = False)
df_ranges['target'] = df_orig.target

df_ranges.head()

Unnamed: 0,age_range,marital_status,dependents_range,work_exp_range,work_status,education_quali,area_study,annual_income_range,target
0,"[30.0, 35.0)",1,"[6.0, inf)","[6.0, 11.0)",2,0,1,"[50.0, inf)",1
1,"[51.0, 91.0)",0,"[6.0, inf)","[11.0, inf)",3,1,0,"[50.0, inf)",0
2,"[91.0, inf)",1,"[6.0, inf)","[11.0, inf)",2,0,7,"[50.0, inf)",1
3,"[91.0, inf)",1,"[6.0, inf)","[11.0, inf)",0,0,0,"[50.0, inf)",0
4,"[42.0, 51.0)",1,"[6.0, inf)","[11.0, inf)",3,2,6,"[40.0, 50.0)",0


### Weight Of Evidence (woe) calculation

![alt text](https://github.com/joshi98kishan/Alternative-Data-Credit-Scoring-Model/blob/master/images/woe-formula.PNG?raw=true)

In [0]:
# "df_dict" is a dictionary of dataframes having woe values for every feature
df_dict = {}

# Iterating over names of columns of "df_ranges" dataframe
for name in df_ranges.columns:
  if name != 'target':
    
    # Getting the unique values of each column of "df_ranges" per interation
    try:
      unique_ranges = list(df_ranges[name].unique().categories) # If the dtype is "Interval" 
    except:
      unique_ranges = list(df_ranges[name].unique()) 

    len_df = len(unique_ranges)
    num_of_goods = 0
    num_of_bads = 0
    df = pd.DataFrame({('unique_' + name) : unique_ranges, # Column of unique values (groups)
                       '#B' : np.zeros(len_df),            # Number of bads in each group
                       '#G' : np.zeros(len_df),            # Number of goods in each group
                       '%B' : np.zeros(len_df),            # Percentage of bads with respect to all the group (#B / num_of_bads)
                       '%G' : np.zeros(len_df),            # Percentage of goods with respect to all the group (#G / num_of_goods)
                       '%G/%B' : np.zeros(len_df),         # Division of % of goods to % of bads of each group 
                       'woe' : np.zeros(len_df)})          # Natural log of the divided value in the previous column (Weight Of Evidence)

    #Calculating number of goods and bads in each groups and storing them to the dataframes
    for i, interval in enumerate(df['unique_' + name]):
      value_counts = df_ranges[df_ranges[name] == interval].target.value_counts()
      df.iloc[i, 1] = value_counts[0]
      df.iloc[i, 2] = value_counts[1]

    #Calculating the total number of bads and goods in "#B" and "#G" columns
    for num in df['#B']:
      num_of_bads += num

    for num in df['#G']:
      num_of_goods += num

    #Calculating % of bads and goods, dividing them and taking a natural log
    for i in range(len(df)):
      df.iloc[i, 3] = (df.iloc[i, 1] / num_of_bads) * 100
      df.iloc[i, 4] = (df.iloc[i, 2] / num_of_goods) * 100
      df.iloc[i, 5] = df.iloc[i, 4] / df.iloc[i, 3]
      df.iloc[i, 6] = np.log(df.iloc[i, 5])

    #Saving dataframes in each interation
    df_dict['df_' + name] = df

**INFORMATION VALUE =**

![alt text](https://github.com/joshi98kishan/Alternative-Data-Credit-Scoring-Model/blob/master/images/iv_formula.PNG?raw=true)

In [0]:
#Calculating Information Value of each feature.

def iv_of_feature(df):
  iv = 0
  for i in range(len(df)):
    iv += ((df.iloc[i, 4] - df.iloc[i, 3]) / 100) * df.iloc[i, 6]

  return iv

Printing each dataframes in the dictionary along with the Information Value (IV).

In [0]:
df_dict['df_marital_status']

Unnamed: 0,unique_marital_status,#B,#G,%B,%G,%G/%B,woe
0,1,258.0,265.0,49.425287,55.439331,1.121679,0.114827
1,0,264.0,213.0,50.574713,44.560669,0.881086,-0.1266


In [0]:
iv_of_feature(df_dict['df_marital_status'])

0.014519534781711193

In [0]:
df_dict['df_dependents_range']

Unnamed: 0,unique_dependents_range,#B,#G,%B,%G,%G/%B,woe
0,"[0.0, 2.0)",37.0,48.0,7.088123,10.041841,1.416714,0.34834
1,"[2.0, 6.0)",101.0,88.0,19.348659,18.410042,0.951489,-0.049727
2,"[6.0, inf)",384.0,342.0,73.563218,71.548117,0.972607,-0.027775


In [0]:
iv_of_feature(df_dict['df_dependents_range'])

0.011315419586711081

In [0]:
df_dict['df_work_exp_range']

Unnamed: 0,unique_work_exp_range,#B,#G,%B,%G,%G/%B,woe
0,"[0.0, 3.0)",22.0,24.0,4.214559,5.020921,1.191328,0.175068
1,"[3.0, 6.0)",36.0,36.0,6.896552,7.531381,1.09205,0.088057
2,"[6.0, 11.0)",63.0,51.0,12.068966,10.669456,0.884041,-0.123252
3,"[11.0, inf)",401.0,367.0,76.819923,76.778243,0.999457,-0.000543


In [0]:
iv_of_feature(df_dict['df_work_exp_range'])

0.0036958455643374753

In [0]:
df_dict['df_work_status']

Unnamed: 0,unique_work_status,#B,#G,%B,%G,%G/%B,woe
0,2,127.0,129.0,24.329502,26.987448,1.109248,0.103682
1,3,137.0,120.0,26.245211,25.104603,0.95654,-0.044432
2,0,128.0,115.0,24.521073,24.058577,0.981139,-0.019041
3,1,130.0,114.0,24.904215,23.849372,0.957644,-0.043279


In [0]:
iv_of_feature(df_dict['df_work_status'])

0.0038072064682793966

In [0]:
df_dict['df_education_quali']

Unnamed: 0,unique_education_quali,#B,#G,%B,%G,%G/%B,woe
0,0,136.0,128.0,26.05364,26.778243,1.027812,0.027432
1,1,123.0,100.0,23.563218,20.920502,0.887846,-0.118957
2,2,140.0,129.0,26.819923,26.987448,1.006246,0.006227
3,3,123.0,121.0,23.563218,25.313808,1.074293,0.071663


In [0]:
iv_of_feature(df_dict['df_education_quali'])

0.004607436027943981

In [0]:
df_dict['df_area_study']

Unnamed: 0,unique_area_study,#B,#G,%B,%G,%G/%B,woe
0,1,71.0,54.0,13.601533,11.297071,0.830573,-0.185639
1,0,73.0,60.0,13.984674,12.552301,0.897576,-0.108058
2,7,72.0,73.0,13.793103,15.271967,1.107218,0.10185
3,6,64.0,61.0,12.260536,12.761506,1.04086,0.040048
4,2,63.0,59.0,12.068966,12.343096,1.022714,0.02246
5,4,66.0,59.0,12.643678,12.343096,0.976227,-0.02406
6,3,56.0,58.0,10.727969,12.133891,1.131052,0.123148
7,5,57.0,54.0,10.91954,11.297071,1.034574,0.03399


In [0]:
iv_of_feature(df_dict['df_area_study'])

0.009526202315853436

In [0]:
df_dict['df_annual_income_range']

Unnamed: 0,unique_annual_income_range,#B,#G,%B,%G,%G/%B,woe
0,"[0.0, 15.0)",64.0,73.0,12.260536,15.271967,1.24562,0.219633
1,"[15.0, 20.0)",24.0,21.0,4.597701,4.393305,0.955544,-0.045475
2,"[20.0, 25.0)",28.0,24.0,5.363985,5.020921,0.936043,-0.066094
3,"[25.0, 30.0)",27.0,25.0,5.172414,5.230126,1.011158,0.011096
4,"[30.0, 40.0)",49.0,49.0,9.386973,10.251046,1.09205,0.088057
5,"[40.0, 50.0)",54.0,41.0,10.344828,8.577406,0.829149,-0.187355
6,"[50.0, inf)",276.0,245.0,52.873563,51.25523,0.969392,-0.031086


In [0]:
iv_of_feature(df_dict['df_annual_income_range'])

0.011515498915661407

In [0]:
#Function that check whether a value lie in a interval or not

def contains(interval, value):
  """For Left closed Right open interval"""

  if value >= interval.left and value < interval.right:
    return True
  else:
    return False

In [0]:
#Function that return woe of a value 

def generate_woe(column, value):
  key = 'df_' + column
  try:                               #If given value is categorical
    df = df_dict[key]
    
    for i, col_val in enumerate(df.iloc[:, 0]):
      if value == col_val:
        return df.iloc[i, 6]
        
  except:                            #If given value is continous
    key = key = 'df_' + column + '_range'
    df = df_dict[key]

    for i, interval in enumerate(df.iloc[:, 0]):
      if contains(interval, value):
        return df.iloc[i, 6]

In [0]:
#Creating a final dataframe, replacing original values in "df_orig" with there woe values

final_df = pd.DataFrame().reindex_like(df_orig)

for i, column in enumerate(df_orig.columns):
  if column != 'target':
    for j in range(len(df_orig)):
      final_df.iloc[j, i] = generate_woe(column, df_orig.iloc[j, i])

final_df.target = df_orig.target

final_df.head()

Unnamed: 0,age,marital_status,dependents,work_exp,work_status,education_quali,area_study,annual_income,target
0,-0.071008,0.114827,-0.027775,-0.123252,0.103682,0.027432,-0.185639,-0.031086,1
1,0.001405,-0.1266,-0.027775,-0.000543,-0.044432,-0.118957,-0.108058,-0.031086,0
2,0.117044,0.114827,-0.027775,-0.000543,0.103682,0.027432,0.10185,-0.031086,1
3,0.117044,0.114827,-0.027775,-0.000543,-0.019041,0.027432,-0.108058,-0.031086,0
4,-0.264164,0.114827,-0.027775,-0.000543,-0.044432,0.006227,0.040048,-0.187355,0


In [0]:
X = final_df.drop('target', 1)
y = final_df.target

In [0]:
#Splitting the data into 80:20 ratio of training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

##Model Training - Using Logistic Regression

It is standard and perhaps the most common practice to use Logisctic Regression at this stage of model developement.

![alt text](https://github.com/joshi98kishan/Alternative-Data-Credit-Scoring-Model/blob/master/images/logistic-regression-formula.PNG?raw=true)

In [0]:
# Using Logistic Regression

model = LogisticRegression()
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
coefficients = model.coef_.reshape(-1, 1)
intercept = model.intercept_[0]

In [0]:
model.score(X_test, y_test)

0.47

##Scorcard Scaling
Scaling increases the understandability of scorecard to the non-experts.
It is used to get the score from the output of a predictive classifier, here Logistic Regression.

Output of a binary logistic regression is a probability of class membership.So, this probability will be converted to a score.

![alt text](https://github.com/joshi98kishan/Alternative-Data-Credit-Scoring-Model/blob/master/images/score_formula.PNG?raw=true)

    ln(odds) is a output of a logistic regression.


![alt text](https://github.com/joshi98kishan/Alternative-Data-Credit-Scoring-Model/blob/master/images/factor_formula.PNG?raw=true)

    The factor represents the number of points, y, required for the odds to increase by some specified multiple, m.
    For example, as it is common for the odds to double every 20 points, then the factor is,

    Factor = 20 / ln(2) = 28.85


![alt text](https://github.com/joshi98kishan/Alternative-Data-Credit-Scoring-Model/blob/master/images/offset_formula.PNG?raw=true)


    The offset is a base point b at which some specified odds of 30:1 at 200 points is calculated as:
    
    Offset = 200 − (28.85 ∗ ln (30)) = 101.88


In [0]:
sample = X_test.iloc[0].values.reshape(1, -1)
ln_odds = np.dot(sample, coefficients) + intercept
ln_odds = ln_odds[0][0]

In [0]:
# The score for each set of features can be calculated:
score = 101.88 + 28.85 * ln_odds
score

100.12284799745635