# Group Project - Risk Based Segmentation 



## Introduction

Customer segmentation involves categorizing the portfolio by industry, location, revenue, account size, and number of employees and many other variables to reveal where risk and opportunity live within the portfolio. Those patterns can then provide key measurable data points for more predictive credit risk management. Taking a portfolio approach to risk management gives credit professionals a better fix on the accounts, in order to develop strategies for better serving segments that present the best opportunities. Not only that, you can work to maximize performance
in all customer segments, even seemingly risky segments.

Customer segmentation analysis can lead to several tangible improvements in credit risk management: stronger credit policies, and improved internal communication and cooperation across teams.


In [None]:
#!pip install --upgrade git+http://github.com/renero/dataset
#!pip install skrebate
#!pip install gplearn
#!pip install git+git://github.com/andirs/impyte.git

In [20]:
import pandas as pd
import RiskDataframe as rdf
dataframe = pd.read_csv("AUTO_LOANS_DATA.csv", sep=";")

myrdf = rdf.RiskDataframe(dataframe)




In [2]:
myrdf.shape

(900860, 14)

In [3]:
myrdf.columns

Index(['REPORTING_DATE', 'ACCOUNT_NUMBER', 'CUSTOMER_ID', 'PROGRAM_NAME',
       'LOAN_OPEN_DATE', 'EXPECTED_CLOSE_DATE', 'ORIGINAL_BOOKED_AMOUNT',
       'OUTSTANDING', 'BUCKET', 'SEX', 'CUSTOMER_OPEN_DATE', 'BIRTH_DATE',
       'PROFESSION', 'CAR_TYPE'],
      dtype='object')

In [4]:
myrdf.isna().sum()

REPORTING_DATE                0
ACCOUNT_NUMBER                0
CUSTOMER_ID                   0
PROGRAM_NAME                  0
LOAN_OPEN_DATE                0
EXPECTED_CLOSE_DATE           0
ORIGINAL_BOOKED_AMOUNT        0
OUTSTANDING                   0
BUCKET                        0
SEX                        4528
CUSTOMER_OPEN_DATE            0
BIRTH_DATE                 4533
PROFESSION                 5558
CAR_TYPE                  11518
dtype: int64

# Checking the data Types

In [5]:
myrdf.dtypes

REPORTING_DATE             object
ACCOUNT_NUMBER              int64
CUSTOMER_ID                 int64
PROGRAM_NAME               object
LOAN_OPEN_DATE             object
EXPECTED_CLOSE_DATE        object
ORIGINAL_BOOKED_AMOUNT    float64
OUTSTANDING               float64
BUCKET                      int64
SEX                        object
CUSTOMER_OPEN_DATE         object
BIRTH_DATE                 object
PROFESSION                 object
CAR_TYPE                   object
dtype: object



## 1) Implement a method .missing_not_at_random() 


In [6]:
myrdf.missing_not_at_random(input_vars=[]) 


Missing Not At Random Repport (MNAR) - SEX, BIRTH_DATE, PROFESSION, CAR_TYPE variables seem Missing Not at Random, there for we recommend: 
 
 Thin File Segment Variables (all others variables free of MNAR issue): REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, BUCKET, CUSTOMER_OPEN_DATE 
 
 Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, BUCKET, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE


In [12]:
pivot_value = 'ACCOUNT_NUMBER'
target_value = 'BUCKET'
down_payment = 'PROGRAM_NAME'
income_status = 'PROFESSION'
birth_date = 'BIRTH_DATE'
dates_todays = ['REPORTING_DATE','LOAN_OPEN_DATE','EXPECTED_CLOSE_DATE','CUSTOMER_OPEN_DATE']


myrdf.start(pivot_value,birth_date,target_value, down_payment,income_status,dates_todays)




Unnamed: 0,ACCOUNT_NUMBER,CUSTOMER_ID,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,BUCKET,SEX,BIRTH_DATE,PROFESSION,CAR_TYPE,DOWN_PAYMENT,TYPE,REPORTING_DATE_DAYS_LAPSED,LOAN_OPEN_DATE_DAYS_LAPSED,EXPECTED_CLOSE_DATE_DAYS_LAPSED,CUSTOMER_OPEN_DATE_DAYS_LAPSED
143,144,144,140500.0,0.00,0,M,39,ACTIVE,UNKNOWN,0.5,EMPLOYED,2046,2274,1557,3038
247,248,248,70000.0,0.00,1,F,37,ACTIVE,UNKNOWN,0.5,EMPLOYED,2046,3570,1739,3583
308,309,307,65500.0,0.00,0,M,44,ACTIVE,UNKNOWN,0.5,EMPLOYED,2046,3276,1465,3282
350,351,349,44500.0,0.00,1,M,40,ACTIVE,UNKNOWN,0.5,EMPLOYED,2046,2619,1162,2624
465,466,12,93000.0,0.00,0,UNKNOWN,UNKNOWN,ACTIVE,UNKNOWN,0.0,CORPORATE,2046,3480,2049,7613
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900855,36547,35528,90000.0,78956.52,0,M,27,ACTIVE,GELORY,0.5,EMPLOYED,740,1076,-726,1087
900856,39597,38396,92500.0,92826.06,0,F,44,ACTIVE,GELORY,0.5,EMPLOYED,740,741,-1092,754
900857,38016,36905,140250.0,114919.47,0,M,41,ACTIVE,NISSAN,0.0,CORPORATE,740,901,188,952
900858,38899,37739,105000.0,101714.25,0,M,35,ACTIVE,DFSK,0.0,CORPORATE,740,804,-1000,833


##	2) implement a method the Segmentation Method




In [13]:
seg_data_cat =['SEX','PROFESSION','CAR_TYPE','TYPE']

In [14]:
myrdf.set_train_cat(target_value,seg_data_cat)

(['The total accuracy using all variable and Logistic regression is: 0.8800505050505051',
  'Using: SEX GINI Full Model Seg1: 30.631642616434807%',
  'Using: SEX GINI Segmented Model Seg1: 30.631642616434807%',
  'Using: SEX GINI Full Model Seg2: 42.067307692307686%',
  'Using: SEX GINI Segmented Model Seg2:42.067307692307686%',
  'Using: PROFESSION GINI Full Model Seg1: 28.9557348985082%',
  'Using: PROFESSION GINI Segmented Model Seg1: 28.9557348985082%',
  'Using: PROFESSION GINI Full Model Seg2: -100.0%',
  'Using: PROFESSION GINI Segmented Model Seg2:-100.0%',
  'Using: CAR_TYPE GINI Full Model Seg1: 12.444444444444436%',
  'Using: CAR_TYPE GINI Segmented Model Seg1: 12.444444444444436%',
  'Using: CAR_TYPE GINI Full Model Seg2: 33.2194543297746%',
  'Using: CAR_TYPE GINI Segmented Model Seg2:33.2194543297746%',
  'Using: TYPE GINI Full Model Seg1: 28.350663893777007%',
  'Using: TYPE GINI Segmented Model Seg1: 28.350663893777007%',
  'Using: TYPE GINI Full Model Seg2: 18.75%',
  

In [15]:
seg_data_num = ['ORIGINAL_BOOKED_AMOUNT','OUTSTANDING','BIRTH_DATE','DOWN_PAYMENT','REPORTING_DATE_DAYS_LAPSED','LOAN_OPEN_DATE_DAYS_LAPSED','EXPECTED_CLOSE_DATE_DAYS_LAPSED','CUSTOMER_OPEN_DATE_DAYS_LAPSED']

In [16]:
myrdf.encod(seg_data_cat)

Unnamed: 0,ACCOUNT_NUMBER,BIRTH_DATE,BUCKET,CAR_TYPE_AUDI,CAR_TYPE_BAIC,CAR_TYPE_BMW,CAR_TYPE_BRILLIANCE,CAR_TYPE_BYD,CAR_TYPE_CARRY,CAR_TYPE_CHANA,...,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,PROFESSION_ACTIVE,PROFESSION_UNEMPLOYED,REPORTING_DATE_DAYS_LAPSED,SEX_F,SEX_M,SEX_UNKNOWN,TYPE_CORPORATE,TYPE_EMPLOYED
143,144.0,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,140500.0,0.00,1.0,0.0,2046.0,0.0,1.0,0.0,0.0,1.0
247,248.0,37,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,70000.0,0.00,1.0,0.0,2046.0,1.0,0.0,0.0,0.0,1.0
308,309.0,44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,65500.0,0.00,1.0,0.0,2046.0,0.0,1.0,0.0,0.0,1.0
350,351.0,40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,44500.0,0.00,1.0,0.0,2046.0,0.0,1.0,0.0,0.0,1.0
465,466.0,UNKNOWN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,93000.0,0.00,1.0,0.0,2046.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900855,36547.0,27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,90000.0,78956.52,1.0,0.0,740.0,0.0,1.0,0.0,0.0,1.0
900856,39597.0,44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,92500.0,92826.06,1.0,0.0,740.0,1.0,0.0,0.0,0.0,1.0
900857,38016.0,41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,140250.0,114919.47,1.0,0.0,740.0,0.0,1.0,0.0,1.0,0.0
900858,38899.0,35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,105000.0,101714.25,1.0,0.0,740.0,0.0,1.0,0.0,1.0,0.0


In [17]:
myrdf.set_train_num(seg_data_cat,target_value,seg_data_num)

(['The total accuracy using all variable and Logistic regression is: 0.8767471410419314',
  'Using: ORIGINAL_BOOKED_AMOUNT GINI Full Model Seg1: 25.88729016786573%',
  'Using: ORIGINAL_BOOKED_AMOUNT GINI Segmented Model Seg1: 25.88729016786573%',
  'Using: ORIGINAL_BOOKED_AMOUNT GINI Full Model Seg2: 28.155339805825253%',
  'Using: ORIGINAL_BOOKED_AMOUNT GINI Segmented Model Seg2: 28.155339805825253%',
  'Using: OUTSTANDING GINI Full Model Seg1: 35.18469540685884%',
  'Using: OUTSTANDING GINI Segmented Model Seg1: 35.18469540685884%',
  'Using: OUTSTANDING GINI Full Model Seg2: 21.547861507128307%',
  'Using: OUTSTANDING GINI Segmented Model Seg2: 21.547861507128307%',
  'Using: BIRTH_DATE GINI Full Model Seg1: 33.36083154594951%',
  'Using: BIRTH_DATE GINI Segmented Model Seg1: 33.36083154594951%',
  'Using: BIRTH_DATE GINI Full Model Seg2: 38.882543743432784%',
  'Using: BIRTH_DATE GINI Segmented Model Seg2: 38.882543743432784%',
  'Using: DOWN_PAYMENT GINI Full Model Seg1: 37.663043

---