# Group Project - Risk Based Segmentation 

This is document contains a description of the task and also a starter code. 
Implement your exercise by changing only this Jupyter Notebook and the class inside RiskDataFrame.py, deliver both files. 


## Introduction

Customer segmentation involves categorizing the portfolio by industry, location, revenue, account size, and number of employees and many other variables to reveal where risk and opportunity live within the portfolio. Those patterns can then provide key measurable data points for more predictive credit risk management. Taking a portfolio approach to risk management gives credit professionals a better fix on the accounts, in order to develop strategies for better serving segments that present the best opportunities. Not only that, you can work to maximize performance
in all customer segments, even seemingly risky segments.

Customer segmentation analysis can lead to several tangible improvements in credit risk management: stronger credit policies, and improved internal communication and cooperation across teams.

## Task scope
Your group is working in the retail risk modeling team and you are asked to build a class to perform risk-based segmentation and test it for car loans’ customers based on given historical data of customer behavior. The class must perform the segmentation from a risk management perspective.

## Class
We will use Nico's great initial code, which extends Pandas DataFrame in a magical way turning our own class into like-Pandas:

    #Initializing the inherited pd.DataFrame
    def __init__(self, *args, **kwargs):
        super().__init__(*args,**kwargs)
    
    @property
    def _constructor(self):
        def func_(*args,**kwargs):
            df = RiskDataframe(*args,**kwargs)
            return df
        return func_

### PERFORM YOUR OWN DATA CLEANNING & DATA PREPARATION HERE

Your objective in this part is simply to prepare the data to apply to the missing_not_at_random and find_segment_split
methods. Do not overcomplicate the data cleanning and data preparation. Keep it simple!


In [5]:
# Example of use
import pandas as pd
import RiskDataframe as rdf
from datetime import datetime
from datetime import timedelta


dataframe = pd.read_csv("AUTO_LOANS_DATA.csv", sep=";")
dataframe['BINARIZED_TARGET'] = dataframe['BUCKET'].apply(lambda x: 1 if x>0 else 0)

# PERFORM YOUR OWN DATA CLEANNING & DATA PREPARATION HERE - START!
# Use any data cleanning / data preparation from individual project / it won't be assessed here)!
dataframe = dataframe[dataframe['PROGRAM_NAME'].str.contains("Corporate")==False]    
dataframe.sort_values(by=['LOAN_OPEN_DATE'])
dataframe.drop_duplicates('CUSTOMER_ID', keep = 'last', inplace = True)
dataframe.drop('ACCOUNT_NUMBER', inplace=True, axis=1)
dataframe.drop('CUSTOMER_ID', inplace=True, axis=1)
dataframe.drop('BUCKET', inplace=True, axis=1)
# PERFORM YOUR OWN DATA CLEANNING HERE  - END!

myrdf = rdf.RiskDataframe(dataframe)
myrdf.shape

(38386, 12)

In [6]:
myrdf.columns

Index(['REPORTING_DATE', 'PROGRAM_NAME', 'LOAN_OPEN_DATE',
       'EXPECTED_CLOSE_DATE', 'ORIGINAL_BOOKED_AMOUNT', 'OUTSTANDING', 'SEX',
       'CUSTOMER_OPEN_DATE', 'BIRTH_DATE', 'PROFESSION', 'CAR_TYPE',
       'BINARIZED_TARGET'],
      dtype='object')

In [7]:
myrdf.dtypes

REPORTING_DATE             object
PROGRAM_NAME               object
LOAN_OPEN_DATE             object
EXPECTED_CLOSE_DATE        object
ORIGINAL_BOOKED_AMOUNT    float64
OUTSTANDING               float64
SEX                        object
CUSTOMER_OPEN_DATE         object
BIRTH_DATE                 object
PROFESSION                 object
CAR_TYPE                   object
BINARIZED_TARGET            int64
dtype: object

### Setting the types

Make sure you use it inside the methods if you need this information, if you already transformed the data update this part.

In [8]:
# Setting the types
argument_dict = {'REPORTING_DATE':'datetime64[ns]','LOAN_OPEN_DATE':'datetime64[ns]',
                 'EXPECTED_CLOSE_DATE':'datetime64[ns]','CUSTOMER_OPEN_DATE':'datetime64[ns]',
                 'BIRTH_DATE':'datetime64[ns]','PROGRAM_NAME':'category','SEX':'category',
                'PROFESSION':'category','CAR_TYPE':'category'}
myrdf.SetAttributes(argument_dict)
myrdf.dtypes

REPORTING_DATE            datetime64[ns]
PROGRAM_NAME                    category
LOAN_OPEN_DATE            datetime64[ns]
EXPECTED_CLOSE_DATE       datetime64[ns]
ORIGINAL_BOOKED_AMOUNT           float64
OUTSTANDING                      float64
SEX                             category
CUSTOMER_OPEN_DATE        datetime64[ns]
BIRTH_DATE                datetime64[ns]
PROFESSION                      category
CAR_TYPE                        category
BINARIZED_TARGET                   int64
dtype: object

#  Implement the following 2 methods to automate the Risk-based segmentation process:
* You can implement more methods if you think it is necessary.
* In computer science, when we are dividing the code it is important to think which code does what. For example, __data cleanning__ and __data preparation__, is it done by the Risk Based Segmentation Class or the class assumes that the data is clean and ready for modelling (all variables are numeric, and dummies are already provided)?
* Use: the input dataset should already be clean and ready for the trainning of a Logistic Regression with a binary target 0 and 1 class. 
* Scope: data cleanning and data preparation is out of the scope of the Class, but notice that .missing_not_at_random() requires the data to have missing values.

## 1) Implement a method .missing_not_at_random() 
To identify different potential segments sharing data (based on sharing missing values) - Expected result is a print:
Missing Not At Random Repport (MNAR) -  PROFESSION, SEX and BIRTH_DATE variables seem Missing Not at Random, there for we recommend:

&emsp;  Thin File Segment Variables (all others variables free of MNAR issue): REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, 
OUTSTANDING, CUSTOMER_OPEN_DATE, CAR_TYPE

&emsp; Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE


In [5]:
myrdf.missing_not_at_random(input_vars=[]) # If input_vars == [] use all varibles of the df, 
                                           # otherwise only use the variables informed in the input_vars list.

'To be implemented.'

---
OUTPUT SAMPLE:

__Missing Not At Random Repport__ -  REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID variables seem Missing Not at Random, there for we recommend:

&emsp;  Thin File Segment Variables: PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, 
ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, SEX, 
CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

&emsp; Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, 
OUTSTANDING, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

---

##	2) implement a method .find_segment_split(variable)
given one variable, implement a method to identify if the variable is a good segmentation splitter and if the variable is a good splitter,   
different segments of customers with different level of risk (the one explained in the second video)
* Scope: data cleanning and data preparation is out of the scope of the Class, note that .find_segment_split(VARIABLE) assumes the data is already clean free of missing values.
* Categorical: for the segmentation process of categorical variable, dummy transformation is not practical, it is recommended that categorical variables come pre-transformed into numerical by replacing the categories by the Probability of belonging to class 1.
* The following code only works for a single variable, implement a loop going over each variable of the dataset as a candidate for segmentation.
* The following method must implement two segmentation approaches, one for Categorical Nominal (order not relevant - variable must be automatically transformed) and others where order is important.



In [6]:
myrdf.find_segment_split(canditate='SEX', # If the candidate variable is Nominal, in the class transform 
                                          # into probability of class 1 before performing the segmentation algorithm.
                         input_vars=['ORIGINAL_BOOKED_AMOUNT', 'OUTSTANDING'],  # List of input variables ready for 
                                                                                # Logistic Regression, dates and 
                                                                                # text must be transformed before 
                                                                                # invoking the method.   
                         target='BINARIZED_TARGET') # The target variable must be 0 or 1

'To be implemented.'

# TO DO:

## Mandatory (part 1 - p1) -  Implement the following: (5 out of 10 points)

- __Project Name__: pick a name for your project (if it is taken at https://pypi.org/ please create small variations), I recommend you get inspiration by the following Pockemon names: some_pockemon_examples.zip

- __Project Managment (Github)__: Work in group using Github, invite professor manoelgad@gmail.com as a collaborator to your project from the very beggining.

- __Implementation__: Discuss in group and decide the implementation you need to do for each method (missing_not_at_random and find_segment_split), then do the implementation of missing_not_at_random and find_segment_split. Your implementation should work in any dataset (make the necessary assumption and inform the user if the assumption are not followed, for example: inform the dataset must be clean and types must be informed in case they are not). 

- __Video__: Create a video from 5 to 15 minutes explainning the whole library (including p1 and p2) and showing examples of how to use it. The video will not be assessed own its own and won't be assessed by colleagues. The video can be very simple, just the notebook/python class and someone explainning things. Upload the video to Youtube and include a link to the video in the website if your group decide to pick Publishing below.    


## Improvements (part 2 - p2) -  Implement 2 of the following list of tasks:  (5 out of 10 points)

- __Improving__: Make improvements to the code -  Reliable/Robust: Create a train-test split, train all models in train and test all models always in test; Robust: Research and apply a statistical test to decide when the accuracy diffrence in statiscally relevant. Small functions/methods: Break your implementation into small functions/methods; Fast: Optimize your code, use vectorization when possible. Use stratified random sampling to reduce dataset sizes and therefore speed up the segmentation process. Implement segmentation split using Tree algorithm.

- __Publishing__: Publish your code in GitHub -  Work in group using Github, invite professor manoelgad@gmail.com as a collaborator to your project from the very beggining. Create a python package and distribute your package using https://pypi.org/, by the end of the project one must be able to pip install your project and use it.
References: https://www.youtube.com/watch?v=GIF3LaRqgXo; and  https://github.com/judy2k/publishing_python_packages_talk

- __Testing__: Implement a Test class using unittest with an "comprehensive" set of tests using a series of datasets of your choice. Have a look at this: https://ains.co/blog/things-which-arent-magic-flask-part-1.html and https://www.youtube.com/watch?v=1Lfv5tUGsn8

- __Documentation__: create a documentation for your project and publish it at GitHub project (readme) and also a pythonanywhere.com website (simple HTML). The documentation must contain an about, a how to and also examples of how to use with one or more datasets. All used datasets should be provided within the project (make sure you don't share huge datasets, make it small before sharing your code).

- __Logging & Repporting__: Log all intermediate results and final results into a Sqlite database using SQLAlchemy, then produce the final result repport in HTML format using Bokeh.


# Evaluation criteria

All team members have the choice of focusing, by choising 1 or 2 of the tasks of part 2.
*   If 1 task of part 2 is choosen, grade will be: p1\*0.5 + p2x\*0.4 + (p2y\*0.1) (p2x is the grade in the choosen task and p2y the other) 
*   If 2 tasks of part 2 are choosen, grade will be: p1\*0.5 + (p2x\*0.25 + p2y\*0.25)
* If your group only implement 1 extra part, all members will be assessed using: p1\*0.5 + p2x\*0.4 + (p2y\*0.1)
* If your group implements more then 2 parts, please indicate the ones you want to be assessed upon.


What professor will look at when assessing the project:
*	Problem structuring - How did you structure the problem and the project?
*	What assumptions did you make? (Please mention them in the video)
*	How did you narrow the scope? (Please mention them in the video)
*	Technical Skills: How reliable (does it use your own class? does is it apply data quality controls?)), readable and flexible (can you apply your code to a new dataset?) was the code that you developed?
* Analytical Skills: How logically sound, complete and meaningful was the approach (machine learning, statistics, analytics, visualization…) that you applied?
*	Usefulness:	How useful would the results of your work for new datasets?



# Tools
You are allowed to use __Python only__ and any Python Library inside your Jupyter Notebook or inside your Class, always give preference to Pandas and Scikit Learn whenever you can.



# Deliverables
*	A zip file with all code and datasets used for the projet.


# Data description
We will provide you with historical data of car loans. The data contains monthly status for each loan for 3 years. In addition to some demographic information

# Notes:
*	This data is Loan level NOT Customer level, meaning that one customer can take more than one loan
*	The data is monthly starting from 2016-01-01 to 2019-09-01 so if the loan already started before Jan2016 you will find partial history for it.
*	We have multiple programs under the car loans product
*	Make sure you understand the difference between Buckets

# Research:
In order to implement the methods missing_not_at_random and find_segment_split, you are allowed to search for whichever information you need in the internet including but not limited to:
*	Code syntax
*	Business term (However you can ask me)

Start by looking into these 3 videos:
*	What is Risk-based segmentation? https://www.youtube.com/watch?v=2ZpLgUcucfQ 
*   This is a generic video on Segmentation, it is a good reference, but careful not all needs to be implemented and not all mentioned here is relevant for this project: https://www.youtube.com/watch?v=PLsUfDDytaE 


---

# APPENDIX: 

## Simple example of Risk Based Segmentation

*   Video explainning the code below: https://www.youtube.com/watch?v=kWtnlpGwh_o


In [7]:
df = dataframe#pd.read_csv("AUTO_LOANS_DATA.csv", sep=";")

In [None]:
argument_dict = {'REPORTING_DATE':'datetime64[ns]','LOAN_OPEN_DATE':'datetime64[ns]',
                 'EXPECTED_CLOSE_DATE':'datetime64[ns]','CUSTOMER_OPEN_DATE':'datetime64[ns]',
                 'BIRTH_DATE':'datetime64[ns]','PROGRAM_NAME':'category','BUCKET':'category','SEX':'category',
                'PROFESSION':'category','CAR_TYPE':'category'}
myrdf.SetAttributes(argument_dict)
myrdf.dtypes

### Random Sample trick to speed up the process...

In [8]:
from sklearn.model_selection import train_test_split
df = dataframe
df_random_sample, _ = train_test_split(dataframe, test_size=0.95)

In [9]:
df = df_random_sample

In [10]:
df

Unnamed: 0,REPORTING_DATE,PROGRAM_NAME,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,SEX,CUSTOMER_OPEN_DATE,BIRTH_DATE,PROFESSION,CAR_TYPE,BINARIZED_TARGET
893110,2019-08-29,Auto Loans 50% Down Payment - Employed,2019-08-26,2024-09-03,288750.0,290097.50,M,2019-07-30,1974-01-17,Company Owner,HYUNDAI,0
564178,2018-02-28,Auto Loans 50% Down Payment - Self Employed,2015-02-09,2020-02-03,30000.0,0.00,F,2010-08-09,1971-01-03,HOUSEWIFE,Saipa,0
888003,2019-08-29,Auto Loans 40% Down Payment - Employed,2016-09-26,2023-09-03,70200.0,50953.96,M,2016-09-06,1974-01-26,Manager,HYUNDAI,0
257509,2016-11-30,Auto Loans 30% Down Payment - Self Employed,2016-04-24,2023-05-03,129500.0,0.00,M,2016-04-19,1988-03-10,Company Owner,HYUNDAI,0
896560,2019-08-29,Auto Loans 30% Down Payment - Employed,2015-06-11,2020-08-03,72730.0,22819.58,F,2015-03-05,1978-10-04,Manager,SUZUKI,1
...,...,...,...,...,...,...,...,...,...,...,...,...
895386,2019-08-29,Pick Up and Small Trucks,2018-11-19,2020-12-03,170400.0,117036.06,M,2018-10-30,1962-04-10,Business Man / Trader,CHEVROLET,0
168544,2016-08-31,Auto Loans 50% Down Payment - Self Employed,2014-10-02,2016-09-03,43500.0,0.00,M,2014-09-11,1969-09-04,Company Owner,GELY,0
897636,2019-08-29,Auto Loans 40% Down Payment - Employed,2017-07-16,2020-06-03,92400.0,32455.49,M,2017-07-10,1988-03-01,Athletes,BYD,0
452927,2017-07-31,Auto Loans 50% Down Payment - Self Employed,2014-09-30,2017-10-03,49500.0,5248.05,F,2014-09-22,1972-08-02,HOUSEWIFE,HYUNDAI,0


In [11]:
df.head()

Unnamed: 0,REPORTING_DATE,PROGRAM_NAME,LOAN_OPEN_DATE,EXPECTED_CLOSE_DATE,ORIGINAL_BOOKED_AMOUNT,OUTSTANDING,SEX,CUSTOMER_OPEN_DATE,BIRTH_DATE,PROFESSION,CAR_TYPE,BINARIZED_TARGET
893110,2019-08-29,Auto Loans 50% Down Payment - Employed,2019-08-26,2024-09-03,288750.0,290097.5,M,2019-07-30,1974-01-17,Company Owner,HYUNDAI,0
564178,2018-02-28,Auto Loans 50% Down Payment - Self Employed,2015-02-09,2020-02-03,30000.0,0.0,F,2010-08-09,1971-01-03,HOUSEWIFE,Saipa,0
888003,2019-08-29,Auto Loans 40% Down Payment - Employed,2016-09-26,2023-09-03,70200.0,50953.96,M,2016-09-06,1974-01-26,Manager,HYUNDAI,0
257509,2016-11-30,Auto Loans 30% Down Payment - Self Employed,2016-04-24,2023-05-03,129500.0,0.0,M,2016-04-19,1988-03-10,Company Owner,HYUNDAI,0
896560,2019-08-29,Auto Loans 30% Down Payment - Employed,2015-06-11,2020-08-03,72730.0,22819.58,F,2015-03-05,1978-10-04,Manager,SUZUKI,1


### Dirty variable selection, feature transformation and data cleanning

In [13]:
#df = df.fillna(0)

In [14]:
def get_specific_columns(df, data_types, to_ignore = list(), ignore_target = False):
    columns = df.select_dtypes(include=data_types).columns
    if ignore_target:
        columns = filter(lambda x: x not in to_ignore, list(columns))
    return list(columns)

In [15]:
target = 'BINARIZED_TARGET'

In [16]:
all_numeric_variables = get_specific_columns(df, ["float64", "int64"], [target], ignore_target = True)

# LogisticRegression - Full Model - all variables
You sould use LogisticRegression in the modeling part to avoid any overfitting issues, and also split your data into train and test split.


In [17]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
splitter = train_test_split
"-----------------------"

df_train, df_test = splitter(df, test_size = 0.2, random_state = 42)

In [18]:
X_train = df_train[all_numeric_variables]
y_train = df_train[target]

In [19]:
X_test = df_test[all_numeric_variables]
y_test = df_test[target]

In [20]:
from sklearn.linear_model import LogisticRegression
method = LogisticRegression(random_state=0)
fitted_full_model = method.fit(X_train, y_train)
y_pred = fitted_full_model.predict(X_test)

In [21]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.90625

# GINI vs Accuracy - use GINI for this analysis!

GINI as well as accuracy is a 0 to 1 measure, 0 being very bad prediction and 1 being perfect separation.
For this project __you should use GINI__ as it looks the model in all predictions (all range of probabilities), accuracy gets the probability and using a cut-off and transform the probability into predicted class 0 for probabilities below 50% and predicted class 1 for above or equal to 50%. So using accuracy makes our analysis for segmentation very short sighted as the result of the analysis could change if one changes the cut-off to let say 40%, for this reason we will use the GINI coeficient which is independent of the cut-off having a better overview of the whole model predictions.

GINI is a simple calculation resulting from AUC. You will not find directly the Gini Coefficient as an attribute for the LogisticRegressor Class, but you can use the 2*AUC-1 formula to calculate it. 

If you want more details about GINI have a look into this video:
https://www.youtube.com/watch?v=MiBUBVUC8kE


Make sure you use .predict_proba (to predict probability) and then get the first column using [:,1] to get only the probability of being 1, instead of .predict which gives the 0 or 1 class. This proba is  what you need to pass as predictions_list below, to finally obtain the GINI:

In [22]:
y_pred_probadbility = fitted_full_model.predict_proba(X_test)[:,1]
#y_test is your actual 0 and 1 class and y_pred_probadbility is the predicted probability of belonging to class 1.
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probadbility)
roc_auc = auc(fpr, tpr)
GINI = (2 * roc_auc) - 1
print(GINI)

0.09977650063856958


# Analysis Model 1 - Gender M and F

In [23]:
df['SEX'].value_counts()

M    1379
F     540
Name: SEX, dtype: int64

In [24]:
df_train_seg1 = df_train[df['SEX'] == "M"]
df_train_seg2 = df_train[df['SEX'] != "M"]
df_test_seg1 = df_test[df['SEX'] == "M"]
df_test_seg2 = df_test[df['SEX'] != "M"]

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


# Full Model vs Seg 1 on Seg 1

In [26]:
X_train_seg1 = df_train_seg1[all_numeric_variables]
y_train_seg1 = df_train_seg1[target]
X_test_seg1 = df_test_seg1[all_numeric_variables]
y_test_seg1 = df_test_seg1[target]
fitted_model_seg1 = method.fit(X_train_seg1, y_train_seg1)

def GINI(y_test, y_pred_probadbility):
    from sklearn.metrics import roc_curve, auc
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_probadbility)
    roc_auc = auc(fpr, tpr)
    GINI = (2 * roc_auc) - 1
    return(GINI)

y_pred_seg1_proba = fitted_model_seg1.predict_proba(X_test_seg1)[:,1]
y_pred_seg1_fullmodel_proba = fitted_full_model.predict_proba(X_test_seg1)[:,1]

print("Segment1: SEX in ('M') [GINI Full Model: {:.4f}% / GINI Segmented Model: {:.4f}%]".format(
    GINI(y_test_seg1, y_pred_seg1_proba)*100,
    GINI(y_test_seg1, y_pred_seg1_fullmodel_proba)*100
)) 

Segment1: SEX in ('M') [GINI Full Model: 0.0887% / GINI Segmented Model: 0.0887%]


# Full Model vs Seg 2 on Seg 2

In [27]:
X_train_seg2 = df_train_seg2[all_numeric_variables]
y_train_seg2 = df_train_seg2[target]
X_test_seg2 = df_test_seg2[all_numeric_variables]
y_test_seg2 = df_test_seg2[target]
fitted_model_seg2 = method.fit(X_train_seg2, y_train_seg2)

y_pred_seg2_proba = fitted_model_seg2.predict_proba(X_test_seg2)[:,1]
y_pred_seg2_fullmodel_proba = fitted_full_model.predict_proba(X_test_seg2)[:,1]

print("Segment1: SEX in ('F') [GINI Full Model: {:.4f}% / GINI Segmented Model: {:.4f}%]".format(
    GINI(y_test_seg2, y_pred_seg2_proba)*100,
    GINI(y_test_seg2, y_pred_seg2_fullmodel_proba)*100
))   

Segment1: SEX in ('F') [GINI Full Model: 57.8431% / GINI Segmented Model: 57.8431%]


# Execution Summary Repport

&emsp;
BUCKET is the target variable and was not analyzed separetly.

__Missing Not At Random Repport__ -  REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID variables seem Missing Not at Random, there for we recommend:

&emsp;  Thin File Segment Variables: PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, 
ORIGINAL_BOOKED_AMOUNT, OUTSTANDING, SEX, 
CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

&emsp; Full File Segment Variables: REPORTING_DATE, ACCOUNT_NUMBER, CUSTOMER_ID, PROGRAM_NAME, LOAN_OPEN_DATE, EXPECTED_CLOSE_DATE, ORIGINAL_BOOKED_AMOUNT, 
OUTSTANDING, SEX, CUSTOMER_OPEN_DATE, BIRTH_DATE, PROFESSION, CAR_TYPE

__Variable by Variable Risk Based Segmentation Analysis__:

&emsp; REPORTING_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; ACCOUNT_NUMBER Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CUSTOMER_ID Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; PROGRAM_NAME Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; LOAN_OPEN_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; EXPECTED_CLOSE_DATE Good for segmentation.  

&emsp; &emsp; Segment1: EXPECTED_CLOSE_DATE < '22/07/2021'  [GINI Full Model: 32.1234% / GINI Segmented Model: 33.4342%]

&emsp; &emsp;  Segment2: EXPECTED_CLOSE_DATE >= '22/07/2021' [GINI Full Model: 63.7523% / GINI Segmented Model: 68.8342%]

&emsp; ORIGINAL_BOOKED_AMOUNT Good for segmentation.  

&emsp; &emsp; Segment1: ORIGINAL_BOOKED_AMOUNT < 90000 [GINI Full Model: 32.3243% / GINI Segmented Model: 33.9833%]

&emsp; &emsp; Segment2: ORIGINAL_BOOKED_AMOUNT >= 90000 [GINI Full Model: 63.3449% / GINI Segmented Model: 68.9438%]

&emsp; OUTSTANDING Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; SEX Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CUSTOMER_OPEN_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CUSTOMER_OPEN_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; BIRTH_DATE Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; PROFESSION Not good for segmentation. Afer analysis, we did not find a good split using this variable.

&emsp; CAR_TYPE Good for segmentation.  

&emsp; &emsp; Segment1: CAR_TYPE in (BMW', 'BYD', 'CARRY', 'Changan', 'CHEVROLET', 'Gelory', 'GELY', 'HYUNDAI') [GINI Full Model: 35.3492% / GINI Segmented Model: 37.3943%]

&emsp; &emsp; Segment2: CAR_TYPE in ('Jack', 'KIA', 'MERCEDES', 'MITSUBISHI', 'NISSAN', 'RENAULT', 'SEAT', 'SKODA', 'SUZUKI') [GINI Full Model: 42.4324% / GINI Segmented Model: 49.4393%]