# Evaluation Metrics: GINI & KS 

<img src="img/bannerlogo.jpg" width=400 height=400 align="center">

**Explore Data Science Academy**

**Author:** [Marang Mutloatse](https://www.linkedin.com/in/marangmutloatse/)

**Email:** <marangmutloatse@gmail.com> 

**Last Revision**: 23 August 2020

**version**: 1.2

- **For the best viewing experience, one should [view this notebook in nbviewer].**
- **With no local installation required, you can also [execute the code in this notebook on Binder]**.

## Learning Objectives

[[ go back to the top ]](#Table-of-contents)
- buwbb
- kbfwi
 - iwbqfib
 - oqwifb

## Outline
[[ go back to the top ]](#Table-of-contents)
- fallen heros
- waving not 

## Introduction

[[ go back to the top ]](#Table-of-contents)


Where:
- **TP** = True Positive - True Positive - It is 1 and I rate it as 1.
- **TN** = True Negative - True Negative - It is 0 and I rate it as 0.
- **FN** = False Negative - False Negative.
- **FP** = False Positive - Positive False.


* Accuracy (accuracy) answers the question, *What is the proportion of correct predictions?*

\begin{equation*}
accuracy =
\frac{( TP + TN )} {Total ( TP + TN + FP + FN)}
\end{equation*}

* Sensitivity (recall) or Percent Support (support) answers the question, *What proportion of real positives have been correctly predicted?*
\begin{equation*}
recall =
\frac{( TP )} {( TP + FN)}
\end{equation*}



* Precision (Confidence) responds to the question *What proportion of my positive predictions is correct?*
\begin{equation*}
precision =
\frac{( TP )} {( TP + FP)}
\end{equation*}

Note that sensitivity and accuracy are defined here as proportion of real positives and proportion of positive predictions.

### F1-score

The F1-score is a classifying metric that calculates a mean of accuracy and recall in a way that emphasizes the lowest value.

It is calculated as the harmonic average of precision and recall, where an F1-score reaches its best value at 1 (perfect accuracy and reminder) and the worst at 0.


### Harmonic Average

The harmonic mean is defined as the inverse of the arithmetic mean of the inverses. Because of that, the result is not sensitive to extremely large values.



## Installation and Required Libraries

[[ go back to the top ]](#Table-of-contents)

This notebook is based on *python 3x* and uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

- [NumPy](https://numpy.org/install/)
- [pandas](https://pandas.pydata.org/pandas-docs/version/0.17.0/install.html)
- [scikit-learn](https://scikit-learn.org/stable/install.html)
- [matplotlib](https://matplotlib.org/3.3.1/users/installing.html)
- [warnings](https://docs.python.org/3/library/warnings.html)

To ensure you have all of the packages to run the notebook, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib warnings

`Conda` may prompt an update if the most recent version is not installed. Allow it to do so.

**Note:** Ideally in an end-to-end Machine learning project, one would create an [environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) with the packages and the correct versions for the specific model. To see more about MLOps and the machine learning pipeline, read [here](https://christophergs.com/machine%20learning/2019/03/17/how-to-deploy-machine-learning-models/).

## Import Libraries

[[ go back to the top ]](#Table-of-contents)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

## Data Check

[[ go back to the top ]](#Table-of-contents)

In [2]:
url = "https://raw.githubusercontent.com/maz2198/explore/master/data/creditcard_transaction_fraud_full.csv"
df = pd.read_csv(url)
df.index = df.iloc[:,0]
df = df.drop(df.columns[0], axis=1)

In [3]:
df.shape

(41647, 18)

In [5]:
df.head().T

Unnamed: 0,0,1,2,3,4
merchant_category_code_cat,9.0,22.0,22.0,9.0,15.0
merchant_category_code_previoustransaction_cat,22.0,22.0,22.0,0.0,9.0
zipcode_cat,3.0,3.0,3.0,2.0,2.0
zipcode_previoustransaction_cat,3.0,3.0,3.0,0.0,2.0
transaction_value_cat,6.0,7.0,7.0,4.0,4.0
transaction_value_previoustransaction_cat,6.0,7.0,7.0,1.0,4.0
pos_entry,2.0,2.0,2.0,2.0,2.0
creditcard_limit_cat,6.0,6.0,6.0,4.0,4.0
brand_visa_mastercard_cat,2.0,2.0,2.0,2.0,2.0
type_of_creditcard_cat,3.0,3.0,3.0,3.0,3.0


### Fraud Rate

In [6]:
print(df['TARGET'].sum() / df['TARGET'].count())

0.03743366869162244


## Resampling

[[ go back to the top ]](#Table-of-contents)

The dataset has been downsampled 100x (reducing non-fraud cases randomly).

__Oversampling__ is a technique where one replicates (oversample) the minority class (fraud) in order to balance different costs for false positives and false negatives. Oversampling of minority class is recomended for small/medium dataset sizes.

__Undersampling__ is a technique where one samples the majority class (non-fraud) down rate in order to balance different costs for false positives and false negatives. Undersampling the majority class is recomended for big/huge dataset sizes.

### Original Fraud Rate

In [7]:
print(df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum()) )

0.0003887432521627116


In [8]:
print(10000*df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum()),"basis points" )

3.887432521627116 basis points


## Train-test-split

[[ go back to the top ]](#Table-of-contents)

In [9]:
from sklearn.model_selection import train_test_split
splitter = train_test_split
"-----------------------"

df_train, df_test = splitter(df, test_size = 0.2, random_state = 42)
print("Dataset shape: {shape}".format(shape = df_train.shape))
print("Dataset shape: {shape}".format(shape = df_test.shape))

Dataset shape: (33317, 18)
Dataset shape: (8330, 18)


## Selecting the final variables and target

In [11]:
def get_specific_columns(df, data_types, to_ignore = list(), ignore_target = False):
    columns = df.select_dtypes(include=data_types).columns
    if ignore_target:
        columns = filter(lambda x: x not in to_ignore, list(columns))
    return list(columns)

target = "TARGET"
variables = list(get_specific_columns(df, ["float64", "int64"], [target], ignore_target = True))


In [12]:
X_train= df_train[variables]
y_train = df_train[target]

X_test= df_test[variables]
y_test = df_test[target]

### Fitting a Logistic Regression model

In [13]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0)
fitted_model = clf.fit(X_train, y_train)

### Retrieving the Predicted class

In [15]:
pred_train = fitted_model.predict(X_train)
pred_test  = fitted_model.predict(X_test)

#### NOW CALCULATE ACCURACY SEPARATING train AND test SAMPLES

In [16]:
from sklearn.metrics import accuracy_score
print("Accuracy* Train: {0}".format(accuracy_score(y_train,pred_train)))
print("Accuracy* Test: {0}".format(accuracy_score(y_test,pred_test)))

Accuracy* Train: 0.9617312483116727
Accuracy* Test: 0.9623049219687875


However, as Fraud Rate is:

In [17]:
print(df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum()) )

0.0003887432521627116


A model that predicts all cases to be non-fraud has an accuracy of:

In [19]:
1-df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum())

0.9996112567478372

We will now use GINI and KS to see if there is any thing good about our model. For further explanation on `GINI` & `KS`, watch this short video on [Gini & KS Statistics in Credit Scoring](https://youtu.be/MiBUBVUC8kE).

### Retrieving the probability of being fraud

In [20]:
pred_train_proba = fitted_model.predict_proba(X_train)[:,1]
pred_test_proba  = fitted_model.predict_proba(X_test)[:,1]

`Scikit-learn`, `pandas` or `statsmodels` do not have a specific function to calculate GINI and KS, so below is a function developed by [Manoel Gadi Fernando Alonso](https://www.linkedin.com/in/manoel-gadi-97821213/) and [Marang Mutloatse](https://github.com/maz2198).

In [22]:
from sklearn.metrics import roc_auc_score

def calculate_gini_score(a,b):
    """Function that receives two parameters; first: 
    a binary variable representing 0=good and 1=bad, and then a second variable with the prediction of the first variable, 
    the second variable can be continuous, integer or binary - continuous is better. Finally, the function returns the 
    GINI Coefficient of the two lists."""    
    gini = 2*roc_auc_score(a,b)-1
    return gini

def calculate_KS(b,a):  
    """Function that received two parameters; first: a binary variable representing 0=good and 1=bad, and then a second 
    variable with the prediction of the first variable, the second variable can be continuous, integer or binary - 
    continuous is better. Finally, the function returns the KS Statistics of the two lists."""
    try:
        tot_bads=1.0*sum(b)
        tot_goods=1.0*(len(b)-tot_bads)
        elements = zip(*[a,b])
        elements = sorted(elements,key= lambda x: x[0])
        elements_df = pd.DataFrame({'probability': b,'gbi': a})
        pivot_elements_df = pd.pivot_table(elements_df, values='probability', index=['gbi'], aggfunc=[sum,len]).fillna(0)
        max_ks = perc_goods = perc_bads = cum_perc_bads = cum_perc_goods = 0
        for i in range(len(pivot_elements_df)):
            perc_goods =  (pivot_elements_df.iloc[i]['len'] - pivot_elements_df.iloc[i]['sum']) / tot_goods
            perc_bads = pivot_elements_df.iloc[i]['sum']/ tot_bads
            cum_perc_goods += perc_goods
            cum_perc_bads += perc_bads
            A = cum_perc_bads-cum_perc_goods
            if abs(A['probability']) > max_ks:
                max_ks = abs(A['probability'])
    except:
        max_ks = 0
    return max_ks

In [23]:
print("GINI Score TRAIN: {0}".format(calculate_gini_score(y_train, pred_train_proba)))
print("GINI Score TEST: {0}".format(calculate_gini_score(y_test, pred_test_proba)))

GINI Score TRAIN: 0.7339339044545414
GINI Score TEST: 0.7257933099007177


In [24]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, pred_train_proba>0.2)

array([[31246,   816],
       [  772,   483]], dtype=int64)

### Calculating KS for train and test samples

[[ go back to the top ]](#Table-of-contents)

In [25]:
print("KS Score TRAIN: {0}".format(calculate_KS(y_train, pred_train_proba)))
print("KS Score TEST: {0}".format(calculate_KS(y_test, pred_test_proba)))

KS Score TRAIN: 0.585001967055448
KS Score TEST: 0.597569412566989


# Conclusion

[[ go back to the top ]](#Table-of-contents)

# Further Reading

[[ go back to the top ]](#Table-of-contents)

# Optional Exercises

[[ go back to the top ]](#Table-of-contents)

1. Knowing what we know about `GINI` and `KS`, what would be the appropriate metric (**f1**, **accuracy**, **mcc** etc.) to use when performing cross validation? If possible, use several different classification models (Random Forest, Logit and Decision Tree) and [plot_comparisons](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). **Here is some code and few examples on plotting the [roc-auc curve](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html).**

2. How/Is, the use of `GINI` and `KS` effective in multiclass classification problems? **Note: Focus on a specific industry and use case, this will ease answering this question**