<img src="profile_manoelgadi.png" width=100 height=100 align="right">

Author: Prof. Manoel Gadi

Contact: mfalonso@faculty.ie.edu

Teaching Web: http://mfalonso.pythonanywhere.com

Last revision: 24/February/2020

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

  from IPython.core.display import display, HTML


---

__Objective of today's session__ - Review for the exam.

* Get in touch with a real Credit Card Transaction Fraud Dataset (from Brazil). 
* Learn the concept of oversample and downsample for unbalanced data

Learn, review and discuss in general and specific for unbalance data: 
* Gini
* Population Stability Index
* Weight of evidence and Information Value
* Correlation - Spearman Ranking
* Feature selection
* Overfitting

Grouping - discuss how to transform or when to drop  variables according to the type of variable:
* primary key auto incremental (id)
* input binary - flag 0/1 variable
* input categorical nominal
* input categorical ordinal
* input numerical continuos (input float)
* dates
* future variables
* Target variable

---

# Credit Card Fraud



<img src="00_frauddetection.jpg"  width=500 height=500 align="center">


---

In [2]:
import pandas as pd

## Reading the data:

In [3]:
df = pd.read_csv("creditcard_transaction_fraud_full.csv")
df.index = df.iloc[:,0]
df = df.drop(df.columns[0], axis=1)

In [4]:
df.shape

(41647, 18)

In [5]:
df.head().T

Unnamed: 0,0,1,2,3,4
merchant_category_code_cat,9.0,22.0,22.0,9.0,15.0
merchant_category_code_previoustransaction_cat,22.0,22.0,22.0,0.0,9.0
zipcode_cat,3.0,3.0,3.0,2.0,2.0
zipcode_previoustransaction_cat,3.0,3.0,3.0,0.0,2.0
transaction_value_cat,6.0,7.0,7.0,4.0,4.0
transaction_value_previoustransaction_cat,6.0,7.0,7.0,1.0,4.0
pos_entry,2.0,2.0,2.0,2.0,2.0
creditcard_limit_cat,6.0,6.0,6.0,4.0,4.0
brand_visa_mastercard_cat,2.0,2.0,2.0,2.0,2.0
type_of_creditcard_cat,3.0,3.0,3.0,3.0,3.0


## All information has been grouped in categories:

### DISCUSSION 1) How to transform or when to drop  variables based simply on the type of variable?

* primary key auto incremental (id) - __drop__
* input binary - flag 0/1 variable - __do nothing__
* input categorical nominal - __create dummies__ (if necessary to reduce categories: apply WoE transformation followed by percentile / quantile grouping followed by dummy creation). Quantile grouping aims to reduce impact of outliers
* input categorical ordinal - __do nothing__ (if necessary to reduce categories: __percentile/quantile grouping followed by dummy creation__)
* input numerical continuos (input float) - __do nothing__ if necessary to reduce categories: __percentile/quantile grouping followed by dummy creation__)
* dates - never use as dates, transform into difference of dates then apply - __percentile or quantile grouping__
* future variables - __drop__
* Target variable - __create the y with it, remove it from X__

### FRAUD RATE


In [6]:
print(df['TARGET'].sum() / df['TARGET'].count()) 

0.03743366869162244


## Resampling

The dataset has been downsampled 100x (reducing non-fraud cases randomly).

<img src="06_oversampling.JPG"  width=500 height=500 align="center">

__Oversampling__ is a techinique where one replicates (oversample) the minority class (fraud) in order to balance different costs for false positives and false negatives. Oversampling of minority class is recomended for small/medium dataset sizes.

__Undersampling__ is a technique where one samples the majority class (non-fraud) down rate in order to balance different costs for false positives and false negatives. Undersampling the majority class is recomended for big/huge dataset sizes.



### Original Fraud Rate:

In [7]:
print(df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum()) )

0.0003887432521627116


In [8]:
print(10000*df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum()),"basis points" )

3.887432521627116 basis points


<img src="01_haystatck.JPG"  width=500 height=500 align="center">

<img src="02_fraud_prevention.JPG"  width=500 height=500 align="center">

<img src="03_type_of_fraud.JPG"  width=500 height=500 align="center">

### Accuracy is not a good measure to use in here 

<img src="04_recall.JPG"  width=500 height=500 align="center">

### It is mportant is to find the fraud. We can use recall, KS, GINI or calculate the actual cost of fraud

<img src="05_cost_of_fraud.JPG"  width=500 height=500 align="center">

in this example we will use GINI

### Train-test split

In [9]:
from sklearn.model_selection import train_test_split
splitter = train_test_split
"-----------------------"

df_train, df_test = splitter(df, test_size = 0.2, random_state = 42)
print("Dataset shape: {shape}".format(shape = df_train.shape))
print("Dataset shape: {shape}".format(shape = df_test.shape))

Dataset shape: (33317, 18)
Dataset shape: (8330, 18)


### Selecting the final variables and target

In [10]:
def get_specific_columns(df, data_types, to_ignore = list(), ignore_target = False):
    columns = df.select_dtypes(include=data_types).columns
    if ignore_target:
        columns = filter(lambda x: x not in to_ignore, list(columns))
    return list(columns)

target = "TARGET"
variables = list(get_specific_columns(df, ["float64", "int64"], [target], ignore_target = True))


In [11]:
X_train= df_train[variables]
y_train = df_train[target]

X_test= df_test[variables]
y_test = df_test[target]

### Fitting a LogisticRegression

In [12]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0)
fitted_model = clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Retrienving the predicted class

In [13]:
pred_train = fitted_model.predict(X_train)
pred_test  = fitted_model.predict(X_test)

#### NOW CALCULATE ACCURACY SEPARATING train AND test SAMPLES

In [14]:
from sklearn.metrics import accuracy_score
print("Accuracy* Train: {0}".format(accuracy_score(y_train,pred_train)))
print("Accuracy* Test: {0}".format(accuracy_score(y_test,pred_test)))


Accuracy* Train: 0.9617312483116727
Accuracy* Test: 0.9623049219687875


*details on appendix

---

### However, as Fraud Rate is:

In [15]:
print(df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum()) )

0.0003887432521627116


### A model that predicts all cases to be non-fraud has a accuracy of:

In [16]:
1-df['TARGET'].sum() / ((df['TARGET'].count()-df['TARGET'].sum())*100+df['TARGET'].sum())

0.9996112567478372

So our model is really bad!

## Let´s use GINI and KS to see if there is any thing good about our model

Gini & KS Statistics in Credit Scoring https://youtu.be/MiBUBVUC8kE

# Things to think about during individual exercise:

1. accuracy, precision, recall and f1-score vs. GINI and KS2
1. Power vs. Robustness1. Linear (OLS based= vs. Non-linear (Tree based)
1. Sample - train/test vs. cross-validation vs. Out-of-time
1. Scaling: No Scaling vs. Standard scaling vs. Min Max Scaling
1. Feature Selection: Bivariate, Feature Importance and Genetic Algorithm
2. 1. Alternative Methods: Ensemble models # Confusion Matrix - Accuracy, recall & precision

<img src = "09_matriz.confusion.jpg" width = 300 height = 300 align = "center">

Where:
* TP = True Positive - True Positive - It is 1 and I rate it as 1.
* TN = True Negative - True Negative - It is 0 and I rate it as 0.
* FN = False Negative - False Negative.
* FP = False Positive - Positive False.


* Accuracy (accuracy) answers the question What is the proportion of correct predictions?

\begin{equation*}
accuracy =
\frac{( TP + TN )} {Total ( TP + TN + FP + FN)}
\end{equation*}

* Sensitivity (recall) or Percent Support (support) answers the question What proportion of real positives have been correctly predicted?
\begin{equation*}
recall =
\frac{( TP )} {( TP + FN)}
\end{equation*}



* Precision (Confidence) responds to the question What proportion of my positive predictions is correct?
\begin{equation*}
precision =
\frac{( TP )} {( TP + FP)}
\end{equation*}

Note that sensitivity and accuracy are defined here as proportion of real positives and proportion of positive predictions.

#### F1-score

The F1-score is a classifying metric that calculates a mean of accuracy and recall in a way that emphasizes the lowest value.

It is calculated as the harmonic average of precision and recall, where an F1-score reaches its best value at 1 (perfect accuracy and reminder) and the worst at 0.

<img src="10_f1-score.png" width=400 height=400 align="center">

#### Harmonic Average

The harmonic mean is defined as the inverse of the arithmetic mean of the inverses. Because of that, the result is not sensitive to extremely large values.

<img src = "11_armonic_mean.png" width = 400 height = 400 align = "center">


In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_train,pred_train))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98     32062
           1       0.47      0.11      0.18      1255

    accuracy                           0.96     33317
   macro avg       0.72      0.55      0.58     33317
weighted avg       0.95      0.96      0.95     33317



---

In [18]:
pred_train_proba = fitted_model.predict_proba(X_train)[:,1]
pred_test_proba  = fitted_model.predict_proba(X_test)[:,1]

### Calculating GINI for train and test samples

In [19]:
from sklearn.metrics import roc_auc_score
def calculate_gini_score(a,b):
    """Function that received two parameters; first: a binary variable representing 0=good and 1=bad, and then a second variable with the prediction of the first variable, the second variable can be continuous, integer or binary - continuous is better. Finally, the function returns the GINI Coefficient of the two lists."""    
    gini = 2*roc_auc_score(a,b)-1
    return gini

In [20]:
print("GINI Score TRAIN: {0}".format(calculate_gini_score(y_train, pred_train_proba)))
print("GINI Score TEST: {0}".format(calculate_gini_score(y_test, pred_test_proba)))

GINI Score TRAIN: 0.7339094995478135
GINI Score TEST: 0.7257859325612812


In [21]:
from sklearn.metrics import confusion_matrix

In [22]:
confusion_matrix(y_train, pred_train_proba>0.2)

array([[31246,   816],
       [  772,   483]], dtype=int64)

<img src="08_gini.jpg"  width=500 height=500 align="center">

--- 

In [23]:
for numb in range(1,11):
    cutoff = numb/10.0
    cm = confusion_matrix(y_train, pred_train_proba>cutoff)
    print("----")
    print("True positive rate (cut-off {}%):".format(100*cutoff),cm[1,1]/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))
    print("1 - False Positive Rate (cut-off {}%):".format(100*cutoff),1 - cm[0,1]/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))

----
True positive rate (cut-off 10.0%): 0.0192394273193865
1 - False Positive Rate (cut-off 10.0%): 0.9537473361947354
----
True positive rate (cut-off 20.0%): 0.01449710358075457
1 - False Positive Rate (cut-off 20.0%): 0.9755079989194705
----
True positive rate (cut-off 30.0%): 0.010865324008764294
1 - False Positive Rate (cut-off 30.0%): 0.984632469910256
----
True positive rate (cut-off 40.0%): 0.0077738091664915805
1 - False Positive Rate (cut-off 40.0%): 0.9900051025002251
----
True positive rate (cut-off 50.0%): 0.004202059008914368
1 - False Positive Rate (cut-off 50.0%): 0.995197646846955
----
True positive rate (cut-off 60.0%): 0.0017408530179788095
1 - False Positive Rate (cut-off 60.0%): 0.998079058738782
----
True positive rate (cut-off 70.0%): 0.0007203529729567488
1 - False Positive Rate (cut-off 70.0%): 0.9991896029054237
----
True positive rate (cut-off 80.0%): 0.0001200588288261248
1 - False Positive Rate (cut-off 80.0%): 0.9997898970495542
----
True positive rate (c

---

### Calculating KS for train and test samples

In [24]:
def calculate_ks(b,a):  
    """Function that received two parameters; first: a binary variable representing 0=good and 1=bad, and then a second variable with the prediction of the first variable, the second variable can be continuous, integer or binary - continuous is better. Finally, the function returns the KS Statistics of the two lists."""
    try:
        tot_bads=1.0*sum(b)
        tot_goods=1.0*(len(b)-tot_bads)
        elements = zip(*[a,b])
        elements = sorted(elements,key= lambda x: x[0])
        elements_df = pd.DataFrame({'probability': b,'gbi': a})
        pivot_elements_df = pd.pivot_table(elements_df, values='probability', index=['gbi'], aggfunc=[sum,len]).fillna(0)
        max_ks = perc_goods = perc_bads = cum_perc_bads = cum_perc_goods = 0
        for i in range(len(pivot_elements_df)):
            perc_goods =  (pivot_elements_df.iloc[i]['len'] - pivot_elements_df.iloc[i]['sum']) / tot_goods
            perc_bads = pivot_elements_df.iloc[i]['sum']/ tot_bads
            cum_perc_goods += perc_goods
            cum_perc_bads += perc_bads
            A = cum_perc_bads-cum_perc_goods
            if abs(A['probability']) > max_ks:
                max_ks = abs(A['probability'])
    except:
        max_ks = 0
    return max_ks

In [25]:
print("KS Score TRAIN: {0}".format(calculate_ks(y_train, pred_train_proba)))
print("KS Score TEST: {0}".format(calculate_ks(y_test, pred_test_proba)))

  pivot_elements_df = pd.pivot_table(elements_df, values='probability', index=['gbi'], aggfunc=[sum,len]).fillna(0)


KS Score TRAIN: 0.5850019670554478


  pivot_elements_df = pd.pivot_table(elements_df, values='probability', index=['gbi'], aggfunc=[sum,len]).fillna(0)


KS Score TEST: 0.5976940076330244


### Understand the cut-off

---

### PSI

<img src="07_cost_of_fraud.JPG"  width=500 height=500 align="center">

$PSI = \sum{}\Big(\big(Actual \% - Expected \%\big) \times ln\big(\dfrac{Actual \%}{Expected \%}\big)\Big)$

In [26]:
from profmanoelgadi_support_package import PSI
PSI.calculate_psi(X_train['merchant_category_code_cat'], X_test['merchant_category_code_cat'], 
                  buckettype='bins', number=10)

ModuleNotFoundError: No module named 'profmanoelgadi_support_package'

#### Storing the PSI into a dataframe

In [None]:
df_stats=pd.DataFrame(X_train.columns,columns=['variable'])

In [None]:
PSI_list = []
for item in X_train.columns:
    PSI_list.append(PSI.calculate_psi(X_train[item], X_test[item], buckettype='bins', number=10))


In [None]:
df_stats['PSI']=PSI_list

In [None]:
df_stats

### Information Value*

*details on appendix

In [None]:
from profmanoelgadi_support_package import IV

In [None]:
final_iv, IV = IV.data_vars(X_train, y_train)

In [None]:
IV

In [None]:
IV_list = []
for item in X_train.columns:
    IV_list.append(float(IV[IV['VAR_NAME']==item]['IV']))


In [None]:
df_stats['IV']=IV_list

In [None]:
df_stats

### Spearman Correlation from Scipy

In [None]:
from scipy.stats import spearmanr
spearmanr(X_train['merchant_category_code_cat'],X_train['merchant_category_code_previoustransaction_cat'])

In [None]:
spearmanr(X_train['merchant_category_code_cat'],X_train['merchant_category_code_previoustransaction_cat'])[0]

In [None]:
for item in X_train.columns:
    Spearman_correlation_list = []
    for item2 in X_train.columns:
        Spearman_correlation_list.append(spearmanr(X_train[item], X_train[item2])[0])
    df_stats['corr_with_'+item]=Spearman_correlation_list

In [None]:
df_stats

In [None]:
df_stats.sort_values(by=['PSI'], ascending=False)

In [None]:
df_stats.sort_values(by=['IV'], ascending=False)

# Final Discussions ...

## Feature selection

1. How can we use the PSI, IV and Correlation for Feature Selection?
1. What else can we use for feature selection?


## Overfitting

1. What is overfitting? 
1. How to identify it? 
1. How to reduce it? 


# Things to think about during individual exercise:

1. accuracy, precision, recall and f1-score vs. GINI and KS2
1. Power vs. Robustness
1. Linear (OLS based= vs. Non-linear (Tree based)
1. Sample - train/test vs. cross-validation vs. Out-of-time
1. Scaling: No Scaling vs. Standard scaling vs. Min Max Scaling
1. Feature Selection: Bivariate, Feature Importance and Genetic Algorithm
2. 1. Alternative Methods: Ensemble models 

