<h1>Exploratory Analisys and Model Evaluation</h1>

# Importing Libraries

In [18]:
import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image
import pandas as pd
import pandas_profiling
import numpy as np
from IPython.display import Markdown, display
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support as score
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree
from sklearn.linear_model import LogisticRegression
import xgboost
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# Importing the Dataset

In [2]:
df = pd.read_csv('dataset_test_ds.csv',sep=';')

# Auxiliary Functions

In [3]:
def printmd(string):
    display(Markdown(string))

In [4]:
def create_classifiers():
    classifiers = []

    model_0 = LogisticRegression(random_state = 0, class_weight = 'balanced')
    classifiers.append(model_0)

    model_1 = tree.DecisionTreeClassifier(random_state = 0)
    classifiers.append(model_1)
    
    model_2 = RandomForestClassifier(random_state = 0, n_estimators=10)
    classifiers.append(model_2)

    model_3 = svm.SVC(random_state = 0, class_weight = 'balanced')
    classifiers.append(model_3)

    model4 = xgboost.XGBClassifier(random_state = 0)
    classifiers.append(model4)
    
    return(classifiers)

In [5]:
def run_models(X_train, X_test, y_train, y_test, classifiers):    
    for clf in classifiers:
        clf.fit(X_train, y_train)
        y_pred= clf.predict(X_test)

        printmd("**{}**".format(type(clf).__name__))

        acc = accuracy_score(y_test, y_pred)
        print("Accuracy {} \n".format( acc))

        cm = confusion_matrix(y_test, y_pred)
        print("Confusion Matrix:")
        print(cm)
        print("")

        precision, recall, fscore, support = score(y_test, y_pred)

        print('Precision \n  Class 0: {} - Class 1: {}\n'.format(precision[0],precision[1]))
        print('Recall \n  Class 0: {} - Class 1: {}\n'.format(recall[0],recall[1]))
        print('FScore \n  Class 0: {} - Class 1: {}\n'.format(fscore[0],fscore[1]))

        print("")

# Exploratory Analisys

## Descriptive Statistics

In [6]:
df.head()

Unnamed: 0,V1,V2,V3,TARGET,V4,V5,V6,V7,V8,V9,V10,Safra
0,0,8.1,9.99,0,1968,0,0,15.15,0,0,0,201901
1,0,4.4,35.0,0,1369,0,0,63.98,1,0,0,201910
2,0,0.7,52.99,0,1228,0,0,98.84,0,0,0,201906
3,0,63.3,810.0,0,0,0,1,9237.21,0,0,0,201910
4,0,4.1,17.5,0,0,0,1,27.7,1,0,0,201902


In [7]:
df.describe()

Unnamed: 0,V1,V2,V3,TARGET,V4,V5,V6,V7,V8,V9,V10,Safra
count,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0,11169.0
mean,0.106366,19.726368,531.046901,0.010744,1396.048438,0.18999,0.177903,4346.085975,0.397529,0.008506,0.030531,201906.522339
std,0.308319,25.438201,906.626021,0.1031,1736.590512,0.656058,0.382448,11542.51655,0.489409,0.091837,0.172051,3.447787
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,201901.0
25%,0.0,2.8,37.52,0.0,30.0,0.0,0.0,77.42,0.0,0.0,0.0,201904.0
50%,0.0,10.0,135.0,0.0,1321.0,0.0,0.0,414.07,0.0,0.0,0.0,201907.0
75%,0.0,25.2,520.0,0.0,1988.0,0.0,0.0,2799.06,1.0,0.0,0.0,201910.0
max,1.0,100.0,8540.0,1.0,15616.0,11.0,1.0,143268.55,1.0,1.0,1.0,201912.0


## Data Exploration

In [8]:
pandas_profiling.ProfileReport(df) 



OBS: **The data exploration IFrame is not displayed on Github web interface**. 

To visualize the interactive report, there's two options: 

* Download and open the *html* version of the report with your browser: https://drive.google.com/file/d/1VqIMwcjWWR7uNdg1hZIeAHXPcVx1T1rg/view?usp=sharing

or

* Download the juptyter notebook and run it localy in your computer

## Key Observations

* The dataset is **absolutely unbalanced** when we look at the 'TARGET' variable;
* The Pearsons Correlation Map indicates that the variables V3 and V7 have a high positive rate of correlation, indicating multicollinearity in the data. Thus, training the models with both of them probably won't be beneficial;
* The variable 'Safra' doesn't seem to have any correlation with any other variable from the dataset. It clearly contains a date regarding the registry of that information, and that could be used on time series analysis with RNNs(Recurrent Neural Networks) for instance, but since this kind of algorithm takes a long time to be trained and tuned, it won't be the focus of this experiment. Therefore, the variable 'Safra' will be discarded for now;
* The provided dataset in general seems to have a good quality of information: no missing values or extreme outliers were detected;
* The vast majority of the variables have a left skewed distribution, meaning that models like Linear Regression, that need the data to be normally distributed, will perform poorly.
* The variables have different ranges when compared to each other. Some are Boolean, some vary between 0 and 10, while some vary from 0 to thousands. Depending on the model, the use of feature scaling techniques might be beneficial

# Premisses

In order to run this experiment, the following premises were considered:

* Independent Variables: ['V1','V2','V3','V4','V5','V6','V8','V9','V10','Safra'];
* Dependent Variable: 'TARGET';
* Since the Dependent Variable is Boolean, this experiment will focus on binary classification models.

Obs: Normally I would ask for more information, or confirm this premises with the person or team responsible for generating the dataset.

# Spliting the Dataset

In [9]:
X = df[['V1','V2','V3','V4','V5','V6','V8','V9','V10']] # Node that the V7 and SAFRA variables were removed
y = df['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, shuffle=True)

# Models and Metrics Used

In this experiment the following models where trained and assessed:
* Logistic Regression
* Decision Tree
* Random Forest
* Support Vector(SVC)
* Gradient Boost Trees (XGBoost)

Additionally, each model was evaluated using the following metrics:
* Accuracy
* Precision
* Recall
* FScore

# Training and Evaluating the Models

## Direct approach

In this part we trained and tested all the models with the provided data - no feature scaling or additional pre-processing was executed.

### Results

In [10]:
classifiers = create_classifiers();
run_models(X_train, X_test, y_train, y_test, classifiers)

**LogisticRegression**

Accuracy 0.8340793792897643 

Confusion Matrix:
[[2775  543]
 [  13   20]]

Precision 
  Class 0: 0.9953371592539455 - Class 1: 0.035523978685612786

Recall 
  Class 0: 0.8363471971066908 - Class 1: 0.6060606060606061

FScore 
  Class 0: 0.908942024238454 - Class 1: 0.06711409395973154




**DecisionTreeClassifier**

Accuracy 0.984482244106237 

Confusion Matrix:
[[3296   22]
 [  30    3]]

Precision 
  Class 0: 0.9909801563439568 - Class 1: 0.12

Recall 
  Class 0: 0.9933694996986137 - Class 1: 0.09090909090909091

FScore 
  Class 0: 0.9921733895243829 - Class 1: 0.10344827586206896




**RandomForestClassifier**

Accuracy 0.9898537749925396 

Confusion Matrix:
[[3316    2]
 [  32    1]]

Precision 
  Class 0: 0.9904420549581839 - Class 1: 0.3333333333333333

Recall 
  Class 0: 0.9993972272453285 - Class 1: 0.030303030303030304

FScore 
  Class 0: 0.9948994899489949 - Class 1: 0.05555555555555555




**SVC**

Accuracy 0.9898537749925396 

Confusion Matrix:
[[3314    4]
 [  30    3]]

Precision 
  Class 0: 0.9910287081339713 - Class 1: 0.42857142857142855

Recall 
  Class 0: 0.9987944544906571 - Class 1: 0.09090909090909091

FScore 
  Class 0: 0.9948964274992494 - Class 1: 0.15000000000000002




**XGBClassifier**

Accuracy 0.9904506117576842 

Confusion Matrix:
[[3318    0]
 [  32    1]]

Precision 
  Class 0: 0.9904477611940299 - Class 1: 1.0

Recall 
  Class 0: 1.0 - Class 1: 0.030303030303030304

FScore 
  Class 0: 0.9952009598080384 - Class 1: 0.05882352941176471




### Comments on the results

If we look at just the Precision, it seems that all the models had phenomenal performances - except maybe Logistic Regression, that just scored 83%. But actually, we must look at the other metrics to fully understand how the models performed, since Accuracy is only meaningful when we have a balanced database between al classes, which is not the case.

When we look at the Recall, for instance, we clearly see that the models performed poorly when trying to classify the TARGET = 1 cases. When we analyze through this perspective, in fact the Linear Regression model was the best performer. In other hand, the same Logistic Regression model wrongly classified TARGET = 1 several times, 550 to be exact. That's why it's precision score was only 5%.

In short, despite of the great accuracy scores, neither of the models performed well.

## Feature Scaling

As it was mentioned before, the variables have different ranges when compared to each other. This characteristic can get in the way of the performance of some models, especially the ones that uses metrics of distances or similarities, like Linear Regression.

Due to that, is a good idea to apply feature scaling techniques in our data to counter that.

In [11]:
sc = StandardScaler()
X_train_scalled = sc.fit_transform(X_train)
X_test_scalled = sc.transform(X_test)

X_train_scalled = pd.DataFrame(X_train_scalled, columns = X_train.columns)
X_test_scalled = pd.DataFrame(X_test_scalled, columns = X_test.columns)

### Results

In [12]:
classifiers = create_classifiers()
run_models(X_train_scalled, X_test_scalled, y_train, y_test, classifiers)

**LogisticRegression**

Accuracy 0.8349746344374813 

Confusion Matrix:
[[2778  540]
 [  13   20]]

Precision 
  Class 0: 0.9953421712647796 - Class 1: 0.03571428571428571

Recall 
  Class 0: 0.8372513562386981 - Class 1: 0.6060606060606061

FScore 
  Class 0: 0.909477819610411 - Class 1: 0.06745362563237774




**DecisionTreeClassifier**

Accuracy 0.9838854073410922 

Confusion Matrix:
[[3294   24]
 [  30    3]]

Precision 
  Class 0: 0.9909747292418772 - Class 1: 0.1111111111111111

Recall 
  Class 0: 0.9927667269439421 - Class 1: 0.09090909090909091

FScore 
  Class 0: 0.991869918699187 - Class 1: 0.09999999999999999




**RandomForestClassifier**

Accuracy 0.9898537749925396 

Confusion Matrix:
[[3316    2]
 [  32    1]]

Precision 
  Class 0: 0.9904420549581839 - Class 1: 0.3333333333333333

Recall 
  Class 0: 0.9993972272453285 - Class 1: 0.030303030303030304

FScore 
  Class 0: 0.9948994899489949 - Class 1: 0.05555555555555555




**SVC**

Accuracy 0.8713816771113101 

Confusion Matrix:
[[2906  412]
 [  19   14]]

Precision 
  Class 0: 0.9935042735042735 - Class 1: 0.03286384976525822

Recall 
  Class 0: 0.8758288125376733 - Class 1: 0.42424242424242425

FScore 
  Class 0: 0.9309626781995834 - Class 1: 0.06100217864923748




**XGBClassifier**

Accuracy 0.9904506117576842 

Confusion Matrix:
[[3318    0]
 [  32    1]]

Precision 
  Class 0: 0.9904477611940299 - Class 1: 1.0

Recall 
  Class 0: 1.0 - Class 1: 0.030303030303030304

FScore 
  Class 0: 0.9952009598080384 - Class 1: 0.05882352941176471




### Comments on the results

When looking at the Tree based models, and the XGBoost, nothing has changed as expected, since these models are not sensible to Feature Scaling. 

The Logistic Regression also had similar results, no improvement.

But when we look at the SVC model, there's been a significant advance in its Recall score.

## Feature Scaling + Undersampling

The fact that our dataset is completely unbalanced can also impact on the models performance. One way of countering that is making the training dataset balanced again by removing examples of the most represented classes.

In [13]:
train_df = pd.concat([X_train_scalled, y_train.reset_index()],axis=1)
n_positives = train_df['TARGET'].value_counts()[1]

new_train_df = pd.concat([train_df[train_df['TARGET'] == 0].head(n_positives*2), train_df[train_df['TARGET'] == 1]])
new_train_df

X_train_r = new_train_df[['V1','V2','V3','V4','V5','V6','V8','V9','V10']]
y_train_r = new_train_df['TARGET']

### Results

In [14]:
classifiers = create_classifiers()
run_models(X_train_r, X_test_scalled, y_train_r, y_test, classifiers)

**LogisticRegression**

Accuracy 0.7970754998507908 

Confusion Matrix:
[[2647  671]
 [   9   24]]

Precision 
  Class 0: 0.9966114457831325 - Class 1: 0.034532374100719423

Recall 
  Class 0: 0.7977697408077155 - Class 1: 0.7272727272727273

FScore 
  Class 0: 0.8861734181452962 - Class 1: 0.06593406593406594




**DecisionTreeClassifier**

Accuracy 0.8248284094300209 

Confusion Matrix:
[[2744  574]
 [  13   20]]

Precision 
  Class 0: 0.995284729778745 - Class 1: 0.03367003367003367

Recall 
  Class 0: 0.8270042194092827 - Class 1: 0.6060606060606061

FScore 
  Class 0: 0.9033744855967079 - Class 1: 0.06379585326953748




**RandomForestClassifier**

Accuracy 0.8707848403461653 

Confusion Matrix:
[[2899  419]
 [  14   19]]

Precision 
  Class 0: 0.9951939581187779 - Class 1: 0.04337899543378995

Recall 
  Class 0: 0.8737191078963231 - Class 1: 0.5757575757575758

FScore 
  Class 0: 0.9305087465896325 - Class 1: 0.08067940552016985




**SVC**

Accuracy 0.7958818263205013 

Confusion Matrix:
[[2644  674]
 [  10   23]]

Precision 
  Class 0: 0.9962321024868124 - Class 1: 0.03299856527977044

Recall 
  Class 0: 0.7968655816757083 - Class 1: 0.696969696969697

FScore 
  Class 0: 0.8854655056932351 - Class 1: 0.06301369863013699




**XGBClassifier**

Accuracy 0.8606386153387049 

Confusion Matrix:
[[2863  455]
 [  12   21]]

Precision 
  Class 0: 0.9958260869565217 - Class 1: 0.04411764705882353

Recall 
  Class 0: 0.8628691983122363 - Class 1: 0.6363636363636364

FScore 
  Class 0: 0.9245922816082675 - Class 1: 0.0825147347740668




### Comments on the results

Using the undersampling technique it's possible to see a lot of different behaviors:

* The Logistic Regression model improved slightly, even with the much smaller training set
* Both the XGBoost, the SVC, the Tree and the Forest models had a great improvement in its Recall scores, loosing precision on the other hand. 

Clearly the balanced training set helped, but unfortunately it got so small that it stood on the way of better results.


## Feature Scaling + Oversampling

Similarly to the Undersampling technique, another way of balancing the dataset is creating new synthetic examples of the classes that are rare in our dataset. One way of achieving this is using the SMOTE algorithm. 

Reference: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html#imblearn-over-sampling-smote

In [15]:
sm = SMOTE(sampling_strategy='auto', k_neighbors=1, random_state=0)
X_train_over, y_train_over = sm.fit_resample(X_train_scalled, y_train)

In [16]:
X_train_over = pd.DataFrame(X_train_over, columns = X_train.columns)
y_train_over = pd.DataFrame(y_train_over, columns = ["TARGET"]) 

### Results

In [17]:
classifiers = create_classifiers()
run_models(X_train_over, X_test_scalled, y_train_over, y_test, classifiers)

**LogisticRegression**

Accuracy 0.8293046851686063 

Confusion Matrix:
[[2759  559]
 [  13   20]]

Precision 
  Class 0: 0.9953102453102453 - Class 1: 0.03454231433506045

Recall 
  Class 0: 0.8315250150693189 - Class 1: 0.6060606060606061

FScore 
  Class 0: 0.9060755336617405 - Class 1: 0.065359477124183




**DecisionTreeClassifier**

Accuracy 0.9713518352730528 

Confusion Matrix:
[[3251   67]
 [  29    4]]

Precision 
  Class 0: 0.9911585365853659 - Class 1: 0.056338028169014086

Recall 
  Class 0: 0.9798071127185051 - Class 1: 0.12121212121212122

FScore 
  Class 0: 0.9854501364049713 - Class 1: 0.07692307692307693




**RandomForestClassifier**

Accuracy 0.9785138764547896 

Confusion Matrix:
[[3276   42]
 [  30    3]]

Precision 
  Class 0: 0.9909255898366606 - Class 1: 0.06666666666666667

Recall 
  Class 0: 0.9873417721518988 - Class 1: 0.09090909090909091

FScore 
  Class 0: 0.9891304347826088 - Class 1: 0.07692307692307691




**SVC**

Accuracy 0.8833184124142047 

Confusion Matrix:
[[2947  371]
 [  20   13]]

Precision 
  Class 0: 0.9932591843613077 - Class 1: 0.033854166666666664

Recall 
  Class 0: 0.8881856540084389 - Class 1: 0.3939393939393939

FScore 
  Class 0: 0.937788385043755 - Class 1: 0.06235011990407673




**XGBClassifier**

Accuracy 0.9086839749328559 

Confusion Matrix:
[[3029  289]
 [  17   16]]

Precision 
  Class 0: 0.994418910045962 - Class 1: 0.05245901639344262

Recall 
  Class 0: 0.9128993369499698 - Class 1: 0.48484848484848486

FScore 
  Class 0: 0.9519170333123821 - Class 1: 0.09467455621301775




### Comments on the results

Although the results were better when compared to the models with the unbalanced datasets, in general the best results where achieved using the undersampling technique instead of the oversampling. That doesn't mean that the oversampling method should be discarded. It's possible to explore a lot of tuning on the SMOTE algorithm to look for better results.

# Conclusion & Recommendations

Summarizing the results:

![title](results_.png)

Based on the Recall metric, the best way to approach this problem so far would be:

**Feature Scaling > Undersampling > Logistic Regression**

The biggest problem faced in this experiment was the low number of *TARGET = 1* examples in the dataset (to be exact, 120 examples). Such a low number made difficult for the models to obtain precise classifications.

The results shown suggest that it's possible to circumvent the database unbalance issue, but in order to do that we need more data, in a way that even when we look at the rarest value of the dependent variable "TARGET", we still can find a considerable number of examples.