# Credit Fraud Detection: EDA & Baseline Prediction
The objective of this analysis is to explore the Credit Fraud Detection dataset, collected from European Cardholders in Spetember 2013 and develope a baseline for detection of fraud using Machine Learning Algorithms. The Data has been anonimatized and preprocessed, such that each record has 28 PCA-transformed fields, plus Time and Amount, together with a Class (0 if regular transaction, 1 for fraud). 
The Following is the roadmap of the study
* [Data Assessment](#section_DataAssessment)
* [Data Exploration](#section_DataExploration)
* [Predictive Models](#section_PredictiveModels)
 * [XGBoost](#section_XGBoost)
 * [XGBoost - Parameter Optimization](#section_XGBoost_Parameter_Optimization)
* [Results](#section_Results)
* [Conclusion and Next Steps](#section_Conclusion)

<a id="section_DataAssesment"></a>
## **Data Assessment** 
Let's have a quick look at what the data can tell us, using basic-but-powerfull analytics. We need to load all the relevant libraries.

In [1]:
# importing relevant libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pylab as plt # plots
from matplotlib.colors import LogNorm # logarithmic norm
plt.rcParams['figure.figsize'] = [10, 5] # set plot size
from sklearn.preprocessing import StandardScaler # needed to normalize data
from sklearn.model_selection import train_test_split # needed to split data into train and test
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc # relevant performance metrics
from sklearn.model_selection import GridSearchCV # needed to optimize hyperparameters 
from sklearn.cluster import Birch # needed for clustering 
import xgboost as xgb # gradient boosting library

The data was given as csv, so we can read it direcly with pandas

In [2]:
data = pd.read_csv('../input/creditcard.csv') # read the data
print('The dataset consists of %d rows and %d columns' % data.shape)
data.head() # A first look at the data

The mean value of features V1-V28 is 0 with standard deviation decreasing from 1.959 to 0.33

In [3]:
lFEATURE = []
lMEAN = []
for ii in range(1,10):
    plt.scatter([ii for jj in range(len(data['V' + str(ii)]))],data['V' + str(ii)],alpha=0.5,color='blue')
    lFEATURE.append(ii)
    lMEAN.append(data['V' + str(ii)].mean())
plt.scatter(lFEATURE,lMEAN,marker = '+', color = 'red', s = 100)
plt.xlabel('Feature')
plt.show()

For a closer look at the distributions we can plot them one by one.

In [4]:
data[data.columns.values[1:5]].hist()
#data[data.columns.values[5:9]].hist()
plt.show()

In [5]:
data.isnull().sum()

There are no null values on any of the fields. So far we know that, all the preprocessed data has been normalized so the mean value is 0 and the standard deviation is less than 2. In the next section we will examine the data in more detail

<a id="section_DataExploration"></a>
## **Data Exploration** 
Due to privacy, the 28 data features have been preprocessed and anonimized so we only have access to the post principal component analysis (PCA) data. However we still have time and amount as descriptive fields. 

In [6]:
fig, ax = plt.subplots(1,2)
ax[0].hist(data['Time']/(3600),bins=48)
ax[0].set_xlabel('Time[hours]')
ax[1].hist(data['Amount'],bins=100)
ax[1].set_xlabel('Amount[CUR]')
plt.show()

The Time distribution in hours presents two main peaks, roughly arounf the hour 15 and 40 (about 24h difference) which suggests that there is a daily modulation of the traffic, assuming that the data was captured from a single time-zone (**Assumption A**)
On the other hand, most of the Amount values are grouped near small values. However, a transformation can provide more insights about the real distribution.

In [7]:
fig, ax = plt.subplots(1,2)
ax[0].hist(data['Time']/(3600),bins=48)
ax[0].set_xlabel('Time[hours]')
ax[1].hist(data['Amount'],bins=np.logspace(np.log10(0.001),np.log10(25000), 50))
ax[1].set_xlabel('Amount[CUR]')
ax[1].set_xscale("log")
plt.show()

Let's explore the Amount field. From the histogram, it is clear that the best transformation to perform on the Amount is a logarithm. However, this will compromise the negative and zero values. Fortunately, there are no negative values and the number of records with zero amount is pretty small. So we can select the positive values of Amount and work on a specific model for the zero case, being the smallest non-zero value 0.01, consistent with one cent of the currency (**Assumption B**).

In [8]:
print('There are %d (%f) records with zero amount, and the minimum non-zero value is %f' 
      % (len(data[data['Amount']==0]),1.*len(data[data['Amount']==0])/len(data),data[data['Amount']>0]['Amount'].min()))

The above discution motivates the first cut in our analysis: Consider records with a minimum of 0.01 CUR.

In [9]:
data_clean = data[data['Amount']>0].copy()

### Imbalanced dataset
The data set is clearly inbalanced in the sense that the number of negative examples  (class 0) is overwhelmly larger than the number of positive examples. This suggests that we need a way even the contributions from positive and negative examples in our predictors.

In [10]:
data_clean['Class'].value_counts()

Let's consider the behavior of the rate of positive (number of positive over total records) as a function of Time.

In [11]:
fig, ax1 = plt.subplots()
hist_time_count, hist_time_bin, hist_time_img = plt.hist(data_clean['Time']/3600, bins=48,label='All Traffic')
hist_time_count_1, hist_time_bin_1, hist_time_img_1 = plt.hist(data_clean[data_clean['Class']==1]['Time']/3600, bins=48)
ax2 = ax1.twinx()
plt.plot(hist_time_count_1/hist_time_count, marker='+',linestyle='None',color='red',label='Fraud Ratio')
ax1.set_xlabel('Time[hours]')
ax2.tick_params('y', colors='r')
plt.legend()
plt.show()
mean_pos_rate = np.mean(hist_time_count_1/hist_time_count)
max_pos_rate = np.max(hist_time_count_1/hist_time_count)

We can see that there is a higher positive rate during the times of lower traffic. 

In [12]:
print('Mean value of pos rate is %f and the max value if pos rate is %f making it %f times higher' % (mean_pos_rate,max_pos_rate,max_pos_rate/mean_pos_rate))

Let's consider now the Amount and the rate of positive as a function of the Amount

In [13]:
fig, ax = plt.subplots(1,2)
ax[0].hist(data_clean['Amount'],bins=np.logspace(np.log10(0.001),np.log10(25000), 50))
ax[0].set_xlabel('Amount[CUR]')
ax[0].set_xscale("log")
ax[1].hist(data_clean[data_clean['Class']==1]['Amount'],bins=np.logspace(np.log10(0.001),np.log10(25000), 50))
ax[1].set_xlabel('Amount[CUR]')
ax[1].set_xscale("log")
plt.show()

In [14]:
fig, ax1 = plt.subplots()
hist_amount_count, hist_amount_bin, hist_amount_img = plt.hist(data_clean['Amount'],bins=np.logspace(np.log10(0.001),np.log10(25000), 50),label='All Traffic')
hist_amount_count_1, hist_amount_bin_1, hist_amount_img_1 = plt.hist(data_clean[data_clean['Class']==1]['Amount'], bins=np.logspace(np.log10(0.001),np.log10(25000), 50))
ax1.set_xscale("log")
ax2 = ax1.twinx()
plt.plot(hist_amount_bin_1[:-1],1.*hist_amount_count_1/hist_amount_count, marker='+',linestyle='None',color='red',label='Fraud Ratio')
ax1.set_xscale("log")
ax1.set_xlabel('Amount[CUR]')
ax2.tick_params('y', colors='r')
ax2.set_xscale("log")
plt.legend()
plt.show()

Actionable insights
* Special attention to low traffic hours needed to identify fraud records. If possible add resources to monitor during down time
* The highest fraude rates are associated to:
 * minimum contributions (Amount = 0.01), possibly tests
 * small contributions (Amount between 0.1 and 1), possibly tests
 * medium contributions (Amount between 1 and 100)
 * large contributions (Amount between 100 and 2000)
* No significant fraud observed for Amounts larger than 2000
 


<a id="section_DataExploration"></a>
## **Predictive Models** 
Classification with unbalanced data set needs a way to mitigate the disbalance and we will try an undersampling. First we will define a few new features like the logarithm of the Amount, the normalized logarithm of the Amount and the normalized Time.

In [15]:
data_clean['logAmount'] = np.log(data_clean['Amount'])/np.log(10)

In [16]:
data_clean['normAmount'] = StandardScaler().fit_transform(data_clean['logAmount'].values.reshape(-1, 1))
data_clean['normTime'] = StandardScaler().fit_transform(data_clean['Time'].values.reshape(-1, 1))

We will now delete the fields that are not normalized

In [17]:
data_sel = data_clean.drop(['Time','Amount','logAmount'],axis=1)

We will set the ratio of positive to negative records as 1 but it can be changed for further explorations

In [18]:
ratio_pos_neg = 1

We now separate the data into 0-class and 1-class. Then we will randomly sample the 0-class set and select a set with the same size as the 1-class set. The new balanced 1-class and 0-class will make up the data_model

In [19]:
data_sel_0 = data_sel[data_sel['Class']==0]
data_sel_1 = data_sel[data_sel['Class']==1]

In [20]:
data_balanced = data_sel_0.sample(int(len(data_sel_1)*ratio_pos_neg))

In [21]:
data_model = pd.concat([data_balanced,data_sel_1]).sample(frac=1)

Now we select the dependent features X and the feature to predict y, and split the set into training and testing (2/3 and 1/3 respectively)

In [22]:
X = data_model.loc[:, data_model.columns != 'Class']
y = data_model.loc[:, data_model.columns == 'Class']

In [23]:
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

<a id="section_XGBoost"></a>
### XGBoost
[XGBoost](http://xgboost.readthedocs.io/en/latest/) is an implementation of [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) that is designed for speed processing of boosted trees algorithms.  We use first a baseline implementation with some initial parameters such as n_estimators =200, max_depth=3 and learning_rate=0.01. We fit the training data and then use the testing data to evaluate the performance of the algorithm.

In [24]:
model = xgb.XGBClassifier(max_depth=3, n_estimators=200, learning_rate=0.01)
model.fit(X_train,y_train)
y_pred_xgb = model.predict(X_test)
y_score_xgb = model.predict_proba(X_test)[:,1]
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_score_xgb)

In [31]:
print('XGBoost ROC AUC', auc(fpr_xgb, tpr_xgb))
print(classification_report(y_test, y_pred_xgb, target_names=['NoFraud','Fraud']))

<a id="section_XGBoost_ParameterOptimization"></a>
### XGBoost - Parameter optimization
We can use a grid search optimization to find the set of parameter with the optimal performance

In [32]:
cv_params = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 100, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic'}
optimized_model = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                            cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1) 
optimized_model.fit(X_train,y_train)

In [33]:
best_optimized_model = optimized_model.best_estimator_

In [34]:
y_pred_xgb = best_optimized_model.predict(X_test)
y_score_xgb = optimized_model.predict_proba(X_test)[:,1]
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_score_xgb)

In [35]:
print('XGBoost ROC AUC', auc(fpr_xgb, tpr_xgb))
print(classification_report(y_test, y_pred_xgb, target_names=['NoFraud','Fraud']))

In [36]:
fig, ax = plt.subplots()
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_xgb, tpr_xgb, label='XGB')
ax.set_aspect('equal')
plt.legend()
plt.show()

The performance improves as we scanned over the possible parameter space, with a very high precision and accuracy. From the ROC curve we can see that when the False Positive Rate (FPR) is close to zero .

In [37]:
features = X.columns.values.tolist()

In [38]:
def getAUC_XGB(i_index):
    slFIELDS = features[:i_index] + features[i_index+1:]
    X = data_model[slFIELDS]
    y = data_model['Class']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    xgbc_best = best_optimized_model
    xgbc_best.fit(X_train, y_train)
    y_pred_xgb = xgbc_best.predict(X_test)
    y_score_xgb = xgbc_best.predict_proba(X_test)[:,1]
    fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_score_xgb)
    return auc(fpr_xgb, tpr_xgb)

In [39]:
getAUC_XGB(i_index=0)

In [40]:
lAUC = []
lIND = []
for ii in range(len(features)):
    lIND.append(features[ii])
    lAUC.append(getAUC_XGB(ii))

In [41]:
plt.scatter(range(len(lIND)),lAUC)
plt.xticks([ii for ii in range(len(features))], features, rotation='vertical')
plt.show()

We can see how excluding the field V14 brings the AUC-ROC down, suggesting this feature carries a higher weight in importance in the interpretation of the model.

<a id="section_XGBoost_ParameterOptimization"></a>
### XGBoost - Parameter optimization & Clustering of 0-class records
In this section we will explore the clustering of the 0-class events into representative records for each segment. The idea is that the certain 0-class events are indeed similar and can be grouped. Instead of randomly under-sampling records, we can use the mean value of the segments. For this particular apporach, we will explore the [BIRCH clustering approach](https://en.wikipedia.org/wiki/BIRCH), that works very well with large data sets defining the number of cluster to be equal to the number of 1-class records. This needs to be revisited and will be part of the next steps.

Due to computation constrains from the kernel we wont cluster the entire 0-class set, but only a fraction, randomly selected. 

In [42]:
data_Birch = data_sel_0.loc[:, data_model.columns != 'Class'].sample(frac=0.1)

Next we define the number of clusters as the multiple of the numbr of 1-class events. In this case it will be equal.

In [43]:
ratio_pos_neg = 1
brc = Birch(n_clusters=len(data_sel_1)*ratio_pos_neg)

We now fit the selected 0-class set with the BIRCH algorithm and add a new variable Label to the data set that will be used to aggregate the events with the same label and produce the mean value of each variable as the representative event.

In [44]:
brc.fit(data_Birch)

In [45]:
data_Birch['Label'] = brc.labels_

In [46]:
data_Birch_mean = data_Birch.groupby('Label').mean().reset_index().drop('Label',axis=1)
data_Birch_mean['Class'] = 0

In [47]:
data_model_Birch = pd.concat([data_Birch_mean,data_sel_1]).sample(frac=1)

Following a similar approach as above we will split the set into training and test, train the XGBoost algorithm and test the performance on the test set computing the relevant metrics.

In [48]:
X = data_model_Birch.loc[:, data_model_Birch.columns != 'Class']
y = data_model_Birch.loc[:, data_model_Birch.columns == 'Class']
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [49]:
model = xgb.XGBClassifier(max_depth=3, n_estimators=200, learning_rate=0.01)
model.fit(X_train,y_train)
y_pred_xgb = model.predict(X_test)
y_score_xgb = model.predict_proba(X_test)[:,1]
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_score_xgb)

In [50]:
print('XGBoost ROC AUC', auc(fpr_xgb, tpr_xgb))
print(classification_report(y_test, y_pred_xgb, target_names=['NoFraud','Fraud']))

In [51]:
cv_params = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 100, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic'}
optimized_model = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                            cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1) 
optimized_model.fit(X_train,y_train)

In [52]:
best_optimized_model = optimized_model.best_estimator_

In [53]:
y_pred_xgb = best_optimized_model.predict(X_test)
y_score_xgb = optimized_model.predict_proba(X_test)[:,1]
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_score_xgb)

In [54]:
print('XGBoost ROC AUC', auc(fpr_xgb, tpr_xgb))
print(classification_report(y_test, y_pred_xgb, target_names=['NoFraud','Fraud']))

We can see that the new procedure improves the performance of the predictor.

<a id="section_DataExploration"></a>
## **Results** 

The predictive model based on XGBoost sets a very good baseline in terms of precision, accuracy and AUC-ROC. Implementing a Search Grid optimization improves the overall performance. The clustering of the 0-class is a procedure that aims to provide the most relevant and complete sample of the regular events based on similarity of records ans further improves the performance of the predictor to a precision in of 0.96 and recall of 0.95 which is a very good performance. 

<a id="section_DataExploration"></a>
## **Conclusions and Next Steps** 

The detection of fraud in credit transations requires special attention to the inbalance problem, the overwhelming disproportion of positive and negative examples. This rare-event detection can be improved by undersampling the 0-class events and using XGBoost, grid search optimization and selecting representative records from clusters of 0-class events, the overall performance achieved was 0.96/0.95 precision and recall. 

To further undersatnd and interpret the model, in a further study we will consider the characteristics of each cluster, quantifying the quality of the clustering methodology, compare to other suitable culstering methodology to improve the selection of representative events. 

The special case of records with amount zero needs to be treated similarly. It is possible to add a fake non-zero but very small amount and add it to the procedure. 