Questions 


### Transaction Table *
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- Pemaildomain Remaildomain
- M1 - M9

### Identity Table *
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. 
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id12 - id38

I have decided to exclude all Vxxx features from the analysis as we do not know what they are or how they were created.

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from time import time

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.model_selection import train_test_split

from sklearn import metrics

In [2]:
%%time
dataset_path = '../datasets/fraud_datasets/' # Local location
# dataset_path = '../input/ieee-fraud-detection/'

# sample_submission_df = pd.read_csv(f'{dataset_path}sample_submission.csv')

train_transaction_df = pd.read_csv(f'{dataset_path}train_transaction.csv')
test_transaction_df = pd.read_csv(f'{dataset_path}test_transaction.csv')

train_id_df = pd.read_csv(f'{dataset_path}train_identity.csv')
test_id_df = pd.read_csv(f'{dataset_path}test_identity.csv')

Wall time: 1min 54s


After importing our raw datasets, they need to be merged by TransactionID's.

<b>(Note: throughout this notebook, it will be important to delete any dataframes that will not be used again as they take up space in memory and slow our machine down.)</b>

In [3]:
%%time
train = train_transaction_df.merge(train_id_df, on='TransactionID', how='left')
test = test_transaction_df.merge(test_id_df, on='TransactionID', how='left')

# Renaming columns for better description
names = {
    'addr1': 'billing zipcode',
    'addr2': 'country codes',
    'P_emaildomain': 'Purchaser_emaildom',
    'R_emaildomain': 'Retailer_email.dom'
}

train.rename(columns=names, inplace=True)
test.rename(columns=names, inplace=True)

Wall time: 1min 13s


In [4]:
del train_transaction_df, train_id_df, test_transaction_df, test_id_df

In [5]:
print(f'Training data has {train.shape[0]} rows and {train.shape[1]} columns')
print(f'Training data has {test.shape[0]} rows and {test.shape[1]} columns')

Training data has 590540 rows and 434 columns
Training data has 506691 rows and 433 columns


In [6]:
train.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


In [None]:
test.head()

## EDA
Visually exploring the dataset it important for familiarizing ourselves with it and trying to better understand the data before complex analysis is undertaken.

### Data
First, lets get an overview of the features and their data types.

<b>(Note: due to the large output the following script produces, the output can be found in the eda_output folder in under feature_overview.txt)</b>

In [None]:
# This is for local environment
# eda_output_path = 'eda_output/'
# feat_over_file = 'feature_overview.txt'

# if os.path.exists(f'{eda_output_path}{feat_over_file}'):
#     os.remove(f'{eda_output_path}{feat_over_file}')

# with open(f'{eda_output_path}{feat_over_file}', 'a') as overview_file:
#     for col, values in train.iteritems():
#         overview_file.write(f'{col}: {values.nunique()} ({values.dtypes})\n')
#         overview_file.write(str(values.unique()[:100]))
#         overview_file.write(
#             '\n\n###########################################################\n\n')

When we examine the feature_overview.txt file, we notice that there are many continuous features and some categorical features.

### Missing Values

In [None]:
train_contains_na = train.isna().any().sum()
test_contains_na = test.isna().any().sum()

In [None]:
print(f'{train_contains_na} out of {len(train.columns)} columns contain missing values in the train data')
print(f'{test_contains_na} out of {len(test.columns)} columns contain missing values in the test data')

In [None]:
# Calculating percentage of missing values in each column
train_missing_values = train.isna().mean().round(2)
test_missing_values = test.isna().mean().round(2)

# Keeping only columns that contain more than 50% missing values
train_missing_values_5 = train_missing_values[train_missing_values.values > 0.05]
test_missing_values_5 = test_missing_values[test_missing_values.values > 0.05]

In [None]:
print(f'{len(train_missing_values_5)} out of {len(train.columns)} columns in the train data contain more than 5% missing values')
print(f'{len(test_missing_values_5)} out of {len(test.columns)} columns in the test data contain more than 5% missing values')

As we can see there are a large number of missing values in the data.

In [None]:
# Keeping only columns that contain more than 50% missing values
train_missing_values_50 = train_missing_values[train_missing_values.values > 0.5]
test_missing_values_50 = test_missing_values[test_missing_values.values > 0.5]

In [None]:
print(f'{len(train_missing_values_50)} out of {len(train.columns)} columns in the train data contain more than 50% missing values')
print(f'{len(test_missing_values_50)} out of {len(test.columns)} columns in the test data contain more than 50% missing values')

### Target Feature: isFraud

In [None]:
isFraud = 'isFraud'

In [None]:
# Plot Function
def PlotFunction(df, feature, title, xLable, yLabel, vertical=False, percentLabels=False, size=[10, 7]):
    plt.style.use('ggplot')

    f = plt.figure(figsize=size)
    ax = f.add_subplot(1, 1, 1)
    ax.title.set_text(title)
    ax.set_ylabel(yLabel)
    ax.set_xlabel(xLable)

    plot = ax.bar([str(i) for i in df[feature].value_counts(dropna=False, normalize=True).index],
                  df[feature].value_counts(dropna=False, normalize=True), 0.40,
                  color=['cornflowerblue', 'darkorange', 'green', 'brown', 'black'])
    if vertical:
        plt.xticks(rotation=90)

    # Add counts above the two bar graphs
    if percentLabels:
        percentages = (df[feature].value_counts(
            dropna=False, normalize=True)*100).round(3)
        i = 0
        for rect in plot:
            height = rect.get_height()
            plt.text(rect.get_x() + rect.get_width()/2.0, height,
                     f'{percentages[i]}%', ha='center', va='bottom')
            i += 1

In [None]:
print(f'These are the two types of values for fraud: {train[isFraud].unique()}')

In [None]:
print(f'isFraud contains {train[isFraud].isna().any() * 1} missing values')

In [None]:
### 'Fraud' Feature
PlotFunction(train, isFraud, 'isFaud percentages', 'Not fraud | Fraud', 'Number of occurances', percentLabels=True)    

A visual examination of the graph reveals an imbalance in the number of fraudulent transactions.

To correct this imbalanace, the use of a sampling method such as SMOTE or oversampling will be needed.

### Exploring the Features

In [None]:
# 'TransactionAmt' feature
plt.figure(figsize=[12, 5])
plt.title('Transaction Amounts')
plt.xlabel('Amount')
plt.ylabel('Number of transactions')
_ = plt.hist(train['TransactionAmt'], bins=100)

There is a clear right skew to this data. I will try a log transformation to reduce this effect.

In [None]:
# 'TransactionAmt' feature
plt.figure(figsize=[12, 5])
plt.title('Transaction Amounts')
plt.xlabel('Amount')
plt.ylabel('Number of transactions')
_ = plt.hist(train['TransactionAmt'].apply(np.log), bins=100)

'TransactionAmt' now shows a normal distribution, we can separate which transactions were fraudulent and compare whether they are larger.

In [None]:
# 'TransactionAmt' feature
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 3))

# Not Fraud
ax1.title.set_text('Non-Fraudulent Transaction Amounts')
ax1.set_xlabel('LgAmount')
ax1.set_ylabel('Number of transactions')
_ = ax1.hist(train['TransactionAmt']
             [train['isFraud'] == 0].apply(np.log), bins=100)

# Fraud
ax2.title.set_text('Fraudulent Transaction Amounts')
ax2.set_xlabel('LgAmount')
ax2.set_ylabel('Number of transactions')
_ = ax2.hist(train['TransactionAmt']
             [train['isFraud'] == 1].apply(np.log), bins=100)

From this, we can see that the fraudulent transaction amounts seem to be higher on average.

In [None]:
### 'ProductCD' feature
PlotFunction(train, 'ProductCD', 'Product codes for each transaction', 'Product Codes', 'Count', percentLabels=True)

In [None]:
f = plt.figure(figsize=[15, 5])
ax = f.add_subplot(1, 1, 1)
ax.title.set_text('Percentage of products from fraudulent transactions')
ax.set_ylabel('Percent')
ax.set_xlabel('Products')

plot = ax.bar([str(i) for i in train['ProductCD'][train['isFraud'] == 1].value_counts(dropna=False, normalize=True).index],
              train['ProductCD'][train['isFraud'] == 1].value_counts(
                  dropna=False, normalize=True), 0.40,
              color=['cornflowerblue', 'darkorange', 'green', 'brown', 'black'])

# Add counts above the two bar graphs
percentages = (train['ProductCD'][train['isFraud'] == 1].value_counts(
    dropna=False, normalize=True)*100).round(3)
i = 0
for rect in plot:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width()/2.0, height,
             f'{percentages[i]}%', ha='center', va='bottom')
    i += 1

There are two main products involved with fraululent transactions, 'W' and 'C'.

In [None]:
# 'card1' feature
print('Number of unique values:', len(train['card1'].unique()))
train['card1'].value_counts(dropna=False, normalize=True).head()

In [None]:
### 'card2' feature
print('Number of unique values:', len(train['card2'].unique()))
train['card2'].value_counts(dropna=False, normalize=True).head()

'card1' & 'card2 have a large number of unique values with no clear dominating catagory.

In [None]:
### 'card3' feature
print('Number of unique values:', len(train['card3'].unique()))
train['card3'].value_counts(dropna=False, normalize=True).head()

'card3' has 88% of its values being 150.

In [None]:
### 'card5' feature
print('Number of unique values:', len(train['card5'].unique()))
train['card5'].value_counts(dropna=False, normalize=True).head()

'card5' has 50% of its values being 226.

In [None]:
### 'card4' feature
PlotFunction(train, 'card4', 'Number of card types', 'Card types', 'Number of card occurrences', percentLabels=True)

print('Number of unique values:', len(train['card4'].unique()))
train['card4'].value_counts(dropna=False, normalize=True).head(10)

'visa' cards are have the highest use in this dataset at 65%.

In [None]:
### 'card6' feature
PlotFunction(train, 'card6', 'Number of account types', 'Account types', 'Number of account type occurrences', percentLabels=True)

print('Number of unique values:', len(train['card6'].unique()))
train['card6'].value_counts(dropna=False, normalize=True).head(10)

'debit' account types have the highest use in this dataset at 75%.

In [None]:
### 'billing zipcode' feature
print('Number of unique values:', len(train['billing zipcode'].unique()))
train['billing zipcode'].value_counts(dropna=False, normalize=True).head(10)

In [None]:
### 'country codes' feature
print('Number of unique values:', len(train['country codes'].unique()))
train['country codes'].value_counts(dropna=False, normalize=True).head()

It seems that 88% of transactions have come from <a href='https://en.wikipedia.org/wiki/List_of_UIC_country_codes'>France</a>. 11% of transactions that do not have a country code, however, it is unlikely that all of these transactions belong to the one country.

In [None]:
### 'Purchaser Email' feature
PlotFunction(train, 'Purchaser_emaildom', 'Number of Purchaser email types', 'Email types',
             'Number of email type occurrences', vertical=True, percentLabels=False)

print('Number of unique values:', len(train['Purchaser_emaildom'].unique()))
train['Purchaser_emaildom'].value_counts(dropna=False, normalize=True).head()

'gmail.com' accounts for 38% of the emails tied with purchaser transactions.

In [None]:
# 'Purchaser Email' feature
PlotFunction(train, 'Retailer_email.dom', 'Number of Purchaser email types', 'Email types',
             'Number of email type occurrences', vertical=True, percentLabels=False)

print('Number of unique values:', len(train['Purchaser_emaildom'].unique()))
train['Retailer_email.dom'].value_counts(dropna=False, normalize=True).head()

'gmail.com' accounts for 77% of the emails tied with retailer transactions.

In [None]:
feature_names = ['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14']

In [None]:
feature_names = ['D1','D2','D3','D4','D5','D6','D7','D8','D9','D10','D11','D12','D13','D14','D15']

In [None]:
feature_names = ['M1', 'M2', 'M4', 'M3', 'M5', 'M6', 'M7', 'M8', 'M9']

fig, axes = plt.subplots(3, 3, figsize=(17, 10))

j = 0
for row in axes:
    for ax in row:
        ax.set_title(feature_names[j])
        ax.set_ylabel('Count')
        ax.set_xlabel(feature_names[j])

        plot = ax.bar([str(i) for i in train[feature_names[j]].value_counts(dropna=False, normalize=True).index],
                      train[feature_names[j]].value_counts(
                          dropna=False, normalize=True), 0.40,
                      color=['cornflowerblue', 'darkorange', 'green', 'brown', 'black'])
        j += 1

plt.tight_layout()

In [None]:
feature_names = ['id_12', 'id_15', 'id_16', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38']


### Handling Missing Values
Due to the large number of NaN values our dataframes contain, it is critical that they are replaced with a meaningful placeholder.

In [7]:
y_train = train['isFraud'].copy()
X_train = train.drop('isFraud', axis=1)

Features that consist of more than 5% missing values will be excluded as the potential effect of bias may become to large.

In [8]:
# train_missing_values = train_missing_values[train_missing_values.values > 0.20]
# test_missing_values = test_missing_values[test_missing_values.values > 0.20]

# test.drop(test_missing_values.index, axis=1, inplace=True)
# X_train.drop(train_missing_values.index, axis=1, inplace=True)

To prevent our model from crashing, it is important to replace missing values with so derived value. The replacement values for categorical features will be the mode of that feature, and for numerical features, it will be the mean for that feature.

In [9]:
%%time
for feature in X_train.columns:
    if X_train[feature].isna().any():
        if X_train[feature].dtype == 'object':
            X_train[feature] = X_train[feature].fillna(X_train[feature].mode())
            test[feature] = test[feature].fillna(test[feature].mode())
        else:
            X_train[feature] = X_train[feature].fillna(X_train[feature].mean())
            test[feature] = test[feature].fillna(test[feature].mean())

Wall time: 26.4 s


As our model can not accept categorical features as object data types, a label transformation must be preformed.

In [10]:
%%time
for feature in X_train.columns:
    if X_train[feature].dtype == 'object' or test[feature].dtype == 'object':
        le = LabelEncoder()
        le.fit(list(X_train[feature].values) + list(test[feature].values))
        X_train[feature] = le.transform(list(X_train[feature].values))
        test[feature] = le.transform(list(test[feature].values))

Wall time: 1min 45s


#### SMOTE or Oversampling

### Feature Selection
As we are preforming mean different feature selection methods, it is important to store each set of features for testing.

First up, we will append all of our features before selection into our feature sets dictionary.

In [None]:
# feature_sets = {}
# feature_sets['all'] = X_train.columns.tolist()

#### Random Forest Feature Selection

In [None]:
# # Build classifier
# clf = RandomForestClassifier(n_estimators=250, max_depth=1, random_state=42)
# clf.fit(X_train, y_train)

# # Select features > 0
# importances = pd.DataFrame(X_train.columns, columns=['features'])
# importances['importance'] = clf.feature_importances_
# importances.sort_values(by='importance', axis=0, ascending=False, inplace=True)
# importances = importances[importances['importance'] > 0]

# # Plot best features
# plt.figure(figsize=[20,10])
# plt.title("Feature importances")
# plt.bar(range(len(importances)), importances['importance'])
# plt.xticks(range(len(importances)), importances['features'])
# plt.xlim([-1, len(importances)])
# plt.xticks(rotation=90)
# plt.show()

In [None]:
# feature_sets['random_forest'] = importances['features'].tolist()

#### Chi Squared Feature Selection

In [None]:
# # Scale features to avoid negative values crashing chi2
# scaler = MinMaxScaler()
# X_train_scaled = scaler.fit_transform(X_train)

# # Building the model
# bestfeatures = SelectKBest(score_func=chi2, k=2)
# fit = bestfeatures.fit(X_train_scaled, y_train.values)

# # Select features > 50
# feature_scores = pd.DataFrame(X_train.columns, columns=['features'])
# feature_scores['scores'] = fit.scores_
# feature_scores.sort_values(by='scores', axis=0, ascending=False, inplace=True)
# feature_score = feature_scores[feature_scores['scores']>50]

# # Plot best features
# plt.figure(figsize=[20,10])
# plt.title("Feature importances")
# plt.bar(range(len(feature_scores)), feature_scores['scores'])
# plt.xticks(range(len(feature_scores)), feature_scores['features'])
# plt.xlim([-1, len(feature_scores)])
# plt.xticks(rotation=90)
# plt.show()

In [None]:
# feature_sets['chi2'] = feature_score['features'].tolist()

#### Train/test split

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(
#     X_train, y_train, test_size=0.2, random_state=42)

In [74]:
# def calculate_decision_stump_efficient1(data, weight, label):

x2 = X_train[['TransactionDT','TransactionAmt']].values
weight2 = np.ones((x2.shape[0], x2.shape[1]), dtype=np.float64)/len(x2)
label2 = y_train.values
pos2 = np.where(label2 == 0)
label2[pos2] = -1
# calculate_decision_stump_efficient1(x2, weight2, label2)


Tp=np.float64(0) #T+ total sum of positive examples weights
Tn=np.float64(0) #T- total sum of negative examples weights
Sp=np.float64(0) #S+ sum of positive weights below the cuurent threshold
Sn=np.float64(0) #S- sum of negative weights below the current threshold
error1=np.float64(0)
error2=np.float64(0)
min_error=np.float64(2.0) 
min_thresh=np.float64(0) 
direction=1

#     y = np.zeros(N, dtype=np.int64)

#get all positive weights    
temp = (label2 == 1)
temp = np.reshape(np.int64(temp),(-1,1))
Tp = np.sum((temp * weight2), axis=0)

# get all negative weights  
temp  = (label2 == -1)
temp = np.reshape(np.int64(temp),(-1,1))
Tn = np.sum((temp * weight2), axis=0)


Sn = np.sum(weight2[np.where(label2 == -1)[0]], axis=0)
Sp = np.sum(weight2[np.where(label2 == 1)[0]], axis=0)

#sort feature values
# sorted_labels = data[:, feature].argsort()
# sorted_vector =  data[sorted_labels]
# neg_weights = weight2[np.where(label2 == -1)[0]]
# pos_weights = weight2[np.where(label2 == 1)[0]]

# Sn = np.zeros((neg_weights.shape[0], neg_weights.shape[1]))
# Sp = np.zeros((pos_weights.shape[0], pos_weights.shape[1]))


# error1 = Sp + (Tn - Sn)
# print((Tn - Sn))
# print(Tn)

# i = -1
# for row in neg_weights:
#     Sn[i + 1] = Sn[i] + row
#     i += 1

# i = -1
# for x in pos_weights:
#     Sp[i + 1] = Sp[i] + row
#     i += 1


# error2 = Sn + (Tp - Sp) 

# print(error1)

# if(min_error > error1) :
#     min_error = error1
#     min_thresh = sorted_vector[i, feature]
#     direction = 1
# if(min_error > error2) :
#     min_error = error2
#     min_thresh = sorted_vector[i, feature]
#     direction = -1   
    
    
# length = len(sorted_vector)
# for i in range(length):

#     #RIGHT DIRECTION THRESHOLD
#     #error1 is the sum of positives up to that point + total negatives minus the sum of negatives so far
#     error1 = Sp + (Tn - Sn) 
#     if label[i] == -1 : 
#         Sn = Sn +  weight[i]
#     else :
#         Sp = Sp + weight[i]

#     #LEFT DIRECTION THRESHOLD
#     error2 = Sn + (Tp - Sp) 

#     if(min_error > error1) :
#         min_error = error1
#         min_thresh = sorted_vector[i, feature]
#         direction = 1
#     if(min_error > error2) :
#         min_error = error2
#         min_thresh = sorted_vector[i, feature]
#         direction = -1           

# return min_thresh, direction, min_error
       

array([0.96500999, 0.96500999])

In [17]:
x = X_train.values

In [18]:
sorted_labels = x[:, 0].argsort()
sorted_vector =  x[sorted_labels]

In [None]:
X_train.values.argsort().tolist()

In [28]:
i = np.array([1,2,3])
j = np.array([1,2,3])
i + j

array([2, 4, 6])

## Adaboost

#### Default Variables

In [None]:
def calculate_decision_stump_efficient(data, label, feature, weight, N):

    Tp=np.float64(0); #T+ total sum of positive examples weights
    Tn=np.float64(0) #T- total sum of negative examples weights
    Sp=np.float64(0) #S+ sum of positive weights below the cuurent threshold
    Sn=np.float64(0) #S- sum of negative weights below the current threshold
    error1=np.float64(0)
    error2=np.float64(0)
    min_error=np.float64(2.0) 
    min_thresh=np.float64(0) 
    direction=1
    
    y = np.zeros(N, dtype=np.int64)
    
    #get all positive weights    
    temp  = (label == 1)
    temp = np.int64(temp)
    Tp = np.sum(temp * weight)
    
    #get all negative weights  
    temp  = (label == -1)
    temp = np.int64(temp)
    Tn = np.sum(temp * weight)
    
    #sort feature values
    sorted_labels = data[:, feature].argsort()
    sorted_vector =  data[sorted_labels]

    length = len(sorted_vector)
    for i in range(length):

        #RIGHT DIRECTION THRESHOLD
        #error1 is the sum of positives up to that point + total negatives minus the sum of negatives so far
        error1 = Sp + (Tn - Sn)
        if label[sorted_labels[i]] == -1 : 
            Sn += weight[sorted_labels[i]]
        else :
            Sp += weight[sorted_labels[i]]
            
        #LEFT DIRECTION THRESHOLD
        error2 = Sn + (Tp - Sp) 
        
        if(min_error > error1) :
            min_error = error1
            min_thresh = sorted_vector[i, feature]
            direction = 1
        if(min_error > error2) :
            min_error = error2
            min_thresh = sorted_vector[i, feature]
            direction = -1           
    
    return min_thresh, direction, min_error
       

In [None]:
def classify_dataset_against_weak_classifier(x, thresh, direction):
    if direction == -1:
        return np.where(x < thresh, 1, -1)
    else:
        return np.where(x < thresh, -1, 1)

In [None]:
# X: an two dimensional np.array of independent features values.
# Y: an one dimensional np.array of actual target values. Values either {-1,1}.
# T: number of boosting arounds.
# features: a list of feature names.
# learn_rate: the learning rate used to calculate alpha[t].
def fit(X, Y, T, features, learn_rate):
    
    h=np.zeros([T,3], dtype=np.float64)
    alphas = np.zeros(T, dtype=np.float64)
    err = np.ones(T, dtype=np.float64) * np.inf
    weight=np.ones(len(X), dtype=np.float64)/len(X)
    N = len(X)
    
    for t in range(T):
        # Iterate through every feature in our training data.
        for feature in range(len(features)):
            weighted_error = np.float64(0)

            # Build our decision stump.
            threshold, sign, weighted_error = calculate_decision_stump_efficient(X, Y, feature, weight, N)

            # Select the best model and feature that produces the lowest model error
            if weighted_error < err[t]:
                err[t] = weighted_error
                h[t][0] = threshold
                h[t][1] = feature
                h[t][2] = sign

        # Calculate alpha value based
        alphas[t] = learn_rate * np.log( (1.0 - err[t])/err[t])
        
        # Make a prediction using t decision stump classifier
        classification = classify_dataset_against_weak_classifier(X[:, int(h[t][1])], h[t][0], h[t][2] )

        # Update the weights based on what was not correctly, some alpha value and a learning rate
        weight *= np.exp(-1.0*alphas[t]*classification*labels)
        
        # Normalize the weights distribution
        weight = weight / np.sum(weight)
        
    return np.append(h,np.reshape(alphas,(-1,1)), axis=1)

In [None]:
def predict(data_arr, boost_classif):
    
    classification_arr = np.zeros((len(data_arr),1))
    
    for idx, thresh, feat, sign, alpha in boost_classif.itertuples():
        ht1 = np.sign(data_arr[:, int(feat)] - thresh) * sign
        classification_arr += alpha*np.reshape(ht1,(-1,1))

    classification_arr = np.where(classification_arr >= 0,1,-1)
    
    return classification_arr

In [None]:
def sum_classifier_votes_for_each_sample(dataset, classifier_df):
    classification = np.float64(0)
    neg = np.float64(0)
    pos = np.float64(0)
    for idx, thresh, feat, sign, alpha in classifier_df.itertuples():

        ht1 = np.sign(dataset[:, int(feat)] - thresh) * sign
        classification += alpha * ht1

        neg += np.where(ht1 < 0, ht1, 0) * alpha
        pos += np.where(ht1 >= 0, ht1, 0) * alpha

    return classification, pos, neg

In [None]:
def margin_calculation_for_training_samples(sign, pos, neg, tot_votes):
    if np.sign(sign) < 0:
        return np.abs(neg) / tot_votes, -1
    else:
        return pos / tot_votes, 1

In [None]:
def sign_of_margin(margin, classification, true_class_label):
    if classification != true_class_label:
        return -1 * margin
    else:
        return margin

In [None]:
def accuracy(statistic_df, data):
    return np.sum(np.where(statistic_df['classification'] != statistic_df['true_class_label'],1,0)) / len(data)

In [None]:
def model_measures(y_test, y_pred):
    accuracy = metrics.accuracy_score(y_test, y_pred)
    print('Model accuracy: %f ' % accuracy)

    # precision tp / (tp + fp)
    precision = metrics.precision_score(y_test, y_pred)
    print('Precision: %f' % precision)

    # recall: tp / (tp + fn)
    recall = metrics.recall_score(y_test, y_pred)
    print('Recall: %f' % recall)

In [None]:
%%time
# Selecting the features
features = X_train.columns.tolist()
# features = feature_sets['all']

# Creating our np.array of features
X_train_arr = X_train.values
test_arr = test.values
# del X_train, X_test

# Creating our np.array of target values
labels = y_train.values
pos2 = np.where(labels == 0)
labels[pos2] = -1
# del y_train

# Number of boosting rounds
T=30

# Alpha learning rate
learning_rate = 0.5

models_all = fit(X_train_arr, labels, T, features, learning_rate)

classifier_df_all = pd.DataFrame(models_all, columns=['threshold','feature','direction','alpha'])

y_pred_all = predict(test_arr, classifier_df_all)

model_measures(y_test, y_pred_all)

In [None]:
# %%time
# # Selecting the features
# # features = anova_result['features']
# features = feature_sets['random_forest']

# # Creating our np.array of features
# X_train_arr_rf = X_train[features].values
# test_arr = test[features].values
# # del X_train, X_test

# # Creating our np.array of target values
# labels = y_train.values
# pos2 = np.where(labels == 0)
# labels[pos2] = -1
# # del y_train

# # Number of boosting rounds
# T=50

# # Alpha learning rate
# learning_rate = 0.5

# models_rf = fit(X_train_arr_rf, labels, T, features, learning_rate)

# classifier_df_rf = pd.DataFrame(models_rf, columns=['threshold','feature','direction','alpha'])

# y_pred_rf = predict(test_arr, classifier_df_rf)

# model_measures(y_test, y_pred_rf)

In [None]:
# %%time
# # Selecting the features
# # features = anova_result['features']
# features = feature_sets['chi2']

# # Creating our np.array of features
# X_train_arr_chi = X_train[features].values
# test_arr = test[features].values
# # del X_train, X_test

# # Creating our np.array of target values
# labels = y_train.values
# pos2 = np.where(labels == 0)
# labels[pos2] = -1
# # del y_train

# # Number of boosting rounds
# T=50

# # Alpha learning rate
# learning_rate = 0.5

# models_chi = fit(X_train_arr_chi, labels, T, features, learning_rate)

# classifier_df_chi = pd.DataFrame(models_chi, columns=['threshold','feature','direction','alpha'])

# y_pred_chi = predict(test_arr, classifier_df_chi)

# model_measures(y_test, y_pred_chi)

In [None]:
finished_df_all = pd.DataFrame(test['TransactionID'])
# finished_df_rf = pd.DataFrame(test['TransactionID'])
# finished_df_chi = pd.DataFrame(test['TransactionID'])

finished_df_all['isFraud'] = np.where(y_pred_all == -1,0,1)
# finished_df_rf['isFraud'] = np.where(y_pred_rf == -1,0,1)
# finished_df_chi['isFraud'] = np.where(y_pred_chi == -1,0,1)


finished_df_all.to_csv('predictions_all.csv', index=False)
# finished_df_rf.to_csv('predictions_rf.csv', index=False)
# finished_df_chi.to_csv('predictions_chi.csv', index=False)

In [None]:
# # Selecting the features
# # features = anova_result['features']
# features = X_train.columns[:2]

# # Creating our np.array of features
# X_train_arr = X_train[features].values
# test_arr = test[features].values
# # del X_train, X_test

# # Creating our np.array of target values
# labels = y_train.values
# pos1 = np.nonzero(labels == 1)
# pos2 = np.where(labels == 0)
# labels[pos2] = -1
# # del y_train

# # Number of boosting rounds
# T=50

# # Alpha learning rate
# learning_rate = 0.5

In [None]:
# len(X_train.columns)

In [None]:
# %%time
# models = fit(X_train_arr, labels, T, features, learning_rate)

In [None]:
# classifier_df = pd.DataFrame(models, columns=['threshold','feature','direction','alpha'])

In [None]:
# y_pred = predict(X_test, classifier_df)

In [None]:
# # Here we test how accurate our model is by using the test data: X_test_arr
# statistics_df = pd.DataFrame(X_train_arr)
# sum_alpha, pos_votes, neg_votes = sum_classifier_votes_for_each_sample(X_train_arr, classifier_df)
# statistics_df['sum_alpha'] = sum_alpha
# statistics_df['pos_votes'] = pos_votes
# statistics_df['neg_votes'] = neg_votes


# statistics_df['total_alpha_votes'] = np.sum(classifier_df.alpha)
# result = statistics_df[['sum_alpha','pos_votes','neg_votes','total_alpha_votes']].apply(lambda x: margin_calculation_for_training_samples(*x), axis=1)
# statistics_df['margin'] = result.apply(lambda x: x[0])
# statistics_df['classification'] = result.apply(lambda x: x[1])
# statistics_df['true_class_label'] = labels

# print(f'Model accuracy: {metrics.accuracy_score(y_train, statistics_df["classification"])}')

## Kaggle submissions

In [None]:
# y_pred = predict(test_arr, classifier_df)

# # result_path = 'results/'
# predictions_file = 'predictions.csv'

# finished_df = pd.DataFrame(test['TransactionID'])
# finished_df['isFraud'] = np.where(y_pred == -1,0,1)


# finished_df.to_csv(predictions_file, index=False)

## Calculating the margins and classifying the data

Now that we have our ensemble of weak classifiers and the alphas for each boosting round, we can begin to measure how confident we are in our models generalizability

In [None]:
# def plot_margin(data, labels, T, features, learning_rate,):
    
#     sign_name = 'sign_of_margin_' + str(T) + 'T'

#     models = fit(data,labels,T,features,learning_rate)
    
#     classifier_df = pd.DataFrame(models, columns=['threshold','feature','direction','alpha'])


#     statistics_df = pd.DataFrame(X_train_arr)
#     sum_alpha, pos_votes, neg_votes = sum_classifier_votes_for_each_sample(data, classifier_df)
#     statistics_df['sum_alpha'] = sum_alpha
#     statistics_df['pos_votes'] = pos_votes
#     statistics_df['neg_votes'] = neg_votes


#     statistics_df['total_alpha_votes'] = np.sum(classifier_df.alpha)
#     result = statistics_df[['sum_alpha','pos_votes','neg_votes','total_alpha_votes']].apply(lambda x: margin_calculation_for_training_samples(*x), axis=1)
#     statistics_df['margin'] = result.apply(lambda x: x[0])
#     statistics_df['classification'] = result.apply(lambda x: x[1])
#     statistics_df['true_class_label'] = labels


#     statistics_df[sign_name] = statistics_df[['margin', 'classification', 'true_class_label']].apply(lambda x: sign_of_margin(*x), axis=1)
    
#     sns.kdeplot(statistics_df[sign_name])

In [None]:
# # 30 Boosting rounds
# plot_margin(X_train_arr, labels, 30, features, learning_rate)

In [None]:
# 50 Boosting rounds
# plot_margin(X_train_arr, labels, 50, features, learning_rate)

In [None]:
# 100 Boosting rounds
# plot_margin(X_train_arr, labels, 100, features, learning_rate)

In [None]:
# 250 Boosting rounds
# plot_margin(X_train_arr, labels, 250, features, learning_rate)

In [None]:
# features = ['TransactionAmt', 'C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14']


# T=30
# # h=np.zeros([T,3])

# alpha = np.zeros(T, dtype=np.float64)

# # err = np.ones(T, dtype=np.float64) * np.inf

# weight=np.ones(len(X_train[features]), dtype=np.float64)/len(X_train[features])


In [None]:
# # First implementation

# from sklearn.tree import DecisionTreeClassifier

# for t in range(T):
#     Tree_model = DecisionTreeClassifier(criterion="gini",max_depth=1)
#     model = Tree_model.fit(X_train[features],y_train,sample_weight=weight)
    
#     predictions = model.predict(X_train[features])
    
#     score = model.score(X_train[features],y_train)
#     model_error = 1 - score
    
#     misclassified = np.where(predictions != y_train, 1, 0)
    
# #     evaluation = np.where(predictions == y_train, 1, 0)
# #     accuracy = sum(evaluation/len(evaluation))

# #     misclassified = np.where(predictions != y_train, 1, 0)
# #     misclassifcation = sum(misclassified/len(misclassified))

# #     model_error = sum(weights*misclassified)/sum(weights)
    
    
#     alpha = np.log((1-model_error)/model_error)
#     alphas.append(alpha)
    
    
#     X_train_weights *= np.exp(-1*alpha*misclassified)
#     X_train_weights = X_train_weights / np.sum(X_train_weights)
    
    
#     print(f'The Accuracy of the {t+1}. model is : {(score*100).round(3)}%')
#     print(f'The misclassification rate is: {(model_error*100).round(3)}%')
#     print('')

In [None]:
# from sklearn.tree import DecisionTreeClassifier

# features = anova_result['features'].tolist()


# T = 50
# learning_rate = 0.5

# weights = np.ones(len(X_train[features]),
#                   dtype=np.float64)/len(X_train[features])


# alphas = []
# models = []

# for t in range(T):
#     Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth=1)

#     model = Tree_model.fit(X_train[features], y_train, sample_weight=weights)

#     # Save your model
#     models.append(model)

#     predictions = model.predict(X_train[features])
#     accuracy = model.score(X_train[features], y_train)

#     # Seperate correctly classified from uncorrrectly classified
# #     classified = np.array(predictions==y_train).astype(int)   # Not sure what this is used for
#     misclassified = np.where(predictions != y_train,1,0)

#     # Misclassification rate
#     misclassification_rate = sum(misclassified)/len(misclassified)

#     # Model error
#     model_error = np.sum(weights*misclassified)/np.sum(weights)

#     # Calculate the alpha values
#     alpha = np.log((1-model_error)/model_error)
#     # Save our alpha
#     alphas.append(alpha)

#     # Update the training weights for next decision stump
#     weights *= np.exp(alpha*misclassified)

#     # Standardize the weights
# #     weights = weights / np.sum(weights)

# #     Evaluation = pd.DataFrame(y_train.copy())
# #     Evaluation['weights'] = weights
# #     Evaluation['predictions'] = predictions
# #     Evaluation['classified'] = classified
# #     Evaluation['misclassified'] = misclassified

# #     print(f'The Accuracy of the {t+1}th model is : {(accuracy*100).round(3)}%')
# #     print(
# #         f'The misclassification rate is: {(misclassification_rate * 100).round(3)}%')
# #     print('')

In [None]:
# classification_sum = np.float64(0)

# accuracy = []
# total_predictions = []

# for alpha, model in zip(alphas, models):
#     model_predictions = model.predict(test[features])
# #     print(model.score(X_test[features], y_test))
#     total_predictions.append(model_predictions*alpha)
    
# total_predictions = np.sign(np.sum(np.array(total_predictions),axis=0))

In [None]:
# number_of_base_learners = 50
# fig = plt.figure(figsize=(10,10))
# ax0 = fig.add_subplot(111)
# # for i in range(number_of_base_learners):
# #     model = Boosting(dataset,i,dataset)
# #     model.fit()
# #     model.predict()
# ax0.plot(range(len(accuracy)),accuracy,'-b')
# ax0.set_xlabel('# models used for Boosting ')
# ax0.set_ylabel('accuracy')
# print('With a number of ',number_of_base_learners,'base models we receive an accuracy of ',accuracy[-1]*100,'%')    
                 
# plt.show()       