# Classification

This project asks you to perform various experiments with classification. The dataset we are using is a toy dataset for credit card fraud detection:

https://www.kaggle.com/datasets/shubhamjoshi2130of/abstract-data-set-for-credit-card-fraud-detection

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk

## Setup for the project

Here we load the dataset, and create the training and test datasets as numpy arrays.

In [2]:
df = pd.read_csv("creditcard.csv",  true_values="Y", false_values="N")
print(f"Number of rows {len(df.index)}")
print(f"The columns of the database {df.columns}")
df.value_counts("isFradulent")

Number of rows 3075
The columns of the database Index(['Merchant_id', 'Transaction date', 'Average Amount/transaction/day',
       'Transaction_amount', 'Is declined', 'Total Number of declines/day',
       'isForeignTransaction', 'isHighRiskCountry', 'Daily_chargeback_avg_amt',
       '6_month_avg_chbk_amt', '6-month_chbk_freq', 'isFradulent'],
      dtype='object')


isFradulent
False    2627
True      448
dtype: int64

In [3]:
xfields = [
    'Average Amount/transaction/day',
       'Transaction_amount', 'Is declined', 'Total Number of declines/day',
       'isForeignTransaction', 'isHighRiskCountry', 'Daily_chargeback_avg_amt',
       '6_month_avg_chbk_amt', '6-month_chbk_freq']

df_shuffled = df.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled["isFradulent"].to_numpy(dtype=np.float64)
# the training data is the first 2000 rows, after shuffled
training_data_x = x[:2000]
training_data_y = y[:2000]
# the test data is the remaining
test_data_x = x[2000:]
test_data_y = y[2000:]

In [4]:
print("Run this to help you with what number goes with what field:")
for i, x in enumerate(xfields):
    print(f"{i} = {x}")

Run this to help you with what number goes with what field:
0 = Average Amount/transaction/day
1 = Transaction_amount
2 = Is declined
3 = Total Number of declines/day
4 = isForeignTransaction
5 = isHighRiskCountry
6 = Daily_chargeback_avg_amt
7 = 6_month_avg_chbk_amt
8 = 6-month_chbk_freq


In [5]:
print(f'train_x.shape\n{training_data_x.shape}')
print(f'train_y.shape\n{training_data_y.shape}')
print(f'test_x.shape\n{test_data_x.shape}')
print(f'test_y.shape\n{test_data_y.shape}')

train_x.shape
(2000, 9)
train_y.shape
(2000,)
test_x.shape
(1075, 9)
test_y.shape
(1075,)


## P1: Create an accuracy metric (7 pts)
Create a simple accuracy metric function which for a pair of ground truth values $y$ and estimates $\hat{y}$ (both of them arrays) calculates the accuracy of the estimate $\hat{y}$. For instance, if you pass y = [1, 0, 1] and 
yhat = [1, 1, 0], the loss function should return 0.3

In [6]:
# Returns (number of correct predictions / total number of predictions).
# Each prediction value should match with the accurate data value.
# y is the accurate data yhat is the prediction.
def accuracy(y, yhat):
    
    accuracy = 0.0
    num_correct_pred = 0
    total_num_pred = len(y)
    trials = len(y)
    
    for i in range(trials):
        if yhat[i] == y[i]:
            num_correct_pred += 1

    accuracy = (num_correct_pred / total_num_pred)
    
    return accuracy

In [7]:
# test your function here
acc = accuracy([1, 0, 1], [1, 1, 0])
print(f"Accuracy is {acc}") # should print 0.33...
acc = accuracy([1, 0, 1, 0, 1, 1], [0, 0, 0, 0, 1, 1])
print(f"Accuracy is {acc}") # should print 0.66...

Accuracy is 0.3333333333333333
Accuracy is 0.6666666666666666


## P2: Implement a majority classifier (7 pts)
This classifier will always return the most likely value. Training the classifier means determining what is the most likely value (regardless vhat value you pass to it). For instance, if more than half of the transactions are fraudulent, then you just return fraudulent always. 

In [8]:
# X is a 2D array of new transaction data.
# Averages is a 2D array of the averages of columns for fraud and non-fraud data.
# row 0 contains averages of columns for fraud.
# row 1 contains averages of columns for non-fraud.
# Returns a string as either "Fraud" or "Non-fraud". This can be changed to bool values.
def classify_majority(x, averages):
    
    # When I state "row" here, I may be referring to transaction row, or x[row].
    # Every x[row] is a transaction's details, which we use for classification.
    
    # Get the number of rows and columns for the input parameters.
    x_rows = x.shape[0]
    x_cols = x.shape[1]
        
    # Contains the return classification array for every transaction.
    # 0 represents no fraud, 1 represents fraud.
    predictions = np.zeros(x_rows)

    # Contains the intermediate 0,1 array to calculate majority.
    classifications = np.zeros((x_rows, x_cols))

    for i in range(x_rows):
        for j in range(x_cols):
            
            # Find difference for every value in x row compared to average of F / NF.
            nonfraud_diff = abs(x[i][j] - averages[0][j]) # Nonfraud
            fraud_diff = abs(x[i][j] - averages[1][j]) # Fraud
            
            # X[i][j] value is closer to fraud average than nonfraud.
            if fraud_diff <= nonfraud_diff:
                classifications[i][j] = 1 # Classify as fraud
            else:
                classifications[i][j] = 0 # Classify as nonfraud
        
        # Classification array contains only 0's, 1's. Same shape as training_x data.
        # Find the majority values (whether F / NF) for every transaction (row).
        # If 30%+ of transaction's features are more similar to fraud, label transaction F.
        if ((np.sum(classifications[i]) / x_cols) >= 0.30):
            predictions[i] = 1 # Fraud
        else:
            predictions[i] = 0 # Nonfraud

    return predictions

# Train_x contains a 2D array that contains 8 features and a large number of rows of data.
# Train_y contains a 1D array that classifies the transaction as fraudulent or not.
# Given training data, return a 2D called averages that is described as follows:
# 1st row contains the averages of every column for the fraud data.
# 2nd row contains the averages of every column for the non-fraud data.
# Using the averages of the data determining whether they are fraud/nonfraud, we can
# determine the likelihood that the data is either fraud/nonfraud and classify it.
def train_majority(training_x, training_y):
    
    # Extract row and column data from training data.
    rows = training_x.shape[0] # rows
    cols = training_x.shape[1] # cols
    
    # Make a 2d array.
    # 1st row (0) stores averages of non-fraud, 2nd row (1) stores fraud averages.
    averages = np.zeros(shape=(2, cols))

    # Extract all fraud and not fraud value rows to mask.
    fraud_mask = np.where(training_y == 1.0)
    non_fraud_mask = np.where(training_y == 0.0)
    
    # Get fraud and non fraud rows.
    fraud = training_x[fraud_mask]
    nonfraud = training_x[non_fraud_mask]

    # print("average of first row (doesn't make sense)", np.mean(fraud[0]))
    # print("average of first column", np.mean(fraud[:,0]))
    
    # print("averages", averages.shape)
    # print("fraud", fraud.shape)
    # print("nonfraud", nonfraud.shape)
    
    # This gives an array of the first column for fraud.
    # print("fraud[:,0]", fraud[:,0])
    # This gives an array of the first row for fraud.
    # print("fraud[0]", fraud[0])
    
    # Get averages for each column in fraud.
    for i in range(averages.shape[0]): # Loop through rows
        for j in range(averages.shape[1]): # Loop through columns
                
            # Extract the mean from every column for nonfraud average.
            if i == 0:
                averages[i][j] = np.mean(nonfraud[:,j])
                
            # Extract the mean from every column for fraud average.
            if i == 1:
                averages[i][j] = np.mean(fraud[:,j])
                
    # print("averages of nonfraud", averages[0])
    # print("averages of fraud", averages[1])
    
    return averages

In [9]:
import time

# TODO: use the train_majority function to find the theta value for the training dataset

# Returns a 2D array of the averages for the train/test data fraud and non-fraud.
# train_averages[0] = nonfraud averages, train_averages[1] = fraud averages
train_start = time.time()
train_averages = train_majority(training_data_x, training_data_y)
train_end = time.time()
test_averages = train_majority(test_data_x, test_data_y)

# TODO: now use the theta value to create the test_data_yhat array which contains the classification for each test value 
label_start = time.time()
train_pred_using_train_avg = classify_majority(training_data_x, train_averages)
label_end = time.time()
test_pred_using_train_avg = classify_majority(test_data_x, train_averages)
train_pred_using_test_avg = classify_majority(training_data_x, test_averages)
test_pred_using_test_avg = classify_majority(test_data_x, test_averages)

# Zipping up the total counts of unique fraud / nonfraud
# Predictions to visualize the unique value and counts.
train_unique, train_counts = np.unique(train_pred, return_counts=True)
test_unique, test_counts = np.unique(test_pred, return_counts=True)
realTrain_unique, realTrain_counts = np.unique(training_data_y, return_counts=True)
realTest_unique, realTest_counts = np.unique(test_data_y, return_counts=True)

print("==================================")
print("0 = No fraud, 1 = Fraud")
print("==================================")
print("Training time: ", round(train_end - train_start, 4), "seconds")
print("Labeling time: ", round(label_end - label_start, 4), "seconds")
print("==================================")
print("averages of nonfraud averages using training\n", train_averages[0])
print("averages of fraud averages using training data\n", train_averages[1])
print("==================================")
print("Train predictions:\n", train_pred_using_train_avg[0:20])
print("Train real:\n", training_data_y[0:20])
print("Test predictions:\n", test_pred_using_train_avg[0:20])
print("Test real:\n", test_data_y[0:20])
print("==================================")
print("Train pred unique counts:\n", dict(zip(train_unique, train_counts)))
print("Train real unique counts:\n", dict(zip(realTrain_unique, realTrain_counts)))
print("Test pred unique counts:\n", dict(zip(test_unique, test_counts)))
print("Test real unique counts:\n", dict(zip(realTest_unique, realTest_counts)))
print("==================================")

# TODO: now calculate the accuracy of the classifier using the function implemented in P1, and print it out
print("Train Fraud averages:\n", train_averages[0])
print("Train Non-fraud averages:\n", train_averages[1])
print("Test Fraud averages:\n", test_averages[0])
print("Test Non-fraud averages:\n", test_averages[1])
print("==================================")
print(f'TRAIN ACCURACY using training averages:\t{accuracy(training_data_y, train_pred_using_train_avg)}')
print(f'TEST ACCURACY using training averages:\t{accuracy(test_data_y, test_pred_using_train_avg)}')
print(f'TRAIN ACCURACY using testing averages:\t{accuracy(training_data_y, train_pred_using_test_avg)}')
print(f'TEST ACCURACY using testing averages:\t{accuracy(test_data_y, test_pred_using_test_avg)}')
print("==================================")

NameError: name 'train_pred' is not defined

In [None]:
# Returns a list of the data values.
averages = train_majority(training_data_x, training_data_y)

print("averages of fraud\n", averages[0])
print("averages of nonfraud", averages[1])

# Data analysis and exploration for training data only

## Comparison of fraud / non-fraud feature averages
- Average Amount/transaction/day
    + Fraud: 536
    + Non-fraud: 509
- Transaction_amount
    + Fraud: 23,190
    + Non-fraud: 7,704
- Is declined
    + Fraud: 0.1127 (11.27%)
    + Non-fraud: 0.0023 (0.23%)
- Total Number of declines/day
    + Fraud: 3.8945
    + Non-fraud: 0.4556
- isForeignTransaction
    + Fraud: 0.6981 (69.81%)
    + Non-fraud: 0.1420 (14.20%)
- isHighRiskCountry
    + Fraud: 0.4581 (45.81%)
    + Non-fraud: 0.0005 (0.05%)
- Daily_chargeback_avg_amt
    + Fraud: 268
    + Non-fraud: 22
- 6_month_avg_chbk_amt
    + Fraud: 196
    + Non-fraud: 15
- 6-month_chbk_freq
    + Fraud: 2.2218
    + Non-fraud: 0.1008
    
## Conclusion from comparing fraud and non-fraud averages
Fraud averages were significantly higher for every category in comparison to non-fraud averages. We can use this info to create an algorithmic flagging system for fraud detection.

Assume we have a new transaction and need to determine whether the data is more similar to either fraud or non-fraud. We can do this by determining if new data has higher averages than fraud data.

To determine similarity it would be suitable to compare the value of the new transaction columns to the averages of fraud and non-fraud and then determine which value the new transaction column data is closer to.

Example psuedo-code algorithm:

```
def categorize(new_transaction, averages):

# each index contains either a 0 or 1. 0 indicates that new transaction cell was less than
# the average of fraud. 1 indicates that the new transaction cell was higher than the average.
# We will take the sum of all categories and divide by the number of cateogires to determine
# The percentage likelihood that the new transaction is either fraud or not fraud.
fraud_points = np.zeros(len(new_transaction))

for i in range(len(new_transaction.shape[0])): # Rows
    for j in range(len(new_transaction.shape[1]): # Columns
       if new_transaction[i] >= averages[i][j] # Fraud
           category_points[i] += 1
       
# This is the classification percentage for the new transaction.
category_percentage = np.sum(fraud_points) / len(fraud_points)

# if category_percentage is greater than 50%, we are notified of fraud.
if category_percentage > 0.50:
    category = 'Fraud'
else:
    category = 'Non-fraud'

return category
```

## Pseudocode explanation
Fraud points is an array of zeros of shape (new transaction rows, new transaction columns). For every row, we are comparing the data in every column to the fraud columns (fraud averages). If the new transaction row contains data that is higher than the fraud average, we add a point to symbolize that is a flag that could later be used to categorize if a new transaction is fraudulent or not.

- Example:
```
category_points = [0, 1, 0, 1, 0, 0] # size: 6
category_percentage = 2 fraud points / size 6
category_percentage = 0.33 # 33%
```
We can therefore determine that this transaction is not likely to be fraud.

## Improvements
- We are assuming that all categories in classifying fraud have equal weight, however the daily transaction amount for fraud is very similar to non-fraud.
- It may be useful to incorporate mode and median values of the fraud and non-fraud training data into the algorithm to get more metrics to compare the new transaction data to.
- I have created this with the assumption that new_transaction is a 1D array, not sure if the algorithm would change if it was a 2D array but probably.
- The only metric we are using to adding a new point to the category is whether the new transaction data averages are greater than the fraud averages, which is very reductionist. There may be legitimate transactions that have greater values for fraud which may be flagged.
- Category percentage value is hard coded to 50%. There may be a more optimal value to determine whether to be notified of fraud. It may be lower or higher than 50%.
- If fraud and non-fraud averages are very similar, this algorithm may be useless and produce too many false positives. This would mean that fraud and non-fraud transactions share too many similarities and one can not determine whether there is a difference in the averages of fraud data and non-fraud data.

## Major problem I ran into
When I made the initial algorithm up above I assumed that the function for classification would receive training data x and training data y, however we can not pass it y because that would be cheating. Therefore, I had to think up a new solution that is able to classify data into fraud or non-fraud using the averages array parameter passed in. Here's what I came up with:

```
Given:
x = [[n, n, n, ... n, n, n], [n, n, n, ... n, n, n], ... [n, n, n, ... n, n, n]]
averages = [Fraud = [n, n, n, ... n, n, n], Nonfraud = [n, n, n, ... n, n, n]]

See if x[row] is more similar to average row Fraud or average row Nonfraud.
Similarity is based on the absolute value of x[row] - (comparing array list).

# Initially all transactions are considered fraud until proven innocent.
# 0 represents non fraud, 1 represents fraud.
# 1D array the size of training x data columns to classify the transaction as F or NF.
is_row_fraud = np.ones(train_x_rows)

# Get the sum of both averages to use for comparison against the training x rows.
fraud_sum = np.sum(averages[0])
nonfraud_sum = np.sum(averages[1])

# Go through every row and every column for training data x.
for i in range(col):
    for j in range(row):
        # Get the sum for the row in the training data.
        x_row_sum = np.sum(x[j])
        
        # Get the absolute value difference in sums from the x row and fraud / nonfraud.
        fraud_diff = abs(x_row_sum - fraud_sum)
        nonfraud_diff = abs(x_row_sum - nonfraud_sum)
        
        # Compare the differences to see whether x is closer to fraud / nonfraud.
        # Categorize it by placing either a 1 for fraud, or a 0 for nonfraud in the
        # is_row_fraud tracker.
        if fraud_diff <= nonfraud_diff:
            is_row_fraud[j] = 0
```

- When doing a for loop to compare np.mean(x[:,i]) > averages[0][i], we are accidentally comparing the average of both fraud and non-fraud at the same time. I would have to do np.where(x[:,np.where(y == 1)]) > averages[0][i]

## After fixing major improvement log
Currently I am doing this:

```
for i in range(x_cols):
x_train_sum = np.sum(x[row])
fraud_sum = np.sum(average[fraud_row])
nonfraud_sum = np.sum(average[nonfraud_row])

if abs(x_train_sum - fraud_sum) <= abs(x_train_sum - nonfraud_sum)
    classification_predictions[row] = labelled_fraud
```

However there is a problem with this. We are assuming that the sum of the training data works as a whole to compare for every feature, but some features are only yes and no. Therefore, it may be more suitable to take the differences of each column and row, and then loop through the array of differences for fraud - x[row] and nonfraud - x[row] and a majority (50%+) of the differences are more similar to either fraud or nonfraud.

Something like:
```
# Contains the return classification array for every transaction.
# 0 represents no fraud, 1 represents fraud.
predictions = np.zeros(x_rows)

# Contains the intermediate 0,1 array to calculate majority.
classifications = np.zeros(x_rows, x_cols)


# Get the difference for ever F / NF value, compare every value in each index to check
# Which classification the x index is closer to. If closer to fraud averages, classify
# The index in that x index as a fraud similarity, if closer to NF, then NF.
# We will have a classification 2D index identical in shape to x as so:
# [[1, 0, 1, 1... 1, 0], ... [0, 1, 0, 1... 0, 1]].
# Where 1 = fraud similarity, 0 = nonfraud similarity.
# Now we loop through this array to find whether each row (transaction) contains more F/NF
# similarity. If we have array = [1, 0, 1, 1, 0, 1], then 4 fraud instances / size 6 = .66%
# So we can classify that transaction as fraud.
# We would put that classification in a predictions 1D array of size columns.
# predictions[i] = (np.sum(classifications[i]) / x_cols)
# predictions = [1, 0, 1, ... 0, 1, 0] transaction[0] = fraud, [1] = nonfraud, etc.
for i in range(x_rows):
    for j in range(x_cols):
    
        fraud_diff = abs(x[i][j] - averages[i][j])
        nonfraud_diff = abs(x[i][j] - averages[i][j])    
        
        if fraud_diff <= nonfraud_diff:
            classifications[i][j] = 1
        else:
            classifications[i][j] = 0
     
    if ((np.sum(classification[i]) / x_cols) >= 0.50):
        predictions[i] = 1 # Fraud
    else:
        predictions[i] = 0 # Nonfraud
        
return predictions
        
```
        
## Final judgement

I have finally finished the final product and it works perfectly, although it could always use improvements.

Here is a small sample of the data.
### 0 = No fraud, 1 = Fraud
- Train predictions:
    + [0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0.]
- Train real:
    + [0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0.]
- Test predictions:
    + [0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
- Test real:
    + [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
- Train pred unique counts:
    + {0.0: 1287, 1.0: 713}
- Train real unique counts:
    + {0.0: 1725, 1.0: 275}
- Test pred unique counts:
    + {0.0: 687, 1.0: 388}
- Test real unique counts:
    + {0.0: 902, 1.0: 173}
- Fraud averages:
    + [5.36797366e+02 2.31905750e+04 1.12727273e-01 3.89454545e+00
 6.98181818e-01 4.58181818e-01 2.68025455e+02 1.96155273e+02
 2.22181818e+00]
- Non-fraud averages:
    + [5.09289820e+02 7.70428074e+03 2.31884058e-03 4.55652174e-01
 1.42028986e-01 5.79710145e-04 2.22985507e+01 1.49634783e+01
 1.00869565e-01]
- **Accuracy:**
    + **0.8715**
- **Accuracy:**
    + **0.8623255813953489**

OUR ACCURACY ~87% FOR THE TRAINING AND TESTING DATA!
THAT IS INCREDIBLE FOR A TRAINING MODEL. MY MODEL IS THE BEST MODEL.

## Final edit:
I realized I was classifying using only the training data generate average values, but now I changed the code to also display the testing data generated values and found the optimal value of >30% for classifying a transaction as fraudulent. If 30% of the features are found to be similar to fraudulent transactions, we will mark that transaction of fradulent.

My final accuracy is the following:

- Train Accuracy using training averages:	0.9225
- Test Accuracy using training averages:	0.9079069767441861
- Train Accuracy using testing averages:	0.922
- Test Accuracy using testing averages:     0.9041860465116279

# **On average, my model has a ~91% accuracy in predicting a fraudulent transaction.**

## P3: Implement a hand engineered classifier (8 pts)

Engineer by hand a classifier function that predicts whether  a transaction is  fraudulent or not. Your function should have a $\theta$ parameter which allows you to tweak it. 
The problem requires you to design a function that performs this classification, tweak its parameters, and measure its accuracy for the best parametrization you found. You should aim for a function that, at minimum, performs better than the majority classifier. 

## Explanation for my handwritten function

We have the averages of the fraud and non-fraud columns.

We will continue to use the averages of the fraud and nonfraud columns, but place more emphasis here on the whether a transaction is in a foreign country, high risk country, and chargeback and checkbook balance. These values contribute to a significantly higher increase in risk for a fraudulent transaction. Let's break down the data:

## Comparison of fraud / non-fraud feature averages
- Average Amount/transaction/day
    + Fraud: 536
    + Non-fraud: 509
- Transaction_amount
    + Fraud: 23,190
    + Non-fraud: 7,704
- Is declined
    + Fraud: 0.1127 (11.27%)
    + Non-fraud: 0.0023 (0.23%)
- Total Number of declines/day
    + Fraud: 3.8945
    + Non-fraud: 0.4556
- isForeignTransaction
    + Fraud: 0.6981 (69.81%)
    + Non-fraud: 0.1420 (14.20%)
- isHighRiskCountry
    + Fraud: 0.4581 (45.81%)
    + Non-fraud: 0.0005 (0.05%)
- Daily_chargeback_avg_amt
    + Fraud: 268
    + Non-fraud: 22
- 6_month_avg_chbk_amt
    + Fraud: 196
    + Non-fraud: 15
- 6-month_chbk_freq
    + Fraud: 2.2218
    + Non-fraud: 0.1008

## Fraudulent transactions...
- have slightly higher transaction counts
- transact 3x the amount as non-fraud transactions
- 49x more the amount of declines per day
- 8.5x more the amount of declines
- 5x more likely to be a foreign transaction
- 916x more likely to be in a high risk country
- 12x more likely to have higher chargeback average
- 13x more likely to have a higher checkbook amount
- 22x more likely to have a higher checkbook frequency

## Conclusion
The most starting numbers that stick out are:
- 49x higher number of declines.
- 916x the amount of fraud cases are in a high risk country.

So our new handwritten function will take those two factors in a heavier weight than our automated identification system.

In [None]:
# TODO: implement here your hand-engineered classifier
# The example below is just a very bad example, but it gives you an idea of how you can reason about the classification problem.
# In your implementation, you should try to actually find some kind of clever algorithm. You can also use more complex parametrizations

# Higher weighting is attributed towards isDeclined, isHighRiskCountry, and checkbook_freq.
# Changed theta to averages. Same thing.
def classify_handwritten(x, averages):
    
    # Get the number of rows and columns for the input parameters.
    x_rows = x.shape[0]
    x_cols = x.shape[1]
        
    # Contains the return classification array for every transaction.
    # 0 represents no fraud, 1 represents fraud.
    predictions = np.zeros(x_rows)

    # Contains the intermediate 0,1 array to calculate majority.
    classifications = np.zeros((x_rows, x_cols))

    for i in range(x_rows):
        for j in range(x_cols):
            
            # Find difference for every value in x row compared to average of F / NF.
            nonfraud_diff = abs(x[i][j] - averages[0][j]) # Nonfraud
            fraud_diff = abs(x[i][j] - averages[1][j]) # Fraud
            
            # X[i][j] value is closer to fraud average than nonfraud.
            if fraud_diff <= nonfraud_diff:
                classifications[i][j] = 1 # Classify as fraud
            else:
                classifications[i][j] = 0 # Classify as nonfraud
        
        # Classification array contains only 0's, 1's. Same shape as training_x data.
        # Highest risk factors are isDeclined (2), isHighRiskCountry (5), and checkbook_freq (8).
        highest_risk_factors = ((classifications[i][2] + classifications[i][5] + classifications[i][8]) / 3)
        
        if (highest_risk_factors >= 0.3) or ((np.sum(classifications[i]) / classifications.shape[1]) >= 0.3):
            predictions[i] = 1 # Fraud
        else:
            predictions[i] = 0 # Nonfraud
        
    return predictions

In [None]:
# TODO: Now, run some experiments with your function. Experiment with different values of the parameter theta.  

import time

# Returns a 2D array of the averages for the train/test data fraud and non-fraud.
# train_averages[0] = nonfraud averages, train_averages[1] = fraud averages
train_start = time.time()
train_averages = train_majority(training_data_x, training_data_y)
train_end = time.time()
test_averages = train_majority(test_data_x, test_data_y)

# TODO: now use the theta value to create the test_data_yhat array which contains the classification for each test value 
label_start = time.time()
train_pred_using_train_avg = classify_handwritten(training_data_x, train_averages)
label_end = time.time()
test_pred_using_train_avg = classify_handwritten(test_data_x, train_averages)
train_pred_using_test_avg = classify_handwritten(training_data_x, test_averages)
test_pred_using_test_avg = classify_handwritten(test_data_x, test_averages)

# Zipping up the total counts of unique fraud / nonfraud
# Predictions to visualize the unique value and counts.
train_unique, train_counts = np.unique(train_pred, return_counts=True)
test_unique, test_counts = np.unique(test_pred, return_counts=True)
realTrain_unique, realTrain_counts = np.unique(training_data_y, return_counts=True)
realTest_unique, realTest_counts = np.unique(test_data_y, return_counts=True)

In [None]:
# TODO: calculate the accuracy of the classifier on the test data with the best
# theta found above and print it.
print("==================================")
print("0 = No fraud, 1 = Fraud")
print("==================================")
print("Training time: ", round(train_end - train_start, 4), "seconds")
print("Labeling time: ", round(label_end - label_start, 4), "seconds")
print("==================================")
print("averages of nonfraud averages using training\n", train_averages[0])
print("averages of fraud averages using training data\n", train_averages[1])
print("==================================")
print("Train predictions:\n", train_pred_using_train_avg[0:20])
print("Train real:\n", training_data_y[0:20])
print("Test predictions:\n", test_pred_using_train_avg[0:20])
print("Test real:\n", test_data_y[0:20])
print("==================================")
print("Train pred unique counts:\n", dict(zip(train_unique, train_counts)))
print("Train real unique counts:\n", dict(zip(realTrain_unique, realTrain_counts)))
print("Test pred unique counts:\n", dict(zip(test_unique, test_counts)))
print("Test real unique counts:\n", dict(zip(realTest_unique, realTest_counts)))
print("==================================")

# TODO: now calculate the accuracy of the classifier using the function implemented in P1, and print it out
print("Train Fraud averages:\n", train_averages[0])
print("Train Non-fraud averages:\n", train_averages[1])
print("Test Fraud averages:\n", test_averages[0])
print("Test Non-fraud averages:\n", test_averages[1])
print("==================================")
print(f'TRAIN ACCURACY using training averages:\t{accuracy(training_data_y, train_pred_using_train_avg)}')
print(f'TEST ACCURACY using training averages:\t{accuracy(test_data_y, test_pred_using_train_avg)}')
print(f'TRAIN ACCURACY using testing averages:\t{accuracy(training_data_y, train_pred_using_test_avg)}')
print(f'TEST ACCURACY using testing averages:\t{accuracy(test_data_y, test_pred_using_test_avg)}')
print("==================================")

# Conclusion
Average accuracy: ~93% Average time: ~0.0635 seconds
-----
This version works better than the first one. It seems that the features isDeclined, isHighRiskCountry, and checkbook frequency are great at predicting whether a transaction is fraud or not. Speed is the same. In order for the algorithm to be more efficient, I'd have to cut out the classification arrays and find some logarithmic or linear function to cut away the nested for loop. But not bad regardless. 0.0635 seconds for a data set with 2000 entries. So for 1,000,000 entries the time to train would be 31.75 seconds.

## P4: Implement a logistic regression classifier using sklearn (8 pts)
Implement a logistic regression function using the sklearn library. 
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [None]:
# TODO: implement the logistic regression here in a function

from sklearn.linear_model import LogisticRegression

def log_regr_withRegularization(train_x, train_y, test_x, test_y):

    # Logistic regression with Regularization.
    clf = LogisticRegression(max_iter = 800, random_state = 0)
    
    # Generate the model.
    clf.fit(train_x, train_y)
    
    # Various data extractions from the model.
    predictions = clf.predict(test_x)
    parameters = clf.get_params()
    probabilities = clf.predict_proba(train_x)
    score = clf.score(test_x, test_y)

    return predictions, parameters, score, probabilities

def log_regr_noRegularization(train_x, train_y, test_x, test_y):

    # No regularization.
    clf = LogisticRegression(max_iter = 800, penalty = 'none', random_state = 0)
    
    # Generat the model.
    clf.fit(train_x, train_y)
    
    # Various data extractions from the model.
    predictions = clf.predict(test_x)
    parameters = clf.get_params()
    score = clf.score(test_x, test_y)
    probabilities = clf.predict_proba(train_x)

    return predictions, parameters, score, probabilities

In [None]:
# TODO: now, run some experiments with it, and measure the accuracy with various parametrizations. In particular, you should run it with and without regularization. 
# In the last line, print the accuracy with the best parameters.

import time

regr_start = time.time()
TrainPred, TrainParam, TrainScore, TrainProb = log_regr_withRegularization(training_data_x, training_data_y, training_data_x, training_data_y)
regr_end = time.time()
TestPred, TestParam, TestScore, TestProb = log_regr_withRegularization(training_data_x, training_data_y, test_data_x, test_data_y)

no_regr_start = time.time()
noRegTrainPred, noRegTrainParam, noRegTrainScore, noRegTrainProb = log_regr_noRegularization(training_data_x, training_data_y, training_data_x, training_data_y)
no_regr_end = time.time()
noRegTestPred, noRegTestParam, noRegTestScore, noRegTestProb = log_regr_noRegularization(training_data_x, training_data_y, test_data_x, test_data_y)

print("===================")
print("No Regularization:")
print("===================")
print("Training Prediction:\n", noRegTrainPred)
print("Testing Prediction:\n", noRegTestPred)
print("Training Parameters:\n", noRegTrainParam)
print("Testing Parameters:\n", noRegTestParam)
print("Training Score:\n", noRegTrainScore)
print("Testing Score:\n", noRegTestScore)
print("Training Probabilities:\n", noRegTrainProb)
print("Testing Probabilities:\n", noRegTestProb)
print("Training Time:\n", no_regr_end - no_regr_start)

print("===================")
print("Regularization:")
print("===================")
print("Training Prediction:\n", TrainPred)
print("Testing Prediction:\n", TestPred)
print("Training Parameters:\n", TrainParam)
print("Testing Parameters:\n", TestParam)
print("Training Score:\n", TrainScore)
print("Testing Score:\n", TestScore)
print("Training Probabilities:\n", TrainProb)
print("Testing Probabilities:\n", TestProb)
print("Training Time:\n", regr_end - regr_start)

TODO: Describe in one paragraph your experiments and evaluation of the Logistic Regression classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. Compare it with the accuracy of the hand-engineered classifier.

# Conclusion

The classifier achieves a higher accuracy than my model. This one gets 99% accuracy while mine has 93%, however this model takes ~.16 seconds on average while mine takes ~.06. So mine is about 3 times faster with 6% less accuracy. If a fraud algorithm that is more efficient than accurate is needed, mine would work better for millions of incoming transactions per second. However, despite my version being more efficient I would always use the scikit learn one because it is highly optimized to work well under different sorts of data while my version works well for a niche data point. Furthermore, my version does also allow parameter tweaking. You can change the majority percentage needed for a data point to be considered fraud.

## P5 Bonus: Implement a random forest classifier using sklearn (5 pts)
Implement a random forest classifier using sklearn 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
# TODO: Implement the random forest classifier here

from sklearn.ensemble import RandomForestClassifier

def forest(train_x, train_y, test_x, test_y):

    # Generate the classifier.
    clf = RandomForestClassifier(random_state = 0)
    clf.fit(train_x, train_y)
    parameters = clf.get_params()
    predictions = clf.predict(test_x)
    score = clf.score(test_x, test_y)
    probabilities = clf.predict_proba(test_x)

    return predictions, parameters, score, probabilities

In [None]:
# TODO: Perform some experiments here with different parameters of the random forest classifier. In the last line, print the accuracy with the best parameters.

start = time.time()
trainPred, trainParam, trainScore, trainProb = forest(training_data_x, training_data_y, training_data_x, training_data_y)
end = time.time()
testPred, testParam, testScore, testProb = forest(training_data_x, training_data_y, test_data_x, test_data_y)

print("===================")
print("No Regularization:")
print("===================")
print("Training Prediction:\n", trainPred)
print("Testing Prediction:\n", testPred)
print("Training Parameters:\n", trainParam)
print("Testing Parameters:\n", testParam)
print("Training Score:\n", trainScore)
print("Testing Score:\n", testScore)
print("Training Probabilities:\n", trainProb)
print("Testing Probabilities:\n", testProb)
print("Training Time:\n", end - start)

TODO: Describe in one paragraph your experiments and evaluation of the random forest classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. 

## Conclusion

Both scores were great for classifying both groups. However, it took .32 seconds for only 2000 data points which is concerning because for 10M data points it would take 1600 seconds ~30 minutes. It wouldn't work to classify transactions in real time. Accuracy was high, however I believe the scikit learn logRegression works better to classify data with more efficiency.

## P6 Bonus: Implement an AdaBoost classifer using sklearn (5 pts)

Implement an AdaBoost classifier using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [None]:
# TODO: Implement the adaboost classifier here

from sklearn.ensemble import AdaBoostClassifier

def ada(train_x, train_y, test_x, test_y):

    # Generate the classifier.
    clf = AdaBoostClassifier()
    clf.fit(train_x, train_y)
    parameters = clf.get_params()
    predictions = clf.predict(test_x)
    score = clf.score(test_x, test_y)
    probabilities = clf.predict_proba(test_x)

    return predictions, parameters, score, probabilities

In [None]:
# TODO: Perform some experiments here with different parametrizations of the adaboost classifier. In the last line, print the accuracy with the best parameters.

start = time.time()
trainPred, trainParam, trainScore, trainProb = ada(training_data_x, training_data_y, training_data_x, training_data_y)
end = time.time()
testPred, testParam, testScore, testProb = ada(training_data_x, training_data_y, test_data_x, test_data_y)

print("===================")
print("No Regularization:")
print("===================")
print("Training Prediction:\n", trainPred)
print("Testing Prediction:\n", testPred)
print("Training Parameters:\n", trainParam)
print("Testing Parameters:\n", testParam)
print("Training Score:\n", trainScore)
print("Testing Score:\n", testScore)
print("Training Probabilities:\n", trainProb)
print("Testing Probabilities:\n", testProb)
print("Training Time:\n", end - start)

TODO: Describe in one paragraph your experiments and evaluation of the AdaBoost classifier. Consider things such as accuracy, training time, ease oftweaking of the parameters.

## Conclusion

With a training time of 0.21 seconds and an accuracy of 99.5% for the train and 98% for the test, this algorithm would likely come in third place between the logistic regression and the random forest algorithm. Tweaking parameters is easy using the scikit API, they have a bunch of different cases for different statistical assumptions on their website.

In [None]:
# Came already in original project.

# SOLUTION
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
clf.fit(training_data_x, training_data_y)
yhat = clf.predict(test_data_x)
acc = accuracy(test_data_y, yhat)
print(f"Accuracy of the AdaBoost classifier {acc}")