# BUS 774: Prediction Modeling Exercise

C Kaligotla

03-MAR-2025

## Problem Description

Given the scale of daily credit card transactions, detecting Credit Card fraud is challenging and is an ideal use case for ML applications.

We have some data about Credit Card fraud from some unnamed institute.

Data for training is available here: [Training Data](https://www.dropbox.com/scl/fi/og0ld7ao5d63ofdnkugtd/card_transdata_train.csv?rlkey=ss3wxcp85kwfq4pzgr8enf2if&dl=0)

The data has the following Features (Variables):
*   *distancefromhome*, numeric - the distance from home where the transaction happened.
* *distancefromlast_transaction*, numeric - the distance from last transaction happened.
* *ratiotomedianpurchaseprice*, numeric - Ratio of purchased price transaction to median purchase price.
* *repeat_retailer*, binary - Is the transaction happened from same retailer.
* *used_chip* , binary - Is the transaction through chip (credit card).
* *used_pin_number*, binary - Is the transaction happened by using PIN number.
* *online_order*, binary - Is the transaction an online order.

Target Variable:
* *fraud*, binary - Is the transaction fraudulent.

**Objective: ***Build an ML algorithm to predict fraudulent transactions.***

Data for testing your model performance is available here: [Test Data](https://www.dropbox.com/scl/fi/uk5xgx94oztqok2ef6w9o/card_transdata_test.csv?rlkey=8j7v1lo11dfenibaxxlxxkrf7&dl=1)

I asked the following questions:
1. What is the best metric to use for evaluating your ML model and why?
2. Build and train your model. Use the testing data to evaluate the model. Print out the confusion matrix from your  prediction model on the test data and report chosen metric
3. Identify your "best" model
4. I have new implementation data sets in class post-submission. Calculate your "best" model performance on this new implementation data and report your metric.
 Links:
 * [OOS_1_Data](https://www.dropbox.com/scl/fi/25tjo6wvwszygrq6kd750/card_OOS1.csv?rlkey=s1yfcjckludqahjf4bzn2u5no&dl=1)
 * [OOS_2_Data](https://www.dropbox.com/scl/fi/ntfp8rja3b2eofqk0g0ii/card_OOS2.csv?rlkey=0zqir93n37na2ac2o4zagfpd1&dl=1)
 * [OOS_3_Data](https://www.dropbox.com/scl/fi/o3g5k9aoe5rj6yfovac54/card_OOS3.csv?rlkey=rqail2dcnnc6qv2mv7sbdsnjp&dl=1)




## Preamble

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LassoCV
from sklearn.metrics import make_scorer,confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import XGBClassifier


## Load Data and Explore

In [23]:
#train_url = 'https://www.dropbox.com/scl/fi/og0ld7ao5d63ofdnkugtd/card_transdata_train.csv?rlkey=ss3wxcp85kwfq4pzgr8enf2if&e=1&dl=1'
#test_url= 'https://www.dropbox.com/scl/fi/uk5xgx94oztqok2ef6w9o/card_transdata_test.csv?rlkey=8j7v1lo11dfenibaxxlxxkrf7&dl=1'
train_url = 'https://www.dropbox.com/scl/fi/x211gqhiufa0top7d7c8m/chimera_data_train.csv?rlkey=suz0x2x9frdhvlmz5m5on98hk&dl=1'
test_url= 'https://www.dropbox.com/scl/fi/ig42vp9mnq3axaosvikxt/chimera_data_test.csv?rlkey=a3llsve82mxxbq6uc54gffa55&dl=1'

df=pd.read_csv(train_url)
df.head()

Unnamed: 0,admin_support,age,boss_survey,boss_tenure,city_size,clock_in,core,education,gender,half_day_leaves,...,remote,salary,subordinates,team_size,tenure,tenure_unit,training,variable_pay,years_since_promotion,exit
0,2,35,0.655444,3,6.1,1,1,2,1,4,...,0,53.894035,0,9,3,3,3,11,3,0
1,0,33,0.533455,4,9.4,0,1,2,0,5,...,0,35.606964,0,6,1,1,3,1,3,0
2,0,32,0.486568,5,2.2,0,1,1,0,4,...,0,27.40036,0,10,2,2,3,1,4,0
3,0,40,0.477364,4,4.3,0,1,3,0,4,...,0,36.138199,0,8,1,1,3,0,4,0
4,2,47,0.60323,4,2.2,0,1,1,1,5,...,0,42.77858,1,9,1,1,2,11,4,1


In [24]:
print(df.shape)
df.info()

(14505, 26)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14505 entries, 0 to 14504
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   admin_support          14505 non-null  int64  
 1   age                    14505 non-null  int64  
 2   boss_survey            14505 non-null  float64
 3   boss_tenure            14505 non-null  int64  
 4   city_size              14505 non-null  float64
 5   clock_in               14505 non-null  int64  
 6   core                   14505 non-null  int64  
 7   education              14505 non-null  int64  
 8   gender                 14505 non-null  int64  
 9   half_day_leaves        14505 non-null  int64  
 10  high_potential         14505 non-null  int64  
 11  job_satisfaction       14505 non-null  float64
 12  kpi_performance        14505 non-null  float64
 13  local                  14505 non-null  int64  
 14  part_time              14505 non-null  int

Note: Everything here is numerical, so we don't have to convert anything.

If for instance, fraud was "Y" and "N" instead of 1 and 0, we'd use code like this:

```
y = data['fraud'].apply(lambda x: 1 if x == 'Y' else 0)  # Encode target
```

There's no missing data, so we don't have to worry about imputation.

In [25]:
df.describe()

Unnamed: 0,admin_support,age,boss_survey,boss_tenure,city_size,clock_in,core,education,gender,half_day_leaves,...,remote,salary,subordinates,team_size,tenure,tenure_unit,training,variable_pay,years_since_promotion,exit
count,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,...,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0,14505.0
mean,0.603792,37.880662,0.500008,3.619235,5.40728,0.350913,0.800896,1.376836,0.631093,4.601792,...,0.300241,38.980513,1.828197,7.121131,1.587384,1.392072,2.596001,7.026818,3.610203,0.13485
std,0.734088,4.736379,0.199857,1.050111,2.802392,0.477272,0.39934,0.658341,0.482525,1.231424,...,0.458379,6.938194,2.545286,2.017644,1.015106,0.883105,0.581865,5.887543,0.757937,0.341575
min,0.0,19.0,-0.277058,3.0,0.9,0.0,0.0,1.0,0.0,0.0,...,0.0,21.664703,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0
25%,0.0,35.0,0.365113,3.0,2.2,0.0,1.0,1.0,0.0,4.0,...,0.0,34.15358,0.0,6.0,1.0,1.0,2.0,2.0,3.0,0.0
50%,0.0,38.0,0.500478,3.0,4.3,0.0,1.0,1.0,1.0,5.0,...,0.0,38.191643,0.0,7.0,1.0,1.0,3.0,5.0,4.0,0.0
75%,1.0,41.0,0.63393,4.0,9.4,1.0,1.0,2.0,1.0,5.0,...,1.0,42.637604,4.0,8.0,2.0,2.0,3.0,12.0,4.0,0.0
max,2.0,56.0,1.276953,32.0,9.4,1.0,1.0,3.0,1.0,9.0,...,1.0,93.252998,8.0,15.0,32.0,32.0,4.0,20.0,6.0,1.0


### # Separate features and target

In [26]:
X = df.drop('exit', axis=1)
y = df['exit']
print(X.head())
print(X.shape)
print(y.head())
print(y.shape)

   admin_support  age  boss_survey  boss_tenure  city_size  clock_in  core  \
0              2   35     0.655444            3        6.1         1     1   
1              0   33     0.533455            4        9.4         0     1   
2              0   32     0.486568            5        2.2         0     1   
3              0   40     0.477364            4        4.3         0     1   
4              2   47     0.603230            4        2.2         0     1   

   education  gender  half_day_leaves  ...  rank  remote     salary  \
0          2       1                4  ...     3       0  53.894035   
1          2       0                5  ...     1       0  35.606964   
2          1       0                4  ...     1       0  27.400360   
3          3       0                4  ...     1       0  36.138199   
4          1       1                5  ...     4       0  42.778580   

   subordinates  team_size  tenure  tenure_unit  training  variable_pay  \
0             0          9   

Pre-Process the data - Transform (standardization/normalization) and convert to array)

See: https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe/

In [27]:
X = StandardScaler().fit_transform(X)
X[:5] # first 5 rows

array([[ 1.90202847, -0.60822011,  0.7777652 , -0.58970551,  0.24719724,
         1.36003871,  0.49859923,  0.94660075,  0.76456079, -0.4887131 ,
        -0.34502213, -0.2456297 ,  1.564719  ,  0.8152072 , -0.5165387 ,
         1.23492104, -0.6550298 ,  2.14955595, -0.71829271,  0.93125137,
         1.3916435 ,  1.82083046,  0.69434111,  0.67486868, -0.80511237],
       [-0.82253434, -1.03049814,  0.16736348,  0.36260783,  1.42480313,
        -0.73527319,  0.49859923,  0.94660075, -1.30794047,  0.32338261,
        -0.34502213,  1.31524848, -1.48035245, -1.22668201, -0.5165387 ,
        -0.77253494, -0.6550298 , -0.48624547, -0.71829271, -0.5556823 ,
        -0.57866289, -0.44398503,  0.69434111, -1.02369117, -0.80511237],
       [-0.82253434, -1.24163716, -0.06725067,  1.31492116, -1.14451889,
        -0.73527319,  0.49859923, -0.57242169, -1.30794047, -0.4887131 ,
        -0.34502213, -0.91567158,  0.40486988, -1.22668201, -0.5165387 ,
        -0.77253494, -0.6550298 , -1.66910184, -0

In [28]:
# convert to numpy array - since it's just 0 and 1, i'm not applying a transform
y= y.to_numpy()
y[:100] #first 100 obs

array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])

### Test-train Split for internal test data (validation data) to choose our *best* model

In [29]:
# I'm creating a separate internal test / validation set to choose best model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(11604, 25)
(11604,)
(2901, 25)
(2901,)


## Build Models

Let's build the following models / classifiers to choose the **Best**
1. Logistic Regression
2. LASSO Logistic Regression
3. Ridge Logistic Regression
4. Decision Tree / CART
5. Random Forest
6. KNN / K-Nearest Neighbors

We'll use 5-fold CV to train/fit our models using

```
# Using StratifiedKFold to maintain the same proportion of classes in each fold
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
```

**First STEP - WHAT is the RIGHT METRIC to use? and why?**

Using f1 here as default.

Also showing you prompts to generate code


***General Workflow:***
1. Define classifier (model)
2. Define the StratifiedKFold cross-validator (need to do it once)
3. Train model using CV on training data, report evaluations across folds)
4. Fit model on entire train data and predict on internal test (validation) data
5. Report Confusion Matrix and F1 Score

In [30]:
f1_scores_training = {} # to store scores


### 1. Logistic Regression Model

In [31]:
# prompt: Build a logistic regression classifier on X_train and y_train and using  5-fold cv using kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42).

# Define the logistic regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Define the StratifiedKFold cross-validator
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(log_reg, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))


Cross-validation scores: [0.3862069  0.31353919 0.36744186 0.29256595 0.40816327]
Mean cross-validation score: 0.35358343239284384


In [32]:
# prompt: Use the best log_reg model on y_test and y_train. Report confusion matrix and f1 score

# Fit the best model (log_reg in this case) on the entire training data
log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = log_reg.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))

f1_scores_training['log_reg'] = f1_score(y_test, y_pred)

[[2462   54]
 [ 297   88]]
F1 Score: 0.33396584440227706


### 2. LASSO Logistic Regression


In [33]:
# prompt: Repeat the above but using LogisticRegression(penalty='l1', solver='saga', max_iter=1000, random_state=42)

from sklearn.metrics import confusion_matrix

# Define the LASSO logistic regression model
lasso_log_reg = LogisticRegression(penalty='l1', solver='saga', max_iter=1000, random_state=42)

# Define the StratifiedKFold cross-validator
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(lasso_log_reg, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))

# Fit the model on the entire training data
lasso_log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = lasso_log_reg.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))

f1_scores_training['lasso_log_reg'] = f1_score(y_test, y_pred)

Cross-validation scores: [0.37875289 0.31753555 0.3712297  0.29256595 0.40454545]
Mean cross-validation score: 0.35292590640465105
[[2463   53]
 [ 297   88]]
F1 Score: 0.33460076045627374


### 3. Ridge Logistic Regression


In [34]:
# prompt: repeat for Ridge Regression

# Define the Ridge logistic regression model
ridge_log_reg = RidgeClassifier(random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(ridge_log_reg, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))

# Fit the model on the entire training data
ridge_log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = ridge_log_reg.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))

f1_scores_training['ridge_log_reg'] = f1_score(y_test, y_pred)

Cross-validation scores: [0.06748466 0.05572755 0.07926829 0.06116208 0.06153846]
Mean cross-validation score: 0.06503621009766908
[[2512    4]
 [ 377    8]]
F1 Score: 0.04030226700251889


### 4. Decision Tree / CART


In [35]:
# prompt: Repeat code for DecisionTree/CART using DecisionTreeClassifier(criterion='entropy', max_depth = 10, random_state=42)

# Define the Decision Tree Classifier model
CART = DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(CART, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))

# Fit the model on the entire training data
CART.fit(X_train, y_train)

# Predict on the test set
y_pred = CART.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))

f1_scores_training['CART'] = f1_score(y_test, y_pred)

Cross-validation scores: [0.34730539 0.29535865 0.31101512 0.30078125 0.36978131]
Mean cross-validation score: 0.324848343985664
[[2415  101]
 [ 308   77]]
F1 Score: 0.27353463587921845


### 5. Random Forest

In [36]:
# prompt: Repeat using RandomForestClassifier(max_depth = 10, random_state=42)

# Define the Random Forest Classifier model
RF = RandomForestClassifier(max_depth=10, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(RF, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))

# Fit the model on the entire training data
RF.fit(X_train, y_train)

# Predict on the test set
y_pred = RF.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))

f1_scores_training['RandomForest'] = f1_score(y_test, y_pred)

Cross-validation scores: [0.29411765 0.31578947 0.3062201  0.28855721 0.37176471]
Mean cross-validation score: 0.31528982724990307
[[2474   42]
 [ 309   76]]
F1 Score: 0.30218687872763417


### 6. KNN / K-Nearest Neighbors


In [37]:
# prompt: Repeat for KNeighborsClassifier(n_neighbors=5)

# Define the KNN Classifier model
knn = KNeighborsClassifier(n_neighbors=5)

# Perform cross-validation
cv_scores = cross_val_score(knn, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))

# Fit the model on the entire training data
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))
f1_scores_training['knn'] = f1_score(y_test, y_pred)

Cross-validation scores: [0.14507772 0.13756614 0.16243655 0.13299233 0.14210526]
Mean cross-validation score: 0.14403559930407306
[[2455   61]
 [ 335   50]]
F1 Score: 0.20161290322580644


Adding in XGBOOST HERE

In [38]:
# Define the KNN Classifier model
xgb = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss', random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(knn, X_train, y_train, cv=kf, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)

# Print the mean cross-validation score
print("Mean cross-validation score:", np.mean(cv_scores))

# Fit the model on the entire training data
xgb.fit(X_train, y_train)

# Predict on the test set
y_pred = xgb.predict(X_test)

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the F1 score
print("F1 Score:", f1_score(y_test, y_pred))
f1_scores_training['xgb'] = f1_score(y_test, y_pred)

Cross-validation scores: [0.14507772 0.13756614 0.16243655 0.13299233 0.14210526]
Mean cross-validation score: 0.14403559930407306


Parameters: { "use_label_encoder" } are not used.



[[2409  107]
 [ 279  106]]
F1 Score: 0.35451505016722407


In [39]:
f1_scores_training

{'log_reg': 0.33396584440227706,
 'lasso_log_reg': 0.33460076045627374,
 'ridge_log_reg': 0.04030226700251889,
 'CART': 0.27353463587921845,
 'RandomForest': 0.30218687872763417,
 'knn': 0.20161290322580644,
 'xgb': 0.35451505016722407}

In [40]:
# prompt: What is the best model from the 6?

# Based on the provided code, the best model is the one with the highest F1 score
# on the internal test set.  You would need to manually compare the F1 scores
# printed for each model to determine the best one.  The code does not store
# these values in a way that allows for automated comparison.

# Example of how to compare manually:
# Look for the "F1 Score:" lines at the end of each model section.
# The model with the highest F1 score is considered the best in this scenario.

#To compare and find the best model:

best_model = max(f1_scores_training, key=f1_scores_training.get)
print(f"Best model based on F1-score is: {best_model} with F1-score {f1_scores_training[best_model]}")


Best model based on F1-score is: xgb with F1-score 0.35451505016722407


## Evaluate Models on Provided Test Data

### Transform TEST data as needed

In [41]:
df2 = pd.read_csv(test_url)

In [42]:
X_TEST = df2.drop('fraud', axis=1)
y_TEST = df2['fraud']
print(X_TEST.shape)
print(y_TEST.shape)

KeyError: "['fraud'] not found in axis"

In [None]:
X_TEST = StandardScaler().fit_transform(X_TEST)
y_TEST= y_TEST.to_numpy()

### EVALUATE MODELS ON the provided X_TEST and y_TEST -- models have not been trained on this!

In [None]:
# prompt: Evaluate all models above on X_TEST and y_TEST. Report f1 scores in a neat table.

# Create a list to store the model names and their corresponding f1 scores
f1_scores_TEST = []

# Evaluate each model on X_TEST and y_TEST and store the f1 scores
models = [log_reg, lasso_log_reg, ridge_log_reg, CART, RF, knn]
model_names = ['log_reg', 'lasso_log_reg', 'ridge_log_reg', 'CART', 'RandomForest', 'knn'] # Create a list of model names

for model in models:
    y_pred = model.predict(X_TEST)
    f1 = f1_score(y_TEST, y_pred)
    f1_scores_TEST.append(f1)

In [None]:
# Create a DataFrame to display the results in a table
results_df = pd.DataFrame({"Model": model_names, "F1 Score TEST DATA": f1_scores_TEST}) # Use model_names here

results_df['f1_scores_InternalTesting'] = results_df['Model'].map(f1_scores_training) # Map using 'Model' column
# Display the table
results_df


Unnamed: 0,Model,F1 Score TEST DATA,f1_scores_InternalTesting
0,log_reg,0.702917,0.722002
1,lasso_log_reg,0.703179,0.721987
2,ridge_log_reg,0.249364,0.268197
3,CART,0.946805,0.999929
4,RandomForest,0.953962,0.999858
5,knn,0.951592,0.991829


## Choose "Best Model" and Fine-Tune if needed

Assume RandomForest Wins.
We can fine-tune (play around hyperparameters to see if we can improve), or at least, train the model on full data

In [None]:
# prompt: combine X and X_TEST

# Combine X and X_TEST
combined_X = np.concatenate((X, X_TEST), axis=0)
combined_y = np.concatenate((y, y_TEST), axis=0)

print(combined_X.shape)
combined_y.shape


(499999, 7)


(499999,)

In [None]:
# Retrain the Random Forest model on the entire dataset
RF_final = RandomForestClassifier(max_depth=10, random_state=42)
RF_final.fit(X, y)


## Evaluate "Best Model" on provided Implementation/OOS DataSets

In [None]:
oos1_url = 'https://www.dropbox.com/scl/fi/25tjo6wvwszygrq6kd750/card_OOS1.csv?rlkey=s1yfcjckludqahjf4bzn2u5no&dl=1'
oos2_url = 'https://www.dropbox.com/scl/fi/ntfp8rja3b2eofqk0g0ii/card_OOS2.csv?rlkey=0zqir93n37na2ac2o4zagfpd1&dl=1'
oos3_url = 'https://www.dropbox.com/scl/fi/o3g5k9aoe5rj6yfovac54/card_OOS3.csv?rlkey=rqail2dcnnc6qv2mv7sbdsnjp&dl=1'

In [None]:
oos1 = pd.read_csv(oos1_url)
oos2 = pd.read_csv(oos2_url)
oos3 = pd.read_csv(oos3_url)

In [None]:
print(oos1.shape)
print(oos2.shape)
print(oos3.shape)

(500, 8)
(1000, 8)
(499, 8)


In [None]:
#Apply scalar transforms to all OOS like before

X_oos1 = oos1.drop('fraud', axis=1)
y_oos1 = oos1['fraud']
X_oos1 = StandardScaler().fit_transform(X_oos1)
y_oos1 = y_oos1.to_numpy()

X_oos2 = oos2.drop('fraud', axis=1)
y_oos2 = oos2['fraud']
X_oos2 = StandardScaler().fit_transform(X_oos2)
y_oos2 = y_oos2.to_numpy()

X_oos3 = oos3.drop('fraud', axis=1)
y_oos3 = oos3['fraud']
X_oos3 = StandardScaler().fit_transform(X_oos3)
y_oos3 = y_oos3.to_numpy()

### EVALUATION TIME!

In [None]:
# prompt: Run RF_final.fit on X_oos1 and y_oos1. Report confusion matrix and f1 score

y_pred_oos1 = RF_final.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score:", f1_score(y_oos1, y_pred_oos1))


[[235  15]
 [189  61]]
F1 Score: 0.37423312883435583


In [None]:
y_pred_oos2 = RF_final.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score:", f1_score(y_oos2, y_pred_oos2))

[[440  60]
 [375 125]]
F1 Score: 0.36496350364963503


In [None]:
y_pred_oos3 = RF_final.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score:", f1_score(y_oos3, y_pred_oos3))


[[341   0]
 [150   8]]
F1 Score: 0.0963855421686747


WTH!



---

Maybe another model was right. Maybe the RF without re-training on all data


---



In [None]:
y_pred_oos1 = RF.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score: OOS1", f1_score(y_oos1, y_pred_oos1))
y_pred_oos2 = RF.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score: OOS2", f1_score(y_oos2, y_pred_oos2))
y_pred_oos3 = RF.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score: OOS3", f1_score(y_oos3, y_pred_oos3))

[[235  15]
 [189  61]]
F1 Score: OOS1 0.37423312883435583
[[440  60]
 [375 125]]
F1 Score: OOS2 0.36496350364963503
[[341   0]
 [150   8]]
F1 Score: OOS3 0.0963855421686747


In [None]:
y_pred_oos1 = CART.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score: OOS1", f1_score(y_oos1, y_pred_oos1))
y_pred_oos2 = CART.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score: OOS2", f1_score(y_oos2, y_pred_oos2))
y_pred_oos3 = CART.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score: OOS3", f1_score(y_oos3, y_pred_oos3))

[[235  15]
 [189  61]]
F1 Score: OOS1 0.37423312883435583
[[440  60]
 [375 125]]
F1 Score: OOS2 0.36496350364963503
[[341   0]
 [150   8]]
F1 Score: OOS3 0.0963855421686747


In [None]:
y_pred_oos1 = knn.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score: OOS1", f1_score(y_oos1, y_pred_oos1))
y_pred_oos2 = knn.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score: OOS2", f1_score(y_oos2, y_pred_oos2))
y_pred_oos3 = knn.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score: OOS3", f1_score(y_oos3, y_pred_oos3))

[[235  15]
 [190  60]]
F1 Score: OOS1 0.36923076923076925
[[437  63]
 [377 123]]
F1 Score: OOS2 0.358600583090379
[[341   0]
 [149   9]]
F1 Score: OOS3 0.10778443113772455


In [None]:
y_pred_oos1 = ridge_log_reg.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score: OOS1", f1_score(y_oos1, y_pred_oos1))
y_pred_oos2 = ridge_log_reg.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score: OOS2", f1_score(y_oos2, y_pred_oos2))
y_pred_oos3 = ridge_log_reg.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score: OOS3", f1_score(y_oos3, y_pred_oos3))

[[250   0]
 [238  12]]
F1 Score: OOS1 0.0916030534351145
[[495   5]
 [480  20]]
F1 Score: OOS2 0.0761904761904762
[[337   4]
 [150   8]]
F1 Score: OOS3 0.09411764705882353


In [None]:
y_pred_oos1 = lasso_log_reg.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score: OOS1", f1_score(y_oos1, y_pred_oos1))
y_pred_oos2 = lasso_log_reg.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score: OOS2", f1_score(y_oos2, y_pred_oos2))
y_pred_oos3 = lasso_log_reg.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score: OOS3", f1_score(y_oos3, y_pred_oos3))

[[250   0]
 [189  61]]
F1 Score: OOS1 0.39228295819935693
[[496   4]
 [390 110]]
F1 Score: OOS2 0.3583061889250814
[[333   8]
 [150   8]]
F1 Score: OOS3 0.09195402298850575


In [None]:
y_pred_oos1 = log_reg.predict(X_oos1)
print(confusion_matrix(y_oos1, y_pred_oos1))
print("F1 Score: OOS1", f1_score(y_oos1, y_pred_oos1))
y_pred_oos2 = log_reg.predict(X_oos2)
print(confusion_matrix(y_oos2, y_pred_oos2))
print("F1 Score: OOS2", f1_score(y_oos2, y_pred_oos2))
y_pred_oos3 = log_reg.predict(X_oos3)
print(confusion_matrix(y_oos3, y_pred_oos3))
print("F1 Score: OOS3", f1_score(y_oos3, y_pred_oos3))

[[250   0]
 [189  61]]
F1 Score: OOS1 0.39228295819935693
[[496   4]
 [390 110]]
F1 Score: OOS2 0.3583061889250814
[[333   8]
 [150   8]]
F1 Score: OOS3 0.09195402298850575




---

***DO AT HOME: Now make a Neural Network on this data -- How does that perform and compare?***

---

