# Cleaning: K-Nearest Neighbors & Credit Card Fraud

## **1.** Imports & Settings

In [None]:
import pandas as pd
import numpy as np

# scalers used in section 5.2.
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer


pd.set_option('display.max_rows', 200)

---

## **2.** Load Data
Here we load the data from Google Drive, but the data can also be accessed from a relative path for use with Jupyter Notebooks/Lab.


### **2.1.** Load Data w/ Jupyter Notebooks/Lab
```
# run this code if using Jupyter Notebooks/Lab
df = pd.read_csv('./data/creditcard.csv')
```

### **2.2.** Access & Load Data w/ Google Drive

In [None]:
# run this cell if using Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/KNN_creditCardFraud/data/creditcard.csv'
df = pd.read_csv(path)

---

## **3.** Detail & View Data
Below we look into details that inform the cleaning, scaling, and subsetting that we may need to perform. This includes looking for unusual or null values and checking the distribution of our classifications.

### **3.1.** Head & Info
The Kaggle description of our dataset indicates that "Time" and "Amount" are the only two columns that did not undergo PCA transformation. As part of PCA, data is normalized. A quick look at `df.head()` and  `df.info()`confirms that "Time" and "Amount" are on a different scale to our 29 V-n columns. It also reveals: 
*   Our dataset contains 29 V-n Features, plus 'Time', 'Amount', and 'Class'
*   There are 0 Null values
*   There are two classes
    * 0 Represents "Not Fraud"
    * 1 Represents "Fraud"
* All Features are floats and Class is int

In [None]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Viewing `df.describe()`, we find that 'Time' and 'Amount' are on different scales to our V-n features. Considering we will experiment with a KNN classifier, these features may be valuable, so we will plan to scale them alongside our other features.

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Time,284807.0,94813.86,47488.145955,0.0,54201.5,84692.0,139320.5,172792.0
V1,284807.0,3.91956e-15,1.958696,-56.40751,-0.920373,0.018109,1.315642,2.45493
V2,284807.0,5.688174e-16,1.651309,-72.715728,-0.59855,0.065486,0.803724,22.057729
V3,284807.0,-8.769071e-15,1.516255,-48.325589,-0.890365,0.179846,1.027196,9.382558
V4,284807.0,2.782312e-15,1.415869,-5.683171,-0.84864,-0.019847,0.743341,16.875344
V5,284807.0,-1.552563e-15,1.380247,-113.743307,-0.691597,-0.054336,0.611926,34.801666
V6,284807.0,2.010663e-15,1.332271,-26.160506,-0.768296,-0.274187,0.398565,73.301626
V7,284807.0,-1.694249e-15,1.237094,-43.557242,-0.554076,0.040103,0.570436,120.589494
V8,284807.0,-1.927028e-16,1.194353,-73.216718,-0.20863,0.022358,0.327346,20.007208
V9,284807.0,-3.137024e-15,1.098632,-13.434066,-0.643098,-0.051429,0.597139,15.594995


### **3.2.** Classification Distrobution
We can plot a histogram to see the distribution of our two classifications (fraud & not-fraud), but because we are working with a binary classification, printing the percentages works just as well:

In [None]:
# quantify the balance of classifcations
no_fraud_percent = round((df['Class'].value_counts()[0] / len(df)) * 100, 2)
fraud_percent = round((df['Class'].value_counts()[1] / len(df)) * 100, 2)

print(f'No Fraud: {no_fraud_percent} percent of dataset')
print(f'Fraud: {fraud_percent} percent of dataset')

No Fraud: 99.83 percent of dataset
Fraud: 0.17 percent of dataset


We find our classifications to be heavily skewed in favor of legitimate transactions, so we will also plan to balance our data to equal observations of fraud and not-fraud cases.
* 99.83% of the data represented by "No Fraud"
* .17% classified as "Fraud"

## **4.** Adjust Dataset
We will adjust our dataset according to the findings above—particularly in respect to the unbalanced classifications and additionally in scaling our features.

### **4.1.** Create Subset
We create a subset of our original dataset, balanced according to "Class", so that we have an even number of fraud and not-fraud samples. Using our original DataFrame, with such an imbalance in favor of not-fraud observations, would potentailly create a bias in our model.
<br />
<br />
Balancing our data will also allow us to create an informative correlation matrix during our EDA. If we were to use the original DataFrame, the correlation matrix would be similarily influenced by the large imbalance between classes.

In [None]:
# save n_fraud to pass into our subset process below
n_fraud = df['Class'].value_counts()[1]

n_fraud

492

In [None]:
# set random_state for consistency in subsequent experiments
df_sample = df.sample(frac=1, random_state=4)

# sample an equal number of fraud and non-fraud observations
fraud_df = df_sample.loc[df_sample['Class'] == 1]
non_fraud_df = df_sample.loc[df_sample['Class'] == 0][:n_fraud]

distributed_df = pd.concat([fraud_df, non_fraud_df])

# sample dataframe rows
sample_df = distributed_df.sample(frac=1, random_state=89)

# check distrobution
no_fraud_percent = round((sample_df['Class'].value_counts()[0] / 
                          len(sample_df)) * 100, 2)
fraud_percent = round((sample_df['Class'].value_counts()[1] / 
                       len(sample_df)) * 100, 2)

print(f'No Fraud: {no_fraud_percent} percent of dataset')
print(f'Fraud: {fraud_percent} percent of dataset')
print(f'Length of sample_df: {len(sample_df)}')

No Fraud: 50.0 percent of dataset
Fraud: 50.0 percent of dataset
Length of sample_df: 984


### **4.2.** Scale Subset
Below, we will generate three different CSVs, each scaled by a different method. We will test each to see how different scaling methods influence the final results of our model.

In [None]:
# set scalers as a dict so that we can loop through them
scale_dict = {'minmax': MinMaxScaler(), 
              'robust': RobustScaler(), 
              'power': PowerTransformer()}

In [None]:
for key, value in scale_dict.items():
  sub_scale_df = sample_df.copy()

  scaler = value

  # fit and transform the data
  sub_scale_df.iloc[:, :30] = scaler.fit_transform(sub_scale_df.iloc[:, :30])

  # export to csv
  sub_scale_df.to_csv(f'/content/drive/MyDrive/Colab Notebooks/KNN_creditCardFraud/data/{key}_sub.csv', 
                      index=False)
  
  print(f'\n{key}')
  display(sub_scale_df.head())



minmax


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
73857,0.322234,0.742574,0.560633,0.702336,0.554754,0.400733,0.661119,0.749845,0.715224,0.562833,0.5124,0.58827,0.439534,0.260067,0.37468,0.645734,0.367267,0.386827,0.386112,0.709957,0.383318,0.477174,0.522509,0.802729,0.466452,0.767464,0.355738,0.764213,0.233939,0.00039,1
143336,0.49694,0.725715,0.612967,0.618477,0.553058,0.355358,0.648647,0.762148,0.730909,0.501037,0.499468,0.610262,0.464827,0.555813,0.320038,0.557125,0.507616,0.414508,0.445114,0.513937,0.406442,0.475033,0.489068,0.784025,0.535805,0.740463,0.614076,0.829221,0.172504,0.118972,1
64460,0.29801,0.588966,0.698188,0.49566,0.500775,0.285489,0.590979,0.701421,0.798485,0.635521,0.580699,0.343657,0.599277,0.450025,0.533803,0.725231,0.595417,0.578712,0.52721,0.46592,0.445848,0.4748,0.457949,0.767575,0.630225,0.776265,0.304783,0.776577,0.239811,0.047035,1
81609,0.3438,0.85925,0.457836,0.798986,0.426463,0.404329,0.726348,0.86437,0.677748,0.723402,0.629763,0.384477,0.741939,0.525811,0.573378,0.474577,0.673314,0.648931,0.649272,0.480069,0.57345,0.480481,0.474875,0.719071,0.619864,0.717867,0.488502,0.703854,0.275796,0.653643,1
274002,0.966264,0.835847,0.581881,0.902661,0.186515,0.436507,0.757062,0.85314,0.675364,0.774546,0.724205,0.041204,0.905644,0.582243,0.837023,0.535736,0.901033,0.771746,0.743367,0.192608,0.336833,0.493353,0.499471,0.789366,0.796265,0.684157,0.132692,0.590431,0.204399,0.021168,0



robust


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
73857,-0.312063,-1.375053,0.153832,-1.008139,0.892221,-1.123255,-0.675448,-1.742241,2.248049,-1.194057,-1.573885,1.279547,-1.455012,-1.009173,-1.340414,0.259058,-1.845711,-2.209413,-1.871189,1.748964,-0.503324,1.139808,0.085615,1.406898,-0.738175,0.722492,0.00531,1.732434,0.441731,-0.194876,1
143336,0.029278,-1.517784,0.983791,-1.550091,0.886248,-2.063663,-0.859125,-1.563764,3.121222,-1.636918,-1.664303,1.362263,-1.359799,0.190388,-1.528123,-0.273813,-1.219845,-2.050865,-1.501257,0.465752,0.160721,1.007213,-0.436274,0.375674,-0.4456,0.469275,1.686984,3.064232,-1.535918,2.367795,1
64460,-0.359391,-2.675517,2.335314,-2.343816,0.702163,-3.511684,-1.708406,-2.444673,6.883374,-0.673136,-1.096379,0.359526,-0.853666,-0.238693,-0.793784,0.737131,-0.828311,-1.110362,-0.986522,0.151413,1.292376,0.992746,-0.921915,-0.531286,-0.047278,0.805032,-0.326391,1.985714,0.630739,0.813154,1
81609,-0.269927,-0.387259,-1.476427,-0.383523,0.440516,-1.048727,0.285197,-0.080938,0.161638,-0.043337,-0.753354,0.513054,-0.316622,0.068698,-0.657832,-0.770233,-0.480941,-0.708171,-0.221208,0.24404,4.956771,1.34461,-0.657774,-3.205468,-0.090987,0.257367,0.869546,0.495873,1.789126,13.922537,1
274002,0.946243,-0.585397,0.490798,0.286496,-0.404329,-0.381847,0.737525,-0.243841,0.028955,0.323182,-0.093074,-0.778045,0.299639,0.297589,0.247858,-0.402441,0.534537,-0.004724,0.36875,-1.637781,-1.838262,2.141879,-0.273925,0.670114,0.653182,-0.058772,-1.446633,-1.827798,-0.509193,0.254143,0



power


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
73857,-0.610685,-1.032163,-0.016067,-0.862476,0.965201,-0.459116,-0.586934,-0.990068,0.48819,-1.049057,-1.198817,1.35025,-1.275898,-1.421247,-1.308585,0.300484,-1.443354,-1.427022,-1.297661,1.694654,-0.39701,0.283628,0.084758,0.576353,-0.949153,0.85149,-0.053853,0.991505,0.278884,-1.278094,1
143336,0.032848,-1.094698,0.601466,-1.16551,0.959313,-0.896564,-0.756524,-0.915347,0.744231,-1.378263,-1.257474,1.418386,-1.219733,0.27472,-1.445058,-0.383359,-1.07362,-1.343443,-1.088664,0.477164,0.010963,0.24559,-0.410305,0.099705,-0.598291,0.540819,1.946907,2.094792,-0.918079,1.245622,1
64460,-0.707474,-1.511311,1.523454,-1.5232,0.772078,-1.50663,-1.545001,-1.249462,1.922443,-0.617264,-0.863159,0.447742,-0.871068,-0.334862,-0.844514,0.984192,-0.790811,-0.778271,-0.762248,0.15242,0.661657,0.241429,-0.851139,-0.28522,-0.068728,0.954627,-0.533606,1.189077,0.386018,0.818455,1
81609,-0.526358,-0.461862,-1.568309,-0.399818,0.482924,-0.422388,0.291949,0.021351,-0.079515,0.012233,-0.586322,0.62338,-0.338461,0.101389,-0.707227,-0.955036,-0.484487,-0.480264,-0.14904,0.249667,2.571523,0.342032,-0.613279,-1.287062,-0.130405,0.287976,1.044418,0.120303,1.020385,1.985211,1
274002,1.488746,-0.604199,0.241746,0.532422,-0.904958,-0.07362,0.698451,-0.134753,-0.112507,0.490807,0.106836,-2.104127,1.255171,0.427679,1.045245,-0.536246,1.642715,0.250617,0.580167,-2.058158,-1.273071,0.565899,-0.259134,0.231896,1.055175,-0.075636,-2.375879,-1.060275,-0.279214,0.438772,0


---

## **5.** Conclusion
 
We've taken an initial look at our data and created three new CSVs to work with when modeling:
1. A MinMax-scaled subset with balanced classifications
2. A Robust-scaled subset with balanced classifications
3. A PowerTransformer-scaled subset with balanced classifications

After adjusting the dataset to account for the imbalance in classifications, we are left with a relatively small amount of data to train a model on: 984 observations. However, the K-Nearest Neighbors model is particularly well-suited for smaller datasets. During our exploration, we hope to find features strongly correlated to our classifications.
<br />
<br />
View our exploratory data analysis in ./EDA-FS_CC-fraud.ipynb