# Challenge

Another approach to identifying fraudulent transactions is to look for outliers in the data. Standard deviation or quartiles are often used to detect outliers. Using this starter notebook, code two Python functions:

* One that uses standard deviation to identify anomalies for any cardholder.

* Another that uses interquartile range to identify anomalies for any cardholder.


## 1. Set Up Environment for Anomaly Detection
### 1.1. Import Libraries

In [1]:
# Initial imports
import pandas as pd
import numpy as np
import random
from sqlalchemy import create_engine


### 1.2. Connect to SQL Database

In [2]:
# Create a connection to the database
#engine = create_engine("postgresql://postgres:postgres@localhost:5432/fraud_detection")
engine = create_engine("mssql+pyodbc://MSI\SQLEXPRESS/ftb_SQL1?driver=SQL+Server+Native+Client+11.0")
#'mssql+pyodbc://server/database'
#C:\Program Files\Microsoft SQL Server\MSSQL15.SQLEXPRESS\MSSQL

### 1.3. Select Data from SQL Database

In [3]:
# Define a query that select all rows from the owners table
query = "SELECT * FROM [ftb_SQL1].[dbo].[v__CreditCard_Transactions_FullDataset];"

# Load data into the DataFrame using the read_sql() method from pandas
transactions_df = pd.read_sql(query, engine)

# Show the data of the new DataFrame
display(transactions_df.head())
display(transactions_df.describe())

Unnamed: 0,Transaction_ID,Transaction_DateTime,Transaction_Date,Transaction_Month,Transaction_Time,Transaction_Time_00_07,Transaction_Time_07_09,Transaction_Time_09_24,Transaction_Hour,Transaction_Amount,...,Transaction_Amount_EvenDollars,Transaction_Amount_Over_1000,Transaction_Credit_Card_ID,Credit_Card_Card,Credit_Card_CarddHolder_ID,Cardholder_Name,Transaction_Merchant_ID,Merchant_Name,Merchant_Merchant_Category_ID,Merchant_Category_Name
0,1,2018-04-30 18:50:48,2018-04-30,4,18:50:48,0,0,1,18,5.62,...,0,0,9,3517111172421930,1,Robert Johnson,42,Kennedy-Chen,42,bar
1,2,2018-06-24 22:54:41,2018-06-24,6,22:54:41,0,0,1,22,4.96,...,0,0,39,4866761290278198714,2,Shane Shaffer,61,"Richardson, Smith and Jordan",61,food truck
2,3,2018-12-19 23:36:10,2018-12-19,12,23:36:10,0,0,1,23,6.51,...,0,0,33,4711773125020499,13,John Martin,112,Greer Inc,112,bar
3,4,2018-05-23 04:27:45,2018-05-23,5,04:27:45,1,0,0,4,6.73,...,0,0,20,4165305432349489280,10,Matthew Gutierrez,17,Bauer-Cole,17,bar
4,5,2018-02-27 09:20:29,2018-02-27,2,09:20:29,0,0,1,9,6.03,...,0,0,18,4150721559116778,23,Mark Lewis,18,Romero-Jordan,18,food truck


Unnamed: 0,Transaction_ID,Transaction_Month,Transaction_Time_00_07,Transaction_Time_07_09,Transaction_Time_09_24,Transaction_Hour,Transaction_Amount,Transaction_Amount_Under_2,Transaction_Amount_EvenDollars,Transaction_Amount_Over_1000,Transaction_Credit_Card_ID,Credit_Card_CarddHolder_ID,Transaction_Merchant_ID,Merchant_Merchant_Category_ID
count,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0
mean,1750.5,6.441429,0.293429,0.078571,0.628,11.565714,40.789129,0.1,0.040571,0.020286,27.284571,13.371714,75.370857,75.370857
std,1010.507298,3.450917,0.455398,0.269107,0.483407,6.936899,202.042922,0.300043,0.197323,0.140996,15.23112,6.882208,43.155086,43.155086
min,1.0,1.0,0.0,0.0,0.0,0.0,0.51,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,875.75,3.0,0.0,0.0,0.0,6.0,3.735,0.0,0.0,0.0,14.0,8.0,37.0,37.0
50%,1750.5,6.0,0.0,0.0,1.0,12.0,10.27,0.0,0.0,0.0,28.0,13.0,76.0,76.0
75%,2625.25,9.0,1.0,0.0,1.0,18.0,14.6475,0.0,0.0,0.0,41.0,19.0,112.0,112.0
max,3500.0,12.0,1.0,1.0,1.0,23.0,2249.0,1.0,1.0,1.0,53.0,25.0,150.0,150.0


### 1.4. Create Data Frames for Anomaly Detection in Python

In [5]:
# 1. Convert SQL data to data frame for outlier functions
test_df = transactions_df[['Transaction_ID','Credit_Card_Card','Transaction_Amount']].copy()
test_df.columns = ['id','card','amount']

# 2. Generate a list of unique credit cards
card_list = test_df['card'].unique()

## 2. Identifying Outliers using Standard Deviation
### 2.1. Define Outlier Function that uses Standard Deviation

In [6]:
# Write function that locates outliers using standard deviation
# Normally: Outyliers are values greater/less than 2.5 standard deviations
# Input Transactions Dataframe 3 columns, names = id, card, amount
# Return Transactions Dataframe where |amount| > 2.5 x Std Dev (amount)

def outliers_std (df):
    std_dev = df['amount'].std()
    mean = df['amount'].mean()
    multiplier = 2.5                              # Determines number of Std Devs from mean
    lower_ci = mean - (multiplier * std_dev)
    upper_ci = mean + (multiplier * std_dev)
    out_df = df[((df['amount']<(lower_ci)) | (df['amount']>(upper_ci)))]
    
    return out_df

### 2.2. Test Std Dev Outlier Function on 3 Randomly Chosen Credit Cards

In [8]:
# Find anomalous transactions for 3 random card holders
for n in range(3):
    card_selected = random.choice(card_list)
    card_df = test_df[test_df['card'] == card_selected]
    print(f'{n + 1}. Credit Card Number: {card_selected}')
    print('=========================================')
    outliers = outliers_std(card_df)
    if len(outliers) > 0:
        display(outliers)
    else:
        print(f'No Outliers present for credit card: {card_selected}')
    print('\n')

1. Credit Card Number: 30063281385429


Unnamed: 0,id,card,amount
2999,3000,30063281385429,21.61




2. Credit Card Number: 4319653513507


Unnamed: 0,id,card,amount
2581,2582,4319653513507,1813.0
2839,2840,4319653513507,1334.0




3. Credit Card Number: 30078299053512


Unnamed: 0,id,card,amount
1004,1005,30078299053512,1119.0
1333,1334,30078299053512,1159.0
1348,1349,30078299053512,1160.0
1548,1549,30078299053512,1053.0
1628,1629,30078299053512,1054.0






## 3. Identifying Outliers Using Interquartile Range
### 3.1. Define Outlier Function that uses IQR

In [9]:
# Write a function that locates outliers using interquartile range
def outliers_IQR(df):
    q1=df['amount'].quantile(0.25)
    q3=df['amount'].quantile(0.75)
    IQR=q3-q1
    multiplier = 1.5                              # Determines number of IQRs from q1/q3
    lower_ci = q1 - (multiplier * IQR)
    upper_ci = q3 + (multiplier * IQR)
    
    out_df = df[((df['amount']<(lower_ci)) | (df['amount']>(upper_ci)))]
    
    return out_df

### 3.2. Test IQR Outlier Function on 3 Randomly Chosen Credit Cards

In [11]:
# Find anomalous transactions for 3 random card holders
for n in range(3):
    card_selected = random.choice(card_list)
    card_df = test_df[test_df['card'] == card_selected]
    print(f'{n + 1}. Credit Card Number: {card_selected}')
    print('=========================================')
    outliers = outliers_IQR(card_df)
    if len(outliers) > 0:
        display(outliers)
    else:
        print(f'No Outliers present for credit card: {card_selected}')
    print('\n')

1. Credit Card Number: 30181963913340


Unnamed: 0,id,card,amount
248,249,30181963913340,691.0
635,636,30181963913340,57.0
689,690,30181963913340,267.0
763,764,30181963913340,325.0
1394,1395,30181963913340,1179.0
1519,1520,30181963913340,1095.0
1619,1620,30181963913340,1009.0
2695,2696,30181963913340,1724.0
2788,2789,30181963913340,1534.0
3142,3143,30181963913340,1795.0




2. Credit Card Number: 180098539019105
No Outliers present for credit card: 180098539019105


3. Credit Card Number: 4741042733274
No Outliers present for credit card: 4741042733274


