# Challenge

Another approach to identifying fraudulent transactions is to look for outliers in the data. Standard deviation or quartiles are often used to detect outliers. Using this starter notebook, code two Python functions:

* One that uses standard deviation to identify anomalies for any cardholder.

* Another that uses interquartile range to identify anomalies for any cardholder.

## Identifying Outliers using Standard Deviation

In [1]:
# Initial imports
import pandas as pd
import numpy as np
import random
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os

In [2]:
# load postgresql database server password as an environmental variable
load_dotenv()

True

In [3]:
db_key = os.getenv("my_pass")
type(db_key)

str

In [4]:
# Create a connection to the database
engine = create_engine(f"postgresql://postgres:{db_key}@localhost:5432/fraud_detection")

In [5]:
# Write function that locates outliers using standard deviation
def find_anomalities_sd(card_holder_id: str = '1'):
    
    # Query the database
    query = f"""
            SELECT t.date, t.amount, t.card
            FROM transaction as t 
            INNER JOIN credit_card AS cc ON cc.card = t.card
            INNER JOIN card_holder AS ch ON ch.id = cc.cardholder_id
            WHERE ch.id = {card_holder_id}  
            ORDER BY t.date
            """
    # Use pandas to create a df from query results
    df = pd.read_sql(query, engine)
    
    # Calculate the mean and std for the t.amount columns
    amount_avg = df['amount'].mean()
    amount_std = df['amount'].std()
    
    # We will use 2 standard deviations for the purpose of our analysis
    lower = amount_avg - (amount_std * 2)
    higher = amount_avg + (amount_std * 2)
    
    # Use a list comprehension to retrieve transactions that are 2 std below/above the mean
    lower_transactions = [amount for amount in df['amount'] if amount < lower]
    higher_transactions = [amount for amount in df['amount'] if amount > higher]
    
    # Create a final list of results
    final_list = lower_transactions + higher_transactions
    
    # Print statistics
    print(f"Average amount: {round(amount_avg, 2)}")
    print(f"Standard deviation: {round(amount_std, 2)}")
    print(f"Lower cut off: {round(lower, 2)}")
    print(f"Upper cut off: {round(higher, 2)}")
    
    # If final_list is not empty
    if final_list: 
        # Modify the df to maintain only the records where amount is part of the final_list
        df = df[df['amount'].isin(final_list)]
        # return df
        return df
    else: 
        return "No signs of fraudelent transactions were found"

In [6]:
# Find anomalous transactions for 3 random card holders
# Create a list to hold unique id values
card_holder_id =[]

# Create loop to generate random id numbers
for i in range(100):
    # random id numbers between 1 and 25 
    _id = np.random.randint(1,25)
    
    # Append id number only if it doesn't exist in card_holder_id list.
    if _id not in card_holder_id: 
        card_holder_id.append(_id)
    
    # Once we have three id numbers, call the find_anomalities_sd() and break out of the main for loop
    if len(card_holder_id) == 5: 
        for x in card_holder_id: 
            print('*' * 60)
            print(f'LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID {x}')
            display(find_anomalities_sd(x))
            print()
        break

************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 6
Average amount: 115.31
Standard deviation: 391.29
Lower cut off: -667.26
Upper cut off: 897.89


Unnamed: 0,date,amount,card
4,2018-01-08 02:34:32,1029.0,3581345943543942
23,2018-02-27 15:27:32,1145.0,3581345943543942
40,2018-04-21 19:41:51,2108.0,3581345943543942
67,2018-07-03 14:56:36,1398.0,3581345943543942
79,2018-07-24 22:42:00,1108.0,3581345943543942
81,2018-08-05 01:06:38,1379.0,3581345943543942
90,2018-09-02 06:17:00,2001.0,3581345943543942
92,2018-09-11 15:16:47,1856.0,3581345943543942
122,2018-11-27 17:20:29,1279.0,3581345943543942



************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 13
Average amount: 10.22
Standard deviation: 5.91
Lower cut off: -1.6
Upper cut off: 22.03


Unnamed: 0,date,amount,card
179,2018-11-08 02:10:03,22.78,5135837688671496



************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 23
Average amount: 9.14
Standard deviation: 5.67
Lower cut off: -2.2
Upper cut off: 20.49


Unnamed: 0,date,amount,card
92,2018-06-21 22:11:26,20.65,4150721559116778



************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 5
Average amount: 8.86
Standard deviation: 5.89
Lower cut off: -2.92
Upper cut off: 20.63


'No signs of fraudelent transactions were found'


************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 16
Average amount: 71.59
Standard deviation: 288.77
Lower cut off: -505.95
Upper cut off: 649.13


Unnamed: 0,date,amount,card
10,2018-01-22 08:07:03,1131.0,5570600642865857
28,2018-02-17 01:27:19,1430.0,5570600642865857
43,2018-03-05 08:26:08,1617.0,5570600642865857
89,2018-05-29 02:55:08,1203.0,5570600642865857
99,2018-06-17 15:59:45,1103.0,5570600642865857
126,2018-07-26 23:02:51,1803.0,5570600642865857
195,2018-11-13 17:07:25,1911.0,5570600642865857
207,2018-12-03 02:38:52,1014.0,5570600642865857
216,2018-12-24 15:55:06,1634.0,5570600642865857





## Identifying Outliers Using Interquartile Range
[An interquartile range is a measure of where the bulk of the values lie](https://www.statisticshowto.com/probability-and-statistics/interquartile-range/)

$IQR$ = $Q_3$ - $Q_1$

Here we will assume that any transaction that is below/above the quartiles($Q_1$, $Q_3$) by more than 1.5 of IQR, is a fraudelent transaction. 

In [7]:
# Write a function that locates outliers using interquartile range
# Write function that locates outliers using IQR
def find_anomalities_iqr(card_holder_id: str = '1'):
    
    # Query the database
    query = f"""
            SELECT t.date, t.amount, t.card
            FROM transaction as t 
            INNER JOIN credit_card AS cc ON cc.card = t.card
            INNER JOIN card_holder AS ch ON ch.id = cc.cardholder_id
            WHERE ch.id = {card_holder_id}  
            ORDER BY t.date
            """
    # Use pandas to create a df from query results
    df = pd.read_sql(query, engine)
    
    # Compute interquartile range uisng np.percentile()
    Q_1 = np.percentile(df['amount'], 25)
    Q_3 = np.percentile(df['amount'], 75)
    
    iqr = Q_3 - Q_1
    
    # Outlier are found outside the IQR
    outliers = iqr * 1.5
    
    # Use a list comprehension to retrieve transactions that are 2 std below/above the mean
    lower_transactions = [amount for amount in df['amount'] if amount < (Q_1 - outliers)]
    higher_transactions = [amount for amount in df['amount'] if amount > (Q_3 + outliers)]
    
    # Create a final list of results
    final_list = lower_transactions + higher_transactions
    
    # Print statistics
    print(f"25th Percentile: {Q_1}")
    print(f"75th Percentile: {Q_3}")
    print(f"IQR: {iqr}")
    print(f"Lower cut off: {round(Q_1 - outliers, 2)}")
    print(f"Upper cut off: {round(Q_3 + outliers, 2)}")
                            
    
    # If final_list is not empty
    if final_list: 
        # Modify the df to maintain only the records where amount is part of the final_list
        df = df[df['amount'].isin(final_list)]
        # return df
        return df
    else: 
        return "No signs of fraudelent transactions were found"

In [9]:
# Find anomalous transactions for 3 random card holders
# We will use the same card_holder_id list generated above for comparison
for x in card_holder_id: 
    print('*' * 60)
    print(f'LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID {x}')
    display(find_anomalities_iqr(x))
    print()
      

************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 6
25th Percentile: 4.137499999999999
75th Percentile: 15.510000000000002
IQR: 11.372500000000002
Lower cut off: -12.92
Upper cut off: 32.57


Unnamed: 0,date,amount,card
4,2018-01-08 02:34:32,1029.0,3581345943543942
23,2018-02-27 15:27:32,1145.0,3581345943543942
26,2018-03-09 04:51:38,389.0,3581345943543942
40,2018-04-21 19:41:51,2108.0,3581345943543942
67,2018-07-03 14:56:36,1398.0,3581345943543942
79,2018-07-24 22:42:00,1108.0,3581345943543942
81,2018-08-05 01:06:38,1379.0,3581345943543942
90,2018-09-02 06:17:00,2001.0,3581345943543942
92,2018-09-11 15:16:47,1856.0,3581345943543942
122,2018-11-27 17:20:29,1279.0,3581345943543942



************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 13
25th Percentile: 4.51
75th Percentile: 15.31
IQR: 10.8
Lower cut off: -11.69
Upper cut off: 31.51


'No signs of fraudelent transactions were found'


************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 23
25th Percentile: 3.86
75th Percentile: 13.030000000000001
IQR: 9.170000000000002
Lower cut off: -9.9
Upper cut off: 26.79


'No signs of fraudelent transactions were found'


************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 5
25th Percentile: 3.4625
75th Percentile: 14.232500000000002
IQR: 10.770000000000001
Lower cut off: -12.69
Upper cut off: 30.39


'No signs of fraudelent transactions were found'


************************************************************
LOOKING FOR FRAUDELENT TRANSACTIONS FOR CARD HOLDER ID 16
25th Percentile: 4.47
75th Percentile: 16.25
IQR: 11.780000000000001
Lower cut off: -13.2
Upper cut off: 33.92


Unnamed: 0,date,amount,card
7,2018-01-11 13:20:31,229.0,5570600642865857
10,2018-01-22 08:07:03,1131.0,5570600642865857
28,2018-02-17 01:27:19,1430.0,5570600642865857
43,2018-03-05 08:26:08,1617.0,5570600642865857
89,2018-05-29 02:55:08,1203.0,5570600642865857
99,2018-06-17 15:59:45,1103.0,5570600642865857
109,2018-07-04 17:28:06,89.0,5570600642865857
126,2018-07-26 23:02:51,1803.0,5570600642865857
172,2018-10-19 12:32:37,178.0,5570600642865857
175,2018-10-23 22:47:13,393.0,5570600642865857



