# Defense Performance Data Creation 

The Defense Performance Data Creation script is a series of SQL queries that is executed. These queries are run each Friday at 5 in the morning via a databricks scheduled job so analysts can come to work with the data already ready to be reported on.

It is very similar to the Fraud Losses Data Creation Script in structure, but differs in the data location

Due to copyright reasons, the code has been largely modified and simplified so that code is vague and not revealing of corporate information. However, my hope is the logic and planned structure of the Capital One's First Party Fraud Monthly Business Report Repository is communicated.


## Script Outline

The script is organized as follows:

        1.Set-Up (imports, connections, creating variables)
        2.Writing SQL queries
        3.Running SQL queries
        4.Granting privledge to newly created tables


## Set-Up Explanation

In order to successfully run this script there are a number of processes that must be done in order to connect to the data and run code. They are

        Running the credentials file
        Running utility scripts
        Install the Capital One built package pptmaker
        Importing packages
        Creating useful variables

In [2]:
#Step 1, run credentials files to connect to Capital One's Data infrastructure
%run "Users/[EID]/creds"

#If you are cloning this repository you will have to change the above to speciy your EID

ERROR:root:File `'Users/[EID]/creds.py'` not found.


In [None]:
#Step 2, run helpful utility scripts that predefine functions used throughout the script
%run "./Utilities/fraud_helper_fx"

In [None]:
%run "./Utilities/MBR_fx"

In [None]:
#Step 3, install Capital One internally created package that can create a .pptx file of graphs/tables
dbutils.library.installPyPi("pptmaker", repo='....')

In [2]:
#Step 4, import packages and create helpful variables

from pptmaker import pptMaker
import pyspark.sql.functions as F
from pyspark.sql import DataFrameStatFunctions as FS
from pyspark.sql.functions import *
from pyspark.sql.types import *
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import re
import json
import pytz
import os.path
from pytz import timezone

#name developers and recipients -- change this if you are cloning the repository

dev_email = ['joby.george@capitalone.com']
recipients = ['joby.george@capitalone.com']

#set timezone to EST 
tz = pytz.timezone('America/New_York')




In [None]:
#set up connection to snowflake so we can access productionized data
snowflake_source_name = "net.snowflake.spark.snowflake"
sfOptions = {
    "sfUrl":"...",
    "sfUser":username, #accessed from running creds file
    "sfPassword":password,#accessed from running creds file
    "sfDatabase":"...",
    "sfSchema":"USER_{}".format(username),
}

Utils = spark.jvm.net.snowflake.spark.snowflake.Utils



## Writing SQL Queries

### Note all code is highly simplified to avoid disclosing confidential information

Our goal is to have granular and aggregated data tables containing all instances of our defense firing and the number of cases that are indeed fraudulent. 

To do this, we create a table with the case firings of defenses in First Party Fraud 

After that we simply aggregate the number of cases by case outcome (fraud/not Fraud) and by month

We lastly need to look at hit rate  in an aggregated fashion
With data that can be aggregated, we build tables that mimic the monthly reporting of the monthly business report

In [None]:
#create a table of the First Party Fraud defenses
defense_table = '''
create or replace table lab_fpf.defense_base as (
    select
        case_id
        ,acct_id
        ,fraud_defense_id
        ,case when frd_dfns_id in (1,2) then 'FPF Model'
        when frd_dfns_id in (3,4,5,6,7) then 'risky email'
        when frd_dfns_id in (8) then 'risky SSN'
        when frd_dfns_id in (9,20,11,45,21,15) then 'malicious young account'
        when frd dfns id in (100, 101,105,215,213,107,143,214,341,78,213,765) then 'agent defenses'
        else frd_dfns_id
        end as defense
        , fraud_case_resolution_code
        , cast(fraud_case_creation_timestamp as DATE) as case_date
        ,date_trunc('month', case_date) as case_month
        , case when fraud_case_resolution_code = '-1' then 'Pending',
        when fraud_case_resolution_code = '10' then 'Fraud'
        when fraud_case_resolution_code = '0' then 'not Fraud'
        end as case_outcome
        , case when case_outcome = 'Fraud' then 1 else 0 end as fraud_ind
    from defense_table 
    where defense in ('FPF Model', 'risky email', risky SSN', 'malicious young account', 'agent defenses')
    and case_date between dateadd(month, -24, date_trunc(month, current_date)) and dateadd(day, -1, date_trunc(month, current_date))
        );'''



In [None]:
#look at aggregated volume, 
agg_defenses = '''create or replace table lab_fpf.case_size as (
    select 
        case_month
        ,defense
        ,count(distinct(case_id)) as case_size
        )
    from lab_fpf.defense_base
    group by 1,2
    order by 1,2);'''

In [None]:
#look at aggregated hit rate
hit_rate_base = '''
create or replace table lab_fpf.df_hitrate_graph_base as (
    select 
        case_month
        ,defense
        ,count(distinct(acct_id)) as fraud_count
        ,count(distinct(case_id)) as fraud_case_size
)
    from lab_fpf.defense_base
    where fraud_ind = 1
    group by 1,2
    order by 1,2);'''

hit_rate_agg = '''
create or replace table lab_fpf.df_hitrate_graph as (
    select 
        a.case_month
        ,a.defense
        ,a.fraud_count
        ,b.case_size
        ,a.fraud_count/b.case_size as hit_rate
)
    from lab_fpf.df_hitrate_graph_base a
    left join lab_fpf.case_size b
    on a.case_month = b.case_month
    and a.defense = b.defense
   );'''

## Running the queries

In order to have databricks run the text queries above we use the 
Utils.runQuery(query) syntax for all of the above queries

In [None]:
#run date tab
query_list = [defense_table
             ,agg_defenses
             ,hit_rate_base
             ,hit_rate_agg

             ]
for query in query_list:
    Utils.runQuery(query)

## Grant Privledges to the tables

Similarly we just need to Utils.runQuery(''grant select on table to all_users''')

In [None]:
table_list = ['lab_fpf.defense_base'
              ,'lab_fpf.case_size'
              ,'lab_fpf.df_hitrate_graph_base'
              ,'lab_fpf.df_hitrate_graph'
             ]

for table in table_list:
    Utils.runQuery('grant select on ' + table + ' to all_users')