# Defense Performance

The Defense Performance script pulls from tables built in **Defense Performance** in order to create monthly views of Fraud Defense Performances, split by various dimmensions that are important when understanding performance.


Due to copyright reasons, the code has been largely modified and generalized so that code is vague and not revealing of corporate information. However, my hope is the audience an understand why the script is ordered in the way it is.

The script is highly similar to the Fraud Losses Graphs script, given the overlap in processes. 

## Script Outline

The script is organized as follows:

        1.Set-Up (imports, connections, creating variables)
        2.Establishing Unit Test
        3.SQL to extract data
        4.Data formatting through Pandas
        5.Slide Creation
        6.Unit test execution


## Set-Up Explanation

In order to successfully run this script there are a number of processes that must be done in order to connect to the data and run code. They are

        Running the credentials file
        Running utility scripts
        Install the Capital One built package pptmaker
        Importing packages
        Creating useful variables

In [2]:
#Step 1, run credentials files to connect to Capital One's Data infrastructure
%run "Users/[EID]/creds"

#If you are cloning this repository you will have to change the above to speciy your EID

ERROR:root:File `'Users/[EID]/creds.py'` not found.


In [None]:
#Step 2, run helpful utility scripts that predefine functions used throughout the script
%run "./Utilities/fraud_helper_fx"

In [None]:
#Step 3, install Capital One internally created package that can create a .pptx file of graphs/tables
dbutils.library.installPyPi("pptmaker", repo='....')

In [2]:
#Step 4, import packages and create helpful variables

from pptmaker import pptMaker
import pyspark.sql.functions as F
from pyspark.sql import DataFrameStatFunctions as FS
from pyspark.sql.functions import *
from pyspark.sql.types import *
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import re
import json
import pytz
import os.path
from pytz import timezone

In [None]:
#set up connection to snowflake so we can access productionized data
snowflake_source_name = "net.snowflake.spark.snowflake"
sfOptions = {
    "sfUrl":"...",
    "sfUser":username, #accessed from running creds file
    "sfPassword":password,#accessed from running creds file
    "sfDatabase":"...",
    "sfSchema":"USER_{}".format(username),
}

Utils = spark.jvm.net.snowflake.spark.snowflake.Utils

## Unit Test Explanation

As mentioned throughout the repository, the "unit test" concept allows analysts to run just a single script and as an output receive just that script's outputted graphs. 

It's incredibly useful when debugging a singular script in the process of forming the entire MBR.

To create the unit test parameter, we establish a widget in databricks with the command below:

Since this script runs after the fraud_losses script in the trigger script, we don't want to create a ppt object unless our
trigger script is set to _yes_

In [1]:
unit_test = dbutils.widgets.text("unit_test", 'type Y for unit test')

if unit_test in unit_test_acpt_values:
    ppt = pptMaker.pptMaker()

NameError: name 'dbutils' is not defined

# Data Extraction

To get various graphs in the format that the team looks at them each month, I run the following SQL queries in databricks. The output of these are Spark data frames oriented in a wide manner. I convert the Spark Dataframe to a Pandas dataframe and then begin data formatting using Pandas.

## Views of interest:

        Case Volumes by defense
        Hit Rate by defense
        
The actual script contains far more views than this, but has been simplified for educational purposes.

### Defense Case Volumes

In [None]:
#SQL to pull losses from three core segments
def_vol_chart_sql = """select case_month, defense, case_size from lab_fpf.case_size"""

#pull data from snowflake, converting Spark Dataframe to pandas
def_vol_chart_pdf = sf_load_query(fraud_loss_chart_sql).toPandas()

#convert case month to datetime
def_vol_chart_pdf['case_month'] = pd.to_datetime(def_vol_chart_pdf['case_month'])

#format date
def_vol_chart_pdf['Formatted_Case_Month'] = def_vol_chart_pdf['case_month'].apply(lambda x:x.strftime('%b-%y'))


In [None]:
#create variables for the month over month table
last_month = today-dateutil.relativedelta.relativedelta(months=1).replace(day=1)
two_months_ago = date - dateutil.relativedelta.relativedelta(months = 1)

#get data for relevant time needed for month over month performance percent difference table
table_df = def_vol_chart_pdf.loc[(def_vol_chart_pdf['case_month']==two_months_ago) | (def_vol_chart_pdf['case_month']==last_month)]

## Hit Rate by defense

In [None]:
#this one is easy due to how we made the table!
hitrate_pdf = sf_load_query('''select case_month
                                    ,defense
                                    ,fraud_count
                                    ,case_size
                                    ,hit_rate
                                    from lab_fpf.df_hitrate_graph''').toPandas()

##format month into text
hitrate_pdf['formatted_case_month'] = hitrate_pdf['case_month'].apply(lambda x:x.strftime('%b-%y'))


In [None]:
#get data for relevant time needed for month over month performance percent difference table
hit_rate_table = hitrate_pdf.loc[(hitrate_pdf['case_month']==two_months_ago) | (hit_rate_pdf['case_month']==last_month)]

### Slide Creation

With our tables and charts, we can create our graphs. We first need to sort the date axis appropriately and then we can create the slides

In [None]:
#grab date values used in x-axis of graphs and sort them chronologically
date_format_list = def_vol_chart_pdf['Formatted_Chargeoff_Month'].unique().tolist()
date_format_list.sort(key= lambda date: datetime.strptime(date, '%b-%y'))

In [None]:
#create defense volume chart
def_vol_graph = ppt.createChart(def_vol_chart_pdf #pandas dataframe input
                               ,metric = 'case_size' #y value for graphs
                               ,x_axis = 'Formatted_Case_Month' #x-axis
                               ,splitter = 'defense' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Defense Volume'
                               ,chart_sub_name = 'By Defense by Case Month'
                               ,x_label = 'Case Month'
                               )

def_volume_table = ppt.createTable(create_month_over_month_percents_fraud_defenses(table_df, 'case_size'))

##ppt.createSlide takes in a list of charts, a list of tables, the slide header, and any footnote text
defense_volume_slide = ppt.createSlide(charts = [def_vol_graph], #list of charts to put in the graph
                                      tables = [def_volume_table],
                                      slide_name = 'Looking at defense activity',
                                    notes = 'Person of Contact: Joby George' )

In [None]:
#create hit rate chart
hit_rate_graph = ppt.createChart(hitrate_pdf #pandas dataframe input
                               ,metric = 'hit_rate' #y value for graphs
                               ,x_axis = 'Formatted_Case_Month' #x-axis
                               ,splitter = 'defense' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Hit Rate Percentage'
                               ,chart_sub_name = 'By Defense by Case Month'
                               ,x_label = 'Case Month'
                               )

hit_rate_table = ppt.createTable(create_month_over_month_percents_fraud_defenses(hit_rate_table, 'hit_rate'))

##ppt.createSlide takes in a list of charts, a list of tables, the slide header, and any footnote text
hit_rate_slide = ppt.createSlide(charts = [hit_rate_graph], #list of charts to put in the graph
                                      tables = [hit_rate_slide],
                                      slide_name = 'Looking at defense performance',
                                    notes = 'Person of Contact: Joby George' )

## Unit Test Execution

To run this script as a stand-alone component from the trigger script, I wrote a simple if statement that looks for the value of the variable unit_test, which was pass as a parameter in the trigger script

If the unit test is in one of the accepted values, the deck is sent out to recipients, if it is not, the slides remain a part of the ppt object, until they are finally sent out in the last command of the trigger script

In [None]:
if unit_test in unit_test_acpt_values:
    ppt.createDeck(ppt_name = 'FPF_Fraud_Defenses_MBR ' + str(datetime.now(timezone("America/New_York")).strftime('%Y_%m_%d_%H')),
    email_to = recipients,
    email_from = dev_email,
    email_subject = 'FPF_Fraud_Defenses_MBR ' + str(datetime.now(timezone("America/New_York")).strftime('%Y_%m_%d_%H')),
    ppt_attach = True)
else:
    pass