# Fraud Losses Graph

The Fraud Losses Graph script pulls from tables built in **Fraud Losses** in order to create monthly views of Fraud Losses, split by various dimmensions that are important when understanding performance.


Due to copyright reasons, the code has been largely modified and generalized so that code is vague and not revealing of corporate information. However, my hope is the audience an understand why the script is ordered in the way it is.

## Script Outline

The script is organized as follows:

        1.Set-Up (imports, connections, creating variables)
        2.Establishing Unit Test
        3.SQL to extract data
        4.Data formatting through Pandas
        5.Slide Creation
        6.Unit test execution


## Set-Up Explanation

In order to successfully run this script there are a number of processes that must be done in order to connect to the data and run code. They are

        Running the credentials file
        Running utility scripts
        Install the Capital One built package pptmaker
        Importing packages
        Creating useful variables

In [2]:
#Step 1, run credentials files to connect to Capital One's Data infrastructure
%run "Users/[EID]/creds"

#If you are cloning this repository you will have to change the above to speciy your EID

ERROR:root:File `'Users/[EID]/creds.py'` not found.


In [None]:
#Step 2, run helpful utility scripts that predefine functions used throughout the script
%run "./Utilities/fraud_helper_fx"

In [None]:
#Step 3, install Capital One internally created package that can create a .pptx file of graphs/tables
dbutils.library.installPyPi("pptmaker", repo='....')

In [2]:
#Step 4, import packages and create helpful variables

from pptmaker import pptMaker
import pyspark.sql.functions as F
from pyspark.sql import DataFrameStatFunctions as FS
from pyspark.sql.functions import *
from pyspark.sql.types import *
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import re
import json
import pytz
import os.path
from pytz import timezone

In [None]:
#set up connection to snowflake so we can access productionized data
snowflake_source_name = "net.snowflake.spark.snowflake"
sfOptions = {
    "sfUrl":"...",
    "sfUser":username, #accessed from running creds file
    "sfPassword":password,#accessed from running creds file
    "sfDatabase":"...",
    "sfSchema":"USER_{}".format(username),
}

Utils = spark.jvm.net.snowflake.spark.snowflake.Utils

## Unit Test Explanation

As mentioned throughout the repository, the "unit test" concept allows analysts to run just a single script and as an output receive just that script's outputted graphs. 

It's incredibly useful when debugging a singular script in the process of forming the entire MBR.

To create the unit test parameter, we establish a widget in databricks with the command below:

In [None]:
unit_test = dbutils.widgets.text("unit_test", 'type Y for unit test')

# Data Extraction

To get various graphs in the format that the team looks at them each month, I run the following SQL queries in databricks. The output of these are Spark data frames oriented in a wide manner. I convert the Spark Dataframe to a Pandas dataframe and then begin data formatting using Pandas.

## Views of interest:

        Losses by segment
        Losses as a ratio to accounts, by segment compared against enterprise wide performance for the same metric
        Losses by account age, by segment

The actual script contains far more views than this, but has been simplified for educational purposes.

### Losses by segment Data Prep

In [None]:
#SQL to pull losses from three core segments
fraud_loss_chart_sql = """select chargeoff_month, segment1_losses, segment2_losses, segment3_losses from lab_fpf.chart_loss"""

#pull data from snowflake
fraud_loss_spark_df = sf_load_query(fraud_loss_chart_sql)

#convert to Pandas dataframe
fraud_loss_pdf = fraud_loss_spark_df.toPandas()

In [None]:
#reorient wide data into long data for grapphing purposes
fraud_loss_chart = pd.melt(fraud_loss_pdf, id_vars = ['chargeoff_month'], var_name = 'segment', value_name = 'Fraud_Losses')

#fix formatting of segment names to be consistently the first two letters of the segment:
fraud_loss_chart['segment'] = fraud_loss_chart[segment].apply(lambda x:x x[:2])

#format dates to be in the format Jan-21 for January 2021 rather than 2021-01-01
fraud_loss_chart['chargeoff_month'].apply(lambda x:x.strftime('%b-%y'))

#we now have fraud losses by month by segment ready to be graphed!

### Fraud accounts over total accounts, by segment and overall 

It's a very useful to understand the ratio of Fraud accounts relative to the total number of accounts. This way, we have an understanding for when Fraud is actually escalating. We can do this by segment, and re-aggregate to include the total Capital One portfolio ratio. 

In [None]:
#SQL to pull losses as a percentage of accounts, split by segment
accts_query = '''select chargeoff_accts, all_accts, chargeoff_accts/NULLIF(all_accts,0) as fraud_rate, segment, chargeoff_month 
from
lab_fpf.accts_losses_agg a
left join lab_fpf.accts_portfolio_agg b on a.segment = b.segment
and a.chargeoff_month = b.snap_date
where segment in (segment1, segment2,segment3) order by 5;'''

#SQL to pull losses as a percentage of accounts not split by segment, we want to compare how the segments chart against
#the entire portfolio
total_by_acct_query = '''select sum(chargeoff_accts) as chargeoff_accts, sum(all_accts) as all_accts,
sum(chargeoff_accts)/sum(all_accts) as fraud_rate, chargeoff_month 
from 
lab_fpf.accts_losses agg a
left join lab_fpf.accts_portfolio_agg b on a.segment = b.segment
and a.chargeoff_month = b.snap_date
group by 4 order by 4;''''



In [None]:
#pull from snowflake, convert to pandas
accts_spark_pdf = sf_load_query(accts_query).toPandas()
total_accts_pdf = sf_load_query(total_by_acct_query).toPandas()


In [None]:
#for the enterprise wide dataframe, create an segment column so the entire data can be shown with one graph

total_accts_pdf['segment'] = 'Overall'

#concatenate the dataframes
accts_final_pdf = pd.concat([accts_pdf, total_accts_pdf])

#format chargeoff month
accts_final_pdf['Formatted_Chargeoff_Month'] = accts_final_pdf['chargeoff_month'].apply(lambda x:x.strftime('%b-%y'))


### Losses by segment and account age bin

In our data creation scripts we created four tables:
        
        chart_loss_age_1: losses from accounts < 1 year old
        chart_loss_age_2: losses fromaccounts between 1-2 years old
        chart_loss_age_4: losses from accounts beteween 2-4 years old
        chart_loss_age_9: losses from accounts accounts aged 4 years +
        
To get these views, we run simple queries on these various tables

In [None]:
#SQL to pull losses by age bin and segment
loss_age_1_query =  """select chargeoff_month, segment1_losses, segment2_losses, segment3_losses from lab_fpf.chart_loss_age1"""
loss_age_2_query =  """select chargeoff_month, segment1_losses, segment2_losses, segment3_losses from lab_fpf.chart_loss_age2"""
loss_age_4_query =  """select chargeoff_month, segment1_losses, segment2_losses, segment3_losses from lab_fpf.chart_loss_age4"""
loss_age_9_query =  """select chargeoff_month, segment1_losses, segment2_losses, segment3_losses from lab_fpf.chart_loss_age9"""

In [None]:
#pull data from snowflake, convert to pandas dataframe
loss_age_1 = sf_load_query(loss_age_1_query).toPandas()
loss_age_2 = sf_load_query(loss_age_2_query).toPandas()
loss_age_4 = sf_load_query(loss_age_4_query).toPandas()
loss_age_9 = sf_load_query(loss_age_9_query).toPandas()

In [None]:
#sort data by charge_off month
loss_age_1 = loss_age_1.sort_values(['chargeoff_month'])
loss_age_2 = loss_age_2.sort_values(['chargeoff_month'])
loss_age_4 = loss_age_4.sort_values(['chargeoff_month'])
loss_age_9 = loss_age_9.sort_values(['chargeoff_month'])

In [None]:
#melt into long data
loss_age_1 = pd.melt(loss_age_1, id_vars = ['chargeoff_month'], var_name = 'segment', value_name = 'Fraud_Losses') 
loss_age_2 = pd.melt(loss_age_2, id_vars = ['chargeoff_month'], var_name = 'segment', value_name = 'Fraud_Losses')
loss_age_4 = pd.melt(loss_age_4, id_vars = ['chargeoff_month'], var_name = 'segment', value_name = 'Fraud_Losses')
loss_age_9 = pd.melt(loss_age_9, id_vars = ['chargeoff_month'], var_name = 'segment', value_name = 'Fraud_Losses')

#format segment name to first two letters

loss_age_1['segment'] = loss_age_1['segment'].str[:2]
loss_age_2['segment'] = loss_age_2['segment'].str[:2]
loss_age_4['segment'] = loss_age_4['segment'].str[:2]
loss_age_9['segment'] = loss_age_9['segment'].str[:2]

#format date
loss_age_1['chargeoff_month'] = loss_age_1['chargeoff_month'].apply(lambda x: x.strftime('%b-%y'))
loss_age_2['chargeoff_month'] = loss_age_2['chargeoff_month'].apply(lambda x: x.strftime('%b-%y'))
loss_age_4['chargeoff_month'] = loss_age_4['chargeoff_month'].apply(lambda x: x.strftime('%b-%y'))
loss_age_9['chargeoff_month'] = loss_age_9['chargeoff_month'].apply(lambda x: x.strftime('%b-%y'))

## Slide Creation

From here, we take the formatted pandas dataframes and run them through the pptMaker functions

In [None]:
#create a powerpoint object that will generate slides in the following commands
ppt = pptMaker.pptMaker()

#specify the x axis order of string dates 
date_format_list = fraud_loss_chart['Formatted_Chargeoff_Month'].unique().tolist()
date_format_list.sort(key = lambda date: datetime.strptime(date, '%b-%y'))

### The following code has output is attached in the repositoy 

The image of the losses by segments 1 -3 corresponds to the output of the following command in the "FPF" folder of the repository

In [None]:
#slide 1 losses by segment

Losses_by_seg_graph = ppt.createChart(fraud_loss_chart #pandas dataframe input
                               ,metric = 'Fraud_Losses' #y value for graphs
                               ,x_axis = 'Formatted_Chargeoff_Month' #x-axis
                               ,splitter = 'segment' #different colors
                               ,chart_type = 'Column_Stacked' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Losses'
                               ,chart_sub_name = 'By Segment by Charge-off Month'
                               ,number_format = '"\$#,,.0 M"' #y axis unit formatting
                               ,x_label = 'Chargeoff Month'
                                      
#the create_month_over_month_precents_fraud_losses function can be found in the utility folder in the MBR_FX script
#the input is the dataframe with the metric of interest, and the name of the metric to calculate the month-over-month 
#percent difference                               )
Losses_by_seg_table = ppt.createTable(create_month_over_month_percents_fraud_losses(fraud_loss_chart, 'Fraud_Losses'))

##ppt.createSlide takes in a list of charts, a list of tables, the slide header, and any footnote text
slide1 = ppt.createSlide(charts = [Losses_by_seg_graph] 
                        , tables = [Losses_by_seg_table]
                        , slide_name = 'Looking at Fraud Losses split by segment'
                        ,notes = 'Person of Contact: Joby George' )


In [None]:
#slide 2 losses as a ratio of accounts, split by segment
Fraud_ratio_graph = ppt.createChart(accts_final_pdf #pandas dataframe input
                               ,metric = 'fraud_rate' #y value for graphs
                               ,x_axis = 'Formatted_Chargeoff_Month' #x-axis
                               ,splitter = 'segment' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Number of Fraud accounts in comparison to overall bookings'
                               ,chart_sub_name = 'By Segment by Chargeoff Month'
                               ,number_format = 'percent' #y axis unit formatting
                               ,x_label = 'Chargeoff Month'
                               )
#the create_month_over_month_precents_fraud_losses function can be found in the utility folder in the MBR_FX script
#the input is the dataframe with the metric of interest, and the name of the metric to calculate the month-over-month 
#percent difference
Fraud_ratio_table= ppt.createTable(create_month_over_month_percents_fraud_losses(accts_final_pdf, 'fraud_rate'))

##ppt.createSlide takes in a list of charts, a list of tables, the slide header, and any footnote text
slide2 = = ppt.createSlide(charts = [Fraud_ratio_graph]
                        , tables = [Fraud_ratio_table]
                        , slide_name = 'Looking at Fraud Ratio split by segment'
                        ,notes = 'Person of Contact: Joby George' )


In [None]:
#four charts for losses by acct age:

young_fraud_losses = ppt.createChart(loss_age_1 #pandas dataframe input
                               ,metric = 'fraud_losses' #y value for graphs
                               ,x_axis = 'Formatted_Chargeoff_Month' #x-axis
                               ,splitter = 'segment' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Losses for accounts under one year old'
                               ,chart_sub_name = 'By Segment by Chargeoff Month'
                               ,number_format = '"\$#,,.0 M"' #y axis unit formatting
                               ,x_label = 'Chargeoff Month'
                               )

one_year_fraud_losses = ppt.createChart(loss_age_2 #pandas dataframe input
                               ,metric = 'fraud_losses' #y value for graphs
                               ,x_axis = 'Formatted_Chargeoff_Month' #x-axis
                               ,splitter = 'segment' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Losses for accounts between 1-2 years old'
                               ,chart_sub_name = 'By Segment by Chargeoff Month'
                               ,number_format = '"\$#,,.0 M"' #y axis unit formatting
                               ,x_label = 'Chargeoff Month'
                               )

middle_age_fraud_losses = ppt.createChart(loss_age_4 #pandas dataframe input
                               ,metric = 'fraud_losses' #y value for graphs
                               ,x_axis = 'Formatted_Chargeoff_Month' #x-axis
                               ,splitter = 'segment' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Losses for accounts between 2-4 years old'
                               ,chart_sub_name = 'By Segment by Chargeoff Month'
                               ,number_format = '"\$#,,.0 M"' #y axis unit formatting
                               ,x_label = 'Chargeoff Month'
                               )

elder_fraud_losses = ppt.createChart(loss_age_9 #pandas dataframe input
                               ,metric = 'fraud_losses' #y value for graphs
                               ,x_axis = 'Formatted_Chargeoff_Month' #x-axis
                               ,splitter = 'segment' #different colors
                               ,chart_type = 'Line' #graph type
                               ,x_sort = date_format_list #order of x axis
                               ,chart_name = 'Fraud Losses for accounts older than 4 years old'
                               ,chart_sub_name = 'By Segment by Chargeoff Month'
                               ,number_format = '"\$#,,.0 M"' #y axis unit formatting
                               ,x_label = 'Chargeoff Month'
                               )


In [None]:
#create a slide containing these four graphs
slide3 = ppt.createSlide(charts = [young_fraud_losses, one_year_fraud_losses, middle_age_fraud_losses, elder_fraud_losses],
                        , slide_name = 'Looking at Fraud losses by age bin, split by segment'
                        ,notes = 'Person of Contact: Joby George' )


## Unit Test Execution

To run this script as a stand-alone component from the trigger script, I wrote a simple if statement that looks for the value of the variable unit_test, which was pass as a parameter in the trigger script

If the unit test is in one of the accepted values, the deck is sent out to recipients, if it is not, the slides remain a part of the ppt object, until they are finally sent out in the last command of the trigger script

In [None]:
if unit_test in unit_test_acpt_values:
    ppt.createDeck(ppt_name = 'FPF_Losses_MBR ' + str(datetime.now(timezone("America/New_York")).strftime('%Y_%m_%d_%H')),
    email_to = recipients,
    email_from = dev_email,
    email_subject = 'FPF_Losses_MBR ' + str(datetime.now(timezone("America/New_York")).strftime('%Y_%m_%d_%H')),
    ppt_attach = True)
else:
    pass