# Detecting credit card fraud using machine learning: Part I

In this example, we'll create a model that will help us detect credit card fraud. This is a classic application of machine learning that is illustrative of the type of work that goes into creating a useful machine learning model. We will then host this as an API endpoint using Google Cloud (in Part II). We'll also create an end-to-end ML pipeline for model training, experimentation, deployment, and monitoring (using [this repo](https://github.com/DataTalksClub/mlops-zoomcamp) as a running resource) (in Part III).


## Motivation

Fraud detection is a classic example of a problem solved using classification algorithms. It's a good case study in basic machine learning development for the following reasons:

1. Real-world applications: Credit card fraud detection is a real-world problem.
2. Feature engineering: To make credit card data useful, it has to be transformed and manipulatedin various ways.
3. Imbalanced datasets: credit card fraud is (thankfully) a relatively rare occurrence, so detecting fraud requires managing imbalanced datasets.
4. Interpretability: since this problem is within a non-technical domain (finance), working on this project in industry will likely require talking with non-ML people. These people will likely be very interested in not only a model that can predict fraud, but also what the model looks for when it detects fraud. Therefore, we want a model that is interpretable.


## Setup and loading data


For our data, we'll be using [this](https://www.kaggle.com/datasets/mishra5001/credit-card/data) dataset from Kaggle, which is a sample dataset for credit card fraud detection.


Let's get our data loaded as well as import any missing packages


In [33]:
import pandas as pd

pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 5)
#pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)

In [2]:
df = pd.read_csv("application_data.csv")

In [34]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Notes:

- Convert all `DAYS_X` fields to their year equivalents (as floats, rounded to nearest hundredth)
- Figure out scheme for managing NaNs (e.g., for `OWN_CAR_AGE`, NaN means they don't have a car, so can possibly keep the NaNs and have a new indicator column).


## What type of data do we have?


Let's now take a quick look at our data


In [35]:
df[df["TARGET"] == 1].head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
26,100031,1,Cash loans,F,N,Y,0,112500.0,979992.0,27076.5,702000.0,Unaccompanied,Working,Secondary / secondary special,Widow,House / apartment,0.018029,-18724,-2628,-6573.0,-1827,,1,1,0,1,0,0,Cooking staff,1.0,3,2,MONDAY,9,0,0,0,0,0,0,Business Entity Type 3,,0.548477,0.190706,0.0165,0.0089,0.9732,,,0.0,0.069,0.0417,,0.0265,,0.0094,,0.0,0.0168,0.0092,0.9732,,,0.0,0.069,0.0417,,0.0271,,0.0083,,0.0,0.0167,0.0089,0.9732,,,0.0,0.069,0.0417,,0.027,,0.0096,,0.0,,block of flats,0.0085,Wooden,Yes,10.0,1.0,10.0,0.0,-161.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0.0,2.0,2.0
40,100047,1,Cash loans,M,N,Y,0,202500.0,1193580.0,35028.0,855000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.025164,-17482,-1262,-1182.0,-1029,,1,1,0,1,0,0,Laborers,2.0,2,2,TUESDAY,9,0,0,0,0,0,0,Business Entity Type 3,,0.306841,0.320163,0.1309,0.125,0.996,0.9456,0.0822,0.16,0.1379,0.25,0.2917,0.0142,0.1059,0.1267,0.0039,0.0078,0.1334,0.1297,0.996,0.9477,0.083,0.1611,0.1379,0.25,0.2917,0.0145,0.1157,0.132,0.0039,0.0082,0.1322,0.125,0.996,0.9463,0.0827,0.16,0.1379,0.25,0.2917,0.0144,0.1077,0.129,0.0039,0.0079,org spec account,block of flats,0.1463,"Stone, brick",No,0.0,0.0,0.0,0.0,-1075.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,2.0,0.0,4.0
42,100049,1,Cash loans,F,N,N,0,135000.0,288873.0,16258.5,238500.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.007305,-13384,-3597,-45.0,-4409,,1,1,1,1,1,0,Sales staff,2.0,3,3,THURSDAY,11,0,0,0,0,0,0,Self-employed,0.468208,0.674203,0.399676,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,-1480.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
81,100096,1,Cash loans,F,N,Y,0,81000.0,252000.0,14593.5,252000.0,Unaccompanied,Pensioner,Secondary / secondary special,Married,House / apartment,0.028663,-24794,365243,-5391.0,-4199,,1,0,0,1,0,0,,2.0,2,2,THURSDAY,10,0,0,0,0,0,0,XNA,,0.023952,0.720944,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


What features do we have available in our data? We can look at the `columns_description.csv` file in order to see what the features are.


In [16]:
column_descriptions = pd.read_csv("columns_description.csv")

This describes the features in our dataset. For our use case, we'll only look at the data in `application_data.csv`.


In [17]:
column_descriptions = column_descriptions[column_descriptions['Table'] == "application_data"][["Row", "Description", "Special"]]

Broadly speaking, we can divide the data about these clients into the following categories:

1. Client demographics
2. Loan details: how much did they borrow? When did they borrow?
   Demographics
3. Occupation: what job do they have?
4. Family: is the client married? Do they have children?
5. Client assets and finances:
6. Housing details: do they live in an apartment or in a house?
7. Documents filled out by the client: Did the client fill out certain documents? Our data doesn't provide any more information on what these documents are

- It's unlikely that these will help us predict loan default rates, and even if it did, it wouldn't be helpful unless we had more context.

7. The building where the client lives
8. Details about the client's local area: in what city does the client live?
9. The type of people who are around the client
   How many people in the client's social circle (how this is defined isn't really specified)


I've observed that data scientists just deep-dive into feature extraction, imputing missing data, doing correlations, etc., without really spending time to understand the features, and I think this is a crucial mistake. There are a variety of factors, such as algorithm choice (e.g., stepwise regression is a greedy algorithm), data collection details (e.g., errors in gathering data, quirks in how the data is represented, data quality issues upstream, etc.), and relationships between features, all of which can affect the result of your work.

The hardest part of a data scientist's job is, in my experience, not the coding or the math or the machine learning, but in fact the critical thinking needed to use these tools correctly. Data science solves business problems, so it helps to step back and have some understanding of the problem being solved.


Our output variable


## Exploratory data analysis through hypothesis generation

For data science projects, I believe that it's most helpful if, during your explorations, you have a set of possible hypotheses to guide your explorations. A good data scientist thinks not only about creating good models, but also the business use case. Having these hypotheses helps give you meaningful starting points for your exploration as well as help us learn more about what factors correlate with credit card default rates.

Put differently, it's a good idea to, before starting exploration, think about what things "make sense" and "should" correlate with our output variable, and start our explorations from there. For example, what factors would logically relate to the probability that a person defaults on their loans?


### A brief discussion on data assumptions in machine learning


#### Is past behavior predictive of future behavior?

One of the fundamental assumptions of using machine learning is the belief that past behaviors are explicitly predictive of future behaviors. This assumption is not always true and in fact there are plenty of examples in real-world applications where using machine learning without domain knowledge can lead to flawed takeaways. For example, AI algorithms are often used in the criminal justice system to predict the probability of future crime, but it is not without its [criticism](https://www.technologyreview.com/2019/01/21/137783/algorithms-criminal-justice-ai/) and some argue that instead of predicting future crime, the algorithms instead predict future policing, since, for example, they are trained on past crimes caught by police, and crimes are much more likely to be caught in areas of high policing, so instead of predicting crime, the algorithms predict which areas will be heavily policed in the future (i.e., ["predictive policing"](https://law.yale.edu/sites/default/files/area/center/mfia/document/infopack.pdf)).

Credit loans are no different, which is why, for example, sensitive demographic information like race and ethnicity are not encoded as they have been used []() (though it is surprising that gender is still encoded in the dataset), although some researchers argue that coding for blindness can (). These types of concerns fall more broadly into the domain of algorithmic bias and fairness, and are points to consider when understanding the

#### Current data is a result of decisions made by past ML algorithms

In the current dataset, the choice of clients to

(past algorithms likely curated the data, so we're seeing the results of filtering by previous data)

(also: AI models will just perpetuate any biases in who loans are given to - I need to find sources for this but I'm pretty sure that biased loan giving is well-known).

ML algorithms reflect the data that they are trained on. As a result, it is important to better understand how the training data came to be what it is.

#### We can't measure what we can't see

We have a dataset of information pertaining to clients who have been approved for loans and we are asked to predict credit default rates. But, importantly, we do not have information about which clients **_did not_** get approved for loans. Presumably a different ML algorithm predicts which clients should and should not get approved for loans, and the results of that algorithm explicitly filters the input data into our own algorithm.

How this affects the data is dependent on the use for our current algorithm. Our problem statement is to create a model that can predict whether or not someone will default on their loan. After building this, will our model be used to give closer scrutiny to the clients who are at risk (ADD MORE).

#### Judging based on behaviors vs. characteristics

> Our law punishes people for what they do, not who they are. Dispensing punishment on the basis of an immutable characteristic flatly contravenes this guiding principle - **Supreme Court Chief Justice John Roberts, in a [2017 death penalty case](https://www.nytimes.com/2017/02/22/us/politics/duane-buck-texas-death-penalty-case-supreme-court.html)**

[something something argument]

This belief is at the root of why, for example, characteristics such as race, gender, and sexual orientation are considered protected classes - immutable characteristics shouldn't be used in cases [...], especially if it is used to negatively affect historically marginalized groups. In fact, many relationships in the data linking these protected classes to certain outcomes are often times due to problems such as lack of equity, diversity, and inclusion towards marginalized groups, such as Amazon's [hiring algorithm](https://www.ml.cmu.edu/news/news-archive/2016-2020/2018/october/amazon-scraps-secret-artificial-intelligence-recruiting-engine-that-showed-biases-against-women.html) that biased male names over female names, [facial recognition systems](https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212) performing poorly on black faces (and especially on black women), and [AI-powered lending practices compounding past and present biases](https://www.brookings.edu/articles/credit-denial-in-the-age-of-ai/) in credit lending against [marginalized groups](https://www.brookings.edu/articles/reducing-bias-in-ai-based-financial-services/).


### What factors correlate to loan default rates?

A quick Google search says that the following factors can increase the odds of a person defaulting on their loans:

1. factor 1
2. factor 2


### What are some things that people who default on their loans generally have in common?

We can inspect our data...


### What kind of people default on their loan rates?

There are likely multiple different kinds of people who default on their loans, and we can figure out, through exploration, what characteristics this people have. Can we figure out the different types of people that default on their loan rates? We can create "personas" for the kind of people that default on their loans, which helps us not only build a more useful model, but also better understand our problem space.

- **Author's Note**: This has the added benefit of giving us the option of splitting up our problem into subproblems and, for example, creating different models for each subgroup, as well as condering interactions between different factors (e.g., factor X may not be predictive of loan defaults unless factor Y is present).


#### Person 1:


#### Person 2:


## Data preprocessing


## Dealing with data imbalances


## Model development


## Model evaluation and iteration


## Model interpretation


## Summary and next steps
