In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("p01.ipynb")

# Project 1 – Loan Applications

## DATA 2201, Fall 2024

### Checkpoint Due Date (Questions 1, 2, and 4): Tues., October 22nd 
### Due Date: Monday, Oct. 28th

## Instructions

---

### Working on the Project

* **For the Checkpoint, which is required, you only need to turn in a `p01.ipynb` containing solutions for Questions 1, 2, and 4!**
    - The "Project 1 Checkpoint" autograder on Gradescope does not thoroughly check your code – it only runs the public tests on Questions 1, 2, and 4 to make sure that you have completed them. There are no hidden tests for the checkpoint, and you will see your score upon submission. 
    - When you submit the final version of the project, however, we will use hidden tests to check your answers more thoroughly.
    - Note that this means you will ultimately have to submit the project twice – once to the "Project 1 Checkpoint" autograder (Questions 1, 2, and 4 only), and once to the "Project 1" autograder (once you're fully done).

- **Do not change the function names!** The functions are how your assignment is graded, and they are graded by their name. .
- You are encouraged to write your own additional helper functions to solve the project, as long as they also end up in `p01.ipynb`.

### Warning!

Many questions in the project intentionally build off of each other and the final result matters! In fact, you can "get a question correct," but only receive partial credit for it because a previous answer was wrong.

For any questions that related with number rounding, please be aware that `np.round()` and `round()` function differently. For this assignment, use `round()` when needed as it is the function we will utilize in the tests.



In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)

from IPython.display import display


def create_kde_plotly(df, group_col, group1, group2, vals_col, title=''):
    fig = ff.create_distplot(
        hist_data=[df.loc[df[group_col] == group1, vals_col], df.loc[df[group_col] == group2, vals_col]],
        group_labels=[group1, group2],
        show_rug=False, show_hist=False
    )
    return fig.update_layout(title=title)

# Imports for Plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pd.options.plotting.backend = 'plotly'

from IPython.display import display

# DATA2201 preferred styles
pio.templates["data2201"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=False,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+data2201"

import plotly.figure_factory as ff

### Plotting 

As you have seen in class, there are many libraries for doing plotting. You have seen `matplotlib`, `seaborn`, and `plotly` all used in class and various assignments. 

For this project you will focus on using `matplotlib` and `seaborn`, although you will see `plotly` being used in some of the example plots shown. 

## About the Assignment

[LendingClub](https://www.lendingclub.com/) is a platform that allows individuals to borrow money – that is, take on **loans**. They've made available a massive dataset with information on millions of loans they've processed. The entire dataset is over 300 MB large, and so we won't work with it here – instead, we'll work with a sample from it.

Run the cell below to load in this sample from the file `data/loans.csv`.

In [None]:
loans_path = Path('data') / 'loans.csv'
loans = pd.read_csv(loans_path)
loans.head()

Each row of the dataset corresponds to a different loan that the LendingClub approved and paid out. Some of the key columns are:
- `'loan_amnt' (float)`: the amount of the loan, or how much the borrower borrowed.
- `'issue_d' (str)`: the date on which the loan was issued.
- `'term' (str)`: the length of the loan, that is, the amount of time the borrower has to pay the loan back.
- `'int_rate' (float)`: the interest rate the borrower will pay on their loan amount.

First, it's worth exploring the distribution of loan amounts. You'll see that the largest possible loan given out through LendingClub was \\$40,000.

In [None]:
loans['loan_amnt'].describe()

Note that when a borrower is approved for a loan, they are presented with multiple offers with different loan terms and interest rates. **Larger interest rates make the loan more expensive for the borrower – as a borrower, you want a lower interest rate!** You'll note that even for the same loan amount, different borrowers were approved for different terms and interest rates. Take a look below:

In [None]:
loans.loc[loans['loan_amnt'] == 3600, ['loan_amnt', 'term', 'int_rate']]

So, why do different borrowers receive different terms and interest rates, despite asking for the same amount of money? The factors that influence loan offers are complex, but it's [known](https://www.bankofamerica.com/smallbusiness/resources/post/factors-that-impact-loan-decisions-and-how-to-increase-your-approval-odds/) that some of the key factors are employment status, annual income, and credit score, among other things.

In this project, we will **explore how various borrower characteristics are related to one another**, in an attempt to better understand the complexity behind loans. It's important to remember, though, that this dataset only contains information about actually approved loans, **not** all loan applications.

---

<a name='outline'></a>

### Navigating the Project

Click on the links below to navigate to different parts of the project. 


- [Part 1: Understanding Lender Decision-Making](#part1)
    - [Question 1 (Checkpoint Question)](#Question-1)
    - [Question 2 (Checkpoint Question)](#Question-2)
    - [Question 3](#Question-3)
    - [Question 4](#Question-4)
- [Part 2: Calculating Disposable Incomes](#part2)
    - [Question 5](#Question-5)
    - [Question 6](#Question-6)
    - [Question 7](#Question-7)
    - [Question 8](#Question-8)
    - [Question 9](#Question-9)
    


<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

<a name='part1'></a>

## Part 1: Understanding Lender Decision-Making

([return to the outline](#outline))

As mentioned above, the dataset we have only has information about approved loans – we don't have information about individuals who applied for loans and weren't approved. That means that we can't directly study what distinguishes successful loan applications from unsuccessful ones.

We'll start by understanding the **quantitative risks** that lenders assess when deciding what terms and interest rates to give to borrowers. Lenders typically charge higher interest rates to borrowers they perceive as high-risk. This practice serves several purposes: it offsets potential losses from unpaid debts, discourages excessively risky lending, and ensures that the higher returns from these loans can cover defaults. 

The first quantitative metric we'll look at is **debt-to-income (DTI) ratio**. Understanding the impact of such a metric on interest rates helps us in evaluating how lenders quantify risk.

<!-- The realm of lender decision-making is complex and multifaceted, influenced by a blend of traditional and non-traditional metrics. Our motivation is to delve into this domain and comprehend financial indicators, starting with: 
 -->
 
<!-- **Quantitative Risks:** Traditional metrics like **Debt-to-Income (DTI)** ratio and annual income are quantifiable and offer a concrete basis for assessing a borrower's financial health. 
 -->
<br/>

$$
\text{DTI} = \left( \frac{\text{Total Monthly Debt Payments}}{\text{Gross Monthly Income}} \right) \cdot 100 
$$

<br/>

Note that you don't need to calculate DTIs – they are already provided for us in the `'dti'` column of `loans`.


A low DTI indicates that a borrower is less likely to face financial strain from taking on additional debt, making them a lower risk to lenders. Conversely, a high DTI may signal financial overextension, suggesting a higher risk for default – that is, a higher risk that the borrower won't pay back the loan. 

When a lender perceives a higher risk of default, they often charge a higher interest rate. Let's see if this correlation is present in our dataset by looking a scatter plot of interest rate (`'int_rate'`) vs DTI (`'dti'`).

In [None]:
# Note: If the plot below doesn't appear, uncomment and run the following line.
# It will make all plotly plots render in a new browser tab.
# This is more inconvenient, but should bypass any rendering issues.

# pio.renderers.default = 'browser'

In [None]:
sample_set = loans.sample(200, random_state=1)

fig = px.scatter(sample_set, x='dti', y='int_rate', trendline='ols',
                 labels={'dti': 'Debt-to-Income Ratio', 
                         'int_rate': 'Interest Rate (%)'},
                 trendline_color_override='orange',
                 title='Interest Rate vs. Debt-to-Income Ratio')

fig.show()

In [None]:
# The same plot can be created in "seaborn"
sns.regplot(sample_set, x='dti', y='int_rate')
plt.title("Interest Rate vs. Debt=to-Income Ratio")
plt.xlabel("Debt-to-Income Ratio")
plt.ylabel("Interest Rate (%)");

<br>

--- 

### Question 1 (Checkpoint Question)

<a name='Question-1'></a>

([return to the outline](#outline))

We'll work with DTIs shortly, but first, we need to clean the dataset. Complete the implementation of the function `clean_loans`, which takes in a DataFrame like `loans` and returns a new DataFrame where:

- The `'issue_d'` column contains `pd.Timestamp` objects rather than strings.
- The `'term'` column contains ints rather than strings.
- The `'emp_title'` column is cleaned such that:
    - All employment titles are lowercase.
    - Leading and trailing whitespaces are removed.
    - The `'rn'` title is replaced with `'registered nurse'`. Note that there are other titles that include `'rn'` as part of a larger string, like `'clinical rn'` or `'attorney'`; for simplicity, **don't** replace `'rn'` in these titles with `'registered nurse'`. Instead, only replace titles that are exactly `'rn'` with `'registered nurse'`. (This means that you shouldn't use `.str.replace` to do your replacement here!)  
- There is a new column, `'term_end'`, which contains the date (as a `pd.Timestamp` object) on which each loan is fully paid.
    - ***Hint***: Use `pd.DateOffset`.
    - ***Hint***: Consider using `apply` with a lambda function. 

If you do the cleaning correctly, the three most common employment titles and frequencies in `loans` should be:

```py
teacher                 415
registered nurse        319
nurse                   112
```

In [None]:
def clean_loans(loans): 
    ...
    

Run the cell below to call `clean_data` on `loans`. Make sure to run this cell before moving forward, otherwise the tests won't work correctly.

In [None]:
loans = pd.read_csv(loans_path)
loans = clean_loans(loans)
loans.head()

In [None]:
grader.check("q1")

Now that we've cleaned `loans`, we can easily do things like plot the number of loans per year in `loans`:

In [None]:
(
    loans['issue_d'].dt.year
    .value_counts()
    .sort_index()
    .plot(kind='line', 
          labels={'index': 'Year', 'value': 'Frequency'},
          title='Number of Loans Granted Per Year<br>(In Sample of Loans)')
    .update_layout(showlegend=False)
)

<br>

--- 

### Question 2 (Checkpoint Question)

<a name='Question-2'></a>

([return to the outline](#outline))

As mentioned at the start of Part 1, lenders give higher interest rates to borrowers they perceive as high-risk. In this question, we'll measure the **correlation** between interest rate (`'int_rate'`) and various other quantitative features: debt-to-income ratio (`'dti'`), annual income (`'annual_inc'`), credit score (`'fico_range_low'`), as well as loan length (`'term'`).

We've discussed debt-to-income ratios already, and annual incomes and loan lengths are easy to interpret, but credit scores might be new to you. The general idea behind credit scores is simple: the higher a borrower's credit score is, the more "trustworthy" they appear to lenders. FICO, short for the Fair Isaac Corporation, is a private organization that computes credit scores for lenders. FICO credit scores range from 300 (very poor) to 850 (excellent).

The DataFrame we have access to has two columns involving FICO scores: `'fico_range_low'` and `'fico_range_high'`. For almost all rows in `loans`, the value for `'fico_range_high'` is just 4 points higher than `'fico_range_low'`, so both columns essentially contain the same information:


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>fico_range_low</th>
      <th>fico_range_high</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>700.0</td>
      <td>704.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>680.0</td>
      <td>684.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>705.0</td>
      <td>709.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>670.0</td>
      <td>674.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>735.0</td>
      <td>739.0</td>
    </tr>
  </tbody>
</table>
<center><i>The first few rows of <code>loans</code>.</i></center>

**So, for simplicity, "credit scores" and "FICO scores" will always refer to the values in the `'fico_range_low'` column for the rest of the project. Don't use `'fico_range_high'` for anything!** 

<br>

Back to the task at hand. To help us measure the correlation between interest rates and other quantitative features, complete the implementation of the function `correlations`, which takes in:
- a DataFrame, `df`, and
- a list, `pairs`, of **tuples**, each of which contains the names of two columns in `df`.

`correlations` should return a **Series** that has the same length as `pairs`, containing the correlation between each specified pair of columns. The values in the index of the Series should be strings of the form `'r_col1_col2'`, where `'col1'` and `'col2'` are the names of two columns, in the same order they appear in the input tuple.

Example behavior is given below.

```py
>>> correlations(loans, [
    ('dti', 'int_rate'),
    ('annual_inc', 'mths_since_last_delinq')
])

r_dti_int_rate                         ???
r_annual_inc_mths_since_last_delinq    ???
dtype: float64
```

Of course, your Series will have values that are numbers, not ???.

The correlation you compute should be the Pearson correlation, we will talk more about correlation in a few weeks. The Pearson's correlation coefficient $r$, is a measure of **linear** association.  It is based on standard units and ranges from -1 to 1.  With a $r$ of 1, the scatter between two variables is a perfect straight line sloping up.  A $r$ = -1, indicates scatter is a perfect straight line sloping down.  An correlation of $r$ = 0 is called uncorrelated with no linear association.  You won't have to compute it manually, though – there are a variety of built-in `pandas` and `numpy` methods which compute it for you directly.

***Note***: Make sure to test your function on DataFrames other than `loans`!

In [None]:
def correlations(df, pairs):
    ...
    

Run the cell below to call `correlations` on `loans`, to find the correlations between debt-to-income and interest rate, annual income and interest rate, and FICO credit score and interest rate. Make sure to run this cell before moving forward, otherwise the tests won't work correctly.

In [None]:
q2_correlations = correlations(loans, [
    ('dti', 'int_rate'),
    ('annual_inc', 'int_rate'),
    ('fico_range_low', 'int_rate'),
    ('term', 'int_rate')
])
q2_correlations

In [None]:
grader.check("q2")

Run the cell below to draw a bar chart of the four correlations you computed, stored in `q2_correlations`.

In [None]:
(
    q2_correlations
    .plot(kind='barh', 
          title='Correlation Between Interest Rate and Various Quantitative Features',
          labels={'index': 'Pair of Columns', 'value': 'Correlation'}
         )
    .update_layout(showlegend=False)
)

You should notice that of the four features analyzed, credit scores are most strongly correlated with interest rates, though term lengths also seem to be quite correlated. Let's first take a look at a scatter plot of interest rate vs. credit score.

In [None]:
px.scatter(loans, x='fico_range_low', y='int_rate',
           labels={'fico_range_low': 'Credit Score', 
                   'int_rate': 'Interest Rate (%)'},
           title='Interest Rate vs. Credit Score')

There's a lot of overplotting here, meaning that many points are being plotted on top of one another. It does indeed seem that as credit scores increase, interest rates tend to decrease on average, but perhaps there's a better way to visualize this information.


One idea is to place credit scores into categories by **binning** them. According to [Experian](https://www.experian.com/blogs/ask-experian/credit-education/score-basics/what-is-a-good-credit-score/#s1), one of the three major credit bureaus in the US, FICO credit scores are described qualitatively as follows:

| Score | Category |
|---|---|
| 580 - 669 | Fair |
| 670 - 739 | Good |
| 740 - 799 | Very Good |
| 800 - 850 | Excellent |

There is actually also a bin below fair, named "poor" with a range of 300-579, but since `loans` doesn't have any poor credit scores, we'll exclude them from our exploration here. Note that while the `dtype` of `'fico_range_low'` is `float`, credit scores are actually integers.

Once we place credit scores into bins, we can visualize the distribution of interest rates separately for each credit score bin. Here, that would allow us to draw four separate distributions of interest rates – one for the fair group, one for the good group, one for the very good group, and one for the excellent group. Each one of those four distributions are **numerical distributions**, which we have several tools for visualizing; the most common tool you've seen is the histogram, but others exist, like the boxplot and violin plot. Let's explore this idea further!

<!-- BEGIN QUESTION -->

<br> 

---

### Question 3

([return to the outline](#outline))

You will create a **boxplot describing the distribution of interest rates, separately for each of the four credit score bins mentioned above, and separately for the two loan lengths**. Here's an example of the plot you'll need to create:

<center><img src="imgs/q3_sample.png" width=60%></center>

To create your figure, you'll use the seaborn's `boxplot` or `catplot` function and provide several arguments (the example above used `catplot`). 

Before creating the figure, though, you'll need to place credit scores into bins. There's a `pandas` function that will be helpful here. **Make sure the bins match those in our example plot exactly – inclusive of the left endpoint and exclusive of the right endpoint.** You'll need to create a new column `fico_range_low_cat` that stores this binned information. You can assume that nobody has an exact credit score of 850. 

Here are some additional requirements for you to get full credit on your boxplot:
- Make sure your axis labels, legend labels, and title are similar to ours.
- You **must** change the colors of the two terms from the default colors to something else. We chose purple and gold. To do this, you'll need to manually specify what color you want for the `36` group and the `60` group; it is fine to hard-code these two term lengths when creating your plot.
 

In [None]:
# create new column of data 'fico_range_low_cat'
...


# create a boxplot using seaborn
...
    

<!-- END QUESTION -->

If you created your boxplot correctly, you should have seen a few things:
- As borrowers' credit scores increase, both the median and variance in interest rates tend to decrease.
- Across the spectrum of credit scores, 60 month loans tend to have higher interest rates than 36 month loans.

You might wonder why longer loans have higher interest rates. From [The Motley Fool](https://www.fool.com/the-ascent/personal-loans/longer-repayment-terms-personal-loans-pros-cons/#:~:text=A%20longer%20term%20is%20riskier,charge%20a%20higher%20interest%20rate.):

> A longer term is riskier for the lender because there's more of a chance interest rates will change dramatically during that time. There's also more of a chance something will go wrong and you won't pay the loan back. Because it's a riskier loan to make, lenders charge a higher interest rate.

Good to know!

Now that we've investigated the role of some of the quantitative factors behind interest rates, let's look at some of the more subjective factors. Take a look at the following personal statement, for example:

> i recently proposed to my girlfriend of almost 8 yrs now and everything was going well untill our pug (ody) the middle of my three dogs started limping around and stumbling all the time. well come to find out he has a tumor on his spine. not very good news for us as our dogs are pretty much our children. so the reason for my loan request is the money i spent on the engagement ring was most of my savings and then i had to take out paydayloans loans for the mylogram bill, wich is similar to an MRI. $2,700 along with meds, visits etc.

You suspect that, perhaps, loans that included personal statements in their applications were given higher interest rates than loans that didn't include personal statements in their applications. This is true in `loans`:

In [None]:
display(loans.assign(has_ps=loans['desc'].notna()).groupby('has_ps')['int_rate'].describe())
create_kde_plotly(
    loans.assign(has_ps=loans['desc'].notna()),
    'has_ps',
    True,
    False,
    'int_rate',
    title='Distributions of Interest Rates<br>Based on Inclusion of Personal Statement'
)

But remember, `loans` is just a sample from a much larger population of loan applications. Is this observed difference statistically significant? Let's perform a permutation test.

<br>

---


### Question 4 (Checkpoint Question)

<a name='Question-4'></a> 

([return to the outline](#outline))

#### `ps_test`

Complete the implementation of the function `ps_test`, which takes in two arguments – a DataFrame like `loans` and a number `N` of repetitions – and returns the p-value for the following permutation test:

- **Null Hypothesis**: Interest rates given to applications with personal statements are the same, on average, as the interest rates given to applications without personal statements.
- **Alternative Hypothesis**: Interest rates given to applications with personal statements are larger on average than interest rates given to applications without personal statements.

As your test statistic, use the **difference in group means (with statement mean minus without statement mean)**.

<br>


#### `missingness_mechanism`

While not stated above explicitly, we can interpret our permutation test as one that assesses whether the missingness of a personal statement is dependent on interest rate.

Run your `ps_test` function with 5000 repetitions. Given the p-value you saw, what do you believe is the most likely missingness mechanism of the personal statement column, assuming we've narrowed down the missingness mechanism to these two options?

1. Missing completely at random (MCAR).
1. Missing at random (MAR) dependent on interest rate.

Complete the implementation of the function `missingness_mechanism`, which takes in no arguments and returns either 1 or 2, corresponding to your answer to the question above.

<br>

#### `argument_for_nmar`

In the function above, we had you use your permutation test to decide between MCAR and MAR as the likely missingness mechanism for personal statements. But, you could make an argument that personal statements are not missing at random (NMAR), too.

Complete the implementation of the function `argument_for_nmar`, which takes in no arguments and returns a **string** with a one-sentence justification as to why personal statements may be not missing at random.

***Note***: We may manually grade your answer to `argument_for_nmar` in your final submission – passing the autograder tests for `argument_for_nmar` is not necessarily sufficient!

In [None]:
def ps_test(df, N):
    ...
    

In [None]:
def missingness_mechanism():
    return ...
    

In [None]:
def argument_for_nmar():
    return ...
    

In [None]:
np.random.seed(42)
loans = loans.assign(has_ps=loans['desc'].notna())
q4_pval = ps_test(loans, 5000)
q4_miss = missingness_mechanism()
q4_arg = argument_for_nmar()

In [None]:
grader.check("q4")

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

<a name='part2'></a>


## Part 2: Calculating Disposable Incomes

([return to the outline](#outline))

In Part 1, we focused on understanding how the interest rates LendingClub chose to give to borrowers depended on different aspects of a borrower's application. In Part 2, we'll focus on understanding borrowers' **disposable incomes**, or net incomes:

$$
\text{Disposable Income} = \text{Gross Income} - \text{Federal Income Tax} - \text{State Income Tax} 
$$

This is a minor simplification, because [some cities](https://www.stlouis-mo.gov/government/departments/comptroller/initiatives/us-cities-that-levy-earnings-taxes.cfm) also collect city-specific income taxes, but we'll ignore those here. Note that the `'annual_inc'` column in `loans` contains each borrower's gross income – that is, their income before taxes are removed.

Understanding a borrower's disposable income enables lenders to evaluate a borrower's ability to meet loan obligations, ensuring that the risk of failing to pay back the loan (or **defaulting**) is minimized. In other words, knowing a borrower's disposable income allows lenders to structure loan payments in a way that aligns with the borrower's cash flow, reducing the likelihood of missed payments.

The United States, like many countries, uses a progressive tax bracket system. This means that as your earnings increase, the percentage of your earnings you owe in tax also increases. In addition, the US tax system uses marginal tax brackets – what this means is that US taxpayers pay different tax percentages on different "chunks" of their earnings.

Here's how Part 2 is structured:
- In Question 5, you'll define a general purpose function that takes in a gross income and an arbitrary tax bracket and returns the amount of tax owed.
- In Questions 6 and 7, you'll clean a DataFrame that contains tax brackets for different states so that the brackets are in a format that you can use with your function from Question 5.
- And finally, in Question 8, you'll compute the amount of disposable income each borrower has.

Let's get started!

<br>

---

### Question 5

<a name='Question-5'></a>

([return to the outline](#outline))

Let's use the following example to illustrate how tax brackets work. You may have actually seen the same example before, but for reasons you're about to see, your implementation will be a bit more complicated here.

| Tax Rate | Tax Bracket |
| --- | --- |
| 10% | [$0, \\$11,000] |
| 12% | (\\$11,000, \\$44,725] |
| 22% | (\\$44,725, \\$95,375] |
| 24% | (\\$95,375, \\$182,100] |
| 32% | (\\$182,100, \\$231,251] |
| 35% | (\\$231,251, \\$578,125] |
| 37% | Over \\$578,125        |


If someone has a gross, or **taxable**, income of \\$75,000, we say they are in the 22% tax bracket. However, such an individual doesn't owe 22% of \\$75,000 in taxes. Instead, they owe:
- 10% of \\$11,000, **plus**
- 12% of \\$33,725 (which is \\$44,725 - \\$11,000), **plus**
- 22% of \\$30,275 (which is \\$75,000 - \\$44,725).

More concretely, their tax owed is
$$0.1 \cdot \$11{,}000 + 0.12 \cdot \$33{,}725 + 0.22 \cdot \$30{,}275 = \$11,807.50.$$


<br>

For the purposes of this question, we will express tax brackets – like the one in the table above – using a **list of tuples**. For instance, we can express the tax brackets above as follows:

```py
[(0.1, 0), 
 (0.12, 11000), 
 (0.22, 44725), 
 (0.24, 95375), 
 (0.32, 182100),
 (0.35, 231251),
 (0.37, 578125)
]
```

Each tuple is structured as `(tax_rate, bracket_lower_limit)`. For example, `(0.1, 0)` indicates a 10% tax rate for income above \\$0.

Before implementing anything, make sure you deeply understand:
1. How we found that a gross income of \\$75,000 owes \\$11,807.50 in taxes using the brackets above.
1. How we've represented the brackets from the table above as a list of tuples.

<br>

Now, complete the implementation of the function `tax_owed`, which takes in a float, `income`, and a list of tuples, `brackets`, and returns the amount of tax owed on a gross income of `income`, using the provided `brackets` (formatted as in the example above).

Example behavior is given below.

```py
>>> tax_owed(75000, 
[(0.1, 0), 
 (0.12, 11000), 
 (0.22, 44725), 
 (0.24, 95375), 
 (0.32, 182100),
 (0.35, 231251),
 (0.37, 578125)
])
11807.5
```

**Make sure to test your function with brackets other than the one above, and verify that it works correctly by replicating the calculations by hand!**

_Type your answer here, replacing this text._

In [None]:
def tax_owed(income, brackets):
    ...
    

In [None]:
example_brackets = [
 (0.1, 0), 
 (0.12, 11000), 
 (0.22, 44725), 
 (0.24, 95375), 
 (0.32, 182100),
 (0.35, 231251),
 (0.37, 578125)]
example_owed = tax_owed(75000, example_brackets)
example_owed

In [None]:
grader.check("q5")

Now that we have a general-purpose function that can take in a gross income and a list of tax brackets and return the tax owed, we want to use this function to compute both the **state** and **federal** taxes each loan applicant owed.

To start this process, we'll load in a DataFrame that contains the tax brackets for each state in 2023 ([source](https://taxfoundation.org/data/all/state/state-income-tax-rates-2023/)). Not all of the loan applications were submitted in 2023 – in fact, they were all submitted between 2008 and 2018 – but brackets don't change very much from year to year, so for simplicity we'll use these brackets throughout.

Run the cell below to define a DataFrame named `state_taxes_raw` with the relevant information.

In [None]:
state_taxes_raw_path = Path('data') / 'state_taxes_raw.csv'
state_taxes_raw = pd.read_csv(state_taxes_raw_path)
state_taxes_raw.head()

As you can see, the state of the DataFrame is a bit hard to parse. The information we need is in there, but you'll need to clean it up so that it's in the right format.

**Before proceeding, you may want to open `data/state_taxes_raw.csv` in your text editor to see how it's formatted!**

<br>

---

### Question 6

<a name='Question-6'></a>

([return to the outline](#outline))

Complete the implementation of the function `clean_state_taxes`, which takes in a DataFrame like `state_taxes_raw` and returns a cleaned version of it. The first few rows of `clean_state_taxes(state_taxes_raw)` should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>State</th>
      <th>Rate</th>
      <th>Lower Limit</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Ala.</td>
      <td>0.02</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Ala.</td>
      <td>0.04</td>
      <td>500</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ala.</td>
      <td>0.05</td>
      <td>3000</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Alaska</td>
      <td>0.00</td>
      <td>0</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Ariz.</td>
      <td>0.02</td>
      <td>0</td>
    </tr>
  </tbody>
</table>



Here's what you need to do to clean `state_taxes_raw`:

- Each state is separated by a row full of null values – drop these.
- Modify the `'State'` column so that each rate and bracket limit has its corresponding state name filled in.
    - Note that values like `'(a, b, c)'` are meaningless and shouldn't appear in the final DataFrame.
    - There are many ways you can go about this, but perhaps the easiest way is to replace all strings that are currently in the `'State'` column that aren't state names with null values, then *fill* in all null values – many of which were already there – with the state name that appears most recently above it. There's a way to do this natively in `pandas`; **don't use a `for`-loop**.
- Modify the `'Rate'` column so that all values are stored as floats that represent **proportions**.
    - Round the proportions to two decimal places to avoid any potential floating point precision errors.
    - Make sure to account for states that don't have any state income tax (like Alaska).
- Modify the `'Lower Limit'` column so that all values are stored as integers.
    - States with no state income tax have null `'Lower Limit'` values; make sure to set these `'Lower Limit'`s to 0.
    
This is quite an involved problem. Try to organize your work as best as you can, for instance, by using many helper functions and the `pipe` DataFrame method. **Nothing about this question requires the use of a `for`-loop, so don't use one!**

***HINT***: pandas `apply` and `lambda` functions may be useful!

_Type your answer here, replacing this text._

In [None]:
def clean_state_taxes(state_taxes_raw):
    ...
    

In [None]:
state_taxes_raw = pd.read_csv(state_taxes_raw_path)
state_taxes = clean_state_taxes(state_taxes_raw)
state_taxes.head()

In [None]:
grader.check("q6")

Moving forward, remember to refer to `state_taxes`, **not** `state_taxes_raw`.

<br>

---

### Question 7

<a name='Question-7'></a>

([return to the outline](#outline))

While the information in `state_taxes` is now a bit more interpretable, the brackets in it are not quite compatible with our `tax_owed` function from Question 5. Here, we'll work on reformatting the brackets in `state_taxes` to be in a more useful format, and combining the resulting brackets with the other information we have (including, most importantly, incomes) for each borrower.

#### `state_brackets`

Complete the implementation of the function `state_brackets`, which takes in a cleaned DataFrame like `state_taxes` and returns a new DataFrame, indexed by `'State'`, with a single column, `'bracket_list'`, that contains the tax brackets for each state as **lists of tuples** in the form `(tax_rate, bracket_lower_limit)`.

The first few rows of `state_brackets(state_taxes)` should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>bracket_list</th>
    </tr>
    <tr>
      <th>State</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Ala.</th>
      <td>[(0.02, 0), (0.04, 500), (0.05, 3000)]</td>
    </tr>
    <tr>
      <th>Alaska</th>
      <td>[(0.0, 0)]</td>
    </tr>
    <tr>
      <th>Ariz.</th>
      <td>[(0.02, 0)]</td>
    </tr>
    <tr>
      <th>Ark.</th>
      <td>[(0.02, 0), (0.04, 4300), (0.05, 8500)]</td>
    </tr>
  </tbody>
</table>

Again, don't use a `for`-loop! In addition, don't forget that you can apply a function to every **row** of a DataFrame by using the DataFrame `apply` method with `axis=1`. `lambda` functions are your friend!

***HINT***: explore using the `zip` function.

In [None]:
def state_brackets(state_taxes):
    ...
    

In [None]:
state_taxes = clean_state_taxes(pd.read_csv(state_taxes_raw_path))
state_taxes_new = state_brackets(state_taxes)

In [None]:
grader.check("q7")

<br>

---

### Question 8 

#### `combine_loans_and_state_taxes`

Complete the implementation of the function `combine_loans_and_state_taxes`. It should take in two DataFrames:
- One like `loans`, and
- One like `state_taxes` (that is, one that is returned by `clean_state_taxes` from Question 5).

`combine_loans_and_state_taxes` should return a new DataFrame that has all of the same rows and columns as `loans`, except:
- There is an additional column, `'bracket_list'`, corresponding to the tax bracket list in each state.
- The name of the column containing states is `'State'` (which is not what it currently is in `loans`), with state names formatted as two-letter abbreviations (e.g. `'CA'`).

On the surface, this may seem straightforward: just call `state_brackets` on `state_taxes` and merge the resulting DataFrame with `loans`. But it's not that easy – the names of states are formatted differently in both `loans` (which uses two-letter abbreviations) and `state_taxes` (which uses non-standard shortened names). To help, we've provided you with a JSON file, `data/state_mapping.json`, that we load in as a dictionary at the top of `combine_loans_and_state_taxes`. Take a peek at `data/state_mapping.json` in your text editor to see how it's structured!

***Hint***: See how the information in state_mapping gets stored.  What type is this variable?   
How might you want to *map* the state names to the two character codes?  
In the merge keep both columns reprsenting state information. 

_Type your answer here, replacing this text._

In [None]:
def combine_loans_and_state_taxes(loans, state_taxes):
    # Start by loading in the JSON file.
    # state_mapping is a dictionary; use it!
    import json
    state_mapping_path = Path('data') / 'state_mapping.json'
    with open(state_mapping_path, 'r') as f:
        state_mapping = json.load(f)
        
    # Now it's your turn:
    ...
    

In [None]:
loans = clean_loans(pd.read_csv(loans_path))
state_taxes = clean_state_taxes(pd.read_csv(state_taxes_raw_path))
loans_with_state_taxes = combine_loans_and_state_taxes(loans, state_taxes)
loans_with_state_taxes.head()

In [None]:
grader.check("q8")

<br>

--- 

### Question 9

<a name='Question-9'></a>

([return to the outline](#outline))

We now have all of the information we need to compute each borrower's disposable income:

$$
\text{Disposable Income} = \text{Gross Income} - \text{Federal Income Tax} - \text{State Income Tax} 
$$

Complete the implementation of the function `find_disposable_income`, which takes in a DataFrame like `loans_with_state_taxes` and returns a copy of the input DataFrame with three additional columns:

- `'federal_tax_owed'`, which contains the amount this individual owes in federal income taxes. To calculate federal income taxes, use the `FEDERAL_BRACKETS` list provided to you in the definition of `find_disposable_income`.
- `'state_tax_owed'`, which contains the amount this individual owes in state income taxes. To calculate state income taxes, use the values in the `'brackets_list'` column, which depend on the borrower's state of residence.
- `'disposable_income'`, which contains this individual's disposable income, which is their income after subtracting federal and state taxes from their gross income.

Note that both federal taxes and state taxes are calculated based on an individual's gross income; this means you'll need to use values from the `'annual_inc'` column both when calculating federal taxes and state taxes.

_Type your answer here, replacing this text._

In [None]:
def find_disposable_income(loans_with_state_taxes):
    FEDERAL_BRACKETS = [
     (0.1, 0), 
     (0.12, 11000), 
     (0.22, 44725), 
     (0.24, 95375), 
     (0.32, 182100),
     (0.35, 231251),
     (0.37, 578125)
    ]
    ...
    

In [None]:
with_disposable_income = find_disposable_income(loans_with_state_taxes)
with_disposable_income.head()

In [None]:
grader.check("q9")

Nice work! To wrap up this section, we'll show you one of the many visualizations you can create using the calculations you just did.

All you need to do here is read through the cell below, try and understand the code, run the cell, and look at the resulting visualization. What trends do you notice?

In [None]:
effective_tax_per_state = (
    with_disposable_income
    .groupby('State')
    .apply(lambda df: df['state_tax_owed'].sum() / df['annual_inc'].sum())
)

median_income = (
    with_disposable_income
    .groupby('State')
    ['annual_inc']
    .median()
)

(
    pd.DataFrame()
    .assign(
        effective_tax_per_state=effective_tax_per_state, 
        median_income=median_income,
        count=with_disposable_income.value_counts('State')
    )
    .reset_index()
    .plot(
        kind='scatter',
        x='median_income',
        y='effective_tax_per_state',
        hover_name='State',
        size='count',
        title='Effective State Tax Rate vs. Median Income Per State',
        labels={
            'median_income': 'Median Income',
            'effective_tax_per_state': 'Effective State Tax Rate<br>sum(state tax) / sum(annual income)',
            'count': 'Number of Borrowers'
        }
    )
)

## Congratulations, you've finished Project 1!

Submit your responses to Q1, 2, and 4 by the Checkpoint deadline (only the autograder questions will be tested). 

Youre response to all questions is due by the later deadline. 

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()