<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Lab: Exploring Loan Rejections Data

<br><br>

In this lab we'll look at data about **loan applications** from a company called Lending Club that facilitate peer-to-peer loans

_"Since 2007, we’ve been bringing borrowers and investors together, transforming the way people access credit. Over the last 10 years, we've helped millions of people take control of their debt, grow their small businesses, and invest for the future."_

[Lending Club](https://www.lendingclub.com/company/about-us)

[Data source](https://www.lendingclub.com/info/download-data.action)

#### 1. Load the `pandas` library

In [1]:
import pandas as pd

#### 2. Read in the dataset and take a look at it

In [2]:
df = pd.read_csv("../data/rejections.csv.gz")
df.head()

Unnamed: 0,Amount Requested,Application Date,Loan Title,Risk_Score,Debt-To-Income Ratio,State,Employment Length
0,1000.0,2007-05-26,Wedding Covered but No Honeymoon,693.0,10%,NM,4 years
1,1000.0,2007-05-26,Consolidating Debt,703.0,10%,MA,< 1 year
2,11000.0,2007-05-27,Want to consolidate my debt,715.0,10%,MD,1 year
3,6000.0,2007-05-27,waksman,698.0,38.64%,MA,< 1 year
4,1500.0,2007-05-27,mdrigo,509.0,9.43%,MD,< 1 year


#### 3. How many rows are there?

In [3]:
len(df)

755491

#### 4. What are each column's data types?

In [4]:
df.dtypes

Amount Requested        float64
Application Date         object
Loan Title               object
Risk_Score              float64
Debt-To-Income Ratio     object
State                    object
Employment Length        object
dtype: object

#### 5. Investigate which columns have missing values

In [5]:
df.isnull().sum()

Amount Requested            0
Application Date            0
Loan Title                 14
Risk_Score              23929
Debt-To-Income Ratio        0
State                      21
Employment Length        8130
dtype: int64

#### 6. What are the smallest and largest loan amounts?

In [6]:
print(df["Amount Requested"].min(), df["Amount Requested"].max())

0.0 1400000.0


#### 7. Find the rows where the requested amount is equal to the minimum and maximum values to investigate these further. Is the amount a valid value in those cases?

Decide what to do with those rows - are they reasonable data? If you don't think so, drop the relevant rows.

In [7]:
df[df["Amount Requested"] == 0]

Unnamed: 0,Amount Requested,Application Date,Loan Title,Risk_Score,Debt-To-Income Ratio,State,Employment Length
531884,0.0,2012-06-07,,677.0,32.28%,RI,< 1 year
594623,0.0,2012-08-15,,685.0,44.04%,NC,< 1 year


In [8]:
df[df["Amount Requested"] == df["Amount Requested"].max()]

Unnamed: 0,Amount Requested,Application Date,Loan Title,Risk_Score,Debt-To-Income Ratio,State,Employment Length
157454,1400000.0,2010-09-10,car,641.0,47.55%,SC,2 years


Conclusion: the 0 loan amounts look like bad data, but it's possible someone tried asking for a 1.4 million loan for  a car...

In [9]:
df = df[df["Amount Requested"] > 0]

#### 8. Calculate the average loan amount by state

Check the values in the column first - are there any missing that need to be dealt with?

In [10]:
len(df[df["State"].isnull()])

21

Yes, there are some nulls, so drop them

In [11]:
df = df.dropna(subset=["State"])

In [12]:
df.groupby("State")["Amount Requested"].mean()

State
AK    13713.117207
AL    11917.882667
AR    11783.122340
AZ    12820.066855
CA    13688.188489
CO    13563.546496
CT    12759.522782
DC    11495.268342
DE    12213.730901
FL    12488.469416
GA    12363.471699
HI    14525.504831
IA     8463.715278
ID     8127.898551
IL    13226.302346
IN     8762.786596
KS    13065.799085
KY    12141.954501
LA    12403.720849
MA    12455.818943
MD    12102.264194
ME     8394.907407
MI    12641.243900
MN    12939.081538
MO    12363.794451
MS     8010.117967
MT    13555.610490
NC    12589.666025
ND    10059.210526
NE     8603.960000
NH    13461.364525
NJ    13935.483039
NM    12971.771379
NV    12698.887622
NY    13244.061086
OH    12159.725084
OK    12443.821066
OR    13210.777879
PA    12776.552922
RI    12017.457795
SC    12247.359840
SD    13313.379630
TN     7631.570156
TX    13165.323352
UT    13458.504136
VA    12951.084647
VT    12963.318207
WA    13681.032498
WI    12357.816119
WV    12232.870257
WY    14239.629975
Name: Amount Requested, d

#### 9. What are the different values in "employment length" and how many applications are there in each category?

In [13]:
df["Employment Length"].value_counts().sort_index()

1 year        25444
10+ years     38324
2 years       26115
3 years       20417
4 years       15929
5 years       14617
6 years       11158
7 years        8090
8 years        7452
9 years        5721
< 1 year     574071
Name: Employment Length, dtype: int64

We can use `sort_index()` to order by the category, not the count. Not perfect because they're strings, so 10 is between 1 and 2, but we get the gist!

#### 10. You've been asked to estimate what % of loan applications are related to debt consolidation

- drop rows that have no loan title
- convert each string in the "loan title" column to be fully uppercase
- count how many rows have the word "DEBT" in the loan title
- work out what this is as a % of your total dataset

In [14]:
df = df.dropna(subset=["Loan Title"])

df["loan_title_upper"] = df["Loan Title"].str.upper()

debt_count = len(df[df["loan_title_upper"].str.contains("DEBT")])

print(debt_count)

print(100*(debt_count / len(df)))

265240
35.109920365977636


#### 11. BONUS: Expand on the above by trying multiple keywords at once

- identify a set of (uppercase) keywords to search for in the loan titles and store them in a Python list
- as above, make sure your loan titles are uppercase
- search for rows containing any of your keywords

You could do this with the multiple filter syntax:

```python
df[df[(<condition 1>) & (<condition 2>)...]]
```

but that means you can't store your keywords in a list and amend them without amending the above code.

Instead, one approach is to create a new column to track whether a row matches any of the keywords. Then looping through your list of keywords, for each row that matches that particular keyword, set the value of this column to `True`. So the idea is:

- Say you're looking for words "DEBT" and "CONSOLIDATE"
- First, identify all rows that contain the word "DEBT" and update a column, say "matches_keyword", to `True`.
- Then do the same for "CONSOLIDATE". Most of the ones that have this word will already have the "matches_keyword" set to `True`, but the idea is to also catch the ones that don't
- You can do this for any number of keywords
- When you're done, you can use the "matches_keyword" column to tell you how many rows match **at least one** of your keywords

In [15]:
keywords = ["DEBT", "CONSOLIDATE", "CONSOLIDATING"]

for word in keywords:
    df.loc[df["loan_title_upper"].str.contains(word), "matches_keyword"] = True

df["matches_keyword"].fillna(False, inplace=True)
    
total_count = len(df[df["matches_keyword"]])

print(total_count)
print(100*(total_count / len(df)))

266978
35.33998009149441


Another way to get the % from this columns is using `value_counts`:

In [16]:
df["matches_keyword"].value_counts() / len(df)

False    0.6466
True     0.3534
Name: matches_keyword, dtype: float64