### Obtaining a data set

I've put a zipped csv file from Lending Tree containing mortage data on One Drive. You should all have access. You can find it here: https://glgit-my.sharepoint.com/:u:/g/personal/jobelenus_glgroup_com/EUoBwtK-k89KopT44j4DjqsB_N2IPj36kuZUmY7SpDgTwg?e=fgEzC0

I've added a column reference here: https://github.com/jobelenus/python-data-analysis-crash-course/blob/master/01-Pandas/reference.md.



### Analyzing the data

*Note: Skip Line 1! Line 2 is the header, so skip that too!*

1. Try and group the dataset by "grade" (A,B,C,D,F).
2. Then see if the highest interest rate in B is greater than the lowest in A, for each grade (e.g. compare a grade with the one above it)

In [1]:
import pandas as pd

def perc_2_float(s):
    if s:
        return float(s.strip('%'))/100
    else:
        return None


df = pd.read_csv('/Users/jobelenus/Downloads/LoanStats3d.csv', skiprows=1, converters={'int_rate': perc_2_float})
print("DONE")

  interactivity=interactivity, compiler=compiler, result=result)


DONE


In [2]:
import numpy as np

all_grades = df.grade.unique()

grades = df[['grade', 'int_rate']].copy()

print(grades.loc[grades.grade == 'C'].head())

max_int_rate = {}
min_int_rate = {}
for g in all_grades:
    rows = grades.loc[grades.grade == g]
    max_int_rate[g] = rows.int_rate.max()
    min_int_rate[g] = rows.int_rate.min()

print("MAX: ", max_int_rate)
print("MIN: ", min_int_rate)


try:
    assert max_int_rate['A'] < min_int_rate['B']
    print("OK!")
except:
    print("FAIL, A:{} > B:{}".format(max_int_rate['A'], min_int_rate['B']))


  grade  int_rate
0     C    0.1485
3     C    0.1399
4     C    0.1199
5     C    0.1344
6     C    0.1288
MAX:  {'C': 0.1499, 'F': 0.2606, 'B': 0.1199, 'E': 0.21989999999999998, 'A': 0.0819, 'D': 0.18489999999999998, 'G': 0.2899, nan: nan}
MIN:  {'C': 0.06, 'F': 0.06, 'B': 0.06, 'E': 0.06, 'A': 0.053200000000000004, 'D': 0.06, 'G': 0.258, nan: nan}
FAIL, A:0.0819 > B:0.06


1. Try and group the data by loan status and term, to determine whether more shorter mortgages are fully paid off than longer ones

In [3]:
next_group = df[['loan_status', 'term']]

next_group = next_group.groupby(['loan_status', 'term']).size().reset_index()
next_group.columns = ['loan_status', 'term', 'numcount']
print(next_group)

short_paid = next_group.loc[next_group.term == ' 36 months']  # there is an extra space here, that is evil!
short_paid = short_paid.loc[short_paid.loan_status == 'Fully Paid']
long_paid = next_group.loc[next_group.loan_status == 'Fully Paid']
long_paid = long_paid.loc[long_paid.term == ' 60 months']

print(short_paid)
print(long_paid)
print("More 36mos loands are paid off than 60mos loans: ", short_paid.numcount.values[0] > long_paid.numcount.values[0])

           loan_status        term  numcount
0          Charged Off   36 months     41939
1          Charged Off   60 months     32823
2              Current   36 months       141
3              Current   60 months     45722
4              Default   36 months        31
5              Default   60 months        92
6           Fully Paid   36 months    240752
7           Fully Paid   60 months     56820
8      In Grace Period   36 months        25
9      In Grace Period   60 months       608
10   Late (16-30 days)   36 months        27
11   Late (16-30 days)   60 months       387
12  Late (31-120 days)   36 months       258
13  Late (31-120 days)   60 months      1470
  loan_status        term  numcount
6  Fully Paid   36 months    240752
  loan_status        term  numcount
7  Fully Paid   60 months     56820
More 36mos loands are paid off than 60mos loans:  True


1. Try and select all the A grade mortgages, and add a new column that calculates the total amount of dollars the buyer will owe (loan_amt * int_rate).
2. Then add another column that tells you how many years it would take to pay it off if they paid with their entire annual income each year.

In [8]:
grade_a = df.loc[df.grade == 'A']
grade_a = grade_a[['loan_amnt', 'int_rate', 'annual_inc']]

grade_a['total_amt'] = grade_a['loan_amnt'] * (grade_a['int_rate'] * 100)
grade_a['num_years'] = grade_a['total_amt'] / grade_a['annual_inc']
grade_a.head()

Unnamed: 0,loan_amnt,int_rate,annual_inc,total_amt,num_years
11,10000.0,0.0649,85000.0,64900.0,0.763529
14,28000.0,0.0649,92000.0,181720.0,1.975217
18,9600.0,0.0749,60000.0,71904.0,1.1984
21,25000.0,0.0749,109000.0,187250.0,1.71789
30,6000.0,0.0791,105000.0,47460.0,0.452
