### Obtaining a data set

I've put a zipped csv file from Lending Tree containing mortage data on One Drive. You should all have access. You can find it here: https://glgit-my.sharepoint.com/:u:/g/personal/jobelenus_glgroup_com/EUoBwtK-k89KopT44j4DjqsB_N2IPj36kuZUmY7SpDgTwg?e=fgEzC0

I've added a column reference here: https://github.com/jobelenus/python-data-analysis-crash-course/blob/master/01-Pandas/reference.md.



### Analyzing the data

*Note: Skip Line 1! Line 2 is the header, so skip that too!*

1. Try and group the dataset by "grade" (A,B,C,D,F).
2. Then see if the highest interest rate in B is greater than the lowest in A, for each grade (e.g. compare a grade with the one above it)

1. Try and group the data by loan status and term, to determine whether more shorter mortgages are fully paid off than longer ones

1. Try and select all the A grade mortgages, and add a new column that calculates the total amount of dollars the buyer will owe (loan_amt * int_rate).
2. Then add another column that tells you how many years it would take to pay it off if they paid with their entire annual income each year.

In [9]:
import pandas as pd
dataFromCSV = pd.read_csv('data.csv', delimiter=',', header=1)
# print(len(df.columns.tolist()))

df = pd.DataFrame(dataFromCSV, columns=['grade', 'int_rate', 'loan_status', 'loan_amnt', 'term', 'annual_inc'])

groupedHighest = df.groupby(['grade'])['int_rate'].max()
groupedLowest = df.groupby(['grade'])['int_rate'].min()

groupedHighLow = pd.merge(groupedHighest, groupedLowest, on='grade', how='outer')
groupedHighLow.columns = ['highest_interest', 'lowest_interest']

groupedHighLow['comparison'] = groupedHighLow['highest_interest'] > groupedHighLow['lowest_interest'].shift()
print(groupedHighLow)
print('-------------')


fullyPaidRecords = df.loc[df['loan_status'] == 'Fully Paid']
groupedLoanAndTerm = fullyPaidRecords[['term', 'loan_status']]
groupedLoanAndTermSize = groupedLoanAndTerm.groupby(['term','loan_status']).size()
print(groupedLoanAndTermSize)
print('-------------')

gradeARecords = df.loc[df['grade'] == 'A'].copy()
gradeARecords['interest_owed'] = df['loan_amnt'] * pd.to_numeric(df['int_rate'].str.strip('%')).div(100)


final = gradeARecords.copy()
final['years_to_pay_back_loan'] = (gradeARecords['loan_amnt'] + gradeARecords['interest_owed']) / gradeARecords['annual_inc']
final

      highest_interest lowest_interest  comparison
grade                                             
A                8.19%           5.32%       False
B               11.99%           6.00%        True
C               14.99%           6.00%        True
D               18.49%           6.00%        True
E               21.99%           6.00%        True
F               26.06%           6.00%        True
G               28.99%          25.80%        True
-------------
term        loan_status
 36 months  Fully Paid     240752
 60 months  Fully Paid      56820
dtype: int64
-------------


Unnamed: 0,grade,int_rate,loan_status,loan_amnt,term,annual_inc,interest_owed,years_to_pay_back_loan
11,A,6.49%,Fully Paid,10000.0,36 months,85000.0,649.0000,0.125282
14,A,6.49%,Fully Paid,28000.0,36 months,92000.0,1817.2000,0.324100
18,A,7.49%,Fully Paid,9600.0,36 months,60000.0,719.0400,0.171984
21,A,7.49%,Fully Paid,25000.0,36 months,109000.0,1872.5000,0.246537
30,A,7.91%,Fully Paid,6000.0,36 months,105000.0,474.6000,0.061663
31,A,5.32%,Fully Paid,15000.0,36 months,80000.0,798.0000,0.197475
35,A,6.49%,Fully Paid,11000.0,36 months,85000.0,713.9000,0.137811
39,A,5.32%,Fully Paid,12000.0,36 months,53750.0,638.4000,0.235133
42,A,7.49%,Fully Paid,18000.0,36 months,75000.0,1348.2000,0.257976
61,A,5.32%,Fully Paid,25000.0,36 months,150000.0,1330.0000,0.175533
