#Loan Data

## Summary
<p>The dataset contains 500 entries presumely from a larger dataset, split in predefined even number from the loan status (Paidoff, Collection, CollectionPaidOff). The extract covers a week in September 2016. The data is extracted as a snapshot at a later date, where the collection process is only completed up to a certain stage. </p>
<p>A complete random sample of the loan data and information about the data extraction date and method would have been desirable.</p>

#### Change log:
17/04/2017 Initial setup<br>
20/04/2017 more stats <br>
21/04/2017 data cleansing and initial plots <br>
23/04/2017 some more univariate charts <br>
27/04/2017 improve chart, more on collection process, basic correlation analysis<br>
 
---> lost interest in this data set, due to its selection. 
---> https://www.kaggle.com/huseinzol05/d/zhijinzhai/loandata/multi-cluster-education-gender seems to get to the same conclusion 

In [None]:
import numpy as np
import pandas as pd 
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-pastel')
import seaborn as sns
import datetime as dt

In [None]:
df = pd.read_csv("../input/Loan payments data.csv")
for c in ['effective_date','paid_off_time','due_date']:
    df[c]=pd.to_datetime(df[c])
df.past_due_days.fillna(0, inplace=True)

The data set consits of 500 entries. Loan_ID can work as a primary key. 

In [None]:
df.info()

## Attribute analysis

### Loan ID
**from the documentation** "Loan_id A unique loan number assigned to each loan customers". <br>
Loan_ID is unique. 


In [None]:
df.Loan_ID.nunique()

### Loan_status
**from the documentation:** "Loan_status Whether a loan is paid off, in collection, new customer yet to payoff, or paid off after the collection efforts"

**Questions:** I can not find the new customers. Do I miss a file???

In [None]:
print (f"Unique values:", df.loan_status.unique()[:])
g=pd.DataFrame(df.groupby('loan_status')['Loan_ID'].count())
g

### Principal
**from the documentation:** "Principal Basic principal loan amount at the origination"

**Findings:**  only 6 values occur

In [None]:
print (f"Unique values:", df.Principal.unique()[:])
g=pd.DataFrame(df.groupby('Principal')['Loan_ID'].count())
g

# Terms
**from the documentation:** "Can be weekly (7 days), biweekly, and monthly payoff schedule" <br>
**Findings:** only the values 7, 15, 30 occurs. This leaves two choices for precise maturity modelling; either use the values from the data set or the description with more accurate time functions. 

In [None]:
l=list(g.index)
l

In [None]:
print (f"Unique values:", df.terms.unique()[:])
mapTerms={7:'weekly', 15:'bi-weekly', 30: 'monthly'}
g=pd.DataFrame(df.groupby('terms')['Loan_ID'].count())
g

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
pos = np.arange(len(df.terms.unique()))
ax.pie(g.Loan_ID.values, labels=[mapTerms[l] for l in list(g.index)])
ax.set_title('Loans (count) by contract term');

#Effective_date

**from the documentation:** Effective_date When the loan got originated and took effects <br>
**Findings:**  The data extract seems to a subset from a week in September 2016.  8th Sep to 14th Sep.
Most of the Loans have been originated on a Sunday or Monday. 

In [None]:
print (f"Unique values:", df['effective_date'].unique()[:])
g=pd.DataFrame(df.groupby('effective_date')['Loan_ID'].count())
g.loc[:,'Weekday']=pd.Series(g.index, index=g.index).dt.weekday_name
g.loc[:,'strDate']=pd.Series(g.index, index=g.index).dt.date
#g

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
pos = np.arange(len(df.effective_date.unique()))
ax.bar(pos, height=g.Loan_ID.values, color=['b','b','g','g','b','b','b'])
ax.set_xticks(pos)
ax.set_xticklabels([w+"\n, "+ d.strftime('%d %b %Y') for w, d in zip(g.Weekday.values,g.strDate.values)],
                   rotation=90)
ax.set_title('Loans (count) by origination date');

### Due_date
**from the documentation:** "Due_date Since it’s one-time payoff schedule, each loan has one single due date" <br>
**Findings:** For 42 entries the due date could not derived from Effective_Date and Terms minus one day. 

In [None]:
print (f"Unique values:", df['due_date'].unique()[:])
g=pd.DataFrame(df.groupby('due_date')['Loan_ID'].count())
#g

In [None]:
df['TDterms']=df['terms'].astype('timedelta64[D]')
df['recalc_due_date']=df['effective_date']+df['TDterms']+dt.timedelta(days=-1)
mm=df[['effective_date', 'due_date', 'terms','recalc_due_date']][df['due_date']!=df['recalc_due_date']]
print('Mismatches in recalculated due_date: '+ str(len(mm)))
mm
#df[['effective_date', 'due_date', 'terms','recalc_due_date']].head()

### Paidoff_time

**from the documentation:**  Paidoff_time "The actual time a customer pays off the loan" <br>
**Findings:** returned as timestamp.  <br>
**ToDo:** Stratify 

In [None]:
print (f"Unique values:", df.paid_off_time.nunique())
#g=pd.DataFrame(df.groupby('paid_off_time')['Loan_ID'].count())
#g

### Pastdue_days
**from the documentation:** "Pastdue_days How many days a loan has been past due" <br>
**Findings:** If a loan is not paid_off on the due date, it went to Collection and if it then paid off to Collection paidoff.  The minimum values for collection is 28, the maximum is 76.
One can state that all loans overdue less than 28 days have been collected, however, it is not possible to determine the final outcome of the collection process. The maximum value for Paidoff is 56.  Afterwards we see two spikes cause the fixed term of the loan. Most likely the data set is taken as a snapshot at some point shortly afterwards, preventing data analysis on a longer time horizon.

In [None]:
print (f"Unique values:", df.past_due_days.unique()[:])
g=pd.DataFrame(df.groupby(['loan_status', 'past_due_days'])['Loan_ID'].count())
g=g.unstack('loan_status')
g=g.fillna(0)
#g

In [None]:
print(f'Collection min days: ',df[df.loan_status=='COLLECTION'].past_due_days.min())
print(f'Collection max days: ',df[df.loan_status=='COLLECTION'].past_due_days.max())
print(f'Paidoff min days: ',df[df.loan_status=='COLLECTION_PAIDOFF'].past_due_days.min())
print(f'Paidoff max days: ',df[df.loan_status=='COLLECTION_PAIDOFF'].past_due_days.max())

In [None]:
fig, ax = plt.subplots(figsize=(13.5,5))
pos = g.loc[:,'Loan_ID'].index
ax.bar(left=pos,
       height=g.loc[:,'Loan_ID']['COLLECTION'].values,
       label='Collection')
ax.bar(left=pos,
       bottom=g.loc[:,'Loan_ID']['COLLECTION'].values,
       height=g.loc[:,'Loan_ID']['COLLECTION_PAIDOFF'].values,
       label='Paid off')
ax.set_xticks([5*x for x in range(np.int(max(pos)/5.0)+1)])
ax.legend()
ax.set_title('Loans (count) by origination date');

### Age

**from the documentation:** "Age, [...] A customer’s basic demographic information" <br>
** Findings:** The age range is 18 to 51. That's a little bit short on the upper end. 

In [None]:
print (f"Unique values:", df.age.unique()[:])
g=pd.DataFrame(df.groupby('age')['Loan_ID'].count())
#g

In [None]:
fig, ax = plt.subplots(figsize=(13.5,5))
pos = g.index
ax.bar(left=pos,
       height=g.loc[:,'Loan_ID'].values,
       label='Age')
ax.set_xticks([5*x for x in range(np.int(max(pos)/5.0)+1)])
ax.legend()
ax.set_title('Borrower Age (count)');

### Education

**from the documentation:** "[...] education, gender A customer’s basic demographic information"

In [None]:
print (f"Unique values:", df.education.unique()[:])
g=pd.DataFrame(df.groupby('education')['Loan_ID'].count())
#g

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
ax.pie(g.Loan_ID.values, labels=list(g.index))
ax.set_title('Education (count of borrows)');

### Gender

**from the documentation:** [...]  gender A customer’s basic demographic information

In [None]:
print (f"Unique values:", df.Gender.unique()[:])
g=pd.DataFrame(df.groupby('Gender')['Loan_ID'].count())
#g

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
ax.pie(g.Loan_ID.values, labels=list(g.index))
ax.set_title('Gemder (count of borrows)');

## Correlation analysis

We calculate the correlation between the demographics of the borrow and find them surprisingly uncorrelated. The might again raise the question; how the data set is selected.

In [None]:
mapGender0 = {'male':0, 'female': 1}
df['Gender0']=df['Gender'].map(mapGender0)
mapEducation0 = {'High School or Below': 0, 'Bechalor': 1, 'college' : 2 , 'Master or Above':3}
df['Education0']=df['education'].map(mapEducation0)

In [None]:
corr = df[['age','Gender0','Education0']].corr()
fig, ax = plt.subplots(figsize = (6, 5))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
map   = sns.heatmap(
        corr, 
        cmap = plt.cm.coolwarm,
        square=True, 
        cbar_kws={'shrink': .9}, 
        ax=ax, 
        annot = True, 
        annot_kws={'fontsize': 12})