# German Loan Data

## Instructions

### Banking: Loan Approval Case
In this use case, each entry in the dataset represents a person who takes a credit loan from a bank. The learning task is to classify each person as either a good or bad credit risk according to the set of attributes.

You can find the data `german_credit_data.csv` saved under the [data](../data) folder:<br>
NOTE: At this point, **DO NOT check the reference website**
- Looking into the data using appropriate functions and extract the fields in the data.
- For each data, describe what the data is about and what fields are saved.
    - Which column contain continuous variables and which columns contain categorical variables?    

You need to answer the questions and perform the task below:
- What are mean age, mean credit amount, and duration?
- What are the major three purpose of loan?
- What is the majoriry loan taker? Male of female?

Note:
- You are NOT ALLOWED to import other library or package
- You can write you own functions
- Your answers should be readable with approprate comments
- You can refer to [markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) if you are not familar with Markdown

### Reference
This dataset was sourced from Kaggle: https://www.kaggle.com/uciml/german-credit

The original source is: https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

## Import libraries 

In [1]:
# Usual libraries are imported here
import os
import yaml
import dask.dataframe as dd
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Please perform your tasks below and answer the questions

In [2]:
retval = os.getcwd()
print("current working directory is %s" % retval)

current working directory is /Users/isla/Desktop


In [3]:
os.chdir('/Users/isla/desktop')

In [4]:
loan_data = pd.read_csv('german_credit_data.csv')

In [5]:
loan_data.head(5)

Unnamed: 0.1,Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,0,67,male,2,own,,little,1169,6,radio/TV
1,1,22,female,2,own,little,moderate,5951,48,radio/TV
2,2,49,male,1,own,little,,2096,12,education
3,3,45,male,2,free,little,little,7882,42,furniture/equipment
4,4,53,male,2,free,little,little,4870,24,car


In [6]:
# get the columns contain categorical variables
loan_data.select_dtypes(include=['category', object]).columns

Index(['Sex', 'Housing', 'Saving accounts', 'Checking account', 'Purpose'], dtype='object')

In [7]:
# get the columns contain continuous variables 
loan_data._get_numeric_data().columns

Index(['Unnamed: 0', 'Age', 'Job', 'Credit amount', 'Duration'], dtype='object')

In [8]:
# What are mean age, mean credit amount, and duration?
df=loan_data.describe()
u_age=df.loc['mean']['Age']
u_creditAmt=df.loc['mean']['Credit amount']
u_dur= df.loc['mean']['Duration']
print(u_age,u_creditAmt,u_dur)

35.546 3271.258 20.903


In [9]:
# What are the major three purpose of loan?
df_cnt = loan_data['Purpose'].value_counts()
df_cnt

car                    337
radio/TV               280
furniture/equipment    181
business                97
education               59
repairs                 22
vacation/others         12
domestic appliances     12
Name: Purpose, dtype: int64

In [10]:
quot=df_cnt.index[0:3].tolist()
print('The major three purpose of loan is %s'%quot)

The major three purpose of loan is ['car', 'radio/TV', 'furniture/equipment']


In [11]:
### What is the majoriry loan taker? Male or female?
data =loan_data.groupby(['Age','Sex'],as_index=False)['Age'].agg({'cnt':'count'})
data.head()

Unnamed: 0,Age,Sex,cnt
0,19,female,2
1,20,female,8
2,20,male,6
3,21,female,6
4,21,male,8


In [12]:
gender = loan_data['Sex'].value_counts()
gender
# The male is the majority taker. Next to discover which generation is the most taker.

male      690
female    310
Name: Sex, dtype: int64

In [13]:
gender=np.where(loan_data['Sex'].str.contains('female'),'female','male')
gender[:5]

# by_tz_os = cframe.groupby(['tz',operating_system])


array(['male', 'female', 'male', 'male', 'male'], dtype='<U6')

In [14]:
agg = loan_data.groupby(['Age',gender])
agg

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x31e7f9be0>

In [15]:
agg_counts=agg.size().unstack().fillna(0)
agg_counts[:10]

Unnamed: 0_level_0,female,male
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
19,2.0,0.0
20,8.0,6.0
21,6.0,8.0
22,13.0,14.0
23,32.0,16.0
24,23.0,21.0
25,21.0,20.0
26,20.0,30.0
27,14.0,37.0
28,15.0,28.0


In [20]:
# unpack the dataframe above 
age=agg_counts.index.tolist()
fe=agg_counts['female'].tolist()
male=agg_counts['male'].tolist()
# group the data into three generation: youth(less than 40 years old),the middle-aged and the old(above 65 years ago.)
def divide_by_age(age_list,fe_list,male_list):
    f1=0
    f2=0
    f3=0
    m1=0
    m2=0
    m3=0
    n = len(age_list)
    for i in range(n):
        if age[i] < 40:
            f1+=fe_list[i]
            m1+=male_list[i]
        elif age[i]<65:
            f2+=fe_list[i]
            m2+=male_list[i]
        else:
            f3+=fe_list[i]
            m3+=male_list[i]
#     merge the data into the dataframe
    df = pd.DataFrame([[f1,m1],[f2,m2],[f3,m3]],columns=['Female','Male'],index=['Youth','the Middle-aged','the Old'])
    return df

In [22]:
df=divide_by_age(age,fe,male)
df

Unnamed: 0,Female,Male
Youth,241.0,460.0
the Middle-aged,63.0,213.0
the Old,6.0,17.0


In [None]:
#The most loan taker group is the male of the youth(less than 40 years ago), with the number of 460. 
#The second is female of youth, with the number of 241.
#Male at middle-aged comes the third with the number of 213 and 
# totally the youth or male group is the majority loan taker.