# 贷款预测数据集

在所有行业中，最为倚重数据分析技术的就是保险业。贷款预测数据集可以让保险公司对即将面对的挑战、选择的应对方式和影响有一个清晰的认识。与泰坦尼克数据集相同，它也是一个分类问题，该数据集共有 13 列 615 行。

典型问题：预测贷款申请能否得到批准。

[Get Data](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/)

**问题陈述**

> 关于公司

> * Dream Housing Finance公司在所有住房贷款中进行交易。他们在所有城市，半城市和农村地区都有业务。客户首先申请住房贷款后，该公司验证客户贷款资格。

> 问题

> * 公司希望根据填写在线申请表时提供的客户详细信息自动执行贷款资格流程（实时）。这些细节是性别，婚姻状况，教育，受抚养人数量，收入，贷款额度，信用记录等。为了实现这一过程的自动化，他们已经给识别客户细分带来了问题，这些客户细分有资格获得贷款金额，以便他们可以专门针对这些客户。他们在这里提供了一个部分数据集。

**数据**

| 变量 | 描述 |
| --- | --- |
|Loan_ID|独特的贷款ID|
|Gender|男性/女性|
|Married|申请人结婚（是/否）|
|Dependents|家属人数|
|Education|申请者教育（研究生/本科生）|
|Self_Employed|自雇人士（是/否）|
|ApplicantIncome|申请人收入|
|CoapplicantIncome|共同收入|
|LoanAmount|贷款金额以千计|
|Loan_Amount_Term|几个月的贷款期限|
|Credit_History|信用记录符合准则|
|Property_Area|城市/半城市/农村|
|Loan_Status|贷款批准（是/否）|


In [2]:
%ls -l

总用量 72
-rw-rw-r-- 1 miaopei miaopei  3683 3月  15 18:20 LoanPrediction.ipynb
-rwxr--r-- 1 miaopei miaopei    21 3月  15 18:15 [0m[01;32mSample_Submission_ZAuTl8O_FK3zQHh.csv[0m*
-rwxr--r-- 1 miaopei miaopei 21957 3月  15 18:15 [01;32mtest_Y3wMUE5_7gLdaTN.csv[0m*
-rwxr--r-- 1 miaopei miaopei 38013 3月  15 18:15 [01;32mtrain_u6lujuX_CVtuZ9i.csv[0m*


In [4]:
import numpy as np
import pandas as pd

data = pd.read_csv('train_u6lujuX_CVtuZ9i.csv', index_col='Loan_ID')

data.head()

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## 1. 布尔索引

如果需要以其它列数据值为条件过滤某一列的数据，您会怎么处理？例如建立一个列表，列表中全部为未能毕业但曾获得贷款的女性。这里可以使用布尔索引，代码如下：

In [13]:
data.loc[(data['Gender']=='Female') & (data['Education']=='Not Graduate') & (data['Loan_Status']=='Y'),
         ['Gender', 'Education', 'Loan_Status']]

Unnamed: 0_level_0,Gender,Education,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LP001155,Female,Not Graduate,Y
LP001669,Female,Not Graduate,Y
LP001692,Female,Not Graduate,Y
LP001908,Female,Not Graduate,Y
LP002300,Female,Not Graduate,Y
LP002314,Female,Not Graduate,Y
LP002407,Female,Not Graduate,Y
LP002489,Female,Not Graduate,Y
LP002502,Female,Not Graduate,Y
LP002534,Female,Not Graduate,Y


## 2. Apply 函数

Apply 函数是处理数据和建立新变量的常用函数之一。在向数据框的每一行或每一列传递指定函数后，Apply 函数会返回相应的值。这个由 Apply 传入的函数可以是系统默认的或者用户自定义的。例如，在下面的例子中它可以用于查找每一行和每一列中的缺失值。

In [16]:
#Create a new function:
def num_missing(x):
    return sum(x.isnull())

#Applying per column:
print "Missing values per column:"
print data.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column

print ''

#Applying per row:
print "nMissing values per row:"
print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each row

Missing values per column:
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

nMissing values per row:
Loan_ID
LP001002    1
LP001003    0
LP001005    0
LP001006    0
LP001008    0
dtype: int64


## Reference

[干货 | 从菜鸟到老司机，数据科学的 17 个必用数据集推荐](https://www.leiphone.com/news/201611/T5iHy9iqqhBVtsz8.html)

[从菜鸟到老司机，数据科学的 17 个必用数据集推荐 Address](http://blog.csdn.net/mtj66/article/details/73848571)

[Python数据处理：Pandas模块的 12 种实用技巧](http://python.jobbole.com/85742/)

[pd.read_csv english](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

[pd.read_csv 中文](http://python.usyiyi.cn/documents/Pandas_0j2/generated/pandas.read_csv.html)

[pd.read_csv 简书](https://www.jianshu.com/p/366aa5daaba9)

[pandas 常用的函数](https://www.jianshu.com/p/8bf430281a5c)

[pandas常用函数清单](https://www.jianshu.com/p/6eb5499cd07d)

[机器学习基础与实践（一）----数据清洗](http://www.cnblogs.com/charlotte77/p/5606926.html)