项目背景：
当今租赁行业通常会衍生出一些金融产品，比如蘑菇租房，贝壳找房，租房客户需要去平台上找房子，然后付租金，所以对于租赁行业，需要和金融公司合作来提供不同的支付方式，租房客户并不是把租金直接付给房东，而是先让金融公司付给，而租赁公司相当于去银行贷款，然后租赁公司再偿还金融公司的本金和利息。客户在支付房租的时候可以选择支付方式，比如可以选择自己付房租，也可以选择让第三方金融公司先付部分房租。
数据集：该数据集为某房屋租赁公司C单的客户数据。
项目需求：
评估租房客户信用贷款是否会逾期，来决定是否放贷给客户房贷
项目思路：
使用机器学习算法建立模型，对客户进行分类
这里选取的用户特征为芝麻信用分数、借款本金、借款利息。
算法选择：逻辑回归和随机森林

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split # 划分数据集

In [2]:
from sklearn.ensemble import RandomForestClassifier  # 随机森林分类器

In [3]:
from sklearn.model_selection import GridSearchCV # 调参  超参数搜索，网格搜索

In [4]:
from sklearn.linear_model import LogisticRegression  # 逻辑回归

In [5]:
data=pd.read_csv(r"D:\note\FengXianYuCe\data.csv")

In [6]:
data.head().T

Unnamed: 0,0,1,2,3,4
id,322583,386885,366671,318605,195278
overduestatus,0,0,0,0,0
loanid,73967,95399,88661,72641,31517
renterid,33793799,33051527,34211951,31373835,32746433
landlordid,3132391,3132115,3142398,3139356,3125326
cityid,268,127,315,340,268
capitalamount,1608,586.66,586.66,487.33,950
payfeeamount,49.31,17.8,17.8,14.95,29.14
deadline,2018/10/22,2018/10/18,2018/10/8,2018/10/20,2018/9/23
periodstage,2,1,1,2,3


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27713 entries, 0 to 27712
Data columns (total 37 columns):
id                      27713 non-null int64
overduestatus           27713 non-null int64
loanid                  27713 non-null int64
renterid                27713 non-null int64
landlordid              27713 non-null int64
cityid                  27713 non-null int64
capitalamount           27713 non-null float64
payfeeamount            27713 non-null float64
deadline                27713 non-null object
periodstage             27713 non-null int64
lendamount              27713 non-null float64
remainamount            27713 non-null float64
lenddate                27713 non-null object
subsid                  27709 non-null float64
salecontractid          27709 non-null float64
is_reservation          27713 non-null int64
is_book                 27713 non-null int64
reg_channel             27709 non-null float64
is_changeroom           27709 non-null float64
is_renew          

In [8]:
X=data[['capitalamount','payfeeamount','zhimascore']]

In [9]:
X.head()

Unnamed: 0,capitalamount,payfeeamount,zhimascore
0,1608.0,49.31,693
1,586.66,17.8,662
2,586.66,17.8,671
3,487.33,14.95,698
4,950.0,29.14,688


In [10]:
Y=data['overduestatus']

In [11]:
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)

In [12]:
rd=RandomForestClassifier(n_estimators=10)  # 随机森林模型

In [13]:
rd.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [14]:
rd.score(x_test,y_test)

0.90907778900274205

In [15]:
rd2=RandomForestClassifier()

In [16]:
param={'max_depth':[5,8,15,25,30],'n_estimators':[120,200,300,500,800]}

In [17]:
gs=GridSearchCV(rd2,param_grid=param,cv=5)  # cv=5 5折交叉验证

In [18]:
gs.fit(x_train,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [5, 8, 15, 25, 30], 'n_estimators': [120, 200, 300, 500, 800]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [19]:
gs.best_params_

{'max_depth': 8, 'n_estimators': 300}

In [20]:
gs.score(x_test,y_test)

0.92336556501659695

In [21]:
# 逻辑回归
lg=LogisticRegression()

In [22]:
lg.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
lg.score(x_test,y_test)

0.91730408428344634

In [24]:
print(gs.predict(x_test))  # 预测结果

[0 0 0 ..., 0 0 0]
