- 知识点1：数据预处理经常涉及到的一个问题就是缺失值的处理问题，关于缺失值的处理方法，可以参考这篇文章：https://blog.csdn.net/weixin_44481878/article/details/88987662
- 知识点2：pandas里面删除缺失数据的高频函数dropna(),详细介绍请参考下面的文章：https://zhuanlan.zhihu.com/p/108114844
- 知识点3：pandas里面apply函数的使用可以参考这篇文章：https://www.jianshu.com/p/4fdd6eee1b06?from=singlemessage
- 知识点4：pandas关于时间的教程https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
- 知识点5：在sklearn中划分数据集有多种方法，比如当采用k折交叉验证时使用KFold函数；当采用留一法时使用LeaveOneOut函数； 更多的数据集划分方法可以参考这篇文章：https://www.cnblogs.com/cmybky/p/11772655.html
- 知识点6:在模型中，我们使用了"User_id"等6个特征，我们使用的6个特征是否对预测我们的label都有意义呢？我们是否能够构造更多有效的特征来纳入模型，以提高我们模型的预测的效果呢？希望同学们能多多思考、多多探索！
- sklearn中决策树的使用可以参考sklearn官方文档里面对于决策树的介绍：https://scikit-learn.org/stable/modules/tree.html

In [1]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

  return f(*args, **kwds)


In [2]:
data = pd.read_csv('./data/ccf_offline_stage1_train.csv')
data.info()
data 

NameError: name 'pd' is not defined

- User_id：用户ID
- Merchant_id：商户ID
- Coupon_id：优惠券ID；null表示无优惠券消费，此时Discount_rate和Date_received字段无意义
- Discount_rate：优惠率；x在[0,1]范围内时，代表折扣率；x:y表示满x减y。单位是元
- Distance：user经常活动的地点离该merchant的最近门店距离是x*500米（如果是连锁店，则取最近的一家门店），x在[0,10]区间；null表示无此信息，0表示低于500米，10表示大于5公里；
- Date_received：领取优惠券日期
- Date：消费日期，（Date - Date_received <= 15) 表示领取优惠券且在15天内使用，即正样本，y = 1；(Date - Date_received > 15)表示领取优惠券未在15天内使用，即负样本，y = 0

In [4]:
data.dropna(inplace=True, how='any')

In [5]:
# x:y 表示满 x 减 y， 将 x:y 类型的字符串设为1
# [0,1] 表示折扣率， 将 [0,1] 类型的字符串设为 0¶
def getDiscountType(row):
    if ':' in row:
        return 1
    else:
        return 0


In [44]:
data['Discount_rate'] = data['Discount_rate'].apply(lambda x: getDiscountType(x))

In [45]:
def label(row):
    
    td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
    if td <= pd.Timedelta(15, 'D'):
        return 1
    else:
        return 0

data['label'] = data.apply(label, axis=1)

In [46]:
# 统计正负样本的分布
print(data['label'].value_counts())

1    6658
0    1186
Name: label, dtype: int64


Tips:当正负样本比例差距过大时（比如正负样本比例为1:10时），此时的数据集称为非平衡数据集，非平衡数据集会影响我们的模型， 因此需要对数据集进行处理，方法包括：正采样、负采样、数据合成等，具体方法可以参考这篇文章： https://blog.csdn.net/a8039974/article/details/83684841

In [47]:
# 划分数据集
X_data = data.drop(columns=['label'])
y_data = data['label']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=2021)

In [48]:
# 查验训练样本的数量和类别分布
print('y_train sample distribute', y_train.value_counts())
# 查验测试样本的数量和类别分布
print('y_test sample distribute', y_test.value_counts())

y_train sample distribute 1    5354
0     921
Name: label, dtype: int64
y_test sample distribute 1    1304
0     265
Name: label, dtype: int64


In [53]:
# 构建模型
model = DecisionTreeClassifier(max_depth=5, random_state=2021)
# 模型训练
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=2021,
            splitter='best')

In [54]:
# 模型预测
y_pred = model.predict(X_test)

In [55]:
accuracy_score(y_test, y_pred)

0.9305289993626513

In [56]:
# 将模型选择特征的标准改为entropy
model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=2021)
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=2021,
            splitter='best')

In [57]:
y_pred = model.predict(X_test)

In [1]:
accuracy_score(y_test, y_pred)

NameError: name 'accuracy_score' is not defined

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.neighbors import KNeighborsClassifier

model = DecisionTreeClassifier(max_depth=9, random_state=33)
model2 = LogisticRegression()
model3 = RandomForestClassifier(max_depth=9, random_state=33)
model4 = GaussianNB()
model5 = CategoricalNB()
model6 = KNeighborsClassifier()

model.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
model5.fit(X_train, y_train)
model6.fit(X_train, y_train)
print('DecisionTreeClassifier train score:', model.score(X_train, y_train))
print('LogisticRegression train score:', model2.score(X_train, y_train))
print('RandomForestClassifier train score:', model3.score(X_train, y_train))
print('GaussianNB train score:', model4.score(X_train, y_train))
print('CategoricalNB train score:', model5.score(X_train, y_train))
print('KNeighborsClassifier train score:', model6.score(X_train, y_train))

print('DecisionTreeClassifier test score:', model.score(X_test, y_test))
print('LogisticRegression test score:', model2.score(X_test, y_test))
print('RandomForestClassifier test score:', model3.score(X_test, y_test))
print('GaussianNB test score:', model4.score(X_test, y_test))
print('CategoricalNB test score:', model5.score(X_test, y_test))
print('KNeighborsClassifier test score:', model6.score(X_test, y_test))
