# 机器学习基础 - 使用据测树对优惠券情况做出预测

为了贴近实际生活和应用，本次练习以实际数据集的处理为主。提供用户在 2016 年 1 月 1 日至 2016 年 6 月 30 日之间真实线上线下消费行为，预测用户在 2016 年 7 月领取优惠券后 15 天以内的使用情况。 

> 注意： 为了保护用户和商家的隐私，所有数据均作匿名处理，同时采用了有偏采样和必要过滤。


## 数据集说明

`ccf_offline_stage1_train.csv`  - 训练数据

Field | Description
:-|-
User_id | 用户 ID
Merchant_id | 商户 ID
Coupon_id | 优惠券 ID：null 表示无优惠券消费，此时 Discount_rate 和 Date_received 字段无意义
Discount_rate | 优惠率：x \in [0,1]代表折扣率；x:y 表示满 x 减 y。单位是元
Distance | user 经常活动的地点离该 merchant 的最近门店距离是 x*500 米（如果是连锁店，则取最近的一家门店），x$\in[0,10]$；null 表示无此信息，0 表示低于 500 米，10 表示大于 5 公里；
Date_received | 领取优惠券日期
Date | 消费日期：如果 Date=null & Coupon_id != null，该记录表示领取优惠券但没有使用，即负样本；如果 Date!=null & Coupon_id = null，则表示普通消费日期；如果 Date!=null & Coupon_id != null，则表示用优惠券消费日期，即正样本；


In [1]:
import pandas as pd
import numpy as np

In [2]:
path = '/Volumes/Library/SynologyDrive/data/AI_Cheats/'
train_data = pd.read_csv(path + 'ccf_offline_stage1_train.csv')

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   User_id        int64  
 1   Merchant_id    int64  
 2   Coupon_id      float64
 3   Discount_rate  object 
 4   Distance       float64
 5   Date_received  float64
 6   Date           float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB


In [4]:
train_data.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date
0,1439408,2632,,,0.0,,20160217.0
1,1439408,4663,11002.0,150:20,1.0,20160528.0,
2,1439408,2632,8591.0,20:1,0.0,20160217.0,
3,1439408,2632,1078.0,20:1,0.0,20160319.0,
4,1439408,2632,8591.0,20:1,0.0,20160613.0,


In [5]:
# 数据预处理 - 丢弃带有缺失值的数据
print(train_data.shape)
data = train_data.dropna(how='any')
print(data.shape)

(1754884, 7)
(67165, 7)


In [6]:
'''
Discount_rate 是 Object 类型，在 Pandas 中代表字符串，字符串类型不能输入模型中，所以需要改为数值类型.

[0, 1]表示折扣率
x:y 表示满 x 减 y
'''
print('Discount_rate 类型：\n', data['Discount_rate'].unique())



Discount_rate 类型：
 ['20:1' '20:5' '30:5' '50:10' '10:5' '50:20' '100:10' '30:10' '50:5'
 '30:1' '100:30' '0.8' '200:30' '100:20' '10:1' '200:20' '0.95' '5:1'
 '100:5' '100:50' '50:1' '20:10' '150:10' '0.9' '200:50' '150:20' '150:50'
 '200:5' '300:30' '100:1' '200:10' '150:30' '0.85' '0.6' '0.5' '300:20'
 '200:100' '300:50' '150:5' '300:10' '0.75' '0.7' '30:20' '50:30']


In [7]:
# 将 Discount_rate 转化为数值特征

# 打折类型

def getDiscountType(row):
    if ':' in row:
        # x:y 设为 1
        return 1
    else:
        # [0, 1] 设置为 0
        return 0
    
data['Discount_rate'] = data['Discount_rate'].apply(getDiscountType)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Discount_rate'] = data['Discount_rate'].apply(getDiscountType)


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67165 entries, 6 to 1754880
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   User_id        67165 non-null  int64  
 1   Merchant_id    67165 non-null  int64  
 2   Coupon_id      67165 non-null  float64
 3   Discount_rate  67165 non-null  int64  
 4   Distance       67165 non-null  float64
 5   Date_received  67165 non-null  float64
 6   Date           67165 non-null  float64
dtypes: float64(4), int64(3)
memory usage: 4.1 MB


In [9]:
# 导入模型，划分数据集
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

### 数据集添加一个 label 列

- 标注标签 Label  标注哪些样本是正样本 y=1，哪些是负样本 y = -1
- 预测目标：用户在领取优惠券之后 15 之内的消费情况
- (Date - Date_received <= 15) 表示领取优惠券且在 15 天内使用，即正样本，y = 1
- (Date - Date_received > 15)   表示领取优惠券未在 15 天内使用，即负样本，y = 0

Pandas 相关教程： https://mp.weixin.qq.com/s/E3-RbVe3LIRKrKU1OcSQ4w

Pandas 时间教程： https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
'''

In [10]:
def label(row):
    if row['Date'] != 'null':
        td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
        if td <= pd.Timedelta(15, 'D'):
            return 1
    return 0

data['label'] = data.apply(label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['label'] = data.apply(label, axis=1)


In [11]:
# 统计正负样本
print(data['label'].value_counts())

label
1    57060
0    10105
Name: count, dtype: int64


In [12]:
# 划分数据集， 80%训练，20%测试
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 1:],
                                                    data.iloc[:, 0],
                                                    test_size=0.2,
                                                    random_state=3)

In [13]:
# 查验训练样本的数量和类别分布
y_train.value_counts()

User_id
2751537    96
6641735    86
6929894    80
501441     59
2839484    56
           ..
5679946     1
2208346     1
1273834     1
3395905     1
4461556     1
Name: count, Length: 34984, dtype: int64

In [14]:
# 查验测试样本的数量和类别分布
y_test.value_counts()

User_id
6641735    27
2751537    22
2839484    15
2507268    14
501441     14
           ..
4284393     1
1242446     1
5408744     1
2577930     1
89464       1
Name: count, Length: 11405, dtype: int64

In [15]:
# 初始化分类决策树模型，深度为 5 层
model = DecisionTreeClassifier(max_depth = 6, random_state = 1)

In [16]:
# 训练模型
model.fit(X_train, y_train)

  y_type = type_of_target(y, input_name="y")


0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,6
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,1
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [17]:
# 模型预测
y_pred = model.predict(X_test)

In [18]:
# 模型评估
accuracy_score(y_test, y_pred)

  type_true = type_of_target(y_true, input_name="y_true")


0.011315417256011316

In [19]:
# 将模型选择特征的标准改为 entropy
model = DecisionTreeClassifier(criterion='entropy', random_state=1, max_depth=2)

In [20]:
# 模型训练
model.fit(X_train, y_train)

  y_type = type_of_target(y, input_name="y")


0,1,2
,criterion,'entropy'
,splitter,'best'
,max_depth,2
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,1
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [21]:
# 模型预测
y_pred = model.predict(X_test)

In [22]:
print(y_pred)

[4931155 4931155  501441 ... 4931155 2751537 2507268]


In [23]:
# 评估
accuracy_score(y_test, y_pred)

  type_true = type_of_target(y_true, input_name="y_true")


0.0040943944018462

除此之外，鼓励大家自行对数据进行探索。数据集：

链接: https://pan.baidu.com/s/1fNC6Ltsc5b8DrkI1vdtYog?pwd=ur1f 提取码: ur1f 

--来自百度网盘超级会员 v6 的分享

![image alt <](http://5b0988e595225.cdn.sohucs.com/images/20190420/1d1070881fd540db817b2a3bdd967f37.gif)