# 数据挖掘实践--探究店家活动对客户情感的影响

## 摘要


## 背景

在同学介绍下，我一开始的任务是帮非计算机学院同学（记为A同学）部署环境运行代码并梳理代码逻辑，无奈接手项目后绝大部分关键业务代码缺失，所幸数据已做好标注，便开始根据已有数据格式和A同学的需要编写代码，整理数据，进行数据分析。本文根据课上所学知识进行实验，记录部分实验内容。

## 任务描述

### 需求一
利用已有标注数据，使用多种模型对未标注数据进行多维度的标注。

该需求类似于半监督学习，但只做一次训练-标注，比半监督学习简单，但对半监督学习的平滑性假设的利用可能较差，只利用了人工标注样本，未利用模型标注样本。

该需求的数据可从原始数据（下文详细描述）中提取得到，由用户对特定商品的评价，及其维度分为功能、外观、服务，幅值为1（积极）、0（中性）、-1（消极）的评分组成。商品评价为原始数据内容，分维度的评分为A同学等人人工标记并交叉验证得到。

根据A同学要求，拟在此数据上利用已标记数据，使用bert对数据进行维度划分（加MLP进行分类），利用朴素贝叶斯（NB），支持向量机（SVM），随机森林（RF）结合词袋模型进行评分（按分类任务处理）。模型选择上，该任务理论可以视为9分类任务，利用bert+MLP进行一步到位的情感评分标注；也可以不使用词袋模型只用bert提取特征，目前这种方案更多的是考虑了算法以外的原因。

### 需求二

在现有数据上提取其他有用信息。如提取不分维度的评分信息用于与分维度评分进行对比，根据活动时间和商品id对数据进行划分统计用于进一步的数据分析。

## 探索性数据分析





初始数据分为商品评论和活动时间两部分，其中商品评论数据格式为json格式（混在CSV文件中），活动时间为xlsx格式，数据结构如下：

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
allcomment_path = "./data/comment3.csv"
comment_sheet = pd.read_csv(allcomment_path)
comment_sheet.head(2)

Unnamed: 0,product_name,productId,afterCount,goodCount,generalCount,poorCount,comment_num,product_prices,hotComments,good_comment,middle_comment,poor_comment,product_comments
0,NEObella韩国进口彩色隐形眼镜美瞳半年抛1片装14.0mm小直径 向日葵425度,100013000000,0,0,0,0,1,"{""J_100013000000"": {""p"": ""-1.00"", ""op"": ""406.3...",[],[],[],[],{}
1,箭牌卫浴（ARROW）洗手盆卫生间洗脸盆柜组合洗漱台组合简约面盆实木浴室柜组合,24236949315,73,2000,10,74,3000,"{""J_24236949315"": {""p"": ""-1.00"", ""op"": ""5699.0...","[{""款式新颖"": 5}, {""风格简约"": 4}, {""吸力强劲"": 2}, {""显得高端...","[{""nickname"": ""j***Y"", ""score"": 5, ""content"": ...","[{""nickname"": ""d***g"", ""score"": 3, ""content"": ...","[{""nickname"": ""龙***痔"", ""score"": 1, ""content"": ...","{""款式新颖"": [{""nickname"": ""y***z"", ""score"": 5, ""c..."


In [3]:

activity_path = "./data/activity.xlsx"
activity_sheet = pd.read_excel(activity_path)
activity_sheet.head(2)

Unnamed: 0,品牌名称,sku_id,活动时间,商品名称
0,福库,49816790708,2020-01-02 14:58:53,福库（CUCKOO）电饭煲 IH电磁加热 多功能 新款高压电饭锅CRP-HR0899SR 3.8L
1,福库,50112986733,2020-01-02 14:59:19,【福库旗舰店】福库电饭煲 原装进口IH加热 多功能新款高压电饭锅 3.8升


观察数据统计信息：

In [4]:
comment_sheet['comment_num'].describe()

count    6880.000000
mean     1000.670058
std      1895.697049
min         0.000000
25%         4.000000
50%       100.000000
75%      1000.000000
max      9000.000000
Name: comment_num, dtype: float64

根据该cell输出，我们知道我们一共有6880个商品，平均每个商品有1000条评论。我们需要把这些评论提取出来。

In [5]:
activity_sheet['活动时间'].describe()


  """Entry point for launching an IPython kernel.


count                   28889
unique                   9005
top       2020-08-31 17:11:56
freq                      129
first     2020-01-02 14:58:53
last      2020-11-26 18:59:04
Name: 活动时间, dtype: object

In [6]:
len(activity_sheet['sku_id'].unique()) # 验证商品id不重复

28889

活动时间数据包含 `2020-01-02 14:58:53` 到 `2020-11-26 18:59:04` 的数据类型为object（提前知道这点的话其实可以改成date格式，方便后面处理，但我是做完实验才回来整理的流程），按商品id划分共28889条，其中大多数商品活动时间时刻重复，可解释为同一店铺的商品活动。

值得注意的是，商品id没有重复的地方，猜测同一商品只记录最终活动时间。

从任务目标上看，我们并不需要太关心其他数据的数据类型，故跳过其他数据的dtype的检查。

## 数据整理

本任务没有独立的数据清理与变换工作，但也在某些流程中有所实现，比如准备数据时按活动时间的前后几个月筛选并标记数据。

不同于数据清理对样本错误进行纠正的目的，我们在主要在这里剔除原始数据中冗余的数据，同时进行数据转换等工作。

In [7]:
# 合并评论表与活动表
comment_activity_sheet = pd.merge(
    comment_sheet,
    activity_sheet[["sku_id", "活动时间"]],
    left_on="productId",
    right_on="sku_id",
)
comment_activity_sheet.head(2)

Unnamed: 0,product_name,productId,afterCount,goodCount,generalCount,poorCount,comment_num,product_prices,hotComments,good_comment,middle_comment,poor_comment,product_comments,sku_id,活动时间
0,箭牌卫浴（ARROW）洗手盆卫生间洗脸盆柜组合洗漱台组合简约面盆实木浴室柜组合,24236949315,73,2000,10,74,3000,"{""J_24236949315"": {""p"": ""-1.00"", ""op"": ""5699.0...","[{""款式新颖"": 5}, {""风格简约"": 4}, {""吸力强劲"": 2}, {""显得高端...","[{""nickname"": ""j***Y"", ""score"": 5, ""content"": ...","[{""nickname"": ""d***g"", ""score"": 3, ""content"": ...","[{""nickname"": ""龙***痔"", ""score"": 1, ""content"": ...","{""款式新颖"": [{""nickname"": ""y***z"", ""score"": 5, ""c...",24236949315,2020-08-05 18:34:07
1,香奈儿（Chanel）口红唇膏可可小姐水亮/丝绒系列口红,62383970544,5,800,21,20,4000,"{""J_62383970544"": {""p"": ""-1.00"", ""op"": ""158.00""}}","[{""提亮肤色"": 1}, {""精美漂亮"": 1}, {""不油不干"": 1}]","[{""nickname"": ""jd_136638hgrq"", ""score"": 5, ""co...","[{""nickname"": ""f***0"", ""score"": 3, ""content"": ...","[{""nickname"": ""十***天"", ""score"": 1, ""content"": ...","{""提亮肤色"": [{""nickname"": ""小***蔓"", ""score"": 5, ""c...",62383970544,2020-07-09 22:37:19


In [8]:
# 截取2月到8月数据
after_february = comment_activity_sheet[
    comment_activity_sheet["活动时间"]
    > "2020-02-01 00:00:00"
]

before_august = after_february[
    after_february["活动时间"] < "2020-08-01 00:00:00"
]


评论存在于产品评论、好评、中评和差评中，需要分别提取：

In [9]:
# 挑选活动时间前90天的产品评论
from main_tools import get_comments

product_comment=before_august[['活动时间','product_comments']].apply(get_comments,axis=1)


In [10]:
# 挑选活动时间前90天的好评、中评和差评
from main_tools import get_comments_
    
# todo 可合并
poor_comment=before_august[['活动时间','poor_comment']].apply(get_comments_,axis=1)
middle_comment=before_august[['活动时间','middle_comment']].apply(get_comments_,axis=1)
good_comment=before_august[['活动时间','good_comment']].apply(get_comments_,axis=1)

In [11]:
# 合并上述流程获得的评论
comment=poor_comment+middle_comment+good_comment+product_comment
comment.values[0][:4]

[('2020-04-29 19:44:36', 1, '产品质感：感觉是假的，和正品的不一样'),
 ('2020-04-18 21:52:42', 1, '差的要命'),
 ('2020-06-27 12:20:38', 1, '不是正品'),
 ('2020-06-27 12:20:38', 1, '不是正品')]

In [12]:
# 去重
from main_tools import deduplicate
deduplicate_comment=comment.map(deduplicate)

137  to  136
78  to  69
30  to  30
119  to  119
56  to  51
0  to  0
26  to  24
39  to  28
20  to  20
8  to  2
10  to  5
63  to  63
10  to  9
29  to  2
210  to  203
192  to  179
90  to  9
135  to  103
12  to  12
9  to  8
9  to  8
0  to  0
238  to  236
121  to  104
34  to  29
22  to  22
0  to  0
21  to  11
7  to  7
2  to  2
185  to  174
206  to  169
67  to  62
41  to  39
201  to  2
10  to  1
13  to  12
1  to  1
11  to  11
1  to  1
12  to  3
0  to  0
20  to  18
437  to  399
22  to  19
0  to  0
0  to  0
30  to  30
0  to  0
0  to  0
29  to  26
11  to  9
34  to  23
5  to  5
9  to  9
4  to  4
11  to  11
61  to  44
3  to  3
0  to  0
0  to  0
240  to  26
1  to  1
0  to  0
241  to  232
30  to  27
79  to  79
538  to  520
232  to  197
189  to  187
119  to  119
29  to  27
440  to  343
308  to  301
349  to  340
390  to  385
124  to  123
75  to  70
143  to  135
224  to  200
599  to  551
72  to  69
1  to  1
107  to  100
160  to  156
148  to  96
3  to  3
3  to  3
52  to  51
127  to  116
3  to  3
105  t

In [13]:
from dateutil.parser import parse
# 清除多余属性
before_august["评论明细"] = deduplicate_comment

# # 前面todo代码是懒得改，下面是脑血栓代码
# thin_before_august = pd.merge(before_august[['productId','评论明细']], activity_sheet[['sku_id', '活动时间']], left_on='productId', right_on='sku_id')
thin_before_august =before_august[['productId','评论明细','活动时间']]
thin_before_august.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,productId,评论明细,活动时间
1,62383970544,"[(2020-07-29 14:17:21, 5, 滋润效果：很好), (2020-06-0...",2020-07-09 22:37:19
2,57328615327,"[(2020-09-06 19:56:05, 3, 颜色太深了), (2020-08-16 ...",2020-07-09 22:37:19


In [14]:

thin_before_august['活动前评论'] = thin_before_august.apply(lambda x: [c for c in x['评论明细'] if parse(c[0]) < x['活动时间']], axis=1)
thin_before_august['活动后评论'] = thin_before_august.apply(lambda x: [c for c in x['评论明细'] if parse(c[0]) > x['活动时间']], axis=1)

thin_before_august.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,productId,评论明细,活动时间,活动前评论,活动后评论
1,62383970544,"[(2020-07-29 14:17:21, 5, 滋润效果：很好), (2020-06-0...",2020-07-09 22:37:19,"[(2020-06-02 06:31:19, 5, 很好看咯), (2020-05-04 1...","[(2020-07-29 14:17:21, 5, 滋润效果：很好), (2020-07-2..."
2,57328615327,"[(2020-09-06 19:56:05, 3, 颜色太深了), (2020-08-16 ...",2020-07-09 22:37:19,"[(2020-05-22 23:07:15, 5, 产品质感：好), (2020-06-03...","[(2020-09-06 19:56:05, 3, 颜色太深了), (2020-08-16 ..."


In [15]:
# 打包数据
data4train=[]
for value in thin_before_august.values:
    for __comment in value[3]:
        data4train.append([value[0], value[2], __comment[0], __comment[2], '活动前', __comment[1]])
    for __comment in value[4]:
        data4train.append([value[0], value[2], __comment[0], __comment[2], '活动后', __comment[1]])


data4train_sheet = pd.DataFrame(data4train, columns=['productId', '活动时间', '评论时间', '评论内容', '类型', '评分'])
data4train_sheet.head(2)

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,评分
0,62383970544,2020-07-09 22:37:19,2020-06-02 06:31:19,很好看咯,活动前,5
1,62383970544,2020-07-09 22:37:19,2020-05-04 10:53:37,产品颜色：很显白,活动前,5


In [16]:
import datetime
# 保存原始评分数据
now_time = datetime.datetime.now().strftime('%Y-%m-%d')
data4train_sheet.to_excel(f'data/score_comment_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])

In [17]:
# 保存已需要标记的数据（人工标记数据也从这里获得）
data4label=data4train_sheet[['productId', '活动时间', '评论时间', '评论内容', '类型']]
data4label['美学维度'] = ''
data4label['功能维度'] = ''
data4label['服务维度'] = ''
data4label.to_excel(f'data/comment_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])


后来A同学提出新需求，要对全年数据做机器标注，原理相同，输出数据格式相同，故封装调用，并省略过程描述。

In [18]:
from main_data_prepare import prepare_data
all_comment=prepare_data(comment_activity_sheet)
all_comment.head(2)

618  to  598
292  to  288
196  to  177
39  to  38
138  to  138
67  to  57
222  to  221
378  to  365
49  to  37
24  to  24
24  to  6
20  to  9
222  to  205
38  to  35
446  to  35
288  to  17
30  to  6
220  to  213
2232  to  1919
570  to  473
36  to  32
736  to  691
544  to  24
105  to  14
145  to  112
211  to  209
16  to  15
41  to  35
314  to  23
32  to  23
343  to  337
277  to  247
274  to  231
417  to  411
246  to  242
73  to  60
923  to  900
207  to  201
195  to  182
210  to  172
235  to  225
127  to  105
2009  to  21
115  to  21
24  to  20
12  to  11
33  to  33
194  to  189
217  to  36
13  to  7
70  to  62
1387  to  1308
40  to  37
84  to  80
317  to  23
4  to  3
43  to  11
36  to  6
27  to  23
209  to  152
14  to  10
65  to  59
65  to  15
130  to  111
2  to  2
0  to  0
89  to  83
0  to  0
0  to  0
273  to  262
148  to  137
132  to  113
89  to  76
1163  to  1116
86  to  86
885  to  880
113  to  91
21  to  20
76  to  57
48  to  39
1  to  1
3  to  3
1061  to  51
45  to  45
0  to  0
5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  thin_comment['活动前评论'] = thin_comment.apply(lambda x: [c for c in x['评论明细'] if parse(c[0]) < x['活动时间']], axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  thin_comment['活动后评论'] = thin_comment.apply(lambda x: [c for c in x['评论明细'] if parse(c[0]) > x['活动时间']], axis=1)


Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,评分
0,24236949315,2020-08-05 18:34:07,2018-11-30 16:48:23,客服特别好，东西都挺齐全的，特别棒，整体都非常好看，很满意。,活动前,5
1,24236949315,2020-08-05 18:34:07,2018-05-29 13:47:24,第一次来这家店非常满意，宝贝收到物流快，特别在乎细节和挑剔的，我也觉得质量非常好，放心,活动前,5


In [19]:
pd.to_datetime(all_comment['评论时间']).describe()

  """Entry point for launching an IPython kernel.


count                 1258364
unique                 947681
top       2021-01-13 13:12:20
freq                       43
first     2012-11-10 14:52:12
last      2021-03-03 15:37:34
Name: 评论时间, dtype: object

由于数据太大，我们无法直接存为Excel文件，故还得想办法过滤掉一些数据，把样本数控制在一百万以下。

In [20]:
comment19_20=all_comment[all_comment['评论时间']>'2019-01-01 00:00:00'][all_comment['评论时间']<'2021-01-01 00:00:00']

  """Entry point for launching an IPython kernel.


In [21]:
comment19_20.to_excel(f'data/all_score_comment_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])
# all_comment.to_csv(f'data/all_score_comment_{now_time}.txt', sep='\t')

In [22]:

data4label19_20=comment19_20[['productId', '活动时间', '评论时间', '评论内容', '类型']]
data4label19_20['美学维度'] = ''
data4label19_20['功能维度'] = ''
data4label19_20['服务维度'] = ''
data4label19_20.to_excel(f'data/all_comment_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])

A同学等人所标注数据同`data4label`所展示的数据格式，其中评分可为1（好评），0（中评），-1（差评）：

## 特征工程与模型训练

经过前一节的处理，我们的数据已经可以很方便地被使用。这一节我们将使用一元词袋法，二元词袋法和bert对数据进行特征提取，同时对模型进行训练。



### 词袋特征

提取词袋特征时需要对句子进行分词，并去除虚词，最后根据训练集出现过的这些词组织得到词袋向量，用于表示句子特征。

In [23]:
# 读取标注数据
train_data_path='./data/for_modeling.xlsx'
train_data_sheets=pd.read_excel(train_data_path,sheet_name=None,index_col=0)
train_data_sheet=train_data_sheets['Sheet1'].append(train_data_sheets['Sheet2'])
train_data_sheet.head(2)

Unnamed: 0_level_0,活动时间,评论时间,评论内容,类型,总体情感得分,美学维度,情感得分,编号,功能维度,情感得分.1,编号.1,服务维度,情感得分.2,编号.2
productId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
64986275944,2020-07-31 18:56:24,2020-06-01 00:18:05,颜值很高，穿着很舒服，重量轻，很轻便,活动前,1,颜值很高,1.0,1.0,穿着很舒服，重量轻，很轻便,1.0,1.0,,,
64986275944,2020-07-31 18:56:24,2020-09-09 08:51:21,鞋样子很好看，号码也很足，穿着非常轻很舒服，快递也很快，在京东买东西就是放心,活动后,1,鞋样子很好看,1.0,2.0,号码也很足，穿着非常轻很舒服,1.0,2.0,快递也很快，在京东买东西就是放心,1.0,1.0


In [24]:
from tools import *
# 分维度划分数据（数据转换内容）
appearance, function, service = split_dimentions(
    train_data_sheet
)
appearance.head(2)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.635 seconds.
Prefix dict has been built successfully.
W0307 08:26:42.855365 28183 init.cc:157] AVX is available, Please re-compile on local machine
Paddle enabled successfully......


Unnamed: 0_level_0,美学维度,情感得分,编号,评论内容
productId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
64986275944,颜值很高,1.0,1.0,颜值很高，穿着很舒服，重量轻，很轻便
64986275944,鞋样子很好看,1.0,2.0,鞋样子很好看，号码也很足，穿着非常轻很舒服，快递也很快，在京东买东西就是放心


In [25]:
# 加载停用词表
stopwords_list=get_stopwords_list('cn_stopwords.txt')

In [26]:
# 生成训练数据（篇幅过长，封装细节）
# 流程为提取表格文本，使用结巴分词分词，然后去除停用词。
appearance_x_y=generate_train_data(appearance,stopwords_list)
function_x_y=generate_train_data(function,stopwords_list)
service_x_y=generate_train_data(service,stopwords_list)
appearance_x_y[0][0],appearance_x_y[1][0]

('颜值 高', 1.0)

In [27]:
# 划分训练集测试集,方便模型评估
from sklearn.model_selection import train_test_split
test_size=0.1
appearance_train_x,appearance_test_x,appearance_train_y,appearance_test_y=train_test_split(*appearance_x_y,test_size=test_size)
function_train_x,function_test_x,function_train_y,function_test_y=train_test_split(*function_x_y,test_size=test_size)
service_train_x,service_test_x,service_train_y,service_test_y=train_test_split(*service_x_y,test_size=test_size)
appearance_train_x[0:10],appearance_train_y[0:10]

(['外观 款式 简洁 大方 放在 桌上 立马 觉得 书桌 档次 读书 环境 立马 营造',
  '外形 好看 喜欢',
  '产品 外观 不咋地',
  '灯罩 白 有点 灰色',
  '款式 简单 大方',
  '满意 外观 漂亮',
  '款式 好看',
  '外观 好看',
  '家里 装饰 搭配 颜色 简单 大气',
  '做工 精致 实物 图片 漂亮'],
 [1.0, 1.0, -1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])

这里我们可以发现，分词结果并不好。像上面第二句，分词结果只有"颜色"一词，标签为积极，显然丢失了关键信息。可以微调，但收益太小远不如直接用bert来得实在。

In [28]:

from sklearn.feature_extraction.text import CountVectorizer
appearance_vector1 = CountVectorizer() # 一元词袋
function_vector1 = CountVectorizer() # 一元词袋
service_vector1 = CountVectorizer() # 一元词袋
appearance_vector2 = CountVectorizer(ngram_range=[1,2]) # 二元词袋
function_vector2 = CountVectorizer(ngram_range=[1,2]) # 二元词袋
service_vector2 = CountVectorizer(ngram_range=[1,2]) # 二元词袋

In [29]:
# 标准化(一元词袋，获得 稀疏句-词关系矩阵)
appearance_train_n_x_1=appearance_vector1.fit_transform(appearance_train_x)
function_train_n_x_1=function_vector1.fit_transform(function_train_x)
service_train_n_x_1=service_vector1.fit_transform(service_train_x)

appearance_test_n_x_1=appearance_vector1.transform(appearance_test_x)
function_test_n_x_1=function_vector1.transform(function_test_x)
service_test_n_x_1=service_vector1.transform(service_test_x)


In [30]:
# 标准化(二元词袋，获得 稀疏句-词关系矩阵)
appearance_train_n_x_2=appearance_vector2.fit_transform(appearance_train_x)
function_train_n_x_2=function_vector2.fit_transform(function_train_x)
service_train_n_x_2=service_vector2.fit_transform(service_train_x)

appearance_test_n_x_2=appearance_vector2.transform(appearance_test_x)
function_test_n_x_2=function_vector2.transform(function_test_x)
service_test_n_x_2=service_vector2.transform(service_test_x)


In [31]:
# 模型训练(一元词袋)
from sklearn.naive_bayes import MultinomialNB
appearance_classifier_1=MultinomialNB(alpha=0.1).fit(appearance_train_n_x_1,appearance_train_y)
function_classifier_1=MultinomialNB(alpha=0.01).fit(function_train_n_x_1,function_train_y)
service_classifier_1=MultinomialNB(alpha=0.01).fit(service_train_n_x_1,service_train_y)

In [32]:
# 模型训练(二元词袋)
appearance_classifier_2=MultinomialNB(alpha=0.1).fit(appearance_train_n_x_2,appearance_train_y)
function_classifier_2=MultinomialNB(alpha=0.01).fit(function_train_n_x_2,function_train_y)
service_classifier_2=MultinomialNB(alpha=0.01).fit(service_train_n_x_2,service_train_y)

In [33]:
# 模型预测
appearance_train_pre_1=appearance_classifier_1.predict(appearance_train_n_x_1)
function_train_pre_1=function_classifier_1.predict(function_train_n_x_1)
service_train_pre_1=service_classifier_1.predict(service_train_n_x_1)

appearance_test_pre_1=appearance_classifier_1.predict(appearance_test_n_x_1)
function_test_pre_1=function_classifier_1.predict(function_test_n_x_1)
service_test_pre_1=service_classifier_1.predict(service_test_n_x_1)

appearance_train_pre_2=appearance_classifier_2.predict(appearance_train_n_x_2)
function_train_pre_2=function_classifier_2.predict(function_train_n_x_2)
service_train_pre_2=service_classifier_2.predict(service_train_n_x_2)

appearance_test_pre_2=appearance_classifier_2.predict(appearance_test_n_x_2)
function_test_pre_2=function_classifier_2.predict(function_test_n_x_2)
service_test_pre_2=service_classifier_2.predict(service_test_n_x_2)

In [34]:
from sklearn import metrics
appearance_train_pre=appearance_train_pre_1
function_train_pre=function_train_pre_1
service_train_pre=service_train_pre_1
print('一元词袋appearance 训练集准确率：',metrics.accuracy_score(appearance_train_y,appearance_train_pre))
print('一元词袋function 训练集准确率：',metrics.accuracy_score(function_train_y,function_train_pre))
print('一元词袋service 训练集准确率：',metrics.accuracy_score(service_train_y,service_train_pre))
print('一元词袋appearance 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_train_y,appearance_train_pre,average='micro'))
print('一元词袋function 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_train_y,function_train_pre,average='micro'))
print('一元词袋service 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_train_y,service_train_pre,average='micro'))
appearance_train_pre=appearance_train_pre_2
function_train_pre=function_train_pre_2
service_train_pre=service_train_pre_2
print('二元词袋appearance 训练集准确率：',metrics.accuracy_score(appearance_train_y,appearance_train_pre))
print('二元词袋function 训练集准确率：',metrics.accuracy_score(function_train_y,function_train_pre))
print('二元词袋service 训练集准确率：',metrics.accuracy_score(service_train_y,service_train_pre))
print('二元词袋appearance 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_train_y,appearance_train_pre,average='micro'))
print('二元词袋function 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_train_y,function_train_pre,average='micro'))
print('二元词袋service 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_train_y,service_train_pre,average='micro'))

一元词袋appearance 训练集准确率： 0.9663865546218487
一元词袋function 训练集准确率： 0.9535714285714286
一元词袋service 训练集准确率： 0.9526671675432006
一元词袋appearance 训练集精确度，召回率，f1值： (0.9663865546218487, 0.9663865546218487, 0.9663865546218487, None)
一元词袋function 训练集精确度，召回率，f1值： (0.9535714285714286, 0.9535714285714286, 0.9535714285714286, None)
一元词袋service 训练集精确度，召回率，f1值： (0.9526671675432006, 0.9526671675432006, 0.9526671675432006, None)
二元词袋appearance 训练集准确率： 0.9710550887021475
二元词袋function 训练集准确率： 0.975
二元词袋service 训练集准确率： 0.97145003756574
二元词袋appearance 训练集精确度，召回率，f1值： (0.9710550887021475, 0.9710550887021475, 0.9710550887021475, None)
二元词袋function 训练集精确度，召回率，f1值： (0.975, 0.975, 0.975, None)
二元词袋service 训练集精确度，召回率，f1值： (0.97145003756574, 0.97145003756574, 0.97145003756574, None)


In [35]:
appearance_test_pre=appearance_test_pre_1
function_test_pre=function_test_pre_1
service_test_pre=service_test_pre_1
print('一元词袋appearance 测试集准确率：',metrics.accuracy_score(appearance_test_y,appearance_test_pre))
print('一元词袋function 测试集准确率：',metrics.accuracy_score(function_test_y,function_test_pre))
print('一元词袋service 测试集准确率：',metrics.accuracy_score(service_test_y,service_test_pre))
print('一元词袋appearance 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_test_y,appearance_test_pre,average='micro'))
print('一元词袋function 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_test_y,function_test_pre,average='micro'))
print('一元词袋service 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_test_y,service_test_pre,average='micro'))
appearance_test_pre=appearance_test_pre_2
function_test_pre=function_test_pre_2
service_test_pre=service_test_pre_2
print('二元词袋appearance 测试集准确率：',metrics.accuracy_score(appearance_test_y,appearance_test_pre))
print('二元词袋function 测试集准确率：',metrics.accuracy_score(function_test_y,function_test_pre))
print('二元词袋service 测试集准确率：',metrics.accuracy_score(service_test_y,service_test_pre))
print('二元词袋appearance 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_test_y,appearance_test_pre,average='micro'))
print('二元词袋function 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_test_y,function_test_pre,average='micro'))
print('二元词袋service 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_test_y,service_test_pre,average='micro'))

一元词袋appearance 测试集准确率： 0.95
一元词袋function 测试集准确率： 0.8974358974358975
一元词袋service 测试集准确率： 0.8918918918918919
一元词袋appearance 测试集精确度，召回率，f1值： (0.95, 0.95, 0.9500000000000001, None)
一元词袋function 测试集精确度，召回率，f1值： (0.8974358974358975, 0.8974358974358975, 0.8974358974358975, None)
一元词袋service 测试集精确度，召回率，f1值： (0.8918918918918919, 0.8918918918918919, 0.8918918918918919, None)
二元词袋appearance 测试集准确率： 0.9333333333333333
二元词袋function 测试集准确率： 0.8942307692307693
二元词袋service 测试集准确率： 0.8986486486486487
二元词袋appearance 测试集精确度，召回率，f1值： (0.9333333333333333, 0.9333333333333333, 0.9333333333333333, None)
二元词袋function 测试集精确度，召回率，f1值： (0.8942307692307693, 0.8942307692307693, 0.8942307692307693, None)
二元词袋service 测试集精确度，召回率，f1值： (0.8986486486486487, 0.8986486486486487, 0.8986486486486488, None)


无论是训练集测试集，各项指标都高得离谱，也许性能真的不错吧。值得注意的是，二元词袋在增加模型复杂度的同时，性能不见得比一元词袋好。所以我们最终只保存一元词袋模型。

### Bert词嵌入

最后是维度划分模型，也就是把一段评论中的每句话分为不同维度进行分析的模型，我们其实拿了谷歌的预训练模型，训练了在bert之后的MLP层进行分类，代码见BertPractice/BertPractice.py 。这部分代码在一个开源项目的基础上修改得到，已在项目README中详细说明。数据预处理流程省略，代码见dataset_prepare.ipynb。

在维度划分这一块，我们训练了3个epoch，每个epoch训练后的模型在验证集（没参与训练，也算测试集）的函数cost，准确率，f1_score如下所示：
```
13:34 epoch0_0.12525090887337118_0.9603448275862069_0.9611486486486487.full
13:35 epoch1_0.11325939101594928_0.9637931034482758_0.964527027027027.full
13:36 epoch2_0.14327293808408384_0.9603448275862069_0.9611486486486487.full
```

可以发现基于bert的语句分类各方面的测试性能都大于96，效果不错。

## 模型使用

使用训练好的模型对所有数据进行标注。

In [36]:
import torch
# 加载维度划分模型
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model=torch.load('module/bert/epoch2_0.14327293808408384_0.9603448275862069_0.9611486486486487.full')
torch.no_grad()
from transformers import BertTokenizer
pretrained_model_name='bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name,
                max_length=100,
                add_special_tokens=True,
                truncation=True,
                padding=True,
                return_tensors="pt",)



TypeError: encode() missing 1 required positional argument: 'text'

In [None]:
# 试试效果
token=tokenizer(['颜色不好看','特别特别的不好','外观丑','不好用','不实用','服务态度差'],
                max_length=100,
                add_special_tokens=True,
                truncation=True,
                padding=True,return_tensors="pt")
token.to(device)
output=model(
    **token
    )
output[0].argmax(dim=1)

tensor([0, 1, 0, 1, 1, 2], device='cuda:0')

In [None]:
class_dict=[
    lambda x:f"apprence:{x}\n",
    lambda x:f"function:{x}\n",
    lambda x:f"service:{x}\n",
    ]
for x,y in zip(['颜色不好看','特别特别的不好','外观丑','不好用','不实用','服务态度差'],output[0].argmax(dim=1).cpu()):
    print(class_dict[y](x))

apprence:颜色不好看

function:特别特别的不好

apprence:外观丑

function:不好用

function:不实用

service:服务态度差



即兴编的测试用例上看起来效果不错，接下来开始设计断句方案。

In [None]:
# 定义断句模板
import re
partten = re.compile(
    "".join(["(.+?[", "?", "!", "？", "！", "。", "…", "\n", " ", "]|.+?$)"])
)
partten.findall("产品质感：轻盈不油腻\n产品颜色：很正。持久效果：一般 还行?")

['产品质感：轻盈不油腻\n', '产品颜色：很正。', '持久效果：一般 ', '还行?']

这部分设计相对精巧高效，多少挽回一些颜面，但不保证能覆盖所有情况。接下来看看在原始数据上的效果。

In [None]:
comments=all_comment.sample(6)[['评论内容']].values
for paragraph in comments:
    sentences=partten.findall(paragraph[0])
    token=tokenizer(
        sentences,
        max_length=100,
        add_special_tokens=True,
        truncation=True,
        padding=True,return_tensors="pt"
    )
    token.to(device)
    output=model(
        **token
        )
    for sentence,kind in zip(sentences,output[0].argmax(dim=1).cpu()):
        print(class_dict[kind](sentence))

service:跑了两天来评价，包裹性很好缓震支撑不错有比较强的&ldquo;踩屎感&rdquo;，货是北京发百丽公司发货，产地越南。

function:应该是正品，这个价值了。

service:送的东西呢？

function:一样都没有

service:物流很快哦！

apprence:口红颜色很好看，上嘴感觉整个人都有气质了，口红是正品哦！

function:持久滋润，上嘴不会干，感觉不挑皮肤，涂上就是小仙女

service:当时看了好多家才敲定的这款，收到货非常的满意，外包装是皮的，坐垫是可以拆洗的，颜色很正，特别的好看，商家和安装师傅都特别好

function:是喜欢的颜色，质感细腻，上色效果均匀

service:这个马桶非常好用，收到货后装修马上迫不及待地让师傅安装好试用了。

function:冲力的，去污力强，污物不挂壁。

service:物流的发货速度很快，买了不久就收到货了。

function:马桶质量很好。



多次运行以上cell可以发现，对于人类能分辨的语句，模型基本不会分错。而对于一些奇奇怪怪的语句，模型会进行错误的划分。我们可以设置一个置信度阈值来避免模型对奇怪语句的分类，还可以想办法优化断句算法（剔除短句）。

In [None]:
# 定义标注函数（分维度且评分）
def categorize_score(frame:DataFrame,scorers):
    # 辅助变量，将词袋向量翻译回文字（这段代码本是给二元词袋写的），在标注结果中直观展示。
    frame['词袋向量逆——美学']=''  # 血栓，但影响极小 
    frame['词袋向量逆——功能']=''  # 血栓，但影响极小
    frame['词袋向量逆——服务']=''  # 血栓，但影响极小
    # for i, paragraph in enumerate(frame[["评论内容"]].values):
    # 下面是一段脑血栓代码，引入了不必要的索引操作
    for i, paragraph in zip(frame.index,frame[["评论内容"]].values): # 考虑sample数据
        if type(paragraph[0]) != str:
            continue
        sentences = partten.findall(paragraph[0])
        if len(sentences) == 0:
            continue
        token = tokenizer(
            sentences,
            max_length=100,
            add_special_tokens=True,
            truncation=True,
            padding=True,
            return_tensors="pt",
        )
        token.to(device)  # 每次送的句子太少，可优化，代码不好改
        output = model(**token)
        categorized_sentences = [[], [], []]
        for sentence, kind in zip(sentences, output[0].argmax(dim=1).cpu()):
            categorized_sentences[kind].append(sentence)
        for colnmn, (class_list, scorer) in enumerate(zip(categorized_sentences, scorers)):
            if len(class_list) > 0:
                classifier, vector = scorer
                vector_ = vector.transform(
                    [seg_sentence("".join(class_list), stopwords_list)]
                )
                score = classifier.predict(vector_)
                # 血栓
                frame.loc[i,frame.columns[5+colnmn]]=score  # 填评分
                frame.loc[i,frame.columns[8+colnmn]]=','.join(vector.inverse_transform(vector_)[0])  # 填向量逆
    

In [None]:
# 治疗脑血栓
def categorize_score_X(
    frame: DataFrame,
    scorers,
    partten: re.Pattern = partten,
    tokenizer: BertTokenizer = tokenizer,
):
    def categorize_score_a_row(row: pd.Series):
        scores_sentences=['']*6
        paragraph = row["评论内容"]
        if type(paragraph) != str:
            return scores_sentences
        sentences = partten.findall(paragraph)
        if len(sentences) == 0:
            return scores_sentences
        token = tokenizer(
            sentences,
            max_length=100,
            add_special_tokens=True,
            truncation=True,
            padding=True,
            return_tensors="pt",
        )
        token.to(device)
        output = model(**token)
        categorized_sentences = [[], [], []]
        for sentence, kind in zip(sentences, output[0].argmax(dim=1).cpu()):
            categorized_sentences[kind].append(sentence)
        for colnmn, (class_list, scorer) in enumerate(
            zip(categorized_sentences, scorers)
        ):
            if len(class_list) > 0:
                classifier, vector = scorer
                vector_ = vector.transform(
                    [seg_sentence("".join(class_list), stopwords_list)]
                )
                score = classifier.predict(vector_)
                scores_sentences[colnmn] = score
                scores_sentences[3 + colnmn] = ",".join(
                    vector.inverse_transform(vector_)[0]
                )
        return scores_sentences
    frame[['美学维度','功能维度','服务维度',
        '词袋向量逆——美学',
        '词袋向量逆——功能',
        '词袋向量逆——服务']]=frame.apply(categorize_score_a_row, axis=1,result_type='expand')


In [None]:

sample_comment=data4label19_20.sample(5).copy(deep=True)
sample_comment

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度
242310,64812946043,2020-05-11 17:12:53,2019-07-14 01:35:26,鞋子一般话，应该算是正品中的残次品吧！鞋面有鼓包，后跟有皱纹。。,活动前,,,
1077676,70291329569,2020-11-19 18:03:04,2020-10-31 11:47:39,宝贝很好上色没有异味老人家很喜欢，自己就可以染发了染完头发年轻了好几岁,活动前,,,
599216,66270565075,2020-05-31 20:21:01,2020-05-04 22:42:59,经常在这买东西，目前还没有失望过，棒棒哒！经常在这买东西，每次都很满意，足不出户就能买到自己...,活动前,,,
741699,44645506794,2020-08-31 17:11:25,2020-06-06 17:26:23,尺码大小：码数标准，平常穿多大就买多大\n透气性能：因为是高帮皮质，透气性就那样吧?\n舒适...,活动前,,,
1244635,100008650208,2020-10-15 11:01:33,2019-06-04 00:22:32,活动价格变动太快，好心酸，不该拆封的,活动前,,,


In [None]:
# 开始标注
bayes_scorer = [
    [appearance_classifier_1,appearance_vector1],
    [function_classifier_1,function_vector1],
    [service_classifier_1,service_vector1]
]
categorize_score(sample_comment,bayes_scorer)
sample_comment

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
242310,64812946043,2020-05-11 17:12:53,2019-07-14 01:35:26,鞋子一般话，应该算是正品中的残次品吧！鞋面有鼓包，后跟有皱纹。。,活动前,,-1.0,,,"应该,正品,残次品,算是,鞋子",
1077676,70291329569,2020-11-19 18:03:04,2020-10-31 11:47:39,宝贝很好上色没有异味老人家很喜欢，自己就可以染发了染完头发年轻了好几岁,活动前,,-1.0,,,"上色,喜欢,头发,宝贝,异味,染发,没有",
599216,66270565075,2020-05-31 20:21:01,2020-05-04 22:42:59,经常在这买东西，目前还没有失望过，棒棒哒！经常在这买东西，每次都很满意，足不出户就能买到自己...,活动前,,,1.0,,,"东西,买到,失望,想要,方便,棒棒,没有,满意,真是太"
741699,44645506794,2020-08-31 17:11:25,2020-06-06 17:26:23,尺码大小：码数标准，平常穿多大就买多大\n透气性能：因为是高帮皮质，透气性就那样吧?\n舒适...,活动前,1.0,1.0,,"好看,特色,舒服","不会,大小,尺码,感觉,标准,皮质,程度,舒适,舒适度",
1244635,100008650208,2020-10-15 11:01:33,2019-06-04 00:22:32,活动价格变动太快，好心酸，不该拆封的,活动前,,,-1.0,,,"不该,价格,拆封,活动"


In [None]:

sample_comment=data4label19_20.sample(5).copy(deep=True)
bayes_scorer = [
    [appearance_classifier_1,appearance_vector1],
    [function_classifier_1,function_vector1],
    [service_classifier_1,service_vector1]
]
categorize_score_X(sample_comment,bayes_scorer)

In [None]:
sample_comment

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
913836,36784650822,2020-02-19 00:17:33,2020-06-11 11:11:58,收到了宝贝，真心喜欢。刚买的破壁机是红颜色的，所以，电饭煲也选了同色。因为选了红色多花了钱，...,活动后,[1.0],,[1.0],"满意,红色,红颜色",,"喜欢,宝贝,客服,小哥,开心,快递,收到,服务,沟通,物流,耐心,购物"
1055764,43235992028,2020-07-13 19:31:41,2019-04-27 17:14:05,物流很快！颜值很好 管身除了logo就是纯色 很ins简约风 颜色也很好看 显色度不是很高适...,活动前,[1.0],[1.0],[1.0],"好看,显白,显色,气色,适合,颜值,颜色","不同,好评,打底,效果,滋润,觉得",物流
744791,50914517605,2020-03-19 09:12:00,2020-02-01 01:21:53,根本没有收到快递 一开始因我在异地没法回来就没注意物流，上线一看快递已签收，我问过父母并没有...,活动前,,,[-1.0],,,"主动,回来,快递,打电话,拉走,收到,根本,没收,没有,没法,注意,物流,知道,签收,联系,..."
523346,5853363,2020-07-29 22:37:56,2020-11-11 00:04:16,比较普通的外观，但也能适合更多场所。灯比较亮，15平一点问题也没有。\n安装起来也比较简单，...,活动后,[0.0],[1.0],,"外观,普通,适合","安装,打孔,搞定,比较简单,没有,起来,轻松,问题,需要",
487784,40837378523,2020-07-28 16:58:56,2019-08-21 15:18:29,床质量很好，做工精细，软硬适中，没有什么刺鼻的味道，性价比很高，很不错,活动前,,[1.0],,,"不错,做工,刺鼻,味道,性价比,没有,精细,质量,软硬,适中",


上图便是我们对一些样本的标注结果，接下来对全数据进行标注后保存。

In [None]:
nb_labled=data4label19_20.copy(deep=True)
categorize_score(nb_labled,bayes_scorer)
nb_labled.sample(5)

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
66012,1973008586,2020-05-20 11:13:49,2019-11-02 10:52:09,不知道是不是正品，怎么都感觉有点不合适，送的包装看起来很伪劣，而且盒子都不是小羊皮的&hel...,活动前,,,-1.0,,,"hellip,包装,合适,感觉,是不是,有点,根本,正品,盒子,看起来,知道"
419609,5977632,2020-10-15 11:02:24,2020-03-07 23:39:44,搓泥太严重了。不是很好，也不会回购,活动前,,-1.0,1.0,,严重,"不会,回购"
367868,65362595922,2020-07-13 18:52:10,2020-09-05 03:14:20,颜色很好看，上色也很精神，不错,活动后,1.0,,,"不错,好看,颜色",,
203645,37373465676,2020-05-18 16:08:29,2020-09-10 09:49:10,买了N多次了，用着很方便。,活动后,,1.0,,,方便,
948909,47378458610,2020-07-23 20:38:28,2019-11-12 13:46:06,口红挺不错的，涂起来很舒适，不易掉色，很喜欢这个颜色,活动前,,1.0,,,"不易,口红,喜欢,挺不错,掉色,舒适,起来,颜色",


In [None]:
nb_labled.to_excel(f'data/labeled_NB_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])

接下来构造RF和SVM分类器，并对数据进行标注保存

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF_appearance_classifier_1=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=5)
RF_function_classifier_1=RandomForestClassifier(n_estimators=30, criterion='gini', max_depth=28)
RF_service_classifier_1=RandomForestClassifier(n_estimators=40, criterion='gini', max_depth=32)
RF_appearance_classifier_1.fit(appearance_train_n_x_1,appearance_train_y)
RF_function_classifier_1.fit(function_train_n_x_1,function_train_y)
RF_service_classifier_1.fit(service_train_n_x_1,service_train_y)
RF_scorer = [
    [RF_appearance_classifier_1,appearance_vector1],
    [RF_function_classifier_1,function_vector1],
    [RF_service_classifier_1,service_vector1]
]

In [None]:
RF_appearance_train_pre_1=RF_appearance_classifier_1.predict(appearance_train_n_x_1)
RF_function_train_pre_1=RF_function_classifier_1.predict(function_train_n_x_1)
RF_service_train_pre_1=RF_service_classifier_1.predict(service_train_n_x_1)

RF_appearance_test_pre_1=RF_appearance_classifier_1.predict(appearance_test_n_x_1)
RF_function_test_pre_1=RF_function_classifier_1.predict(function_test_n_x_1)
RF_service_test_pre_1=RF_service_classifier_1.predict(service_test_n_x_1)

appearance_train_pre=RF_appearance_train_pre_1
function_train_pre=RF_function_train_pre_1
service_train_pre=RF_service_train_pre_1
print('一元词袋appearance 训练集准确率：',metrics.accuracy_score(appearance_train_y,appearance_train_pre))
print('一元词袋function 训练集准确率：',metrics.accuracy_score(function_train_y,function_train_pre))
print('一元词袋service 训练集准确率：',metrics.accuracy_score(service_train_y,service_train_pre))
print('一元词袋appearance 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_train_y,appearance_train_pre,average='micro'))
print('一元词袋function 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_train_y,function_train_pre,average='micro'))
print('一元词袋service 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_train_y,service_train_pre,average='micro'))
appearance_test_pre=RF_appearance_test_pre_1
function_test_pre=RF_function_test_pre_1
service_test_pre=RF_service_test_pre_1
print('一元词袋appearance 测试集准确率：',metrics.accuracy_score(appearance_test_y,appearance_test_pre))
print('一元词袋function 测试集准确率：',metrics.accuracy_score(function_test_y,function_test_pre))
print('一元词袋service 测试集准确率：',metrics.accuracy_score(service_test_y,service_test_pre))
print('一元词袋appearance 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_test_y,appearance_test_pre,average='micro'))
print('一元词袋function 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_test_y,function_test_pre,average='micro'))
print('一元词袋service 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_test_y,service_test_pre,average='micro'))

一元词袋appearance 训练集准确率： 0.911297852474323
一元词袋function 训练集准确率： 0.9264285714285714
一元词袋service 训练集准确率： 0.9226145755071374
一元词袋appearance 训练集精确度，召回率，f1值： (0.911297852474323, 0.911297852474323, 0.911297852474323, None)
一元词袋function 训练集精确度，召回率，f1值： (0.9264285714285714, 0.9264285714285714, 0.9264285714285714, None)
一元词袋service 训练集精确度，召回率，f1值： (0.9226145755071374, 0.9226145755071374, 0.9226145755071374, None)
一元词袋appearance 测试集准确率： 0.925
一元词袋function 测试集准确率： 0.9134615384615384
一元词袋service 测试集准确率： 0.8918918918918919
一元词袋appearance 测试集精确度，召回率，f1值： (0.925, 0.925, 0.925, None)
一元词袋function 测试集精确度，召回率，f1值： (0.9134615384615384, 0.9134615384615384, 0.9134615384615384, None)
一元词袋service 测试集精确度，召回率，f1值： (0.8918918918918919, 0.8918918918918919, 0.8918918918918919, None)


### 调参

#### 第一次
```
RF_appearance_classifier_1=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=5)
RF_function_classifier_1=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=5)
RF_service_classifier_1=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=5)

一元词袋appearance 训练集准确率： 0.9150326797385621
一元词袋function 训练集准确率： 0.6632142857142858
一元词袋service 训练集准确率： 0.6942148760330579
一元词袋appearance 训练集精确度，召回率，f1值： (0.9150326797385621, 0.9150326797385621, 0.9150326797385621, None)
一元词袋function 训练集精确度，召回率，f1值： (0.6632142857142858, 0.6632142857142858, 0.6632142857142858, None)
一元词袋service 训练集精确度，召回率，f1值： (0.6942148760330579, 0.6942148760330579, 0.6942148760330579, None)
一元词袋appearance 测试集准确率： 0.925
一元词袋function 测试集准确率： 0.6410256410256411
一元词袋service 测试集准确率： 0.6486486486486487
一元词袋appearance 测试集精确度，召回率，f1值： (0.925, 0.925, 0.925, None)
一元词袋function 测试集精确度，召回率，f1值： (0.6410256410256411, 0.6410256410256411, 0.6410256410256411, None)
一元词袋service 测试集精确度，召回率，f1值： (0.6486486486486487, 0.6486486486486487, 0.6486486486486487, None)
```

结论：appearance一发入魂，function 和 services 欠拟合，增加模型复杂度（树深度），同时增加树的数目防止过拟合（bagging）。


#### 第二次
```
RF_appearance_classifier_1=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=5)
RF_function_classifier_1=RandomForestClassifier(n_estimators=20, criterion='gini', max_depth=16)
RF_service_classifier_1=RandomForestClassifier(n_estimators=20, criterion='gini', max_depth=16)

一元词袋appearance 训练集准确率： 0.9150326797385621
一元词袋function 训练集准确率： 0.9042857142857142
一元词袋service 训练集准确率： 0.8399699474079639
一元词袋appearance 训练集精确度，召回率，f1值： (0.9150326797385621, 0.9150326797385621, 0.9150326797385621, None)
一元词袋function 训练集精确度，召回率，f1值： (0.9042857142857142, 0.9042857142857142, 0.9042857142857142, None)
一元词袋service 训练集精确度，召回率，f1值： (0.8399699474079639, 0.8399699474079639, 0.8399699474079639, None)
一元词袋appearance 测试集准确率： 0.925
一元词袋function 测试集准确率： 0.8942307692307693
一元词袋service 测试集准确率： 0.8040540540540541
一元词袋appearance 测试集精确度，召回率，f1值： (0.925, 0.925, 0.925, None)
一元词袋function 测试集精确度，召回率，f1值： (0.8942307692307693, 0.8942307692307693, 0.8942307692307693, None)
一元词袋service 测试集精确度，召回率，f1值： (0.8040540540540541, 0.8040540540540541, 0.804054054054054, None)
```

结论：运气不错但services性能还能再提高10个点。

最终调参结果见原cell

In [None]:
sample_comment=data4label19_20.sample(5).copy(deep=True)
categorize_score_X(sample_comment,RF_scorer)
sample_comment

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
983508,68480830322,2020-05-15 16:09:16,2020-07-30 17:04:09,吸力大小：大小适宜，打开电源稳稳的吸附在玻璃上。关掉电源很容易取下。\n智能程度：总体来说还...,活动后,,[1.0],[-1.0],,"不错,位置,关掉,吸力,噪音,地方,声音,大小,容易,工作,总体,情况,打开,接受,放在,效...","不会,保障,咨询,安全,客服,放心,机器,电源,能力,遇见"
160374,1555771170,2020-08-03 10:55:56,2020-06-18 23:48:23,评分检测：38万+\n运行速度：开机速度30秒左右\n游戏效果：联盟运行流程\n外形外观：看...,活动前,[1.0],[1.0],[-1.0],"不错,外形,外观,看着","不错,价位,值得,升级,感觉,效果,检测,流程,特色,现象,还好,速度","安装,容易"
863246,70252064568,2020-09-22 10:16:39,2020-09-20 05:54:44,经过亲身体验，这家店信誉相当不错，产品质量更像钻石一般。,活动前,,[1.0],,,"不错,产品质量,体验",
571974,34627492430,2020-11-19 18:03:37,2019-02-18 10:17:16,刚买完就降价，生气,活动前,,,[-1.0],,,"生气,降价"
126757,37853323857,2020-05-29 23:02:32,2020-11-11 11:22:31,造型挺可爱的 质量不错,活动后,[1.0],[1.0],,造型,"不错,质量",


In [None]:
RF_labled=data4label19_20.copy(deep=True)
categorize_score_X(RF_labled,RF_scorer)
RF_labled.sample(5)

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
147962,4215119,2020-10-15 11:02:55,2020-02-23 10:28:46,包装也太一般了 看了一下换电池 还得拧螺丝 试了一下 太难了,活动前,,[-1.0],[1.0],,电池,"包装,螺丝"
553920,48705871864,2020-08-12 18:18:46,2020-12-22 20:30:51,很好，不错，速度快,活动后,,,[1.0],,,"不错,速度"
640153,1959907863,2020-07-13 10:36:39,2019-09-30 15:57:22,产品外观：很美\n保湿效果：好\n持久效果：容易掉但是谁让它保湿呢，谁让它便宜呢\n轻薄程度...,活动前,[1.0],[-1.0],,"产品,外观,效果,显色,有点,肤色,超级","便宜,保湿,容易,持久,效果,气味,程度,膏体,轻薄",
964281,62698509842,2020-11-19 18:03:26,2020-04-24 20:13:35,感觉不错， 给妈妈买的，够用一年了，刚收到货，快递很好，送到家门，用用看看，掉不掉色,活动前,,[1.0],,,"不错,够用,妈妈,快递,感觉,掉色,收到,看看,送到",
1002576,37240105375,2020-07-20 18:58:29,2019-03-23 16:28:38,还可以的。就是送的袋子质量比较差,活动前,,[1.0],,,质量,


In [None]:
RF_labled.to_excel(f'data/labeled_RF_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])

In [None]:
from sklearn.svm import SVC
# 这里也是啥都没调，看看效果先吧。默认一对多法进行多分类。
svm_appearance_classifier_1=SVC()
svm_function_classifier_1=SVC()
svm_service_classifier_1=SVC()
svm_appearance_classifier_1.fit(appearance_train_n_x_1,appearance_train_y)
svm_function_classifier_1.fit(function_train_n_x_1,function_train_y)
svm_service_classifier_1.fit(service_train_n_x_1,service_train_y)
svm_scorer = [
    [svm_appearance_classifier_1,appearance_vector1],
    [svm_function_classifier_1,function_vector1],
    [svm_service_classifier_1,service_vector1]
]

In [None]:
svm_appearance_train_pre_1=svm_appearance_classifier_1.predict(appearance_train_n_x_1)
svm_function_train_pre_1=svm_function_classifier_1.predict(function_train_n_x_1)
svm_service_train_pre_1=svm_service_classifier_1.predict(service_train_n_x_1)

svm_appearance_test_pre_1=svm_appearance_classifier_1.predict(appearance_test_n_x_1)
svm_function_test_pre_1=svm_function_classifier_1.predict(function_test_n_x_1)
svm_service_test_pre_1=svm_service_classifier_1.predict(service_test_n_x_1)

appearance_train_pre=svm_appearance_train_pre_1
function_train_pre=svm_function_train_pre_1
service_train_pre=svm_service_train_pre_1
print('一元词袋appearance 训练集准确率：',metrics.accuracy_score(appearance_train_y,appearance_train_pre))
print('一元词袋function 训练集准确率：',metrics.accuracy_score(function_train_y,function_train_pre))
print('一元词袋service 训练集准确率：',metrics.accuracy_score(service_train_y,service_train_pre))
print('一元词袋appearance 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_train_y,appearance_train_pre,average='micro'))
print('一元词袋function 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_train_y,function_train_pre,average='micro'))
print('一元词袋service 训练集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_train_y,service_train_pre,average='micro'))
appearance_test_pre=svm_appearance_test_pre_1
function_test_pre=svm_function_test_pre_1
service_test_pre=svm_service_test_pre_1
print('一元词袋appearance 测试集准确率：',metrics.accuracy_score(appearance_test_y,appearance_test_pre))
print('一元词袋function 测试集准确率：',metrics.accuracy_score(function_test_y,function_test_pre))
print('一元词袋service 测试集准确率：',metrics.accuracy_score(service_test_y,service_test_pre))
print('一元词袋appearance 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(appearance_test_y,appearance_test_pre,average='micro'))
print('一元词袋function 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(function_test_y,function_test_pre,average='micro'))
print('一元词袋service 测试集精确度，召回率，f1值：',metrics.precision_recall_fscore_support(service_test_y,service_test_pre,average='micro'))

一元词袋appearance 训练集准确率： 0.9589169000933707
一元词袋function 训练集准确率： 0.945
一元词袋service 训练集准确率： 0.9135987978963186
一元词袋appearance 训练集精确度，召回率，f1值： (0.9589169000933707, 0.9589169000933707, 0.9589169000933707, None)
一元词袋function 训练集精确度，召回率，f1值： (0.945, 0.945, 0.945, None)
一元词袋service 训练集精确度，召回率，f1值： (0.9135987978963186, 0.9135987978963186, 0.9135987978963186, None)
一元词袋appearance 测试集准确率： 0.925
一元词袋function 测试集准确率： 0.9262820512820513
一元词袋service 测试集准确率： 0.8918918918918919
一元词袋appearance 测试集精确度，召回率，f1值： (0.925, 0.925, 0.925, None)
一元词袋function 测试集精确度，召回率，f1值： (0.9262820512820513, 0.9262820512820513, 0.9262820512820513, None)
一元词袋service 测试集精确度，召回率，f1值： (0.8918918918918919, 0.8918918918918919, 0.8918918918918919, None)


额。。。不需要调参？？！

In [None]:
sample_comment=data4label19_20.sample(5).copy(deep=True)
categorize_score_X(sample_comment,svm_scorer)
sample_comment

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
434242,31034484969,2020-11-19 18:00:04,2019-08-03 22:35:07,上色不错，买给家里老人的，比理发店便宜，不用出门一级在家搞定,活动前,,[1.0],,,"不用,不错,便宜,出门,家里,搞定,理发店,老人",
94952,187715,2020-08-24 22:07:04,2019-06-27 20:56:42,物有所值，比较牢固。还不错。,活动前,,[1.0],,,"不错,牢固,物有所值",
1093487,100007406761,2020-10-15 11:02:54,2020-06-17 13:16:28,很快很好很满意。。。,活动前,,[-1.0],[1.0],,,满意
402827,5177847,2020-07-01 22:54:28,2020-11-01 13:36:18,商品是好商品，价格还可以，安装师傅不是人，一根进水管收55，图省事马桶盖进水口接口处，配套的...,活动后,,,[-1.0],,,"不想,价格,商品,安装,师傅,省事,配套,马桶盖"
829833,70243154996,2020-07-29 10:55:07,2020-11-12 10:37:52,颜色根本就不是胡萝卜色，和图片一点都不一样,活动后,[-1.0],,,"图片,根本,颜色",,


In [None]:
svm_labled=data4label19_20.copy(deep=True)
categorize_score_X(svm_labled,svm_scorer)
svm_labled.sample(5)

Unnamed: 0,productId,活动时间,评论时间,评论内容,类型,美学维度,功能维度,服务维度,词袋向量逆——美学,词袋向量逆——功能,词袋向量逆——服务
1065584,67310662198,2020-08-11 11:32:19,2020-08-17 11:28:44,此用户未填写评价内容,活动后,,,[-1.0],,,"用户,评价"
42400,65446406726,2020-08-17 19:08:23,2020-05-16 22:08:52,520给女朋友的他应该会喜欢,活动前,[1.0],,,"喜欢,应该",,
503046,38390280822,2020-07-28 15:34:06,2020-07-16 10:53:00,行挺舒服的物流也很快，包装也比较精细,活动前,,,[1.0],,,"包装,挺舒服,物流,精细"
405459,46142631518,2020-03-31 13:58:20,2019-10-30 14:41:25,很好用很好用很好用！！！,活动前,,[1.0],,,好用,
882976,4806709,2020-10-15 11:03:12,2020-05-14 19:46:25,买的时候花了389，没过五天就299了，因为有赠品无法保价，而且389买的赠品还没有299买...,活动前,,[-1.0],[-1.0],,"不合理,东西,有点,欧莱雅","京东,会员,保价,失望,心里,感觉,无法,没有,赠品"


In [None]:
svm_labled.to_excel(f'data/labeled_SVM_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])

至此任务完成，余下工作可用excel完成。

In [None]:
# 一些收尾工作
RF_labeled_timespan3=data4label.copy(deep=True)
categorize_score_X(RF_labeled_timespan3,RF_scorer)
RF_labeled_timespan3.to_excel(f'data/labeled_timespan3_RF_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])
svm_labeled_timespan3=data4label.copy(deep=True)
categorize_score_X(svm_labeled_timespan3,svm_scorer)
svm_labeled_timespan3.sample(5)
svm_labeled_timespan3.to_excel(f'data/labeled_timespan3_SVM_{now_time}.xlsx', engine='xlsxwriter', index=False, freeze_panes=[1,0])

## 思考总结

不足：

1. 软件工程经验不足，没能在一开始便理清A同学真实需求，并催促其向其他开发者索要代码（说是丢了，但更可能一开始就不存在），导致多次增加需求，重新编写代码，减低了开发效率并增加了后期梳理工作的工作量。这里也有我对NLP知识不了解的原因，否则能在收到第一份误导代码时就发现问题。不错的教训，但除了多快速了解任务相关的背景技术，很难有其他直观经验，只能说需要培养嗅觉。
2. 编程经验不足，变量命名不规范，函数封装不够优雅，整理代码时也发现不少脑血栓代码。和后端开发的接口定义一样，是我的比较烦恼的地方。应多学习一些开源包的接口封装，提炼其封装思想。此外，据说一些java开源项目接口定义特别优雅，好像是沐神提到的，下次再看到得记下并学习。
3. 数据分析工具使用不熟练，编程效率低下。除了接活外，多刷教学书籍上的代码学习效率可能更高。
4. 数据分析经验不足，各个阶段工作划分不明确。像极了做全栈开发时前后端划分不明确，但前后端划分可以以运行效率和数据安全作为依据，这里没必要分太清楚。
5. 编程时三心二意，工作效率低下。需要提升工作时的专注力。记得以前看过一些关于心流的书，也许有参考价值。
6. 以上模型几乎没怎么调参。需要学习熟练使用ray-tune等调参工具。
7. 性能高得离谱，人工审阅起来却不咋地，感觉数据可能存在问题。以后可以尝试单独看每一类的评价指标，同时应该注意是否有无样本不平衡问题。