# 基于集成学习的 Amazon 用户评论质量预测

## 一、案例简介

随着电商平台的兴起，以及疫情的持续影响，线上购物在我们的日常生活中扮演着越来越重要的角色。在进行线上商品挑选时，评论往往是我们十分关注的一个方面。然而目前电商网站的评论质量参差不齐，甚至有水军刷好评或者恶意差评的情况出现，严重影响了顾客的购物体验。因此，对于评论质量的预测成为电商平台越来越关注的话题，如果能自动对评论质量进行评估，就能根据预测结果避免展现低质量的评论。本案例中我们将基于集成学习的方法对 Amazon 现实场景中的评论质量进行预测。

## 二、实验说明

本案例中完成两种集成学习算法的实现（Bagging、AdaBoost.M1），其中基分类器要求使用 SVM 和决策树两种，因此，一共需要对比四组结果（[AUC](https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics) 作为评价指标）：

* Bagging + SVM
* Bagging + 决策树
* AdaBoost.M1 + SVM
* AdaBoost.M1 + 决策树

## 三、数据概览

In [1]:
import pandas as pd 
train_df = pd.read_csv('./data/train.csv', sep='\t')

In [2]:
train_df

Unnamed: 0,reviewerID,asin,reviewText,overall,votes_up,votes_all,label
0,7885,3901,"First off, allow me to correct a common mistak...",5.0,6,7,0
1,52087,47978,I am really troubled by this Story and Enterta...,3.0,99,134,0
2,5701,3667,A near-perfect film version of a downright glo...,4.0,14,14,1
3,47191,40892,Keep your expectations low. Really really low...,1.0,4,7,0
4,40957,15367,"""they dont make em like this no more...""well.....",5.0,3,6,0
...,...,...,...,...,...,...,...
57034,58315,29374,"If you like beautifully shot, well acted films...",2.0,12,21,0
57035,23328,45548,This is a great set of films Wayne did Fox and...,5.0,15,18,0
57036,27203,42453,It's what's known as a comedy of manners. It's...,3.0,4,5,0
57037,33992,44891,Ellen can do no wrong as far a creating wonder...,5.0,4,5,0


In [3]:
print(train_df.info())
print(train_df.describe())
print(train_df.label.value_counts())
print(train_df.overall.value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57039 entries, 0 to 57038
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   reviewerID  57039 non-null  int64  
 1   asin        57039 non-null  int64  
 2   reviewText  57039 non-null  object 
 3   overall     57039 non-null  float64
 4   votes_up    57039 non-null  int64  
 5   votes_all   57039 non-null  int64  
 6   label       57039 non-null  int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 3.0+ MB
None
          reviewerID          asin       overall      votes_up     votes_all  \
count   57039.000000  57039.000000  57039.000000  57039.000000  57039.000000   
mean    33359.761865  19973.170866      3.535178     12.387594     18.475850   
std     30016.804127  14104.410152      1.529742     45.130499     50.149683   
min        50.000000      0.000000      1.000000      0.000000      5.000000   
25%      9235.000000   8218.000000      2.000000      4.

本次数据来源于 Amazon 电商平台，包含超过 50,000 条用户在购买商品后留下的评论，各列的含义如下：

* reviewerID：用户 ID
* asin：商品 ID
* reviewText：英文评论文本
* overall：用户对商品的打分（1-5）
* votes_up：认为评论有用的点赞数（只在训练集出现）
* votes_all：该评论得到的总评价数（只在训练集出现）
* label：评论质量的 label，1 表示高质量，0 表示低质量（只在训练集出现）

评论质量的 label 来自于其他用户对评论的 votes，votes_up/votes_all ≥ 0.9 的作为高质量评论。此外测试集包含一个额外的列Id，标识了每一个测试的样例。

## 四、比赛提交格式

课程页面：https://aistudio.baidu.com/aistudio/education/dashboard

提交文件需要对测试集中每一条评论给出预测为高质量的概率，每行包括一个Id（和测试集对应）以及预测的概率Predicted（0-1的浮点数），用逗号分隔。示例提交格式如下：

```
Id,Predicted
0,0.9
1,0.45
2,0.78
...
```
命名为`result.csv`

**注意除了提交比赛，还需要像之前作业一样在学堂在线提交代码和报告（不包括数据）**

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
import random
from sklearn.model_selection import train_test_split

In [5]:
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText'])
print(train_x)
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_x.shape)
print(train_y.shape)
test_df = pd.read_csv('./data/test.csv', sep='\t')
print(test_df)
test_x = test_df.drop(columns=['Id','reviewText'])
print(test_x)

       reviewerID   asin  overall
0            7885   3901      5.0
1           52087  47978      3.0
2            5701   3667      4.0
3           47191  40892      1.0
4           40957  15367      5.0
...           ...    ...      ...
57034       58315  29374      2.0
57035       23328  45548      5.0
57036       27203  42453      3.0
57037       33992  44891      5.0
57038       27478  19198      2.0

[57039 rows x 3 columns]
(57039, 3)
(57039,)
          Id  reviewerID   asin  \
0          0       82947  37386   
1          1       10154  23543   
2          2        5789   5724   
3          3        9198   5909   
4          4       33252  21214   
...      ...         ...    ...   
11203  11203       18250  35309   
11204  11204        3200   2130   
11205  11205       37366  41971   
11206  11206        1781  33089   
11207  11207       26372  35457   

                                              reviewText  overall  
0      I REALLY wanted this series but I am in SHOCK ... 

### 单决策树
使用决策树观察发现不控制树深度和最小叶节点的情况下训练集的成功率为100%。  
说明该树完全生长。

In [6]:
DT = DecisionTreeClassifier(random_state = 0)
DT.fit(train_x, train_y)
pred_train_y = DT.predict(train_x)
#test_y = DT.predict(test_x)
print(pred_train_y.shape)
print("the accuracy of train result is %.6f" %(DT.score(train_x, train_y)))
print("teh accuracy of train result by acc_score is %.6f" %(accuracy_score(train_y, pred_train_y)))
#print(len(test_y))
print(pred_train_y)

(57039,)
the accuracy of train result is 1.000000
teh accuracy of train result by acc_score is 1.000000
[0 0 1 ... 0 0 1]


#### 查看数据倾斜程度

In [7]:
print(train_df.label.value_counts())
print(len(train_df.label[train_df.label == 0])/len(train_df))

0    44137
1    12902
Name: label, dtype: int64
0.7738038885674714


发现数据严重倾斜，全部预测0，准确率也有77%。  
根据结果可以发现，由于数据严重不平衡，分类器简单粗暴的将结果投给标签多的一类。

In [8]:
trai_x, veri_x, trai_y, veri_y = train_test_split(train_x, train_y, test_size=0.2, random_state=0)

DT = DecisionTreeClassifier(random_state = 4, min_samples_leaf = 10, max_depth = 3)
DT.fit(trai_x, trai_y)
pred_train_y = DT.predict(train_x)
pred_veri_y = DT.predict(veri_x)
prob_y = DT.predict_proba(veri_x)
#test_y = DT.predict(test_x)
print(pred_train_y.shape)
print("the accuracy of train result is %.6f" %(DT.score(train_x, train_y)))
print("the accuracy of train result by acc_score is %.6f" %(accuracy_score(train_y, pred_train_y)))
print("the accuracy of verify result is %.6f" %(accuracy_score(veri_y, pred_veri_y)))
print("the AUC of verify is %.6f" %(roc_auc_score(veri_y, prob_y[:,1])))
print(pred_train_y)
print(pred_train_y[1000:1030])

(57039,)
the accuracy of train result is 0.773804
the accuracy of train result by acc_score is 0.773804
the accuracy of verify result is 0.776736
the AUC of verify is 0.722470
[0 0 0 ... 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


#### 数据平衡
尝试对数据重采样来balance label。(之后发现工具可以有参数来自动balance数据。。。。）  
可以发现平衡后的数据虽然准确率下降了，但是分类器参与其中了,因为训练集的结果不单单只有0了。  
但是发现没有办法更好的提升准确度。

In [9]:
train_df0 = train_df[train_df.label == 0].sample(n=12902)
df = train_df0.append(train_df[train_df.label==1])
df = df.sample(frac=1)
print(df.label.value_counts())
trai_x = df.iloc[:, :-3].drop(columns=['reviewText']).values
trai_y = df.label.values
print(trai_y.shape)
test_df = pd.read_csv('./data/test.csv', sep='\t')

trai_x, veri_x, trai_y, veri_y = train_test_split(trai_x, trai_y, test_size=0.2, random_state=0)

DT = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 30, max_depth = 5)
DT.fit(trai_x, trai_y)
pred_train_y = DT.predict(trai_x)
pred_veri_y = DT.predict(veri_x)
prob_y = DT.predict_proba(veri_x)
print(pred_train_y.shape)
print("The accuracy of train result is %.6f" %(DT.score(trai_x, trai_y)))
print("The accuracy of train result by acc_score is %.6f" %(accuracy_score(trai_y, pred_train_y)))
print("The accuracy of verify result is %.6f" %(accuracy_score(veri_y, pred_veri_y)))
print("The AUC of verify is %.6f" %(roc_auc_score(veri_y, prob_y[:,1])))
#print(len(test_y))
#print(test_y)
print(pred_train_y[1000:1030])


0    12902
1    12902
Name: label, dtype: int64
(25804,)
(20643,)
The accuracy of train result is 0.701691
The accuracy of train result by acc_score is 0.701691
The accuracy of verify result is 0.698508
The AUC of verify is 0.723416
[0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 0 0]


和上一个cell一样，只不过手动balance数据的方式不一样。  
np.random.choice

In [10]:
tra_x = train_df.drop(columns=['reviewText','votes_all','votes_up'])
print(tra_x)
#print(train_x)
train1 = tra_x[tra_x.label==1].values
train0 = tra_x[tra_x.label==0].values
train0i = np.random.choice(len(train0), len(train1))
train0 = train0[train0i]
train = np.append(train0,train1,axis=0)
np.random.shuffle(train)
#tra_x = train[:,2].reshape(-1,1)
tra_x = train[:,:-1]
#tra_x = train
tra_y = train[:,-1]
print(tra_x.shape)
print(train1.shape, train0.shape)
DT = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 30, max_depth = 5)
DT.fit(tra_x, tra_y)
pred_train_y = DT.predict(tra_x)
#test_y = DT.predict(test_x[:,2])
print(pred_train_y.shape)
print("the accuracy of train result is %.6f" %(DT.score(tra_x, tra_y)))
print("the accuracy of train result by acc_score is %.6f" %(accuracy_score(tra_y, pred_train_y)))
#print(len(test_y))
#print(test_y)

       reviewerID   asin  overall  label
0            7885   3901      5.0      0
1           52087  47978      3.0      0
2            5701   3667      4.0      1
3           47191  40892      1.0      0
4           40957  15367      5.0      0
...           ...    ...      ...    ...
57034       58315  29374      2.0      0
57035       23328  45548      5.0      0
57036       27203  42453      3.0      0
57037       33992  44891      5.0      0
57038       27478  19198      2.0      1

[57039 rows x 4 columns]
(25804, 3)
(12902, 4) (12902, 4)
(25804,)
the accuracy of train result is 0.699194
the accuracy of train result by acc_score is 0.699194


#### 改造特征
构造特征，将商品和用户的id换成商品id和用户id的频率  
发现效果同样一般

In [11]:
asin = dict(train_df['asin'].value_counts())
asin_count = train_df['asin'].apply(lambda x: asin[x])
reviewid = dict(train_df['reviewerID'].value_counts())
reviewid_count = train_df['reviewerID'].apply(lambda x: reviewid[x])
train_df['asin'] = asin_count
train_df['reviewerID'] = reviewid_count
train_df

Unnamed: 0,reviewerID,asin,reviewText,overall,votes_up,votes_all,label
0,33,5,"First off, allow me to correct a common mistak...",5.0,6,7,0
1,6,4,I am really troubled by this Story and Enterta...,3.0,99,134,0
2,29,1,A near-perfect film version of a downright glo...,4.0,14,14,1
3,6,4,Keep your expectations low. Really really low...,1.0,4,7,0
4,6,8,"""they dont make em like this no more...""well.....",5.0,3,6,0
...,...,...,...,...,...,...,...
57034,15,21,"If you like beautifully shot, well acted films...",2.0,12,21,0
57035,27,1,This is a great set of films Wayne did Fox and...,5.0,15,18,0
57036,10,3,It's what's known as a comedy of manners. It's...,3.0,4,5,0
57037,8,1,Ellen can do no wrong as far a creating wonder...,5.0,4,5,0


In [12]:
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText'])
print(train_x)
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_x.shape)
print(train_y.shape)

       reviewerID  asin  overall
0              33     5      5.0
1               6     4      3.0
2              29     1      4.0
3               6     4      1.0
4               6     8      5.0
...           ...   ...      ...
57034          15    21      2.0
57035          27     1      5.0
57036          10     3      3.0
57037           8     1      5.0
57038           9     1      2.0

[57039 rows x 3 columns]
(57039, 3)
(57039,)


In [13]:
trai_x, veri_x, trai_y, veri_y = train_test_split(train_x, train_y, test_size=0.2, random_state=0)

DT = DecisionTreeClassifier(random_state = 4,class_weight = 'balanced', min_samples_leaf = 10, max_depth = 3)
DT.fit(trai_x, trai_y)
pred_train_y = DT.predict(train_x)
pred_veri_y = DT.predict(veri_x)
prob_y = DT.predict_proba(veri_x)
#test_y = DT.predict(test_x)
print(pred_train_y.shape)
print("The accuracy of train result is %.6f" %(DT.score(train_x, train_y)))
print("The accuracy of train result by acc_score is %.6f" %(accuracy_score(train_y, pred_train_y)))
print("The accuracy of verify result is %.6f" %(accuracy_score(veri_y, pred_veri_y)))
print("The AUC of verify is %.6f" %(roc_auc_score(veri_y, prob_y[:,1])))
#print(pred_train_y)
#print(pred_train_y[1000:1030])

(57039,)
The accuracy of train result is 0.693771
The accuracy of train result by acc_score is 0.693771
The accuracy of verify result is 0.700824
The AUC of verify is 0.786613


#### 尝试对reviewText做分析  
并且除了决策树，还使用了朴素贝叶斯和svm来观察  
效果同样一般

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB

train_x = train_df.reviewText.values
train_y = train_df.label.values
train_x, verify_x, train_y, verify_y = train_test_split(train_x, train_y, test_size=0.2, random_state=0)
vectorizer = CountVectorizer()
train_x = vectorizer.fit_transform(train_x)
verify_x = vectorizer.transform(verify_x)
train_x.shape, verify_x.shape

((45631, 136519), (11408, 136519))

In [15]:
DT = DecisionTreeClassifier(class_weight='balanced', min_samples_leaf = 10, max_depth = 1)
DT.fit(train_x, train_y)
pred_train_y = DT.predict(train_x)
pred_verify_y = DT.predict(verify_x)
prob_y = DT.predict_proba(verify_x)
print(pred_train_y.shape)
print("The accuracy of train result is %.6f" %(DT.score(train_x, train_y)))
print("The accuracy of train result by acc_score is %.6f" %(accuracy_score(train_y, pred_train_y)))
print("The accuracy of verify result is %.6f" %(accuracy_score(verify_y, pred_verify_y)))
print("The AUC of verify is %.6f" %(roc_auc_score(verify_y, prob_y[:,1])))
print(pred_train_y)

(45631,)
The accuracy of train result is 0.693717
The accuracy of train result by acc_score is 0.693717
The accuracy of verify result is 0.684257
The AUC of verify is 0.582870
[0 1 0 ... 0 0 0]


In [16]:
BNB = BernoulliNB()
BNB.fit(train_x,train_y)
pred_train_y = BNB.predict(train_x)
pred_verify_y = BNB.predict(verify_x)
prob_y = BNB.predict_proba(verify_x)
verify_acc = accuracy_score(pred_verify_y, verify_y)
train_acc = accuracy_score(pred_train_y, train_y)
print("Accuracy train by BernoulliNB is : %.6f" %(train_acc))
print("Accuracy verify by BernoulliNB is : %.6f" %(verify_acc))
print("The AUC by BernoulliNB of verify is %.6f" %(roc_auc_score(verify_y, prob_y[:,1])))

Accuracy train by BernoulliNB is : 0.747540
Accuracy verify by BernoulliNB is : 0.724492
The AUC by BernoulliNB of verify is 0.706235


In [None]:
SVM = SVC(probability=True)
SVM.fit(train_x, train_y)
pred_train_y = SVM.predict(train_x)
pred_y = SVM.predict(verify_x)
prob_y = SVM.predict_proba(verify_x)
print("Accuracy train by svm is %.6f " %(accuracy_score(pred_train_y, train_y)))
print("Accuracy verify by svm is %.6f" %(accuracy_score(pred_y, verify_y))) 
print("AUC of verify by svm is %.6f" %(roc_auc_score(verify_y, prob_y[:,1])))



### only svm
使用svm


In [17]:
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText'])
print(train_x.columns)
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_x.shape)
print(train_y.shape)
test_df = pd.read_csv('./data/test.csv', sep='\t')


SVM = SVC()
SVM.fit(train_x, train_y)
pred_train_y = SVM.predict(train_x)
#
#test_y = SVM.predict(test_x)
print("the accuracy of train result by svm is %.6f" %(accuracy_score(pred_train_y, train_y)))
#print(len(test_y))
#print(test_y)

Index(['reviewerID', 'asin', 'overall'], dtype='object')
(57039, 3)
(57039,)




the accuracy of train result by svm is 0.780010


In [19]:
import collections

train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_y.shape)
test_df = pd.read_csv('./data/test.csv', sep='\t')
#train_x
test_x = test_df.drop(columns=['Id','reviewText'])

train_x, verify_x, train_y, verify_y = train_test_split(train_x, train_y, test_size = 0.2, random_state = 0)
train_x.shape, verify_x.shape, train_y.shape, verify_y.shape

(57039,)


((45631, 3), (11408, 3), (45631,), (11408,))

### bagging 算法实现  
* 数据平衡  
在探索尝试时发现，如果训练数据标签不平衡会导致数据bagging实现了如果原数据label不平衡，在采样时平衡label比例。（下面也有实验对比）

In [20]:
class Bagging(object):   
    def __init__(self, sample_num=None, sample_size=None, algo='DT', class_weight=None, max_depth=None, min_samples_leaf=None, C=1.0, gamma='auto', kernel='rbf', balance=1):
        self.algo = algo
        self.sample_num = sample_num
        self.sample_size = sample_size
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.C = C  # svc parameter
        self.gamma = gamma   # svc parameter
        self.kernel = kernel   # svc parameter
        self.p_num = 0  # population number
        self.classes_num = {}  # for store sample num of each class, ex: Suppose in binary classification, label is 0 & 1, the sapmple_szie=100, this variable may like this: {0:30,1:70}
        self.algo_model = [] # store the batch of algo model.
        self.p0 = []   # population number of target 0
        self.P1 = []
        self.p = []
        self.balance = balance
        self.acc = []
        self.class_weight = class_weight
        
    def b_sampling_class(self):  # sample with data balance, make sure use the minimum data volume
        less = len(self.p0) if len(self.p0) < len(self.p1) else len(self.p1)
        if (self.sample_size/2)/less > 1:  # the sample size allocate to each class big than the minimum class size
            s0i = np.random.choice(len(self.p0), less)
            s1i = np.random.choice(len(self.p1), less)
        else:
            s0i = np.random.choice(len(self.p0), int(self.sample_size/2))
            s1i = np.random.choice(len(self.p1), int(self.sample_size/2))
        s0 = self.p0[s0i]
        s1 = self.p1[s1i]
        s = np.append(s0, s1, axis=0)
        np.random.shuffle(s)
        #print("in balance sampling, s shape", end=':')
        #print(s.shape)
        #print(collections.Counter(s[:,-1]))
        return s

    def sampling_class(self):
        si = np.random.choice(self.p_num, self.sample_size)
        s = self.p[si]
        #print("in non-balance sampling, s shape", end=' :')
        #print(s.shape)
        #print(collections.Counter(s[:,-1]))
        return s
    
    def sampling(self):
        if self.balance == 1:
            return self.b_sampling_class()
        else:
            return self.sampling_class()
                
    def model_gen(self, clf, S):
        self.algo_model.append(clf.fit(S[:,:-1], S[:,-1]))
        prd = clf.predict(S[:,:-1])
        self.acc.append(accuracy_score(prd, S[:,-1]))

    def sampling_algo(self):
        for s in range(self.sample_num):
            if self.algo == 'DT':
                clf = DecisionTreeClassifier(class_weight=self.class_weight, max_depth=self.max_depth, min_samples_leaf = self.min_samples_leaf)
                S = self.sampling()
                self.model_gen(clf, S)
            else:
                clf = SVC(class_weight=self.class_weight, probability=True, C=self.C, kernel=self.kernel, gamma=self.gamma)
                S = self.sampling()
                self.model_gen(clf, S)
        print("Accuracy mean of trainning data is : %.6f" %(np.mean(self.acc)))
            
    def fit(self, X, Y):
        self.p_num = len(X)
        self.p =  np.append(X, Y.reshape(-1,1), axis=1)  # Population, append by row, left + right
        self.p0 = self.p[self.p[:,-1] == 0]
        self.p1 = self.p[self.p[:,-1] == 1]
        self.sampling_algo()
            
    def major_num(self, batch_pred):
        pred = []
        for d in batch_pred:
            pred.append(0) if np.sum(d==0) >= np.sum(d==1) else pred.append(1)
        return np.array(pred)
            
    def predict(self, X):
        batch_pred = []
        batch_prob = []
        for m in self.algo_model:
            batch_pred.append(m.predict(X))
            batch_prob.append(m.predict_proba(X)[:,1])
        batch_pred = np.array(batch_pred).T
        batch_prob = np.array(batch_prob).T
        pred = self.major_num(batch_pred)
        print(batch_prob)
        print(batch_prob.shape)
        prob = np.sum(batch_prob, axis =  1)/self.sample_num
        print(prob)
        return prob, pred

#### bagging + DT + balance data
根据下面数据发现，在使用balance数据的情况下，不同的分类器是有不同的结果。  
我们可以调节抽样次数sample_num和样本大小sample_size来观察结果数据的变化。

In [21]:
bag = Bagging(algo='DT', sample_num=1000, sample_size=1000, max_depth=3, min_samples_leaf=10,balance=1)
print(train_x.shape, train_y.shape)
print(verify_x.shape)
bag.fit(train_x, train_y)
prob_y, pred_y = bag.predict(verify_x)
print(len(bag.algo_model))
#pred_y = np.array(pred_y)
#print(b_pred_y)
#print(b_pred_y.shape)
#print(b_pred_y[0:20,0:20])
#print(pred_y.shape)
#print(pred_y[0:50])
print("The validation set accuracy is :%.6f" %(accuracy_score(pred_y, verify_y)))
print('The AUC is %.5f' %(roc_auc_score(verify_y, prob_y)))
print(prob_y.shape)
print(pred_y[0:30])

(45631, 3) (45631,)
(11408, 3)
Accuracy mean of trainning data is : 0.729312
[[0.07042254 0.1875     0.14358974 ... 0.43225806 0.09473684 0.        ]
 [0.51851852 0.43478261 0.61111111 ... 0.72294372 0.48571429 0.27941176]
 [0.39759036 0.61090909 0.44565217 ... 0.43225806 0.4375     0.37068966]
 ...
 [0.16666667 0.06470588 0.14358974 ... 0.0952381  0.48571429 0.27941176]
 [0.07042254 0.1875     0.14358974 ... 0.43225806 0.09473684 0.08988764]
 [0.07042254 0.06470588 0.02777778 ... 0.0106383  0.09473684 0.06451613]]
(11408, 1000)
[0.12483438 0.46904604 0.46763098 ... 0.20279603 0.1550008  0.04498955]
1000
The validation set accuracy is :0.686536
The AUC is 0.79749
(11408,)
[0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1]


#### bagging + DT + unbalance data
而选择不balance数据的情况下，由于数据偏差比较严重，准确率的均值在78%。分类器发现粗暴的将几乎所有（通过多次实验发现不是所有，极其少量的数据会有1个投票给1）的分类投给数量多的便签类得到的结果会最好。而这样模型也就失去了泛化能力。  
但是auc的指标几乎没有变化，这也体现了auc对数据倾斜不过敏。

In [22]:
bag = Bagging(algo='DT', sample_num=1000, sample_size=1000, max_depth=3, min_samples_leaf=10,balance=0)
print(train_x.shape, train_y.shape)
bag.fit(train_x, train_y)
prob_y, pred_y = bag.predict(verify_x)
print(len(bag.algo_model))
print("The validation data accuracy is :%.6f" %(accuracy_score(pred_y, verify_y)))
print('The AUC is %.5f' %(roc_auc_score(verify_y, prob_y)))
print(pred_y[0:30])

(45631, 3) (45631,)
Accuracy mean of trainning data is : 0.777966
[[0.0877193  0.03065134 0.         ... 0.07446809 0.01895735 0.00621118]
 [0.26666667 0.27419355 0.18181818 ... 0.27272727 0.1875     0.15625   ]
 [0.2578125  0.06666667 0.1572327  ... 0.215      0.21794872 0.21264368]
 ...
 [0.02164502 0.06818182 0.07086614 ... 0.01515152 0.08759124 0.04458599]
 [0.0877193  0.03065134 0.         ... 0.07446809 0.01895735 0.00621118]
 [0.02164502 0.03065134 0.         ... 0.01515152 0.01895735 0.00621118]]
(11408, 1000)
[0.03860659 0.20486737 0.20463209 ... 0.06571579 0.0484632  0.01508678]
1000
The validation data accuracy is :0.776736
The AUC is 0.79741
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


#### bagging + svm + balance data
由于svm运算时间比较长，所以采样的次数和采样的大小都相对控制的比较小

In [23]:
bag = Bagging(algo='svc', sample_num=100, sample_size=1000, C=0.8, kernel='rbf', balance=1)
print(train_x.shape, train_y.shape)
bag.fit(train_x, train_y)
prob_y, pred_y = bag.predict(verify_x)
print(len(bag.algo_model))
print("The validation data accuracy is :%.6f" %(accuracy_score(pred_y, verify_y)))
print("The AUC is %.6f" %(roc_auc_score(verify_y, prob_y)))
print(pred_y[0:30])

(45631, 3) (45631,)
Accuracy mean of trainning data is : 0.800640
[[0.31144696 0.30822152 0.33719356 ... 0.18057857 0.24780517 0.25142775]
 [0.46500364 0.62662928 0.45081123 ... 0.32184763 0.4197097  0.19753491]
 [0.42967928 0.30756566 0.22708025 ... 0.20864039 0.28507324 0.739645  ]
 ...
 [0.5        0.17344801 0.22289104 ... 0.24824761 0.29722095 0.35507473]
 [0.28068014 0.27091348 0.2154418  ... 0.16989891 0.34907726 0.34077927]
 [0.34879029 0.28525567 0.32919831 ... 0.31621707 0.31779101 0.30071139]]
(11408, 100)
[0.28244934 0.48800376 0.39454834 ... 0.25786014 0.28376654 0.30563281]
100
The validation data accuracy is :0.707223
The AUC is 0.784454
[0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 1]


### bagging + DT + feature modify
使用特征改造过的数据（商品和用户id的次数），使用数据平衡，AUC提升9%

In [24]:
print(train_df.asin.value_counts())
asin = dict(train_df['asin'].value_counts())
asin_count = train_df['asin'].apply(lambda x: asin[x])
reviewid = dict(train_df['reviewerID'].value_counts())
reviewid_count = train_df['reviewerID'].apply(lambda x: reviewid[x])
train_df['asin'] = asin_count
train_df['reviewerID'] = reviewid_count
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText'])
print(train_x)
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_x.shape)
print(train_y.shape)
trai_x, veri_x, trai_y, veri_y = train_test_split(train_x, train_y, test_size=0.2, random_state=0)

bag = Bagging(algo='DT', class_weight='balanced', sample_num=1000, sample_size=1000, max_depth=3, min_samples_leaf=10,balance=0)
print(trai_x.shape, trai_y.shape)
bag.fit(trai_x, trai_y)
prob_y, pred_y = bag.predict(veri_x)
print(len(bag.algo_model))
#pred_y = np.array(pred_y)
#print(b_pred_y)
#print(b_pred_y.shape)
#print(b_pred_y[0:20,0:20])
#print(pred_y.shape)
#print(pred_y[0:50])
print("The validation data accuracy is :%.6f" %(accuracy_score(pred_y, veri_y)))
print("The AUC is %.6f" %(roc_auc_score(veri_y, prob_y)))
print(pred_y[0:30])

1     12143
2      9186
3      6198
4      4384
5      3385
      ...  
48       48
45       45
38       38
36       36
35       35
Name: asin, Length: 65, dtype: int64
       reviewerID   asin  overall
0             759   3385      5.0
1            5484   4384      3.0
2             928  12143      4.0
3            5484   4384      1.0
4            5484   1744      5.0
...           ...    ...      ...
57034        1695    483      2.0
57035         945  12143      5.0
57036        3040   6198      3.0
57037        3816  12143      5.0
57038        3060  12143      2.0

[57039 rows x 3 columns]
(57039, 3)
(57039,)
(45631, 3) (45631,)
Accuracy mean of trainning data is : 0.675112
[[0.06202869 0.11586902 0.0470819  ... 0.21531632 0.         0.05616046]
 [0.52944283 0.11586902 0.18955043 ... 0.55747126 0.35755258 0.40388578]
 [0.22263642 0.60554003 0.5129771  ... 0.48305599 0.4148066  0.42994242]
 ...
 [0.2432662  0.04887675 0.18955043 ... 0.01781451 0.17492984 0.11785929]
 [0.06202869 0

### bagging + svm + feature modify
可能svm是强分类器，并且比较稳定，运用到bigging中效果不好。

In [23]:
bag = Bagging(algo='svc', sample_num=10, sample_size=1000, C=0.8, kernel='rbf', balance=1)
print(train_x.shape, train_y.shape)
bag.fit(train_x, train_y)
prob_y, pred_y = bag.predict(verify_x)
print(len(bag.algo_model))
#pred_y = np.array(pred_y)
#print(b_pred_y)
#print(b_pred_y.shape)
#print(b_pred_y[0:20,0:20])
#print(pred_y.shape)
#print(pred_y[0:50])
print("The validation data accuracy is :%.6f" %(accuracy_score(pred_y, verify_y)))
print("The AUC is %.6f" %(roc_auc_score(verify_y, prob_y)))
print(pred_y[0:30])

(57039, 3) (57039,)
Accuracy mean of trainning data is : 0.848400
[[0.4213499  0.42498012 0.43384138 ... 0.41863479 0.42884702 0.4178336 ]
 [0.4213499  0.42498012 0.43384138 ... 0.41863479 0.42884702 0.4178336 ]
 [0.4213499  0.42498012 0.43384138 ... 0.41863479 0.42884702 0.4178336 ]
 ...
 [0.4213499  0.42498012 0.43384138 ... 0.41863479 0.42884702 0.4178336 ]
 [0.4213499  0.42498012 0.43384138 ... 0.41863479 0.42884702 0.4178336 ]
 [0.4213499  0.42498012 0.43384138 ... 0.41863479 0.42884702 0.4178336 ]]
(11408, 10)
[0.4234656 0.4234656 0.4234656 ... 0.4234656 0.4234656 0.4234656]
10
The validation data accuracy is :0.776736
The AUC is 0.500000
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### Adaboost.m1

In [25]:
import collections

train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_y.shape)
test_df = pd.read_csv('./data/test.csv', sep='\t')
#train_x
test_x = test_df.drop(columns=['Id','reviewText'])

train_x, verify_x, train_y, verify_y = train_test_split(train_x, train_y, test_size = 0.2, random_state = 0)
train_x.shape, verify_x.shape, train_y.shape, verify_y.shape

(57039,)


((45631, 3), (11408, 3), (45631,), (11408,))

In [26]:
class Adaboost(object):
    def __init__(self, iters, algo='DT', class_weight=None, max_depth=None, min_samples_leaf=None, C=None, gamma='auto', kernel='rbf'):
        self.iters = iters
        self.algo = algo
        self.class_weight = class_weight
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.C = C
        self.kernel = kernel
        self.gamma = gamma
        self.algo_model = {}
        self.sample_weight = []
        self.W = []
        self.contrast = []  # predict right or error index
        self.result = []
        self.rw = []  # result weight, log(1/beta)
       
    def model_gen(self, P):
        if self.algo == 'DT':
            clf = DecisionTreeClassifier(max_depth=self.max_depth, min_samples_leaf = self.min_samples_leaf)
        else:
            clf = SVC(probability=True, C=self.C, kernel=self.kernel, gamma=self.gamma)
        clf.fit(P[:,:-1], P[:,-1], sample_weight=self.W)
        pred_y = clf.predict(P[:,:-1])
        self.contrast = np.array([pred_y == P[:,-1]]).T  # predict right or error index
        F_i = np.argwhere(self.contrast == False)[:,0] # true index
        epsilon = self.W[F_i].sum()  # error rate
        beta = epsilon/(1-epsilon)
        self.algo_model[beta] = clf.fit(P[:,:-1], P[:,-1], sample_weight=self.W)
        return epsilon, beta
        
    def weight_update(self, epsilon, beta):
        T_i = np.argwhere(self.contrast == True)[:,0]   # true index
        self.W[T_i] = self.W[T_i] * beta
        self.W = self.W/self.W.sum()
                
    def classifier_gen(self, P):
        for m in range(self.iters):
            epsilon, beta = self.model_gen(P)
            if epsilon > 0.5: 
                continue
            self.weight_update(epsilon, beta)
        
    def fit(self, X, Y):
        self.W = np.linspace(1/len(X), 1/len(X), len(X))
        P = np.append(X, Y.reshape(-1,1), axis = 1)
        print(P.shape)
        self.classifier_gen(P)
    
    def predict(self, X):
        for k, v in self.algo_model.items():
            self.result.append(v.predict_proba(X))
            a = v.predict_proba(X)
            self.rw.append(np.log2(1/k))
        self.result = np.array(self.result)
        self.rw = np.array(self.rw)
        zero = self.result[:,:,0]
        one = self.result[:,:,1]
        zerow = zero * self.rw.reshape(-1,1)
        onew = one * self.rw.reshape(-1,1)
        zerow = np.sum(zerow, axis = 0)/self.iters
        onew = np.sum(onew, axis = 0)/self.iters
        no = zerow+onew
        zerow = zerow/no
        onew = onew/no
        tmp = []
        for i in onew:
            tmp.append(1) if i >= 0.475 else tmp.append(0)
        return np.array(tmp), onew

### adaboost.m1 + Decision Tree

In [27]:
ada = Adaboost(algo='DT', iters=1000, class_weight='balanced', max_depth=1, min_samples_leaf=10)
#ada = Adaboost(algo='DT', iters=10, max_depth=3, min_samples_leaf=10)
print(train_x.shape, train_y.shape)
ada.fit(train_x, train_y)
pred_y, prob_y = ada.predict(verify_x)
print(pred_y[0:30])
print("Accuracy is %.6f" %(accuracy_score(pred_y, verify_y)))
print(prob_y)
print("auc is %.6f " %(roc_auc_score(verify_y, prob_y)))

(45631, 3) (45631,)
(45631, 4)
[0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1]
Accuracy is 0.734309
[0.37632321 0.40385517 0.47192363 ... 0.38023906 0.38731852 0.34582602]
auc is 0.793306 


### adaboost.m1 + DT + feature modify
再bagging中，对特征改造后评价指标有明显的上升，在adaboost中也尝试一下。

In [28]:
print(train_df.asin.value_counts())
asin = dict(train_df['asin'].value_counts())
asin_count = train_df['asin'].apply(lambda x: asin[x])
reviewid = dict(train_df['reviewerID'].value_counts())
reviewid_count = train_df['reviewerID'].apply(lambda x: reviewid[x])
train_df['asin'] = asin_count
train_df['reviewerID'] = reviewid_count
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText'])
print(train_x)
train_x = train_df.iloc[:, :-3].drop(columns=['reviewText']).values
train_y = train_df.label.values
print(train_x.shape)
print(train_y.shape)
trai_x, veri_x, trai_y, veri_y = train_test_split(train_x, train_y, test_size=0.2, random_state=0)

12143    12143
9186      9186
6198      6198
4384      4384
3385      3385
         ...  
48          48
45          45
38          38
36          36
35          35
Name: asin, Length: 64, dtype: int64
       reviewerID   asin  overall
0             759   3385      5.0
1            5484   4384      3.0
2             928  12143      4.0
3            5484   4384      1.0
4            5484   1744      5.0
...           ...    ...      ...
57034        1695    483      2.0
57035         945  12143      5.0
57036        3040   6198      3.0
57037        3816  12143      5.0
57038        3060  12143      2.0

[57039 rows x 3 columns]
(57039, 3)
(57039,)


In [29]:
ada = Adaboost(algo='DT', iters=1000, class_weight='balanced', max_depth=1, min_samples_leaf=10)
print(train_x.shape, train_y.shape)
ada.fit(train_x, train_y)
pred_y, prob_y = ada.predict(verify_x)
print(pred_y[0:30])
print("Accuracy is %.6f" %(accuracy_score(pred_y, verify_y)))
print(prob_y)
print("auc is %.6f " %(roc_auc_score(verify_y, prob_y)))

(57039, 3) (57039,)
(57039, 4)
[0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1]
Accuracy is 0.734309
[0.374124   0.40238259 0.47202788 ... 0.37807378 0.38518329 0.34242225]
auc is 0.793571 


### 调库来看看和自己的实现的差距~  
bagging + DT

In [30]:
from sklearn.ensemble import BaggingClassifier
tree = DecisionTreeClassifier(class_weight='balanced', max_depth=1, min_samples_leaf = 10)
clf = BaggingClassifier(base_estimator=tree, n_estimators=1000, max_samples=1.0, bootstrap=True, bootstrap_features=False, random_state=0)
clf.fit(train_x, train_y)
pred_y = clf.predict(verify_x)
prob_y = clf.predict_proba(verify_x)
print(pred_y[0:30])
print("accuracy is %.6f" %(accuracy_score(pred_y, verify_y)))
print('auc is %.6f' %(roc_auc_score(verify_y, prob_y[:,1])))

[0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0 1 1]
accuracy is 0.595810
auc is 0.701907


In [31]:
from sklearn.ensemble import AdaBoostClassifier
tree = DecisionTreeClassifier(class_weight='balanced', max_depth = 1, min_samples_leaf = 10)
clf = AdaBoostClassifier(base_estimator=tree, n_estimators=1000, learning_rate=0.9)
clf.fit(train_x, train_y)
pred_y = clf.predict(verify_x)
prob_y = clf.predict_proba(verify_x)
print(pred_y[0:30])
print("accuracy is %.6f" %(accuracy_score(pred_y, verify_y)))
auc = roc_auc_score(verify_y, prob_y[:,1])
print(auc)

[0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1]
accuracy is 0.689078
0.8006870673345394
