* Kaggle比赛情感分析题目：Bag of Words Meets Bags of Popcorn
* Kaggle比赛地址：https://www.kaggle.com/c/word2vec-nlp-tutorial#description

# 数据特征处理

## 一、数据预处理

### 1、加载工具包

In [1]:
import pandas as pd
import numpy as np

### 2、读取并查看数据

In [2]:
root_dir = "data"
# 载入数据集
train = pd.read_csv('%s/%s' % (root_dir, 'labeledTrainData.tsv'), header=0, delimiter="\t", quoting=1)
unlabel_train = pd.read_csv('%s/%s' % (root_dir, 'unlabeledTrainData.tsv'), header=0, delimiter="\t", quoting=3)   
test = pd.read_csv('%s/%s' % (root_dir, 'testData.tsv'), header=0, delimiter="\t", quoting=1)

print(train.shape)
print(train.columns.values)
print(unlabel_train.shape)
print(unlabel_train.columns.values)
print(test.shape)
print(test.columns.values)

print(train.head(3))
print(unlabel_train.head(3))
print(test.head(3))

(25000, 3)
['id' 'sentiment' 'review']
(50000, 2)
['id' 'review']
(25000, 2)
['id' 'review']
       id  sentiment                                             review
0  5814_8          1  With all this stuff going down at the moment w...
1  2381_9          1  \The Classic War of the Worlds\" by Timothy Hi...
2  7759_3          0  The film starts with a manager (Nicholas Bell)...
          id                                             review
0   "9999_0"  "Watching Time Chasers, it obvious that it was...
1  "45057_0"  "I saw this film about 20 years ago and rememb...
2  "15561_0"  "Minor Spoilers<br /><br />In New York, Joan B...
         id                                             review
0  12311_10  Naturally in a film who's main themes are of m...
1    8348_2  This movie is a disaster within a disaster fil...
2    5828_4  All in all, this is a movie for kids. We saw i...


从原始数据中可以看出：
* 1.labeledTrainData数据用于模型训练；unlabeledTrainData数据用于Word2vec提取特征；testData数据用于提交结果预测。
* 2.文本数据来自网络爬虫数据，带有html格式

### 3.去除HTML标签+数字+全部小写

In [3]:
def review_to_wordlist(review):
    '''
    把IMDB的评论转成词序列
    '''
    from bs4 import BeautifulSoup
    # 1.去掉HTML标签，拿到内容
    review_text = BeautifulSoup(review, "html.parser").get_text()
    
    import re
    # 用正则表达式取出符合规范的部分
    review_text = re.sub("[^a-zA-Z]", " ", review_text)
    
    # 小写化所有的词，并转成词list
    words_list = review_text.lower().split()
    
    #去除停用词。需要下载nltk库，并且下载stopwords。
    from nltk.corpus import stopwords
    stopwords = set(stopwords.words("english"))
    words = [word for word in words_list if word not in stopwords]
    
    # 返回words
    return words

# 预处理数据
label = train['sentiment']
train_data = []
for i in range(len(train['review'])):
    train_data.append(' '.join(review_to_wordlist(train['review'][i])))
    
unlable_data = []    
for i in range(len(unlabel_train['review'])):
    unlable_data.append(' '.join(review_to_wordlist(unlabel_train['review'][i])))   
    
test_data = []
for i in range(len(test['review'])):
    test_data.append(' '.join(review_to_wordlist(test['review'][i])))

# 预览数据
print(train_data[0], '\n')
print(unlable_data[0], '\n')
print(test_data[0])

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

* 查看数据量

In [4]:
print(len(train_data),len(unlable_data),len(test_data))

25000 50000 25000


## 二、特征工程

把文本转换为向量，有几种常见的文本向量处理方法，比如：

* 1.单词计数
* 2.TF-IDF向量
* 3.Word2vec向量

### 1.Count词向量

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer(
    max_features=4000,#过滤掉低频
    analyzer='word', # tokenise by character ngrams
    ngram_range=(1,2),  # 二元n-gram模型
    stop_words = 'english')

# 合并训练和测试集以便进行TFIDF向量化操作
data_all = train_data + test_data
len_train = len(train_data)

count_vec.fit(data_all)
data_all = count_vec.transform(data_all)

# 恢复成训练集和测试集部分
count_train_x = data_all[:len_train]
count_test_x = data_all[len_train:]

print('count处理结束.')

print("train: \n", np.shape(count_train_x[0]))
print("test: \n", np.shape(count_test_x[0]))

count处理结束.
train: 
 (1, 4000)
test: 
 (1, 4000)


In [6]:
count_train_x.shape

(25000, 4000)

In [7]:
count_test_x.shape

(25000, 4000)

* Count特征
* count_train_x
* count_test_x

### 2.TF-IDF词向量

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
"""
min_df: 最小支持度为2（词汇出现的最小次数）
max_features: 默认为None，可设为int，对所有关键词的term frequency进行降序排序，只取前max_features个作为关键词集
strip_accents: 将使用ascii或unicode编码在预处理步骤去除raw document中的重音符号
analyzer: 设置返回类型
token_pattern: 表示token的正则表达式，需要设置analyzer == 'word'，默认的正则表达式选择2个及以上的字母或数字作为token，标点符号默认当作token分隔符，而不会被当作token
ngram_range: 词组切分的长度范围
use_idf: 启用逆文档频率重新加权
use_idf：默认为True，权值是tf*idf，如果设为False，将不使用idf，就是只使用tf，相当于CountVectorizer了。
smooth_idf: idf平滑参数，默认为True，idf=ln((文档总数+1)/(包含该词的文档数+1))+1，如果设为False，idf=ln(文档总数/包含该词的文档数)+1
sublinear_tf: 默认为False，如果设为True，则替换tf为1 + log(tf)
stop_words: 设置停用词，设为english将使用内置的英语停用词，设为一个list可自定义停用词，设为None不使用停用词，设为None且max_df∈[0.7, 1.0)将自动根据当前的语料库建立停用词表
"""
tfidf = TFIDF(min_df=2,
           max_features=4000,#过滤掉低频
           strip_accents='unicode',
           analyzer='word',
           token_pattern=r'\w{1,}',
           ngram_range=(1,2),  # 二元n-gram模型
           use_idf=1,
           smooth_idf=1,
           sublinear_tf=1,
           stop_words = 'english') # 去掉英文停用词

# 合并训练和测试集以便进行TFIDF向量化操作
data_all = train_data + test_data
len_train = len(train_data)

tfidf.fit(data_all)
data_all = tfidf.transform(data_all)
# 恢复成训练集和测试集部分
tfidf_train_x = data_all[:len_train]
tfidf_test_x = data_all[len_train:]
print('TF-IDF处理结束.')

print("train: \n", np.shape(tfidf_train_x[0]))
print("test: \n", np.shape(tfidf_test_x[0]))

TF-IDF处理结束.
train: 
 (1, 4000)
test: 
 (1, 4000)


In [9]:
tfidf_train_x.shape

(25000, 4000)

In [10]:
tfidf_test_x.shape

(25000, 4000)

### 3.Word2vec词向量

* gensim.models.word2vec.Word2Vec 输入数据是字符的list格式，所以需要对数据进行预处理

#### 3.1 输入数据预处理

In [38]:
#预处理训练数据
train_words = []
for i in train_data:
    train_words.append(i.split())
    
#预处理特征数据
unlable_words = []
for i in unlable_data:
    unlable_words.append(i.split())

#预处理测试数据
test_words = []
for i in test_data:
    test_words.append(i.split())

#合并数据
all_words = train_words + unlable_words + test_words

len(all_words)

100000

#### 3.2数据预览

In [39]:
# 预览数据
print(all_words[0])

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

#### 3.3 Word2vec模型训练保存

In [13]:
from gensim.models.word2vec import Word2Vec
import os

# 设定词向量训练的参数
size = 100      # Word vector dimensionality
min_count = 3   # Minimum word count
num_workers = 4 # Number of threads to run in parallel
window = 10     # Context window size
model_name = '{}size_{}min_count_{}window.model'.format(size, min_count, window)

wv_model = Word2Vec(all_words, workers=num_workers, size=size, min_count = min_count,window = window)
wv_model.init_sims(replace=True)#模型训练好后，锁定模型

wv_model.save(model_name)#保存模型



#### 3.4 Word2vec模型加载

In [36]:
from gensim.models.word2vec import Word2Vec
wv_model = Word2Vec.load("100size_3min_count_10window.model")

#### 3.5 word2vec特征处理

此处画风比较奇特:
* 将一个句子对应的词向量求和取平均，做为机器学习的特征，但是效果还不错。

In [40]:
def to_review_vector(review):
    global word_vec
    word_vec = np.zeros((1,100))
    for word in review:
        if word in wv_model:
            word_vec += np.array([wv_model[word]])
    return pd.Series(word_vec.mean(axis = 0))

train_data_features = []

for i in train_words:
    train_data_features.append(to_review_vector(i))

test_data_features = []
for i in test_words:
    test_data_features.append(to_review_vector(i))



# 机器学习方法

## 一、机器学习建模

### 1. TF-IDF+朴素贝叶斯模型+交叉验证

In [25]:
# 朴素贝叶斯训练
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.naive_bayes import MultinomialNB as MNB

model_MNB = MNB() # (alpha=1.0, class_prior=None, fit_prior=True)
# 为了在预测的时候使用
model_MNB.fit(tfidf_train_x, label)

print("多项式贝叶斯分类器5折交叉验证得分: ", cross_val_score(model_MNB, tfidf_train_x, label, cv=5, scoring='roc_auc'))
print("多项式贝叶斯分类器5折交叉验证得分: ", np.mean(cross_val_score(model_MNB, tfidf_train_x, label, cv=5, scoring='roc_auc')))


test_predicted = np.array(model_MNB.predict(tfidf_test_x))

print('保存结果...')

submission_df = pd.DataFrame({'id': test['id'].values, 'sentiment': test_predicted})
print(submission_df.head())
submission_df.to_csv('submission_mnb_tfidf.csv', index = False)

print('结束.')

多项式贝叶斯分类器5折交叉验证得分:  [0.92969136 0.93232992 0.93167488 0.93575024 0.92634304]
多项式贝叶斯分类器5折交叉验证得分:  0.9311578880000001
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

### 2. TF-IDF+逻辑回归模型+网格搜索交叉验证

In [26]:
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import GridSearchCV

# 设定grid search的参数
grid_values = {'C': [0.1, 1, 10]}  
# 设定打分为roc_auc
"""
penalty: l1 or l2, 用于指定惩罚中使用的标准。
"""
model_LR = GridSearchCV(LR(penalty='l2', dual=True, random_state=0), grid_values, scoring='roc_auc', cv=5)
model_LR.fit(tfidf_train_x, label)

# 输出结果
print("最好的参数：")
print( model_LR.best_params_)

print("最好的得分：")
print(model_LR.best_score_)


print("网格搜索参数及得分：")
print(model_LR.grid_scores_)

print("网格搜索结果：")
print(model_LR.cv_results_)

model_LR = LR(penalty='l2', dual=True, random_state=0)
model_LR.fit(tfidf_train_x, label)

test_predicted = np.array(model_LR.predict(tfidf_test_x))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_lr_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

最好的参数：
{'C': 1}
最好的得分：
0.9507756479999999
网格搜索参数及得分：
[mean: 0.93949, std: 0.00324, params: {'C': 0.1}, mean: 0.95078, std: 0.00276, params: {'C': 1}, mean: 0.94398, std: 0.00275, params: {'C': 10}]
网格搜索结果：
{'mean_fit_time': array([0.12460012, 0.17419996, 0.4506001 ]), 'std_fit_time': array([0.01323034, 0.02166458, 0.01783922]), 'mean_score_time': array([0.0026    , 0.00319996, 0.00260005]), 'std_score_time': array([0.00048982, 0.00040011, 0.00048996]), 'param_C': masked_array(data=[0.1, 1, 10],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'C': 0.1}, {'C': 1}, {'C': 10}], 'split0_test_score': array([0.93795456, 0.95020352, 0.9436768 ]), 'split1_test_score': array([0.94196192, 0.9542512 , 0.94876688]), 'split2_test_score': array([0.93759728, 0.9471152 , 0.9402704 ]), 'split3_test_score': array([0.9444336 , 0.95359056, 0.94413712]), 'split4_test_score': array([0.93548752, 0.94871776, 0.94303536]), 'mean_test_score': array([0.939486



结束.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

### 3. TF-IDF+SVM模型+网格搜索交叉验证

* SVM模型训练太耗时，尤其是使用网格搜索训练
* 参数调优时，使用param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}+5折交叉验证，连续训练时长将近48小时。

In [19]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

'''
线性的SVM只需要，只需要调优正则化参数C
基于RBF核的SVM，需要调优gamma参数和C
'''
param_grid = {'C': [10],'gamma': [1]}

model_SVM = GridSearchCV(SVC(), param_grid, scoring='roc_auc', cv=5)
model_SVM.fit(tfidf_train_x, label)

# 输出结果
print("最好的参数：")
print( model_SVM.best_params_)

print("最好的得分：")
print(model_SVM.best_score_)

print("网格搜索参数及得分：")
print(model_SVM.grid_scores_)


最好的参数：
{'C': 10, 'gamma': 1}
最好的得分：
0.9517343039999999
网格搜索参数及得分：
[mean: 0.95173, std: 0.00264, params: {'C': 10, 'gamma': 1}]


In [20]:
from sklearn.externals import joblib
joblib.dump(model_SVM, "model_SVM")
model_SVM = joblib.load("model_SVM")

In [21]:
from sklearn.svm import SVC
svm = SVC(kernel='linear',C=10,gamma = 1)
svm.fit(tfidf_train_x, label)

test_predicted = np.array(svm.predict(tfidf_test_x))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_svm_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          0
3    7186_2          0
4   12128_7          1
结束.


![image.png](attachment:image.png)

### 4. TF-IDF+MLP（多层感知机模型）

In [15]:
#None 维度在这里是一个 batch size 的占位符

import keras
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split

#train test split
mlp_train_x, mlp_test_x, mlp_train_y, mlp_test_y = train_test_split(tfidf_train_x, label, test_size=0.3, random_state=123)

model_MLP = Sequential()
#model.add(Dense(3, activation='relu', input_shape=(18,)))

model_MLP.add(Dense(10, input_shape=(4000,), activation='relu'))#

model_MLP.add(Dropout(0.2))
model_MLP.add(Dense(2, activation='softmax'))

model_MLP.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_MLP.summary()

keras_y_train = np.array(keras.utils.to_categorical(mlp_train_y, 2))
keras_y_test = np.array(keras.utils.to_categorical(mlp_test_y, 2))

#仅保存最好的模型
filepath="model_MLP/weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True,mode='max')#验证集准确率比之前效果好就保存权重

callbacks_list = [checkpoint]

model_MLP.fit(mlp_train_x, keras_y_train, validation_data=(mlp_test_x, keras_y_test), epochs=500, batch_size=5000,callbacks=callbacks_list)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 10)                40010     
_________________________________________________________________
dropout_5 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 22        
Total params: 40,032
Trainable params: 40,032
Non-trainable params: 0
_________________________________________________________________
Train on 17500 samples, validate on 7500 samples
Epoch 1/500

Epoch 00001: val_acc improved from -inf to 0.67640, saving model to model_MLP/weights.best.hdf5
Epoch 2/500

Epoch 00002: val_acc improved from 0.67640 to 0.77120, saving model to model_MLP/weights.best.hdf5
Epoch 3/500

Epoch 00003: val_acc improved from 0.77120 to 0.80707, saving model to model_MLP/weights.best.hdf5
Epoch 4/500

Epoch 0


Epoch 00068: val_acc improved from 0.87400 to 0.87427, saving model to model_MLP/weights.best.hdf5
Epoch 69/500

Epoch 00069: val_acc did not improve from 0.87427
Epoch 70/500

Epoch 00070: val_acc did not improve from 0.87427
Epoch 71/500

Epoch 00071: val_acc did not improve from 0.87427
Epoch 72/500

Epoch 00072: val_acc did not improve from 0.87427
Epoch 73/500

Epoch 00073: val_acc improved from 0.87427 to 0.87440, saving model to model_MLP/weights.best.hdf5
Epoch 74/500

Epoch 00074: val_acc did not improve from 0.87440
Epoch 75/500

Epoch 00075: val_acc improved from 0.87440 to 0.87440, saving model to model_MLP/weights.best.hdf5
Epoch 76/500

Epoch 00076: val_acc did not improve from 0.87440
Epoch 77/500

Epoch 00077: val_acc improved from 0.87440 to 0.87453, saving model to model_MLP/weights.best.hdf5
Epoch 78/500

Epoch 00078: val_acc did not improve from 0.87453
Epoch 79/500

Epoch 00079: val_acc did not improve from 0.87453
Epoch 80/500

Epoch 00080: val_acc did not improv


Epoch 00110: val_acc did not improve from 0.87467
Epoch 111/500

Epoch 00111: val_acc did not improve from 0.87467
Epoch 112/500

Epoch 00112: val_acc did not improve from 0.87467
Epoch 113/500

Epoch 00113: val_acc did not improve from 0.87467
Epoch 114/500

Epoch 00114: val_acc did not improve from 0.87467
Epoch 115/500

Epoch 00115: val_acc did not improve from 0.87467
Epoch 116/500

Epoch 00116: val_acc did not improve from 0.87467
Epoch 117/500

Epoch 00117: val_acc did not improve from 0.87467
Epoch 118/500

Epoch 00118: val_acc did not improve from 0.87467
Epoch 119/500

Epoch 00119: val_acc did not improve from 0.87467
Epoch 120/500

Epoch 00120: val_acc did not improve from 0.87467
Epoch 121/500

Epoch 00121: val_acc did not improve from 0.87467
Epoch 122/500

Epoch 00122: val_acc did not improve from 0.87467
Epoch 123/500

Epoch 00123: val_acc did not improve from 0.87467
Epoch 124/500

Epoch 00124: val_acc did not improve from 0.87467
Epoch 125/500

Epoch 00125: val_acc did


Epoch 00153: val_acc did not improve from 0.87467
Epoch 154/500

Epoch 00154: val_acc did not improve from 0.87467
Epoch 155/500

Epoch 00155: val_acc did not improve from 0.87467
Epoch 156/500

Epoch 00156: val_acc did not improve from 0.87467
Epoch 157/500

Epoch 00157: val_acc did not improve from 0.87467
Epoch 158/500

Epoch 00158: val_acc did not improve from 0.87467
Epoch 159/500

Epoch 00159: val_acc did not improve from 0.87467
Epoch 160/500

Epoch 00160: val_acc did not improve from 0.87467
Epoch 161/500

Epoch 00161: val_acc did not improve from 0.87467
Epoch 162/500

Epoch 00162: val_acc did not improve from 0.87467
Epoch 163/500

Epoch 00163: val_acc did not improve from 0.87467
Epoch 164/500

Epoch 00164: val_acc did not improve from 0.87467
Epoch 165/500

Epoch 00165: val_acc did not improve from 0.87467
Epoch 166/500

Epoch 00166: val_acc did not improve from 0.87467
Epoch 167/500

Epoch 00167: val_acc did not improve from 0.87467
Epoch 168/500

Epoch 00168: val_acc did


Epoch 00196: val_acc did not improve from 0.87467
Epoch 197/500

Epoch 00197: val_acc did not improve from 0.87467
Epoch 198/500

Epoch 00198: val_acc did not improve from 0.87467
Epoch 199/500

Epoch 00199: val_acc did not improve from 0.87467
Epoch 200/500

Epoch 00200: val_acc did not improve from 0.87467
Epoch 201/500

Epoch 00201: val_acc did not improve from 0.87467
Epoch 202/500

Epoch 00202: val_acc did not improve from 0.87467
Epoch 203/500

Epoch 00203: val_acc did not improve from 0.87467
Epoch 204/500

Epoch 00204: val_acc did not improve from 0.87467
Epoch 205/500

Epoch 00205: val_acc did not improve from 0.87467
Epoch 206/500

Epoch 00206: val_acc did not improve from 0.87467
Epoch 207/500

Epoch 00207: val_acc did not improve from 0.87467
Epoch 208/500

Epoch 00208: val_acc did not improve from 0.87467
Epoch 209/500

Epoch 00209: val_acc did not improve from 0.87467
Epoch 210/500

Epoch 00210: val_acc did not improve from 0.87467
Epoch 211/500

Epoch 00211: val_acc did


Epoch 00239: val_acc did not improve from 0.87467
Epoch 240/500

Epoch 00240: val_acc did not improve from 0.87467
Epoch 241/500

Epoch 00241: val_acc did not improve from 0.87467
Epoch 242/500

Epoch 00242: val_acc did not improve from 0.87467
Epoch 243/500

Epoch 00243: val_acc did not improve from 0.87467
Epoch 244/500

Epoch 00244: val_acc did not improve from 0.87467
Epoch 245/500

Epoch 00245: val_acc did not improve from 0.87467
Epoch 246/500

Epoch 00246: val_acc did not improve from 0.87467
Epoch 247/500

Epoch 00247: val_acc did not improve from 0.87467
Epoch 248/500

Epoch 00248: val_acc did not improve from 0.87467
Epoch 249/500

Epoch 00249: val_acc did not improve from 0.87467
Epoch 250/500

Epoch 00250: val_acc did not improve from 0.87467
Epoch 251/500

Epoch 00251: val_acc did not improve from 0.87467
Epoch 252/500

Epoch 00252: val_acc did not improve from 0.87467
Epoch 253/500

Epoch 00253: val_acc did not improve from 0.87467
Epoch 254/500

Epoch 00254: val_acc did


Epoch 00282: val_acc did not improve from 0.87467
Epoch 283/500

Epoch 00283: val_acc did not improve from 0.87467
Epoch 284/500

Epoch 00284: val_acc did not improve from 0.87467
Epoch 285/500

Epoch 00285: val_acc did not improve from 0.87467
Epoch 286/500

Epoch 00286: val_acc did not improve from 0.87467
Epoch 287/500

Epoch 00287: val_acc did not improve from 0.87467
Epoch 288/500

Epoch 00288: val_acc did not improve from 0.87467
Epoch 289/500

Epoch 00289: val_acc did not improve from 0.87467
Epoch 290/500

Epoch 00290: val_acc did not improve from 0.87467
Epoch 291/500

Epoch 00291: val_acc did not improve from 0.87467
Epoch 292/500

Epoch 00292: val_acc did not improve from 0.87467
Epoch 293/500

Epoch 00293: val_acc did not improve from 0.87467
Epoch 294/500

Epoch 00294: val_acc did not improve from 0.87467
Epoch 295/500

Epoch 00295: val_acc did not improve from 0.87467
Epoch 296/500

Epoch 00296: val_acc did not improve from 0.87467
Epoch 297/500

Epoch 00297: val_acc did


Epoch 00325: val_acc did not improve from 0.87467
Epoch 326/500

Epoch 00326: val_acc did not improve from 0.87467
Epoch 327/500

Epoch 00327: val_acc did not improve from 0.87467
Epoch 328/500

Epoch 00328: val_acc did not improve from 0.87467
Epoch 329/500

Epoch 00329: val_acc did not improve from 0.87467
Epoch 330/500

Epoch 00330: val_acc did not improve from 0.87467
Epoch 331/500

Epoch 00331: val_acc did not improve from 0.87467
Epoch 332/500

Epoch 00332: val_acc did not improve from 0.87467
Epoch 333/500

Epoch 00333: val_acc did not improve from 0.87467
Epoch 334/500

Epoch 00334: val_acc did not improve from 0.87467
Epoch 335/500

Epoch 00335: val_acc did not improve from 0.87467
Epoch 336/500

Epoch 00336: val_acc did not improve from 0.87467
Epoch 337/500

Epoch 00337: val_acc did not improve from 0.87467
Epoch 338/500

Epoch 00338: val_acc did not improve from 0.87467
Epoch 339/500

Epoch 00339: val_acc did not improve from 0.87467
Epoch 340/500

Epoch 00340: val_acc did


Epoch 00368: val_acc did not improve from 0.87467
Epoch 369/500

Epoch 00369: val_acc did not improve from 0.87467
Epoch 370/500

Epoch 00370: val_acc did not improve from 0.87467
Epoch 371/500

Epoch 00371: val_acc did not improve from 0.87467
Epoch 372/500

Epoch 00372: val_acc did not improve from 0.87467
Epoch 373/500

Epoch 00373: val_acc did not improve from 0.87467
Epoch 374/500

Epoch 00374: val_acc did not improve from 0.87467
Epoch 375/500

Epoch 00375: val_acc did not improve from 0.87467
Epoch 376/500

Epoch 00376: val_acc did not improve from 0.87467
Epoch 377/500

Epoch 00377: val_acc did not improve from 0.87467
Epoch 378/500

Epoch 00378: val_acc did not improve from 0.87467
Epoch 379/500

Epoch 00379: val_acc did not improve from 0.87467
Epoch 380/500

Epoch 00380: val_acc did not improve from 0.87467
Epoch 381/500

Epoch 00381: val_acc did not improve from 0.87467
Epoch 382/500

Epoch 00382: val_acc did not improve from 0.87467
Epoch 383/500

Epoch 00383: val_acc did


Epoch 00411: val_acc did not improve from 0.87467
Epoch 412/500

Epoch 00412: val_acc did not improve from 0.87467
Epoch 413/500

Epoch 00413: val_acc did not improve from 0.87467
Epoch 414/500

Epoch 00414: val_acc did not improve from 0.87467
Epoch 415/500

Epoch 00415: val_acc did not improve from 0.87467
Epoch 416/500

Epoch 00416: val_acc did not improve from 0.87467
Epoch 417/500

Epoch 00417: val_acc did not improve from 0.87467
Epoch 418/500

Epoch 00418: val_acc did not improve from 0.87467
Epoch 419/500

Epoch 00419: val_acc did not improve from 0.87467
Epoch 420/500

Epoch 00420: val_acc did not improve from 0.87467
Epoch 421/500

Epoch 00421: val_acc did not improve from 0.87467
Epoch 422/500

Epoch 00422: val_acc did not improve from 0.87467
Epoch 423/500

Epoch 00423: val_acc did not improve from 0.87467
Epoch 424/500

Epoch 00424: val_acc did not improve from 0.87467
Epoch 425/500

Epoch 00425: val_acc did not improve from 0.87467
Epoch 426/500

Epoch 00426: val_acc did


Epoch 00454: val_acc did not improve from 0.87467
Epoch 455/500

Epoch 00455: val_acc did not improve from 0.87467
Epoch 456/500

Epoch 00456: val_acc did not improve from 0.87467
Epoch 457/500

Epoch 00457: val_acc did not improve from 0.87467
Epoch 458/500

Epoch 00458: val_acc did not improve from 0.87467
Epoch 459/500

Epoch 00459: val_acc did not improve from 0.87467
Epoch 460/500

Epoch 00460: val_acc did not improve from 0.87467
Epoch 461/500

Epoch 00461: val_acc did not improve from 0.87467
Epoch 462/500

Epoch 00462: val_acc did not improve from 0.87467
Epoch 463/500

Epoch 00463: val_acc did not improve from 0.87467
Epoch 464/500

Epoch 00464: val_acc did not improve from 0.87467
Epoch 465/500

Epoch 00465: val_acc did not improve from 0.87467
Epoch 466/500

Epoch 00466: val_acc did not improve from 0.87467
Epoch 467/500

Epoch 00467: val_acc did not improve from 0.87467
Epoch 468/500

Epoch 00468: val_acc did not improve from 0.87467
Epoch 469/500

Epoch 00469: val_acc did


Epoch 00497: val_acc did not improve from 0.87467
Epoch 498/500

Epoch 00498: val_acc did not improve from 0.87467
Epoch 499/500

Epoch 00499: val_acc did not improve from 0.87467
Epoch 500/500

Epoch 00500: val_acc did not improve from 0.87467


<keras.callbacks.History at 0x47b50828>

In [16]:
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential
from keras.utils import plot_model

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
import gensim
from gensim.models.word2vec import Word2Vec
from keras.callbacks import ModelCheckpoint


model_MLP = Sequential()

model_MLP.add(Dense(10, input_shape=(4000,), activation='relu'))#

model_MLP.add(Dropout(0.2))
model_MLP.add(Dense(2, activation='softmax'))

model_MLP.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_MLP.summary()

# load weights
model_MLP.load_weights("model_MLP/weights.best.hdf5")




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_11 (Dense)             (None, 10)                40010     
_________________________________________________________________
dropout_6 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 22        
Total params: 40,032
Trainable params: 40,032
Non-trainable params: 0
_________________________________________________________________


In [17]:
test_predicted = np.array(model_MLP.predict_classes(tfidf_test_x))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_mlp_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          0
3    7186_2          1
4   12128_7          1
结束.


### 5. TF-IDF+DT

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

model_DT = DecisionTreeClassifier()
model_DT.fit(tfidf_train_x, label)

scores = cross_val_score(model_DT, tfidf_train_x, label, cv=5, scoring='roc_auc')

print("决策树 5折交叉验证得分: ", scores)
print("决策树 5折交叉验证平均得分: ", np.mean(scores))

test_predicted = np.array(model_DT.predict(tfidf_test_x))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_dt_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

决策树 5折交叉验证得分: 
 [0.7128 0.705  0.7102 0.7062 0.7096]
决策树 5折交叉验证平均得分: 
 0.7087600000000001
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


### 6. TF-IDF+xgboost

In [21]:
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import numpy as np
import pandas as pd

model_XGB = XGBClassifier(n_estimators=150, min_samples_leaf=3, max_depth=6)
"""
AttributeError: 'list' object has no attribute 'shape'
list => np.array
"""
model_XGB.fit(tfidf_train_x, label)

scores = cross_val_score(model_XGB, tfidf_train_x, label, cv=5, scoring='roc_auc')

print("XGB 5折交叉验证得分: ", scores)
print("XGB 5折交叉验证平均得分: ", np.mean(scores))

test_predicted = np.array(model_XGB.predict(tfidf_test_x))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_xgb_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

XGB 5折交叉验证得分:  [0.91607008 0.91845504 0.91138704 0.922242   0.9142976 ]
XGB 5折交叉验证平均得分:  0.916490352
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


  if diff:


### 7. TF-IDF+GBDT

In [44]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

model_GBDT = GradientBoostingClassifier()
model_GBDT.fit(tfidf_train_x, label)

scores = cross_val_score(model_GBDT, tfidf_train_x, label, cv=5, scoring='roc_auc')

print("GBDT 5折交叉验证得分: \n", scores)
print("GBDT 折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(model_GBDT.predict(tfidf_test_x))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_gbdt_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

GBDT 5折交叉验证得分: 
 [0.88863632 0.88981456 0.88274104 0.89644848 0.88723104]
GBDT 折交叉验证平均得分: 
 0.888974288
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


### 8. TF-IDF+RF

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

model_RF = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=0)
model_RF.fit(tfidf_train_x, label)

scores = cross_val_score(model_RF, tfidf_train_x, label, cv=5, scoring='roc_auc')

print("随机森林 5折交叉验证得分: ", scores)
print("随机森林 5折交叉验证平均得分: ", np.mean(scores))

test_predicted = np.array(model_RF.predict(tfidf_test_x))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_rfc_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

随机森林 5折交叉验证得分:  [0.89770248 0.90205984 0.89408512 0.9035528  0.89733304]
随机森林 5折交叉验证平均得分:  0.898946656
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


### 9. TF-IDF+Voting

In [27]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

model_VOT = VotingClassifier(estimators=[('lr', model_LR), ('xgb', model_XGB), ('rf', model_RF)],voting='hard')
model_VOT.fit(tfidf_train_x, np.array(label))

scores = cross_val_score(model_VOT, tfidf_train_x,label, cv=5, scoring=None,n_jobs = -1)

print("VotingClassifier 5折交叉验证得分: \n", scores)
print("VotingClassifier 5折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(model_VOT.predict(tfidf_test_x))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_vot_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

VotingClassifier 5折交叉验证得分: 
 [0.8394 0.8568 0.8388 0.854  0.8464]
VotingClassifier 5折交叉验证平均得分: 
 0.84708


  if diff:


保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


  if diff:


### 10. TF-IDF+Stacking

In [35]:
'''模型融合中使用到的各个单模型'''
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
#from sklearn.cross_validation import StratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix, roc_auc_score, roc_curve, auc

# 划分train数据集,调用代码,把数据集名字转成和代码一样
X = tfidf_train_x
y = np.array(label)

X_test_features = tfidf_test_x

stacking_LR = LR(penalty='l2', dual=True, random_state=0)

stacking_xgb = XGBClassifier(n_estimators=150, min_samples_leaf=3, max_depth=6)

stacking_rf = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=0)


clfs = [stacking_LR,stacking_xgb,stacking_rf]#模型

# 创建n_folds
n_folds = 5
skf = StratifiedKFold(n_splits=n_folds,shuffle=True,random_state=1)#K-Flod数据

# 创建零矩阵（存储第一层的预测结果）
dataset_blend_train = np.zeros((X.shape[0], len(clfs)))#行数：训练集的行数，列数：模型的个数

dataset_blend_test = np.zeros((X_test_features.shape[0], len(clfs)))#行数：测试集数量，列数：模型的个数
dataset_blend_test_j = np.zeros((X_test_features.shape[0], n_folds))#行数：测试集数量，列数：K折

# 建立第一层模型
for j, clf in enumerate(clfs):#枚举分类器
    i = 0
    for train_index, test_index in skf.split(X, y):#K折数据
        X_1_train, y_1_train, X_1_test, y_1_test = X[train_index], y[train_index], X[test_index], y[test_index]
        
        clf.fit(X_1_train, y_1_train)#第j个模型预测第k折数据
        
        y_submission = clf.predict_proba(X_1_test)[:, 1]#第j个模型预测剩下的1折数据，去除答案是1的概率列
        
        dataset_blend_train[test_index, j] = y_submission#第j个模型预测的第k折数据的答案写到预测结果里
        
        dataset_blend_test_j[:, i] = clf.predict_proba(X_test_features)[:, 1]#对测试集进行预测
        
        i = i + 1 #第i折
        
    '''对于测试集，直接用这k个模型的预测值均值作为新的特征。'''
    dataset_blend_test[:, j] = dataset_blend_test_j.mean(1) #每个模型的K折的预测值取平均做为第j个分类器的预测值
    
# 用建立第二层模型

C = [0.01,0.1,1,10]

for i in C:
    stacking_model_lr = LR(C=i, max_iter=100)
    print(i)
    aucs = []
    for train_index, test_index in skf.split(dataset_blend_train, y):#K折数据
        X_2_train, y_2_train, X_2_test, y_2_test = dataset_blend_train[train_index], y[train_index], dataset_blend_train[test_index], y[test_index]
        stacking_model_lr.fit(X_2_train, y_2_train)
        test_predict_proba = stacking_model_lr.predict_proba(X_2_test)[:, 1]
        fpr, tpr, thresholds = roc_curve(y_2_test, test_predict_proba, pos_label=1)
        print("stacking auc",auc(fpr, tpr))
        aucs.append(auc(fpr, tpr))
    print(np.average(aucs))

    stacking_model_lr = LR(C=10, max_iter=100)

stacking_model_lr.fit(dataset_blend_train, y)

test_predict = stacking_model_lr.predict(dataset_blend_test)

test_predicted = np.array(test_predict)

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_stacking_tfidf.csv',columns = ['id','sentiment'], index = False)
print('结束.')

0.01
stacking auc 0.9520484800000001
stacking auc 0.94503472
stacking auc 0.94511664
stacking auc 0.9553464
stacking auc 0.94973888
0.949457024
0.1
stacking auc 0.95298912
stacking auc 0.9471843200000001
stacking auc 0.9471433599999999
stacking auc 0.9565649600000001
stacking auc 0.9505606400000002
0.95088848
1
stacking auc 0.9529368000000001
stacking auc 0.9477648
stacking auc 0.94768992
stacking auc 0.9566342399999999
stacking auc 0.95043744
0.9510926399999999
10
stacking auc 0.9528793600000001
stacking auc 0.9478297600000001
stacking auc 0.94773648
stacking auc 0.9566390399999998
stacking auc 0.9503939199999999
0.9510957120000001
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


## 二、Word2vec+机器学习建模

### 1.Word2vec+LR

In [41]:
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import GridSearchCV
from sklearn.cross_validation import cross_val_score

wv_model_LR = LR(penalty='l2', dual=True, random_state=0)
wv_model_LR.fit(train_data_features, label)

scores = cross_val_score(wv_model_LR, train_data_features, label, cv=10, scoring='roc_auc')

print("LR分类器 10折交叉验证得分: \n", scores)
print("LR分类器 10折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_model_LR.predict(test_data_features))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_lr_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')



LR分类器 10折交叉验证得分: 
 [0.9412576  0.93986304 0.94739904 0.93963392 0.93683136 0.93972416
 0.94117184 0.94543232 0.93660928 0.94397184]
LR分类器 10折交叉验证平均得分: 
 0.9411894399999999
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          0
4   12128_7          1
结束.


![image.png](attachment:image.png)

### 2.Word2vec+GNB

In [42]:
from sklearn.naive_bayes import GaussianNB as GNB
from sklearn.cross_validation import cross_val_score

wv_gnb_model = GNB()
wv_gnb_model.fit(train_data_features, label)


scores = cross_val_score(wv_gnb_model, train_data_features, label, cv=10, scoring='roc_auc')
print("\n高斯贝叶斯分类器 10折交叉验证得分: \n", scores)
print("\n高斯贝叶斯分类器 10折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_gnb_model.predict(test_data_features))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_gnb_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')


高斯贝叶斯分类器 10折交叉验证得分: 
 [0.80043648 0.78343712 0.78856768 0.79913536 0.7964928  0.78470464
 0.79811424 0.80671424 0.7928432  0.80504768]

高斯贝叶斯分类器 10折交叉验证平均得分: 
 0.7955493440000001
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          0
3    7186_2          0
4   12128_7          0
结束.


![image.png](attachment:image.png)

### 3.Word2vec+Knn

In [49]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

wv_knn_model = KNeighborsClassifier(n_neighbors=5)
wv_knn_model.fit(train_data_features, label)

scores = cross_val_score(wv_knn_model, train_data_features, label, cv=10, scoring='roc_auc')

print("\nknn算法 10折交叉验证得分: \n", scores)
print("\nknn算法 10折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_knn_model.predict(test_data_features))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_knn_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')


knn算法 10折交叉验证得分: 
 [0.89709856 0.89423168 0.90665152 0.8890768  0.89263712 0.89179264
 0.88568    0.90213216 0.89707968 0.88941376]

knn算法 10折交叉验证平均得分: 
 0.894579392




保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          0
3    7186_2          0
4   12128_7          1
结束.


### 4.Word2vec+SVM

In [60]:
from sklearn.svm import SVC

'''
线性的SVM只需要，只需要调优正则化参数C
基于RBF核的SVM，需要调优gamma参数和C
'''

wv_svm_model = SVC(kernel='linear',C=10,gamma = 1)
wv_svm_model.fit(train_data_features, label)


最好的参数：


AttributeError: 'SVC' object has no attribute 'best_params_'

In [61]:
test_predicted = np.array(wv_svm_model.predict(test_data_features))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_svm_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')

保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          0
4   12128_7          1
结束.


### 5.Word2vec+DT

In [50]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

wv_tree_model = DecisionTreeClassifier()
wv_tree_model.fit(train_data_features, label)

scores = cross_val_score(wv_tree_model, train_data_features, label, cv=5, scoring='roc_auc')

print("决策树 10折交叉验证得分: \n", scores)
print("决策树 10折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_tree_model.predict(test_data_features))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_dtc_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')


决策树 10折交叉验证得分: 
 [0.7498 0.7506 0.7572 0.7648 0.736 ]

决策树 10折交叉验证平均得分: 
 0.75168




保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          0
3    7186_2          0
4   12128_7          1
结束.


### 6.Word2vec+xgboost

In [43]:
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import numpy as np
import pandas as pd

wv_xgb_model = XGBClassifier(n_estimators=150, min_samples_leaf=3, max_depth=6)
"""
AttributeError: 'list' object has no attribute 'shape'
list => np.array
"""
wv_xgb_model.fit(pd.DataFrame(train_data_features), label)

scores = cross_val_score(wv_xgb_model, pd.DataFrame(train_data_features), label, cv=5, scoring='roc_auc')

print("XGB 5折交叉验证得分: \n", scores)
print("XGB 5折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_xgb_model.predict(pd.DataFrame(test_data_features)))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_xgb_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')

XGB 5折交叉验证得分: 
 [0.94458    0.94526208 0.94390896 0.9463752  0.94129232]
XGB 5折交叉验证平均得分: 
 0.944283712
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          0
4   12128_7          1
结束.


  if diff:


### 7.Word2vec+RF

In [45]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

wv_rf_model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=0)
wv_rf_model.fit(train_data_features, label)

scores = cross_val_score(wv_rf_model, train_data_features, label, cv=5, scoring='roc_auc')

print("\n随机森林 5折交叉验证得分: \n", scores)
print("\n随机森林 5折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_rf_model.predict(test_data_features))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_rf_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')


随机森林 5折交叉验证得分: 
 [0.9182096  0.91855744 0.91677872 0.92213184 0.91022928]

随机森林 5折交叉验证平均得分: 
 0.9171813759999999
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          0
3    7186_2          0
4   12128_7          1
结束.


### 8.Word2vec+GBDT

In [46]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

wv_gbdt_model = GradientBoostingClassifier()
wv_gbdt_model.fit(train_data_features, label)

scores = cross_val_score(wv_gbdt_model, train_data_features, label, cv=5, scoring='roc_auc')

print("GBDT 5折交叉验证得分: \n", scores)
print("GBDT 折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_gbdt_model.predict(test_data_features))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_gbdt_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')

GBDT 5折交叉验证得分: 
 [0.9328144  0.93499104 0.93023024 0.93560096 0.92776608]
GBDT 折交叉验证平均得分: 
 0.9322805439999999
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          0
4   12128_7          1
结束.


### 9.Word2vec+Adaboost

* adaboost模型训练太耗时,没跑完

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score

wv_ab_model = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=2),
    n_estimators=600,
    learning_rate=1)

wv_ab_model.fit(train_data_features, label)

scores = cross_val_score(wv_ab_model, train_data_features, label, cv=3, scoring='roc_auc')

print("AdaBoost 3折交叉验证得分: \n", scores)
print("AdaBoost 3折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_ab_model.predict(test_data_features))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_adaboost_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')

### 10.Word2vec+Voting

In [53]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

wv_vot_model = VotingClassifier(estimators=[('lr', wv_model_LR), ('xgb', wv_xgb_model), ('gbdt', wv_gbdt_model),('rf', wv_rf_model)],voting='hard')
wv_vot_model.fit(pd.DataFrame(train_data_features), np.array(label))

scores = cross_val_score(wv_vot_model, pd.DataFrame(train_data_features),label, cv=5, scoring=None,n_jobs = -1)

print("VotingClassifier 5折交叉验证得分: \n", scores)
print("VotingClassifier 5折交叉验证平均得分: \n", np.mean(scores))

test_predicted = np.array(wv_vot_model.predict(pd.DataFrame(test_data_features)))
print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_vot_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')

VotingClassifier 5折交叉验证得分: 
 [0.8656 0.8702 0.8644 0.8674 0.8572]
VotingClassifier 5折交叉验证平均得分: 
 0.86496


  if diff:


保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          0
4   12128_7          1
结束.


  if diff:


### 11.Word2vec+Stacking

In [54]:
#K折数据切分
from sklearn.model_selection import StratifiedKFold
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [5, 6],[7, 8]])
y = np.array([0, 0, 0, 1, 1, 1])#1的个数和0的个数要大于3，3也就是n_splits
skf = StratifiedKFold(n_splits=3,shuffle=True,random_state=1)

for train_index, test_index in skf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

print(X_train)
print(X_test)

print(y_train)
print(y_test)

TRAIN: [1 2 4 5] TEST: [0 3]
TRAIN: [0 1 3 4] TEST: [2 5]
TRAIN: [0 2 3 5] TEST: [1 4]
[[1 2]
 [1 2]
 [3 4]
 [7 8]]
[[3 4]
 [5 6]]
[0 0 1 1]
[0 1]


In [57]:
'''模型融合中使用到的各个单模型'''
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
#from sklearn.cross_validation import StratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix, roc_auc_score, roc_curve, auc

# 划分train数据集,调用代码,把数据集名字转成和代码一样
X = np.array(train_data_features)
y = np.array(label)

X_test_features = np.array(test_data_features)

stacking_LR = LR(penalty='l2', dual=True, random_state=0)

stacking_xgb = XGBClassifier(n_estimators=150, min_samples_leaf=3, max_depth=6)

stacking_rf = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=0)


clfs = [stacking_LR,stacking_xgb,stacking_rf]#模型

# 创建n_folds
n_folds = 5
skf = StratifiedKFold(n_splits=n_folds,shuffle=True,random_state=1)#K-Flod数据

# 创建零矩阵（存储第一层的预测结果）
dataset_blend_train = np.zeros((X.shape[0], len(clfs)))#行数：训练集的行数，列数：模型的个数

dataset_blend_test = np.zeros((X_test_features.shape[0], len(clfs)))#行数：测试集数量，列数：模型的个数
dataset_blend_test_j = np.zeros((X_test_features.shape[0], n_folds))#行数：测试集数量，列数：K折

# 建立第一层模型
for j, clf in enumerate(clfs):#枚举分类器
    i = 0
    for train_index, test_index in skf.split(X, y):#K折数据
        X_1_train, y_1_train, X_1_test, y_1_test = X[train_index], y[train_index], X[test_index], y[test_index]
        
        clf.fit(X_1_train, y_1_train)#第j个模型预测第k折数据
        
        y_submission = clf.predict_proba(X_1_test)[:, 1]#第j个模型预测剩下的1折数据，去除答案是1的概率列
        
        dataset_blend_train[test_index, j] = y_submission#第j个模型预测的第k折数据的答案写到预测结果里
        
        dataset_blend_test_j[:, i] = clf.predict_proba(X_test_features)[:, 1]#对测试集进行预测
        
        i = i + 1 #第i折
        
    '''对于测试集，直接用这k个模型的预测值均值作为新的特征。'''
    dataset_blend_test[:, j] = dataset_blend_test_j.mean(1) #每个模型的K折的预测值取平均做为第j个分类器的预测值

In [58]:
# 用建立第二层模型

C = [0.01,0.1,1,10]

for i in C:
    stacking_model_lr = LR(C=i, max_iter=100)
    print(i)
    aucs = []
    for train_index, test_index in skf.split(dataset_blend_train, y):#K折数据
        X_2_train, y_2_train, X_2_test, y_2_test = dataset_blend_train[train_index], y[train_index], dataset_blend_train[test_index], y[test_index]
        stacking_model_lr.fit(X_2_train, y_2_train)
        test_predict_proba = stacking_model_lr.predict_proba(X_2_test)[:, 1]
        fpr, tpr, thresholds = roc_curve(y_2_test, test_predict_proba, pos_label=1)
        print("stacking auc",auc(fpr, tpr))
        aucs.append(auc(fpr, tpr))
    print(np.average(aucs))


0.01
stacking auc 0.9506499199999999
stacking auc 0.94645408
stacking auc 0.9462804800000002
stacking auc 0.9520553599999999
stacking auc 0.9493892799999999
0.9489658240000001
0.1
stacking auc 0.9507424000000001
stacking auc 0.94657616
stacking auc 0.9463376
stacking auc 0.95219488
stacking auc 0.9494544
0.949061088
1
stacking auc 0.95079408
stacking auc 0.9466462400000001
stacking auc 0.9463564799999999
stacking auc 0.9522491200000001
stacking auc 0.94948288
0.9491057599999999
10
stacking auc 0.9508049599999999
stacking auc 0.94665712
stacking auc 0.94635328
stacking auc 0.9522579200000001
stacking auc 0.94948464
0.949111584


In [59]:
#stacking 预测
stacking_model_lr = LR(C=10, max_iter=100)

stacking_model_lr.fit(dataset_blend_train, y)

test_predict = stacking_model_lr.predict(dataset_blend_test)

test_predicted = np.array(test_predict)

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_stacking_wv.csv',columns = ['id','sentiment'], index = False)
print('结束.')

保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          0
4   12128_7          1
结束.


# 深度学习方法

## 一、深度学习建模

### 1.LSTM

In [12]:
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
from keras.callbacks import ModelCheckpoint

MAX_SEQUENCE_LENGTH = 100 # 每条新闻最大长度
EMBEDDING_DIM = 100       # 词向量空间维度

all_data = train_data+test_data
#Tokenizer是一个用于向量化文本，或将文本转换为序列（即单词在字典中的下标构成的列表，从1算起）的类
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_data)
sequences = tokenizer.texts_to_sequences(all_data)

#总共词数(word_index：key:词，value:索引)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
#print(word_index)

#将整篇文章根据向量化文本序列都退少补生成文章矩阵
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)

x_train,x_test = data[:len(train_data)],data[len(train_data):]

#将标签独热向量处理
labels = to_categorical(np.asarray(label))
print('Shape of label tensor:', labels.shape)


model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))#LSTM参数：LSTM的输出向量的维度
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])

VALIDATION_SPLIT = 0.16 # 验证集比例
TEST_SPLIT = 0.2 # 测试集比例

p1 = int(len(x_train)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(x_train)*(1-TEST_SPLIT))

train_x = x_train[:p1]
train_y = labels[:p1]
val_x = x_train[p1:p2]
val_y = labels[p1:p2]
test_x = x_train[p2:]
test_y = labels[p2:]

print ('train docs: '+str(len(train_x)))
print ('val docs: '+str(len(val_x)))
print ('test docs: '+str(len(test_x)))

filepath="lstm_model/lstm_weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True,mode='max')
callbacks_list = [checkpoint]

# Fit the model
#model.fit(X, Y, validation_split=0.33, nb_epoch=150, batch_size=10,callbacks=callbacks_list, verbose=0)

model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=12, batch_size=5000,callbacks=callbacks_list)

#model.save('word_vector_cnn.h5')
print (model.evaluate(test_x, test_y))

test_predicted = np.array(model.predict_classes(x_test))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_lstm.csv',columns = ['id','sentiment'], index = False)
print('结束.')


Found 101245 unique tokens.
Shape of data tensor: (50000, 100)
Shape of label tensor: (25000, 2)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 100, 100)          10124600  
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dropout_6 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 202       
Total params: 10,205,202
Trainable params: 10,205,202
Non-trainable params: 0
_________________________________________________________________
train docs: 15999
val docs: 4001
test docs: 5000
Train on 15999 samples, validate on 4001 samples
Epoch 1/12

Epoch 00001: val_acc improved from -inf to 0.61510, sav

### 2.CNN

In [5]:
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential
from keras.utils import plot_model

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import numpy as np


MAX_SEQUENCE_LENGTH = 100 # 每条新闻最大长度
EMBEDDING_DIM = 100 # 词向量空间维度



#合并训练集和测试集
all_data = train_data+test_data

#Tokenizer是一个用于向量化文本，或将文本转换为序列（即单词在字典中的下标构成的列表，从1算起）的类
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_data)
sequences = tokenizer.texts_to_sequences(all_data)

#总共词数(word_index：key:词，value:索引)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
#print(word_index)

#将整篇文章根据向量化文本序列都退少补生成文章矩阵
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)

x_train,x_test = data[:len(train_data)],data[len(train_data):]

#将标签独热向量处理
labels = to_categorical(np.asarray(label))
print('Shape of label tensor:', labels.shape)

VALIDATION_SPLIT = 0.16 # 验证集比例
TEST_SPLIT = 0.2 # 测试集比例

p1 = int(len(x_train)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(x_train)*(1-TEST_SPLIT))

train_x = x_train[:p1]
train_y = labels[:p1]
val_x = x_train[p1:p2]
val_y = labels[p1:p2]
test_x = x_train[p2:]
test_y = labels[p2:]

print ('train docs: '+str(len(train_x)))
print ('val docs: '+str(len(val_x)))
print ('test docs: '+str(len(test_x)))

model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(Dropout(0.2))
model.add(Conv1D(250, 3, padding='valid', activation='relu', strides=1))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()

#plot_model(model, to_file='model.png',show_shapes=True)

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])


filepath="cnn_weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True,mode='max')
callbacks_list = [checkpoint]

# Fit the model
#model.fit(X, Y, validation_split=0.33, nb_epoch=150, batch_size=10,callbacks=callbacks_list, verbose=0)


model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=12, batch_size=5000,callbacks=callbacks_list)

#model.save('word_vector_cnn.h5')
print (model.evaluate(test_x, test_y))

test_predicted = np.array(model.predict_classes(x_test))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_cnn.csv',columns = ['id','sentiment'], index = False)
print('结束.')


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Found 101245 unique tokens.
Shape of data tensor: (50000, 100)
Shape of label tensor: (25000, 2)
train docs: 15999
val docs: 4001
test docs: 5000
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          10124600  
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 100)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 98, 250)           75250     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 250)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               800100 

  % delta_t_median)



Epoch 00001: val_acc improved from -inf to 0.49888, saving model to weights-improvement-01-0.50.hdf5
Epoch 2/6

Epoch 00002: val_acc did not improve from 0.49888
Epoch 3/6

Epoch 00003: val_acc did not improve from 0.49888
Epoch 4/6

Epoch 00004: val_acc improved from 0.49888 to 0.49938, saving model to weights-improvement-04-0.50.hdf5
Epoch 5/6

Epoch 00005: val_acc improved from 0.49938 to 0.71832, saving model to weights-improvement-05-0.72.hdf5
Epoch 6/6

Epoch 00006: val_acc improved from 0.71832 to 0.74706, saving model to weights-improvement-06-0.75.hdf5
[0.5218041631698609, 0.7492]
保存结果...
         id  sentiment
0  12311_10          1
1    8348_2          0
2    5828_4          1
3    7186_2          1
4   12128_7          1
结束.


## 二、Word2vec+深度学习建模

### 1.LSTM+Word2Vec

In [8]:
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential
from keras.utils import plot_model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
import gensim
from gensim.models.word2vec import Word2Vec
from keras.callbacks import ModelCheckpoint


all_data = train_data+test_data
#Tokenizer是一个用于向量化文本，或将文本转换为序列（即单词在字典中的下标构成的列表，从1算起）的类
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_data)
sequences = tokenizer.texts_to_sequences(all_data)

#总共词数(word_index：key:词，value:索引)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
#print(word_index)

#将整篇文章根据向量化文本序列都退少补生成文章矩阵
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)

x_train,x_test = data[:len(train_data)],data[len(train_data):]

#将标签独热向量处理
labels = to_categorical(np.asarray(label))
print('Shape of label tensor:', labels.shape)

wv_model = Word2Vec.load("100size_3min_count_10window.model")

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items(): 
    if word in wv_model:
        embedding_matrix[i] = np.asarray(wv_model[word],dtype='float32')
        
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)#冻结嵌入层

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))#LSTM参数：LSTM的输出向量的维度
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])


VALIDATION_SPLIT = 0.16 # 验证集比例
TEST_SPLIT = 0.2 # 测试集比例

p1 = int(len(x_train)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(x_train)*(1-TEST_SPLIT))

train_x = x_train[:p1]
train_y = labels[:p1]
val_x = x_train[p1:p2]
val_y = labels[p1:p2]
test_x = x_train[p2:]
test_y = labels[p2:]

print ('train docs: '+str(len(train_x)))
print ('val docs: '+str(len(val_x)))
print ('test docs: '+str(len(test_x)))

filepath="lstm_word2vec_model/lstm_word2vec_weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
#仅保存最好的模型
#filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True,mode='max')#验证集准确率比之前效果好就保存权重
callbacks_list = [checkpoint]


model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=40, batch_size=5000,callbacks=callbacks_list)

# Fit the model
#model.fit(X, Y, validation_split=0.33, nb_epoch=150, batch_size=10,callbacks=callbacks_list, verbose=0)


#model.save('word_vector_cnn.h5')
print (model.evaluate(test_x, test_y))


test_predicted = np.array(model.predict_classes(x_test))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_lstm_word2vec.csv',columns = ['id','sentiment'], index = False)
print('结束.')


Found 101245 unique tokens.
Shape of data tensor: (50000, 100)
Shape of label tensor: (25000, 2)




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 100)          10124600  
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 202       
Total params: 10,205,202
Trainable params: 80,602
Non-trainable params: 10,124,600
_________________________________________________________________
train docs: 15999
val docs: 4001
test docs: 5000
Train on 15999 samples, validate on 4001 samples
Epoch 1/40

Epoch 00001: val_acc improved from -inf to 0.70957, saving model to lstm_word2vec_weights-improvement-01-0.71.hdf5
Epoch 2/40

Epoch 00002: val_acc

#### 加载模型（使用保存的模型评估或继续训练）

In [10]:
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential
from keras.utils import plot_model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
import gensim
from gensim.models.word2vec import Word2Vec
from keras.callbacks import ModelCheckpoint

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))#LSTM参数：LSTM的输出向量的维度
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# load weights
model.load_weights("lstm_word2vec_weights-improvement-36-0.87.hdf5")

#model.save('word_vector_cnn.h5')
print (model.evaluate(test_x, test_y))

[0.31665992724895475, 0.8666]


### 2.CNN+Word2Vec

In [11]:
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential
from keras.utils import plot_model

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
from gensim.models.word2vec import Word2Vec

MAX_SEQUENCE_LENGTH = 100 # 每条新闻最大长度
EMBEDDING_DIM = 100       # 词向量空间维度

VALIDATION_SPLIT = 0.16 # 验证集比例
TEST_SPLIT = 0.2 # 测试集比例

#合并训练集和测试集
all_data = train_data+test_data

#Tokenizer是一个用于向量化文本，或将文本转换为序列（即单词在字典中的下标构成的列表，从1算起）的类
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_data)
sequences = tokenizer.texts_to_sequences(all_data)

#总共词数(word_index：key:词，value:索引)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
#print(word_index)

#将整篇文章根据向量化文本序列都退少补生成文章矩阵
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)

x_train,x_test = data[:len(train_data)],data[len(train_data):]

#将标签独热向量处理
labels = to_categorical(np.asarray(label))
print('Shape of label tensor:', labels.shape)


p1 = int(len(x_train)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(x_train)*(1-TEST_SPLIT))

train_x = x_train[:p1]
train_y = labels[:p1]
val_x = x_train[p1:p2]
val_y = labels[p1:p2]
test_x = x_train[p2:]
test_y = labels[p2:]

print ('train docs: '+str(len(train_x)))
print ('val docs: '+str(len(val_x)))
print ('test docs: '+str(len(test_x)))


wv_model = Word2Vec.load("100size_3min_count_10window.model")

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items(): 
    if word in wv_model:
        embedding_matrix[i] = np.asarray(wv_model[word],dtype='float32')
        
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)#冻结嵌入层


model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(250, 3, padding='valid', activation='relu', strides=1))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()

#plot_model(model, to_file='model.png',show_shapes=True)

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])


filepath="cnn_model/cnn_word2vec_weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True,mode='max')
callbacks_list = [checkpoint]

# Fit the model
#model.fit(X, Y, validation_split=0.33, nb_epoch=150, batch_size=10,callbacks=callbacks_list, verbose=0)


model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=20, batch_size=5000,callbacks=callbacks_list)

#model.save('word_vector_cnn.h5')
print (model.evaluate(test_x, test_y))

test_predicted = np.array(model.predict_classes(x_test))

print('保存结果...')
submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})
print(submission_df.head(5))
submission_df.to_csv('submission_cnn_word2vec.csv',columns = ['id','sentiment'], index = False)
print('结束.')


Found 101245 unique tokens.
Shape of data tensor: (50000, 100)
Shape of label tensor: (25000, 2)
train docs: 15999
val docs: 4001
test docs: 5000




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 100)          10124600  
_________________________________________________________________
dropout_5 (Dropout)          (None, 100, 100)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 98, 250)           75250     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 250)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 100)               800100    
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 202       
Total para