# 句子级情感分类任务，使用豆瓣评论作为训练语料
## 换成其他语料时，只要改变第一块中df的内容即可

In [1]:
import numpy as np
np.random.seed(1337)  # for reproducibility

import pandas as pd
df=pd.read_csv('data/DoubanZH.txt',names=['label','content'])
df['label']=df.label.apply(lambda x:1 if x==50 else 0)
df=df[df.content.isnull()==False]
df.head()

Unnamed: 0,label,content
0,1,智取 威虎山 之 寻找 梁家辉
1,1,燃爆 了 ！ ！ ！
2,1,硬到 骨子里
3,0,红色 电影 新 马甲 。
4,0,看 完 影评 我 觉得 我 是 一个 人 不能 更 糟心 的 片子


## 处理语料格式，将文本转为id

In [33]:
# 提取全文的词表
vocabs={'\s':0}
vocab_list=['\s']
for content in df.content:
    for word in content.split():
        if word not in vocabs:
            vocabs[word]=len(vocabs)
            vocab_list.append(word)
print('词表中共包含单词%d个'%len(vocabs))

# 将中文的句子中的单词转为id
data=[]
for content in df.content:
    sentence=[]
    for word in content.split():
        sentence.append(vocabs[word])
    data.append(sentence)
    
print('最长句子长度为%d个词'%np.max([len(sentence) for sentence in data]))

# 载入数据
from keras.preprocessing import sequence

xs=sequence.pad_sequences(data,maxlen=80)
ys=df.label.values

词表中共包含单词124374个
最长句子长度为140个词


## 构建CNN模型

In [41]:
from keras.preprocessing import sequence
from keras.models import Sequential,Model
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding,Input
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.datasets import imdb


# set parameters:
max_features = len(vocabs)
maxlen = 400
batch_size = 256
embedding_dims = 100
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 10

In [42]:
input=Input(shape=(80,),dtype='int32')

x=Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen,
                    dropout=0.2,)(input)


x=Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1)(x)
x=GlobalMaxPooling1D()(x)
# We add a vanilla hidden layer:
x=Dense(hidden_dims)(x)
x=Dropout(0.2)(x)
x=Activation('relu')(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
x=Dense(1)(x)
output=Activation('sigmoid')(x)

model=Model(input=[input],output=output)

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(xs, ys,
          batch_size=batch_size,
          nb_epoch=nb_epoch,validation_split=0.2,verbose=1)


Train on 219205 samples, validate on 54802 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f03e695c6a0>

In [14]:
import gensim
w2v=gensim.models.Word2Vec.load('../WordEmbedding/data/news_tensite_xml.dat.jiebaresult.w2v')
fastText=gensim.models.Word2Vec.load_word2vec_format('../WordEmbedding/data/news_tensite_xml.dat.jiebaresult.fasttext.vec')

In [36]:
weights=np.array([w2v[word] if word in w2v else 
                  np.random.uniform(low=-0.05,high=0.05,size=(embedding_dims,)) 
                 for word in vocab_list])
print(weights.shape)

(124374, 100)


In [43]:
input=Input(shape=(80,),dtype='int32')

x=Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen,weights=[weights],
                    dropout=0.2,)(input)


x=Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1)(x)
x=GlobalMaxPooling1D()(x)
# We add a vanilla hidden layer:
x=Dense(hidden_dims)(x)
x=Dropout(0.2)(x)
x=Activation('relu')(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
x=Dense(1)(x)
output=Activation('sigmoid')(x)

model=Model(input=[input],output=output)

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(xs, ys,
          batch_size=batch_size,
          nb_epoch=20,validation_split=0.2,verbose=1)


Train on 219205 samples, validate on 54802 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f03e6469208>