# 句子级情感分类任务，使用豆瓣评论作为训练语料
## 换成其他语料时，只要改变第一块中df的内容即可

In [27]:
import numpy as np
np.random.seed(1337)  # for reproducibility

import pandas as pd
df=pd.read_csv('data/DoubanZH.txt',names=['label','content'])
df['label']=df.label.apply(lambda x:1 if x==50 else 0)
df=df[df.content.isnull()==False]
df.head()

Unnamed: 0,label,content
0,1,智取 威虎山 之 寻找 梁家辉
1,1,燃爆 了 ！ ！ ！
2,1,硬到 骨子里
3,0,红色 电影 新 马甲 。
4,0,看 完 影评 我 觉得 我 是 一个 人 不能 更 糟心 的 片子


## 处理语料格式，将文本转为id

In [28]:
# 提取全文的词表
vocabs={'\s':0}
for content in df.content:
    for word in content.split():
        vocabs[word]=len(vocabs)
print('词表中共包含单词%d个'%len(vocabs))

# 将中文的句子中的单词转为id
data=[]
for content in df.content:
    sentence=[]
    for word in content.split():
        sentence.append(vocabs[word])
    data.append(sentence)
    
print('最长句子长度为%d个词'%np.max([len(sentence) for sentence in data]))

# 载入数据
from keras.preprocessing import sequence

xs=sequence.pad_sequences(data,maxlen=80)
ys=df.label.values

词表中共包含单词124374个


## 构建CNN模型

In [38]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.datasets import imdb

## 配置参数
max_features = len(vocabs)
maxlen = 80
batch_size = 32
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 2

In [None]:
model = Sequential()

model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen,
                    dropout=0.2))

model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))

model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(xs, ys,
          batch_size=batch_size,
          nb_epoch=nb_epoch,validation_split=0.2)
