https://spaces.ac.cn/archives/3942

基于迁移学习和双向LSTM的核心实体识别

迁移学习体现在：
1、用训练语料和测试语料一起训练Word2Vec，使得词向量本捕捉了测试语料的语义；
2、用训练语料训练模型；
3、得到模型后，对测试语料预测，把预测结果跟训练语料一起训练新的模型；
4、用新的模型预测，模型效果会有一定提升；
5、对比两次预测结果，如果两次预测结果都一样，那说明这个预测结果很有可能是对的，用这部分“很有可能是对的”的测试结果来训练模型；
6、用更新的模型预测；
7、如果你愿意，可以继续重复第4、5、6步。

双向LSTM的思路：
1、分词；
2、转换为5tag标注问题（0:非核心实体，1:单词的核心实体，2:多词核心实体的首词，3:多词核心实体的中间部分，4:多词核心实体的末词）；
3、通过双向LSTM，直接对输入句子输出预测标注序列；
4、通过viterbi算法来获得标注结果；
5、因为常规的LSTM存在后面的词比前面的词更重要的弊端，因此用双向LSTM。


In [3]:
import numpy as np
import pandas as pd
import jieba
from tqdm import tqdm
import re

In [4]:
d = pd.read_json('/home/ian/code/github/data/data.json') #训练数据已经被预处理成为标准json格式
d.index = range(len(d)) #重新定义一下索引，当然这只是优化显示效果
word_size = 128 #词向量维度
maxlen = 80 #句子截断长度

In [25]:
d.shape

(12445, 4)

In [5]:
d.head()

Unnamed: 0,content,core_entity
0,﻿佐藤健（Satoh Takeru），1989年3月21日出生于日本埼玉县埼玉市，日本演员。,[佐藤健]
1,在近现代政治中，左派是指社会中维护社会中下层利益，支持改变旧的不合理社会秩序，创造更为平等的...,[左派]
2,《罪恶城市》和《Original Gangstaz》一样是一款黑帮主题的角色扮演类游戏，虽...,"[罪恶城市, Original Gangstaz]"
3,朱明峰山的风景很美丽的，树木茂盛，古树参天，空气清新，奇花异草遍布山间，山顶的山脉耸立，是很...,[朱明峰山]
4,《雪》是张艺军演唱的一首歌曲。,[雪]


In [6]:
'''
修改分词函数，主要是：
1、英文和数字部分不分词，直接返回；
2、双书名号里边的内容不分词；
3、双引号里边如果是十字以内的内容不分词；
4、超出范围内的字符全部替换为空格；
5、分词使用结巴分词，并关闭新词发现功能。
'''
#匹配数字、大小写字母、空格、.或者书名号内的字符，或者匹配中文双引号内的字符
not_cuts = re.compile(u'([\da-zA-Z \.]+)|《(.*?)》|“(.{1,10})”')
#匹配开始以汉字、数字、大小写字母、书名号、中英文括号、中文双引号、·、英文句号
re_replace = re.compile(u'[^\u4e00-\u9fa50-9a-zA-Z《》\(\)（）“”·\.]')

In [10]:
re_replace.sub(' ','---你。。。好')

'   你   好'

In [11]:
for i in not_cuts.finditer('佐藤健（Satoh Takeru），1989年3月21日出生于日本埼玉县埼玉市，日本演员。'):
    print(i)

<_sre.SRE_Match object; span=(4, 16), match='Satoh Takeru'>
<_sre.SRE_Match object; span=(18, 22), match='1989'>
<_sre.SRE_Match object; span=(23, 24), match='3'>
<_sre.SRE_Match object; span=(25, 27), match='21'>


In [12]:
for i in not_cuts.finditer('《雪》是张艺军演唱的一首歌曲。'):
    print(i)

<_sre.SRE_Match object; span=(0, 3), match='《雪》'>


In [13]:
def mycut(s):
    result = []
    j = 0
    s = re_replace.sub(' ', s)#超出范围内的字符全部替换为空格；
    for i in not_cuts.finditer(s):
        result.extend(jieba.lcut(s[j:i.start()], HMM=False))
        if s[i.start()] in [u'《', u'“']:
            result.extend([s[i.start()], s[i.start()+1:i.end()-1], s[i.end()-1]])
        else:
            result.append(s[i.start():i.end()])
        j = i.end()
    result.extend(jieba.lcut(s[j:], HMM=False))
    return result

In [14]:
d['words'] = d['content'].apply(mycut) #分词
d.head()

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.585 seconds.
Prefix dict has been built succesfully.


Unnamed: 0,content,core_entity,words
0,﻿佐藤健（Satoh Takeru），1989年3月21日出生于日本埼玉县埼玉市，日本演员。,[佐藤健],"[ , 佐藤, 健, （, Satoh Takeru, ）, 1989, 年, 3, 月..."
1,在近现代政治中，左派是指社会中维护社会中下层利益，支持改变旧的不合理社会秩序，创造更为平等的...,[左派],"[在, 近现代, 政治, 中, , 左派, 是, 指, 社会, 中, 维护, 社会, 中下..."
2,《罪恶城市》和《Original Gangstaz》一样是一款黑帮主题的角色扮演类游戏，虽...,"[罪恶城市, Original Gangstaz]","[《, 罪恶城市, 》, 和, 《, Original Gangstaz, 》, 一样, ..."
3,朱明峰山的风景很美丽的，树木茂盛，古树参天，空气清新，奇花异草遍布山间，山顶的山脉耸立，是很...,[朱明峰山],"[朱, 明, 峰山, 的, 风景, 很, 美丽, 的, , 树木, 茂盛, , 古树, ..."
4,《雪》是张艺军演唱的一首歌曲。,[雪],"[《, 雪, 》, 是, 张, 艺, 军, 演唱, 的, 一首, 歌曲, ]"


In [17]:
def label(k): #将输出结果转换为标签序列
    s = d['words'][k]
    r = ['0']*len(s)
    for i in range(len(s)):
        for j in d['core_entity'][k]:
            if s[i] in j:
                r[i] = '1'
                break
    s = ''.join(r)
    r = [0]*len(s)
    for i in re.finditer('1+', s):
        if i.end() - i.start() > 1:
            r[i.start()] = 2#2表示≥2个字的核心实体的开始
            r[i.end()-1] = 4#4表示≥2个字的核心实体的结尾
            for j in range(i.start()+1, i.end()-1):
                r[j] = 3#3表示≥2个字的核心实体的中间
        else:
            r[i.start()] = 1#1表示1个字的核心实体
    return r

In [16]:
label(0)

[0, 2, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [20]:
d['label'] = list(map(label, tqdm(iter(d.index)))) #输出tags


0it [00:00, ?it/s][A
222it [00:00, 2199.37it/s][A
508it [00:00, 2528.43it/s][A
765it [00:00, 2537.33it/s][A
1026it [00:00, 2553.44it/s][A
1277it [00:00, 2543.29it/s][A
1495it [00:00, 2481.08it/s][A
1709it [00:00, 2378.90it/s][A
1915it [00:00, 2337.98it/s][A
2195it [00:00, 2387.50it/s][A
2447it [00:01, 2400.47it/s][A
2717it [00:01, 2427.03it/s][A
2993it [00:01, 2453.83it/s][A
3291it [00:01, 2493.80it/s][A
3580it [00:01, 2521.15it/s][A
3858it [00:01, 2537.88it/s][A
4144it [00:01, 2557.09it/s][A
4422it [00:01, 2568.67it/s][A
4707it [00:01, 2583.69it/s][A
5005it [00:01, 2604.29it/s][A
5318it [00:02, 2630.19it/s][A
5612it [00:02, 2641.94it/s][A
5904it [00:02, 2651.34it/s][A
6194it [00:02, 2661.98it/s][A
6484it [00:02, 2668.59it/s][A
6772it [00:02, 2671.43it/s][A
7056it [00:02, 2671.92it/s][A
7336it [00:02, 2676.34it/s][A
7615it [00:02, 2675.69it/s][A
7898it [00:02, 2678.58it/s][A
8176it [00:03, 2655.73it/s][A
8477it [00:03, 2666.94it/s][A
8771it [00:03, 2675

In [21]:
d.head()

Unnamed: 0,content,core_entity,words,label
0,﻿佐藤健（Satoh Takeru），1989年3月21日出生于日本埼玉县埼玉市，日本演员。,[佐藤健],"[ , 佐藤, 健, （, Satoh Takeru, ）, 1989, 年, 3, 月...","[0, 2, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,在近现代政治中，左派是指社会中维护社会中下层利益，支持改变旧的不合理社会秩序，创造更为平等的...,[左派],"[在, 近现代, 政治, 中, , 左派, 是, 指, 社会, 中, 维护, 社会, 中下...","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,《罪恶城市》和《Original Gangstaz》一样是一款黑帮主题的角色扮演类游戏，虽...,"[罪恶城市, Original Gangstaz]","[《, 罪恶城市, 》, 和, 《, Original Gangstaz, 》, 一样, ...","[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,朱明峰山的风景很美丽的，树木茂盛，古树参天，空气清新，奇花异草遍布山间，山顶的山脉耸立，是很...,[朱明峰山],"[朱, 明, 峰山, 的, 风景, 很, 美丽, 的, , 树木, 茂盛, , 古树, ...","[2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,《雪》是张艺军演唱的一首歌曲。,[雪],"[《, 雪, 》, 是, 张, 艺, 军, 演唱, 的, 一首, 歌曲, ]","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


In [23]:
#随机打乱数据
idx = list(range(len(d)))
d.index = idx
np.random.shuffle(idx)#Modify a sequence in-place by shuffling its contents.
d = d.loc[idx]
d.index = range(len(d))

In [26]:
'''
用gensim来训练Word2Vec：
1、联合训练语料和测试语料一起训练；
2、经过测试用skip gram效果会好些。
'''
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [30]:
%%time
word2vec = gensim.models.Word2Vec(d['words'], 
                                  min_count=1, 
                                  size=word_size, 
                                  workers=20,
                                  iter=20,
                                  window=8,
                                  negative=8,
                                  sg=1)
word2vec.save('word2vec_words_final.model')
word2vec.init_sims(replace=True) #预先归一化，使得词向量不受尺度影响

2018-08-08 16:54:50,061 : INFO : collecting all words and their counts
2018-08-08 16:54:50,064 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-08-08 16:54:50,149 : INFO : PROGRESS: at sentence #10000, processed 312348 words, keeping 37036 word types
2018-08-08 16:54:50,164 : INFO : collected 42029 word types from a corpus of 390502 raw words and 12445 sentences
2018-08-08 16:54:50,166 : INFO : Loading a fresh vocabulary
2018-08-08 16:54:50,278 : INFO : effective_min_count=1 retains 42029 unique words (100% of original 42029, drops 0)
2018-08-08 16:54:50,279 : INFO : effective_min_count=1 leaves 390502 word corpus (100% of original 390502, drops 0)
2018-08-08 16:54:50,358 : INFO : deleting the raw counts dictionary of 42029 items
2018-08-08 16:54:50,360 : INFO : sample=0.001 downsamples 23 most-common words
2018-08-08 16:54:50,361 : INFO : downsampling leaves estimated 303299 word corpus (77.7% of prior 390502)
2018-08-08 16:54:50,428 : INFO : estimated r

2018-08-08 16:54:53,139 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-08 16:54:53,149 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-08-08 16:54:53,162 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-08-08 16:54:53,165 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-08-08 16:54:53,180 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-08 16:54:53,196 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-08 16:54:53,200 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-08 16:54:53,207 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-08 16:54:53,208 : INFO : EPOCH - 4 : training on 390502 raw words (303228 effective words) took 0.6s, 518552 effective words/s
2018-08-08 16:54:53,635 : INFO : worker thread finished; awaiting finish of 19 more threads
2018-08-08 16:54:53,646 : INFO : worker threa

2018-08-08 16:54:55,594 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-08-08 16:54:55,598 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-08-08 16:54:55,602 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-08-08 16:54:55,606 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-08-08 16:54:55,608 : INFO : EPOCH - 8 : training on 390502 raw words (303112 effective words) took 0.6s, 550907 effective words/s
2018-08-08 16:54:56,090 : INFO : worker thread finished; awaiting finish of 19 more threads
2018-08-08 16:54:56,092 : INFO : worker thread finished; awaiting finish of 18 more threads
2018-08-08 16:54:56,096 : INFO : worker thread finished; awaiting finish of 17 more threads
2018-08-08 16:54:56,101 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-08-08 16:54:56,118 : INFO : worker thread finished; awaiting finish of 15 more threads
2018-08-08 16:54:56,142 : INFO : worker t

2018-08-08 16:54:58,093 : INFO : EPOCH - 12 : training on 390502 raw words (303431 effective words) took 0.6s, 518632 effective words/s
2018-08-08 16:54:58,538 : INFO : worker thread finished; awaiting finish of 19 more threads
2018-08-08 16:54:58,559 : INFO : worker thread finished; awaiting finish of 18 more threads
2018-08-08 16:54:58,565 : INFO : worker thread finished; awaiting finish of 17 more threads
2018-08-08 16:54:58,571 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-08-08 16:54:58,578 : INFO : worker thread finished; awaiting finish of 15 more threads
2018-08-08 16:54:58,589 : INFO : worker thread finished; awaiting finish of 14 more threads
2018-08-08 16:54:58,596 : INFO : worker thread finished; awaiting finish of 13 more threads
2018-08-08 16:54:58,597 : INFO : worker thread finished; awaiting finish of 12 more threads
2018-08-08 16:54:58,602 : INFO : worker thread finished; awaiting finish of 11 more threads
2018-08-08 16:54:58,607 : INFO : wor

2018-08-08 16:55:01,140 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-08-08 16:55:01,153 : INFO : worker thread finished; awaiting finish of 15 more threads
2018-08-08 16:55:01,174 : INFO : worker thread finished; awaiting finish of 14 more threads
2018-08-08 16:55:01,182 : INFO : worker thread finished; awaiting finish of 13 more threads
2018-08-08 16:55:01,187 : INFO : worker thread finished; awaiting finish of 12 more threads
2018-08-08 16:55:01,188 : INFO : worker thread finished; awaiting finish of 11 more threads
2018-08-08 16:55:01,192 : INFO : worker thread finished; awaiting finish of 10 more threads
2018-08-08 16:55:01,208 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-08-08 16:55:01,217 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-08-08 16:55:01,219 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-08-08 16:55:01,221 : INFO : worker thread finished; awaiting finish of 6 more 

CPU times: user 2min 40s, sys: 573 ms, total: 2min 41s
Wall time: 13.9 s


In [33]:
print('正在进行第一次训练......')

'''
用最新版本的Keras训练模型，使用GPU加速（我的是GTX 960）
其中Bidirectional函数目前要在github版本才有
'''
from keras.layers import Dense, LSTM, Lambda, TimeDistributed, Input, Masking, Bidirectional
from keras.models import Model
from keras.utils import np_utils
from keras.regularizers import l1 #通过L1正则项，使得输出更加稀疏

正在进行第一次训练......


In [50]:
sequence = Input(shape=(maxlen, word_size))
mask = Masking(mask_value=0.)(sequence)
blstm = Bidirectional(LSTM(64, return_sequences=True), merge_mode='sum')(mask)
blstm = Bidirectional(LSTM(32, return_sequences=True), merge_mode='sum')(blstm)
output = TimeDistributed(Dense(5, activation='softmax', activity_regularizer=l1(0.01)))(blstm)
model = Model(inputs=sequence, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [51]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 80, 128)           0         
_________________________________________________________________
masking_3 (Masking)          (None, 80, 128)           0         
_________________________________________________________________
bidirectional_5 (Bidirection (None, 80, 64)            98816     
_________________________________________________________________
bidirectional_6 (Bidirection (None, 80, 32)            24832     
_________________________________________________________________
time_distributed_2 (TimeDist (None, 80, 5)             165       
Total params: 123,813
Trainable params: 123,813
Non-trainable params: 0
_________________________________________________________________


In [37]:
'''
gen_matrix实现从分词后的list来输出训练样本
gen_target实现将输出序列转换为one hot形式的目标
超过maxlen则截断，不足补0
'''
gen_matrix = lambda z: np.vstack((word2vec[z[:maxlen]], np.zeros((maxlen-len(z[:maxlen]), word_size))))
gen_target = lambda z: np_utils.to_categorical(np.array(z[:maxlen] + [0]*(maxlen-len(z[:maxlen]))), 5)


In [48]:
#从节省内存的角度，通过生成器的方式来训练
def data_generator(data, targets, batch_size): 
    idx = np.arange(len(data))
    np.random.shuffle(idx)
    batches = [idx[range(batch_size*i, min(len(data), batch_size*(i+1)))] for i in range(len(data)//batch_size+1)]
    while True:
        for i in batches:
            xx, yy = np.array(map(gen_matrix, data[i])), np.array(map(gen_target, targets[i]))
            yield (xx, yy)

batch_size = 1024

In [52]:
%%time
history = model.fit_generator(data_generator(d['words'], d['label'], batch_size), samples_per_epoch=len(d), nb_epoch=200)
model.save_weights('words_seq2seq_final_1.model')

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


Epoch 1/200


TypeError: len() of unsized object

In [None]:
#输出预测结果（原始数据，未整理）
def predict_data(data, batch_size):
    batches = [range(batch_size*i, min(len(data), batch_size*(i+1))) for i in range(len(data)/batch_size+1)]
    p = model.predict(np.array(map(gen_matrix, data[batches[0]])), verbose=1)
    for i in batches[1:]:
        print min(i), 'done.'
        p = np.vstack((p, model.predict(np.array(map(gen_matrix, data[i])), verbose=1)))
    return p

d['predict'] = list(predict_data(d['words'], batch_size))
dd['predict'] = list(predict_data(dd['words'], batch_size))

'''
动态规划部分：
1、zy是转移矩阵，用了对数概率；概率的数值是大概估计的，事实上，这个数值的精确意义不是很大。
2、viterbi是动态规划算法。
'''
zy = {'00':0.15, 
      '01':0.15, 
      '02':0.7, 
      '10':1.0, 
      '23':0.5, 
      '24':0.5,
      '33':0.5,
      '34':0.5, 
      '40':1.0
     }

zy = {i:np.log(zy[i]) for i in zy.keys()}

def viterbi(nodes):
    paths = nodes[0]
    for l in range(1,len(nodes)):
        paths_ = paths.copy()
        paths = {}
        for i in nodes[l].keys():
            nows = {}
            for j in paths_.keys():
                if j[-1]+i in zy.keys():
                    nows[j+i]= paths_[j]+nodes[l][i]+zy[j[-1]+i]
            k = np.argmax(nows.values())
            paths[nows.keys()[k]] = nows.values()[k]
    return paths.keys()[np.argmax(paths.values())]

'''
整理输出结果，即生成提交数据所需要的格式。
整个过程包括：动态规划、结果提取。
'''

def predict(i):
    nodes = [dict(zip(['0','1','2','3','4'], k)) for k in np.log(dd['predict'][i][:len(dd['words'][i])])]
    r = viterbi(nodes)
    result = []
    words = dd['words'][i]
    for j in re.finditer('2.*?4|1', r):
        result.append((''.join(words[j.start():j.end()]), np.mean([nodes[k][r[k]] for k in range(j.start(),j.end())])))
    if result:
        result = pd.DataFrame(result)
        return [result[0][result[1].argmax()]]
    else:
        return result

dd['core_entity'] = map(predict, tqdm(iter(dd.index), desc=u'第一次预测'))

'''
导出提交的JSON格式
'''
gen = lambda i:'[{"content": "'+dd.iloc[i]['content']+'", "core_entity": ["'+''.join(dd.iloc[i]['core_entity'])+'"]}]'
ssss = map(gen, tqdm(range(len(dd))))
result='\n'.join(ssss)
import codecs
f=codecs.open('result1.txt', 'w', encoding='utf-8')
f.write(result)
f.close()
import os
os.system('rm result1.zip')
os.system('zip result1.zip result1.txt')

print u'正在进行第一次迁移学习......'

'''
开始迁移学习。
'''

def label(k): #将输出结果转换为标签序列
    s = dd['words'][k]
    r = ['0']*len(s)
    for i in range(len(s)):
        for j in dd['core_entity'][k]:
            if s[i] in j:
                r[i] = '1'
                break
    s = ''.join(r)
    r = [0]*len(s)
    for i in re.finditer('1+', s):
        if i.end() - i.start() > 1:
            r[i.start()] = 2
            r[i.end()-1] = 4
            for j in range(i.start()+1, i.end()-1):
                r[j] = 3
        else:
            r[i.start()] = 1
    return r

dd['label'] = map(label, tqdm(iter(dd.index))) #输出tags

'''
将测试集和训练集一起放到模型中训练，
其中测试集的样本权重设置为1，训练集为10
'''
w = np.array([1]*len(dd) + [10]*len(d))
def data_generator(data, targets, batch_size): 
    idx = np.arange(len(data))
    np.random.shuffle(idx)
    batches = [idx[range(batch_size*i, min(len(data), batch_size*(i+1)))] for i in range(len(data)/batch_size+1)]
    while True:
        for i in batches:
            xx, yy = np.array(map(gen_matrix, data[i])), np.array(map(gen_target, targets[i]))
            yield (xx, yy, w[i])

history = model.fit_generator(data_generator(
                                    dd[['words']].append(d[['words']], ignore_index=True)['words'], 
                                    dd[['label']].append(d[['label']], ignore_index=True)['label'], 
                                    batch_size), 
                              samples_per_epoch=len(dd)+len(d), 
                              nb_epoch=20)

model.save_weights('words_seq2seq_final_2.model')
d['predict'] = list(predict_data(d['words'], batch_size))
dd['predict'] = list(predict_data(dd['words'], batch_size))
dd['core_entity_2'] = map(predict, tqdm(iter(dd.index), desc=u'第一次迁移学习预测'))

'''
导出提交的JSON格式
'''
gen = lambda i:'[{"content": "'+dd.iloc[i]['content']+'", "core_entity": ["'+''.join(dd.iloc[i]['core_entity_2'])+'"]}]'
ssss = map(gen, tqdm(range(len(dd))))
result='\n'.join(ssss)
import codecs
f=codecs.open('result2.txt', 'w', encoding='utf-8')
f.write(result)
f.close()
import os
os.system('rm result2.zip')
os.system('zip result2.zip result2.txt')

print u'正在进行第二次迁移学习......'

'''
开始迁移学习2。
'''

ddd = dd[dd['core_entity'] == dd['core_entity_2']].copy()

'''
将测试集和训练集一起放到模型中训练，
其中测试集的样本权重设置为1，训练集为5
'''
w = np.array([1]*len(ddd) + [5]*len(d))
def data_generator(data, targets, batch_size): 
    idx = np.arange(len(data))
    np.random.shuffle(idx)
    batches = [idx[range(batch_size*i, min(len(data), batch_size*(i+1)))] for i in range(len(data)/batch_size+1)]
    while True:
        for i in batches:
            xx, yy = np.array(map(gen_matrix, data[i])), np.array(map(gen_target, targets[i]))
            yield (xx, yy, w[i])

history = model.fit_generator(data_generator(
                                    ddd[['words']].append(d[['words']], ignore_index=True)['words'], 
                                    ddd[['label']].append(d[['label']], ignore_index=True)['label'], 
                                    batch_size), 
                              samples_per_epoch=len(ddd)+len(d), 
                              nb_epoch=20)

model.save_weights('words_seq2seq_final_3.model')
d['predict'] = list(predict_data(d['words'], batch_size))
dd['predict'] = list(predict_data(dd['words'], batch_size))
dd['core_entity_3'] = map(predict, tqdm(iter(dd.index), desc=u'第二次迁移学习预测'))

'''
导出提交的JSON格式
'''
gen = lambda i:'[{"content": "'+dd.iloc[i]['content']+'", "core_entity": ["'+''.join(dd.iloc[i]['core_entity_3'])+'"]}]'
ssss = map(gen, tqdm(range(len(dd))))
result='\n'.join(ssss)
import codecs
f=codecs.open('result3.txt', 'w', encoding='utf-8')
f.write(result)
f.close()
import os
os.system('rm result3.zip')
os.system('zip result3.zip result3.txt')

print u'正在进行第三次迁移学习......'

'''
开始迁移学习3。
'''

ddd = dd[dd['core_entity'] == dd['core_entity_2']].copy()
ddd = ddd[ddd['core_entity_3'] == ddd['core_entity_2']].copy()

'''
将测试集和训练集一起放到模型中训练，
其中测试集的样本权重设置为1，训练集为1
'''
w = np.array([1]*len(ddd) + [1]*len(d))
def data_generator(data, targets, batch_size): 
    idx = np.arange(len(data))
    np.random.shuffle(idx)
    batches = [idx[range(batch_size*i, min(len(data), batch_size*(i+1)))] for i in range(len(data)/batch_size+1)]
    while True:
        for i in batches:
            xx, yy = np.array(map(gen_matrix, data[i])), np.array(map(gen_target, targets[i]))
            yield (xx, yy, w[i])

history = model.fit_generator(data_generator(
                                    ddd[['words']].append(d[['words']], ignore_index=True)['words'], 
                                    ddd[['label']].append(d[['label']], ignore_index=True)['label'], 
                                    batch_size), 
                              samples_per_epoch=len(ddd)+len(d), 
                              nb_epoch=20)

model.save_weights('words_seq2seq_final_4.model')
d['predict'] = list(predict_data(d['words'], batch_size))
dd['predict'] = list(predict_data(dd['words'], batch_size))
dd['core_entity_4'] = map(predict, tqdm(iter(dd.index), desc=u'第三次迁移学习预测'))

'''
导出提交的JSON格式
'''
gen = lambda i:'[{"content": "'+dd.iloc[i]['content']+'", "core_entity": ["'+''.join(dd.iloc[i]['core_entity_4'])+'"]}]'
ssss = map(gen, tqdm(range(len(dd))))
result='\n'.join(ssss)
import codecs
f=codecs.open('result4.txt', 'w', encoding='utf-8')
f.write(result)
f.close()
import os
os.system('rm result4.zip')
os.system('zip result4.zip result4.txt')

print u'正在进行第四次迁移学习......'

'''
开始迁移学习4。
'''

ddd = dd[dd['core_entity'] == dd['core_entity_2']].copy()
ddd = ddd[ddd['core_entity_3'] == ddd['core_entity_2']].copy()
ddd = ddd[ddd['core_entity_4'] == ddd['core_entity_2']].copy()

'''
将测试集和训练集一起放到模型中训练，
其中测试集的样本权重设置为1，训练集为1
'''
w = np.array([1]*len(ddd) + [1]*len(d))
def data_generator(data, targets, batch_size): 
    idx = np.arange(len(data))
    np.random.shuffle(idx)
    batches = [idx[range(batch_size*i, min(len(data), batch_size*(i+1)))] for i in range(len(data)/batch_size+1)]
    while True:
        for i in batches:
            xx, yy = np.array(map(gen_matrix, data[i])), np.array(map(gen_target, targets[i]))
            yield (xx, yy, w[i])

history = model.fit_generator(data_generator(
                                    ddd[['words']].append(d[['words']], ignore_index=True)['words'], 
                                    ddd[['label']].append(d[['label']], ignore_index=True)['label'], 
                                    batch_size), 
                              samples_per_epoch=len(ddd)+len(d), 
                              nb_epoch=20)

model.save_weights('words_seq2seq_final_5.model')
d['predict'] = list(predict_data(d['words'], batch_size))
dd['predict'] = list(predict_data(dd['words'], batch_size))
dd['core_entity_5'] = map(predict, tqdm(iter(dd.index), desc=u'第四次迁移学习预测'))

'''
导出提交的JSON格式
'''
gen = lambda i:'[{"content": "'+dd.iloc[i]['content']+'", "core_entity": ["'+''.join(dd.iloc[i]['core_entity_5'])+'"]}]'
ssss = map(gen, tqdm(range(len(dd))))
result='\n'.join(ssss)
import codecs
f=codecs.open('result5.txt', 'w', encoding='utf-8')
f.write(result)
f.close()
import os
os.system('rm result5.zip')
os.system('zip result5.zip result5.txt')