# Project 1: 利用信息抽取技术搭建知识库

本项目的目的是结合命名实体识别、依存语法分析、实体消歧、实体统一对网站开放语料抓取的数据建立小型知识图谱。

## Part1：开发句法结构分析工具

### 1.1 开发工具
使用CYK算法，根据所提供的：非终结符集合、终结符集合、规则集，对以下句子计算句法结构。

“the boy saw the dog with a telescope"



非终结符集合：N={S, NP, VP, PP, DT, Vi, Vt, NN, IN}

终结符集合：{sleeps, saw, boy, girl, dog, telescope, the, with, in}

规则集: R={
- (1) S-->NP VP 1.0
- (2) VP-->VI 0.3
- (3) VP-->Vt NP 0.4
- (4) VP-->VP PP 0.3
- (5) NP-->DT NN 0.8
- (6) NP-->NP PP 0.2
- (7) PP-->IN NP 1.0
- (8) Vi-->sleeps 1.0
- (9) Vt-->saw 1.0
- (10) NN-->boy 0.1
- (11) NN-->girl 0.1
- (12) NN-->telescope 0.3
- (13) NN-->dog 0.5
- (14) DT-->the 0.5
- (15) DT-->a 0.5
- (16) IN-->with 0.6
- (17) IN-->in 0.4


}

In [5]:
# 分数（15）
class my_CYK(object):
    def __init__(self, non_ternimal, terminal, rules_prob, start_prob):
        self.non_terminal = non_ternimal
        self.terminal = terminal
        self.rules_prob = rules_prob
        self.start_symbol = start_prob


    def parse_sentence(self, sentence):
        sents = sentence.split()
        best_path = [[{} for _ in range(len(sents))] for _ in range(len(sents))]

        # initialization
        for i in range(len(sents)):
            for x in self.non_terminal:
                best_path[i][i][x] = {}
                if (sents[i],) in self.rules_prob[x].keys():
                    best_path[i][i][x]['prob'] = self.rules_prob[x][(sents[i],)]
                    best_path[i][i][x]['path'] = {'split':None, 'rule': sents[i]}
                else:
                    best_path[i][i][x]['prob'] = 0
                    best_path[i][i][x]['path'] = {'split':None, 'rule': None}

        # CKY recursive
        for l in range(1, len(sents)):
            for i in range(len(sents)-l):
                j = i + l
                for x in self.non_terminal:
                    tmp_best_x = {'prob':0, 'path':None}
                    for key, value in self.rules_prob[x].items():
                        if key[0] not in self.non_terminal: 
                            break
                        for s in range(i, j):
                            tmp_prob = value * best_path[i][s][key[0]]['prob'] * best_path[s+1][j][key[1]]['prob']
                            if tmp_prob > tmp_best_x['prob']:
                                tmp_best_x['prob'] = tmp_prob
                                tmp_best_x['path'] = {'split': s, 'rule': key}
                    best_path[i][j][x] = tmp_best_x
        self.best_path = best_path

        # parse result
        self._parse_result(0, len(sents)-1, self.start_symbol)
        print("prob = ", self.best_path[0][len(sents)-1][self.start_symbol]['prob'])


    def _parse_result(self, left_idx, right_idx, root, ind=0):
        node = self.best_path[left_idx][right_idx][root]
        if node['path']['split'] is not None:
            print('\t'*ind, (root, self.rules_prob[root].get(node['path']['rule'])))
            self._parse_result(left_idx, node['path']['split'], node['path']['rule'][0], ind+1)
            self._parse_result(node['path']['split']+1, right_idx, node['path']['rule'][1], ind+1)
        else:
            print('\t'*ind, (root, self.rules_prob[root].get((node['path']['rule'],))) )
            print('--->', node['path']['rule'])



def main(sentence):
    non_terminal = {'S', 'NP', 'VP', 'PP', 'DT', 'Vi', 'Vt', 'NN', 'IN'}
    start_symbol = 'S'
    terminal = {'sleeps', 'saw', 'boy', 'girl', 'dog', 'telescope', 'the', 'with', 'in'}
    rules_prob = {'S': {('NP', 'VP'): 1.0},
                  'VP': {('Vt', 'NP'): 0.8, ('VP', 'PP'): 0.2},
                  'NP': {('DT', 'NN'): 0.8, ('NP', 'PP'): 0.2},
                  'PP': {('IN', 'NP'): 1.0},
                  'Vi': {('sleeps',): 1.0},
                  'Vt': {('saw',): 1.0},
                  'NN': {('boy',): 0.1, ('girl',): 0.1,('telescope',): 0.3,('dog',): 0.5},
                  'DT': {('the',): 1.0},
                  'IN': {('with',): 0.6, ('in',): 0.4},
                }
    cyk = my_CYK(non_terminal, terminal, rules_prob, start_symbol)
    cyk.parse_sentence(sentence)





In [6]:
# TODO: 对该测试用例进行测试
# "the boy saw the dog with the telescope"

if __name__ == "__main__":
    sentence = "the boy saw the dog with the telescope"
    main(sentence)

 ('S', 1.0)
	 ('NP', 0.8)
		 ('DT', 1.0)
---> the
		 ('NN', 0.1)
---> boy
	 ('VP', 0.2)
		 ('VP', 0.8)
			 ('Vt', 1.0)
---> saw
			 ('NP', 0.8)
				 ('DT', 1.0)
---> the
				 ('NN', 0.5)
---> dog
		 ('PP', 1.0)
			 ('IN', 0.6)
---> with
			 ('NP', 0.8)
				 ('DT', 1.0)
---> the
				 ('NN', 0.3)
---> telescope
prob =  0.0007372800000000003


### 1.2 计算算法复杂度
计算上一节开发的算法所对应的时间复杂度和空间复杂度。

In [None]:
# 分数（3）
# 上面所写的算法的时间复杂度和空间复杂度分别是多少？
# TODO
时间复杂度=O(), 空间复杂度=O()

## Part2 基于Bootstrapping，抽取企业股权交易关系，并建立知识库

### 2.1 练习实体消歧
将句中识别的实体与知识库中实体进行匹配，解决实体歧义问题。
可利用上下文本相似度进行识别。

在data/entity_disambiguation目录中，entity_list.csv是50个实体，valid_data.csv是需要消歧的语句（待添加）。

答案提交在submit目录中，命名为entity_disambiguation_submit.csv。格式为：第一列是需要消歧的语句序号，第二列为多个“实体起始位坐标-实体结束位坐标：实体序号”以“|”分隔的字符串。

*成绩以实体识别准确率以及召回率综合的F值评分


In [61]:
# code
# 将识别出的实体与知识库中实体进行匹配，解决识别出一个实体对应知识库中多个实体的问题。

# 将entity_list.csv中已知实体的名称导入分词词典

import jieba
import pandas as pd

entity_data = pd.read_csv('../data/entity_disambiguation/entity_list.csv', encoding = 'gb18030')
entity_dict = {}

for i in range(len(entity_data)):
    line = entity_data.iloc[i, :]
    for word in line.entity_name.split('|'):
        jieba.add_word(word)
        if word in entity_dict:
            entity_dict[word].append(line.entity_id)
        else:
            entity_dict[word] = [line.entity_id]

# 对每句句子识别并匹配实体     

valid_data = pd.read_csv('../data/entity_disambiguation/valid_data.csv', encoding = 'gb18030')

result_data = []
for i in range(len(valid_data)):
    line = valid_data.iloc[i, :]
    ret =[]  # 存储实体的坐标和序号
    loc = 0
    window = 10  # 观察上下文的窗口大小
    sentence = jieba.lcut(line.sentence)
    ret = []
    for idx, word in enumerate(sentence):
        if word in entity_dict:
            max_similar = 0
            max_entity_id = 0
            context = sentence[max(0, idx-window):min(len(sentence)-1, idx+window)]
            for ids in entity_dict[word]:
                similar = len(set(context)&set(jieba.lcut(entity_data[entity_data.entity_id.isin([ids])].reset_index().desc[0])))
                if max_similar>similar:
                    max_similar = similar
                    max_entity_id = ids
            ret.append(str(loc)+'-'+str(loc+len(word))+':'+str(ids))
        loc+=len(word)
    result_data.append([i, '|'.join(ret)])
    

In [64]:
pd.DataFrame(result_data).to_csv('../submit/entity_disambiguation_submit.csv', index=False)
result_data


[[0, '3-6:1008|109-112:1008|187-190:1008'],
 [1, '18-21:1008'],
 [2, '23-26:1008|40-43:1008'],
 [3, '7-10:1008'],
 [4, '2-5:1008|14-17:1008'],
 [5, '28-30:1003|34-36:1003|41-43:1003'],
 [6, '4-8:1001|25-27:1003|34-36:1003|100-102:1003'],
 [7, '0-2:1003|6-10:1001|19-21:1003|34-36:1003|45-47:1003'],
 [8, '8-10:1003|22-24:1003|34-36:1003|37-39:1003|46-48:1003'],
 [9, '14-16:1003'],
 [10, '0-2:1005|39-44:1005'],
 [11, '7-11:1005|20-22:1005'],
 [12, '4-6:1005|29-31:1005|62-64:1005'],
 [13, '26-28:1005'],
 [14, '0-2:1005|24-26:1005'],
 [15, '10-12:1005|28-30:1005'],
 [16, '6-8:1005|20-22:1005'],
 [17, '8-12:1011|26-30:1011'],
 [18, '9-13:1011|28-30:1013'],
 [19, '0-2:1013|18-20:1013'],
 [20, '6-8:1013'],
 [21, '0-2:1013|26-28:1013|41-43:1013'],
 [22, '0-2:1013|20-22:1013'],
 [23, '0-2:1013'],
 [24, '0-2:1013|32-34:1013'],
 [25, '0-3:1016'],
 [26, '2-5:1016|11-14:1016|18-21:1016'],
 [27, '20-23:1016'],
 [28, '11-14:1016']]

### 2.2 实体识别

借助开源工具，对实体进行识别。

将每句句子中实体识别出，存入实体词典，并用特殊符号替换语句。

In [64]:
# code
# 首先尝试利用开源工具分出实体

import fool
import pandas as pd
from copy import copy


sample_data = pd.read_csv('../data/info_extract/samples_test.csv', encoding = 'utf-8', header=0)
sample_data['ner'] = None
ner_id = 1001
ner_dict = {}  # 存储所有实体
ner_dict_reverse = {}  # 存储所有实体
for i in range(len(sample_data)):
    sentence = copy(sample_data.iloc[i, 1])
    words, ners = fool.analysis(sentence)
    ners[0].sort(key=lambda x:x[0], reverse=True)
    print(ners)
    for start, end, ner_type, ner_name in ners[0]:
        if ner_name not in ner_dict:
            ner_dict[ner_name] = ner_id
            ner_dict_reverse[ner_id] = ner_name
            ner_id+=1
        sentence = sentence[:start] + ' ner_' + str(ner_dict[ner_name]) + '_ ' + sentence[end-1:]
    sample_data.iloc[i, 2] = sentence


[[(28, 33, 'company', '国泰君安'), (0, 13, 'company', '深圳能源集团股份有限公司')]]
[[(36, 49, 'company', '远大产业控股股份有限公司'), (0, 13, 'company', '远大产业控股股份有限公司')]]
[[(104, 109, 'company', '河北银行'), (88, 99, 'company', '河北银行股份有限公司'), (61, 74, 'company', '南京栖霞建设集团有限公司'), (34, 47, 'company', '南京栖霞建设股份有限公司')]]
[[(189, 196, 'time', '2015年度'), (185, 190, 'company', '歌礼制药'), (160, 165, 'company', '康桥资本'), (136, 150, 'company', '天士力（香港）药业有限公司'), (88, 114, 'company', 'CBCInvestmentSevenLimited'), (81, 86, 'company', '康桥资本'), (44, 58, 'company', '天士力（香港）药业有限公司'), (19, 33, 'company', '天士力医药集团股份有限公司'), (2, 16, 'company', '天士力制药集团股份有限公司')]]
[[(44, 59, 'company', '江苏康缘美域生物医药有限公司'), (21, 37, 'company', '连云港康缘美域保健食品有限公司'), (6, 19, 'company', '江苏康缘药业股份有限公司'), (0, 6, 'time', '2016年')]]
[[(74, 85, 'company', '康缘国际实业有限公司'), (60, 73, 'company', '江苏康缘药业股份有限公司'), (39, 52, 'company', '江苏康缘集团有限责任公司'), (21, 32, 'company', '康缘国际实业有限公司'), (6, 19, 'company', '江苏康缘药业股份有限公司'), (0, 6, 'time', '2015年')]]
[[(27, 33, 'company', '天津大西洋'), (1

In [65]:
sample_data

Unnamed: 0,id,sentence,ner
0,1,深圳能源集团股份有限公司拟按现有2.03%的持股比例参与国泰君安本次可转换公司债的配售，参与...,ner_1002_ 拟按现有2.03%的持股比例参与 ner_1001_ 本次可转换公司债...
1,2,远大产业控股股份有限公司于报告期实施的发行股份购买资产的交易对方中金波为远大产业控股股份有限...,ner_1003_ 于报告期实施的发行股份购买资产的交易对方中金波为 ner_1003_ ...
2,3,一、根据公司第六届董事会第七次会议审议并通过的公司重大资产重组方案，南京栖霞建设股份有限公司...,一、根据公司第六届董事会第七次会议审议并通过的公司重大资产重组方案， ner_1007_ 拟...
3,4,一、天士力制药集团股份有限公司（简称“天士力医药集团股份有限公司”、“公司”）拟向子公司天士...,一、 ner_1014_ （简称“ ner_1013_ ”、“公司”）拟向子公司 ner_1...
4,5,2016年，江苏康缘药业股份有限公司将持有连云港康缘美域保健食品有限公司的股权全部转让给江苏...,ner_1018_ ， ner_1017_ 将持有 ner_1016_ 的股权全部转让给 ...
5,6,2015年，江苏康缘药业股份有限公司将持有康缘国际实业有限公司的股权全部转让给江苏康缘集团有...,ner_1021_ ， ner_1017_ 将持有 ner_1019_ 的股权全部转让给 ...
6,7,本年度上海大西洋收购天津大西洋焊接材料有限公司将其所持天津大西洋销售44%股权，支付对价8，...,ner_1025_ ner_1024_ 收购 ner_1023_ 将其所持 ner_10...


### 2.3 实体统一
对同一实体具有多个名称的情况进行统一，将多种称谓统一到一个实体上，并体现在实体的属性中（可以给实体建立“别称”属性）

公司名称有其特点，例如后缀可以省略、上市公司的地名可以省略等等。在data/dict目录中提供了几个词典，可供实体统一使用。
- company_suffix.txt是公司的通用后缀词典
- company_business_scope.txt是公司经营范围常用词典
- co_Province_Dim.txt是省份词典
- co_City_Dim.txt是城市词典
- stopwords.txt是可供参考的停用词

In [9]:
# code

import jieba
import jieba.posseg as pseg
import re
import datetime



#功能：从输入的“公司名”中提取主体(列表形式)
def main_extract(input_str,stop_word,d_4_delete,d_city_province):
    input_str = replace(input_str,d_city_province)
    #开始分词
    seg = pseg.cut(input_str)
    seg_lst = []
    for w in seg:
        elmt = w.word
        if elmt not in d_4_delete:
            seg_lst.append(elmt)
    seg_lst = remove_word(seg_lst,stop_word)
    seg_lst = city_prov_ahead(seg_lst,d_city_province)
    return seg_lst

    

#功能：将list中地名提前
def city_prov_ahead(seg_lst,d_city_province):
    city_prov_lst = []
    for seg in seg_lst:
        if seg in d_city_province:
            city_prov_lst.append(seg)
            seg_lst.remove(seg)
    city_prov_lst.sort()
    return city_prov_lst+seg_lst
        
    

    
#功能：去除停用词
def remove_word(seg,sw):
    ret = []
    for i in range(len(seg)):
        if seg[i] not in sw:
            ret.append(seg[i])
    return ret


#功能：替换com，dep的内容
def replace(com,d_city_province):
    #————————公司、部门
    #替换
    #'*'
    com = re.sub(r'(\*)*(\#)*(\-)*(\—)*(\~)*(\.)*(\/)*(\?)*(\!)*(\？)*(\")*','',com)
    #'、'
    com = re.sub(r'(\、)*','',com)
    #'+'
    com = re.sub(r'(\+)*','',com)
    #','
    com = re.sub(r'(\,)+',' ',com)
    #'，'
    com = re.sub(r'(\，)+',' ',com)
    #':'
    com = re.sub(r'(\:)*','',com)
    #[]【】都删除
    com = re.sub(r'\【.*?\】','',com)
    com = re.sub(r'\[.*?\]','',com)
    #数字在结尾替换为‘’
    com = re.sub(r'\d*$',"",com)
    #'&nbsp;'或‘&lt;’替换为‘’
    com = re.sub(r'(&gt;)*(&nbsp;)*(&lt;)*',"",com)
    #地名
    com = re.sub(r'\(',"（",com)
    com = re.sub(r'\)',"）",com)
    pat = re.search(r'\（.+?\）',com)
    while pat:
        v = pat.group()[3:-3]
        start = pat.span()[0]
        end = pat.span()[1]
        if v not in d_city_province:
            com = com[:start]+com[end:]
        else:
            com = com[:start]+com[start+3:end-3]+com[end:]
        pat = re.search(r'\（.+?\）',com)
    #()（）
    com = re.sub(r'(\()*(\))*(\（)*(\）)*','',com)
    #全数字
    com = re.sub(r'^(\d)+$',"",com)
    return com



#初始加载步骤
#输出“用来删除的字典”和“stop word”
def my_initial():
    fr1 = open(r"../data/dict/co_City_Dim.txt", encoding='utf-8')
    fr2 = open(r"../data/dict/co_Province_Dim.txt", encoding='utf-8')
    fr3 = open(r"../data/dict/company_business_scope.txt", encoding='utf-8')
    fr4 = open(r"../data/dict/company_suffix.txt", encoding='utf-8')
    #城市名
    lines1 = fr1.readlines()
    d_4_delete = []
    d_city_province = [re.sub(r'(\r|\n)*','',line) for line in lines1]
    #省份名
    lines2 = fr2.readlines()
    l2_tmp = [re.sub(r'(\r|\n)*','',line) for line in lines2]
    d_city_province.extend(l2_tmp)
    #公司后缀
    lines3 = fr3.readlines()
    l3_tmp = [re.sub(r'(\r|\n)*','',line) for line in lines3]
    lines4 = fr4.readlines()
    l4_tmp = [re.sub(r'(\r|\n)*','',line) for line in lines4]
    d_4_delete.extend(l4_tmp)
    #get stop_word
    fr = open(r'../data/dict/stopwords.txt', encoding='utf-8')   
    stop_word = fr.readlines()
    stop_word_after = [re.sub(r'(\r|\n)*','',stop_word[i]) for i in range(len(stop_word))]
    stop_word_after[-1] = stop_word[-1]
    stop_word = stop_word_after
    return d_4_delete,stop_word,d_city_province


In [18]:
d_4_delete,stop_word,d_city_province = my_initial()
company_name = "河北银行股份有限公司"
lst = main_extract(company_name,stop_word,d_4_delete,d_city_province)
company_name = ''.join(lst)  # 对公司名提取主体部分，将包含相同主体部分的公司统一为一个实体
print(company_name)

河北银行


In [None]:
# 在语句中统一实体

import fool
import pandas as pd
from copy import copy


sample_data = pd.read_csv('../data/info_extract/samples_test.csv', encoding = 'utf-8', header=0)
sample_data['ner'] = None
ner_id = 1001
ner_dict_new = {}  # 存储所有实体
ner_dict_reverse_new = {}  # 存储所有实体
for i in range(len(sample_data)):
    sentence = copy(sample_data.iloc[i, 1])
    words, ners = fool.analysis(sentence)
    ners[0].sort(key=lambda x:x[0], reverse=True)
    print(ners)
    for start, end, ner_type, ner_name in ners[0]:
        company_main_name = ''.join(main_extract(ner_name,stop_word,d_4_delete,d_city_province))  # 提取公司主体名称
        if company_main_name not in ner_dict:
            ner_dict[company_main_name] = ner_id
            ner_id+=1
        sentence = sentence[:start] + ' ner_' + str(ner_dict[company_main_name]) + '_ ' + sentence[end-1:]
    sample_data.iloc[i, 2] = sentence

    

### 2.4 关系抽取
借助句法分析工具，和实体识别的结果，以及正则表达式，设定模版抽取关系，并存储进图数据库。

本次要求抽取股权交易关系，关系为有向边，由投资方指向被投资方。

模板建立可以使用“正则表达式”、“实体间距离”、“实体上下文”、“依存句法”等。

答案提交在submit目录中，命名为info_extract_submit.csv和info_extract_entity.csv。
- info_extract_entity.csv格式为：第一列是实体编号，第二列是实体名（多个实体名用“|”分隔）
- info_extract_submit.csv格式为：第一列是关系发起方实体编号，第二列为关系接收方实体编号。

*成绩以抽取的关系准确率以及召回率综合的F值评分

#### 建立种子模板

In [73]:
# code

# 最后提交文件为识别出的整个投资图谱，以及图谱中结点列表与属性。

# 建立模板
import re

rep1 = re.compile(r'(ner_\d\d\d\d)_\s+收购\s+(ner_\d\d\d\d)_')  # 例子模板

relation_list = []  # 存储已经提取的关系
for i in range(len(sample_data)):
    sentence = copy(sample_data.iloc[i, 2])
    for v in rep1.findall(sentence+sentence):
        relation_list.append(v)

In [75]:
relation_list

[('ner_1024', 'ner_1023'), ('ner_1024', 'ner_1023')]

#### 利用bootstrapping搜索

In [None]:
# code



### 2.5 存储进图数据库

本次作业我们使用neo4j作为图数据库，neo4j需要java环境，请先配置好环境。

In [76]:

from py2neo import Node, Relationship, Graph

graph = Graph(
    "http://localhost:7474", 
    username="neo4j", 
    password="person"
)

for v in relation_list:
    a = Node('Company', name=v[0])
    b = Node('Company', name=v[1])
    r = Relationship(a, 'INVEST', b)
    s = a | b | r
    graph.create(s)

In [79]:
r

(ner_1024)-[:INVEST {}]->(ner_1023)