# Building a Knowledge Graph Using Information Retrieval
This project combines Named Entity Recognition, dependency syntactic parsing, Entity Disambiguation, Entity Resolution, NoSQL(graph database, Noe4j, cypher language), etc. to build 
a knowledge graph. Our goal is to use NLP tools to analyze many news data crawling from the website about list companies, find the latent relationships among different entities and build a knowledge graph, which will guide our future decisions. Here, we mainly focus on the equity transaction relationships. The organization of the project can be summarized as follows:
1. entity resolution,
2. named entity recognition
3. relation extraction
4. build a classifier to recognize the relationships in the test dataset
5. create a graph database using NoSQL to store the relationships
6. entity disambiguation


## 1. Entity Resolution¶
We need to unify different names under the same entity and extract the main contents. For example, we could omit the suffix, locations, and business scopes.
Data sets are provided as follows:
1. company_suffix.txt: common suffixes of companies
2. company_business_scope.txt: standard business scopes of companies
3. co_Province_Dim.txt: a dictionary of province
4. co_City_Dim.txt: a dictionary of city
5. stopwords.txt: stopwords

In [1]:
#!pip install jieba
import jieba
import jieba.posseg as pseg
import re
import datetime
from collections import defaultdict
dict_entity_name_unify = defaultdict(lambda:"")

In [2]:
def main_extract(input_str,stop_word,d_4_delete,d_city_province):
    """
    retrive entity from input name, and build dict
    {"entity name": |"full name1"|"full name2"}
    """
    seg = pseg.cut(input_str)
    seg_lst = remove_word(seg,stop_word,d_4_delete)
    seg_lst = city_prov_ahead(seg_lst,d_city_province)
    result = ''.join(seg_lst)
    if result != input_str:
        dict_entity_name_unify[result] = dict_entity_name_unify[result] + "|" + input_str     
    return result

In [3]:
def city_prov_ahead(seg,d_city_province):
    """
    move the location description to the front part of the strings
    """
    city_prov_lst = []
    for word in seg:
        if word in d_city_province:
            city_prov_lst.append(word)
    seg_lst = [word for word in seg if word not in city_prov_lst]
    return city_prov_lst+seg_lst

In [4]:
def remove_word(seg,stop_word,d_4_delete):
    """
    remove stop words and words in predefinited deleting set
    """
    filter_stop_word = [word for word, flag in seg if word not in stop_word]
    seg_lst = [word for word in filter_stop_word if word not in d_4_delete]
    return seg_lst

In [5]:
def my_initial():
    """
    load text data
    """
    fr1 = open(r"../data/dict/co_City_Dim.txt", encoding='utf-8')
    fr2 = open(r"../data/dict/co_Province_Dim.txt", encoding='utf-8')
    fr3 = open(r"../data/dict/company_business_scope.txt", encoding='utf-8')
    fr4 = open(r"../data/dict/company_suffix.txt", encoding='utf-8')
    
    lines1 = fr1.readlines()
    d_4_delete = []
    d_city_province = [re.sub(r'(\r|\n)*','',line) for line in lines1]

    lines2 = fr2.readlines()
    l2_tmp = [re.sub(r'(\r|\n)*','',line) for line in lines2]
    d_city_province.extend(l2_tmp)

    lines4 = fr4.readlines()
    l4_tmp = [re.sub(r'(\r|\n)*','',line) for line in lines4]
    d_4_delete.extend(l4_tmp)

    fr = open(r'../data/dict/stopwords.txt', encoding='utf-8')   
    stop_word = fr.readlines()
    stop_word_after = [re.sub(r'(\r|\n)*','',stop_word[i]) for i in range(len(stop_word))]
    return d_4_delete,stop_word_after,d_city_province

In [6]:
#test
d_4_delete,stop_word,d_city_province = my_initial()
input_str = "河北银行股份有限公司"
lst = main_extract(input_str,stop_word,d_4_delete,d_city_province)
company_name = ''.join(lst)  
print(company_name)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.174 seconds.
Prefix dict has been built successfully.


河北银行


## 2. Named Entity Recognition
Here I use an open-source tool named FoolNLTK to do named entity recognition.
FoolNLTK is built based bi-lstm CRF framework.
data is from data/train_data.csv and data/test_data.csv
which is crawled from the news of the Listed companies.

In [7]:
import pandas as pd
from tqdm import tqdm, trange

train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)
train_data.head()

Unnamed: 0,id,sentence,tag,member1,member2
0,6461,与本公司关系:受同一公司控制 2，杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...,0,0,0
1,2111,三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...,0,0,0
2,9603,2016年协鑫集成科技股份有限公司向瑞峰（张家港）光伏科技有限公司支付设备款人民币4，515...,1,协鑫集成科技股份有限公司,瑞峰（张家港）光伏科技有限公司
3,3456,证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...,0,0,0
4,8844,本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...,1,广发证券股份有限公司,辽宁成大股份有限公司


The training data is labelled by hand. Tag = 1 means that there exits two entities in the sentence.

In [8]:
test_data = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)
test_data.head()

Unnamed: 0,id,sentence
0,9259,2015年1月26日，多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》
1,9136,2、2016年2月5日，深圳市新纶科技股份有限公司与侯毅先
2,220,2015年10月26日，山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》
3,9041,2、2015年12月31日，印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...
4,10041,一、金发科技拟与熊海涛女士签订《股份转让协议》，协议约定：以每股1.0509元的收购价格，收...


Firstly, I need to identify the entities in each sentence, and then replace these entities with special marks.

In [9]:
import fool
import pandas as pd
from copy import copy
test_data['ner']=None
ner_id = 1001
ner_dict_new = defaultdict(lambda:0)  
ner_dict_reverse_new = defaultdict(lambda:"")

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [10]:
test_data


Unnamed: 0,id,sentence,ner
0,9259,2015年1月26日，多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》,
1,9136,2、2016年2月5日，深圳市新纶科技股份有限公司与侯毅先,
2,220,2015年10月26日，山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》,
3,9041,2、2015年12月31日，印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...,
4,10041,一、金发科技拟与熊海涛女士签订《股份转让协议》，协议约定：以每股1.0509元的收购价格，收...,
...,...,...,...
414,9503,近日，该子公司已完成工商注册登记手续，并领取了南京市工商行政管理局颁发的<企业法人营业执照>...,
415,4689,(二)本次交易构成关联交易正元投资拟认购金额不低于 13 亿元且不低于本次配套融资总额的 2...,
416,1772,证券代码:600225 证券简称:天津松江 公告编号:临 2015-118 天津松江股份有限...,
417,9021,2015年3月31日，湖南天润数字娱乐文化传媒股份有限公司与广东恒润互兴资产管理有限公司签署...,


In [11]:
words, ners = fool.analysis(test_data.iloc[0,1])







In [12]:
words[0]

[('2015年', 't'),
 ('1月', 't'),
 ('26日', 't'),
 ('，', 'wd'),
 ('多', 'a'),
 ('氟', 'n'),
 ('多', 'a'),
 ('化工', 'n'),
 ('股份', 'n'),
 ('有限公司', 'n'),
 ('与', 'p'),
 ('李云峰', 'nr'),
 ('先生', 'n'),
 ('签署', 'v'),
 ('了', 'y'),
 ('《', 'wkz'),
 ('附', 'v'),
 ('条件', 'n'),
 ('生效', 'vi'),
 ('的', 'ude'),
 ('股份', 'n'),
 ('认购', 'v'),
 ('合同', 'n'),
 ('》', 'wky')]

In [19]:
ners[0]

[(0, 10, 'time', '2015年1月26日'),
 (11, 22, 'company', '多氟多化工股份有限公司'),
 (23, 26, 'person', '李云峰')]

In [13]:
for i in trange(len(test_data)):
    sentence = copy(test_data.iloc[i, 1])
    words, ners = fool.analysis(sentence)
    for start, end, ner_type, ner_name in ners[0]:
        if ner_type == 'company' or ner_type == 'person':
            company_main_name = main_extract(ner_name, stop_word, d_4_delete, d_city_province)
#             company_main_name = ''.join(lst) 
            if company_main_name not in ner_dict_new:
                ner_id += 1
                ner_dict_new[company_main_name] = ner_id
            sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end - 1:]
    test_data.iloc[i, -1] = sentence
X_test = test_data[['ner']]

100%|██████████| 419/419 [00:34<00:00, 12.29it/s]


In [14]:
test_data.head()


Unnamed: 0,id,sentence,ner
0,9259,2015年1月26日，多氟多化工股份有限公司与李云峰先生签署了《附条件生效的股份认购合同》,2015年1月26日， ner_1002_ 司 ner_1003_ 云峰先生签署了《附条件生...
1,9136,2、2016年2月5日，深圳市新纶科技股份有限公司与侯毅先,2、2016年2月5日， ner_1004_ 司与侯 ner_1005_ 先
2,220,2015年10月26日，山东华鹏玻璃股份有限公司与张德华先生签署了附条件生效条件的《股份认购合同》,2015年10月26日， ner_1006_ 司与 ner_1007_ 华先生签署了附条件生...
3,9041,2、2015年12月31日，印纪娱乐传媒股份有限公司与肖文革签订了《印纪娱乐传媒股份有限公司...,2、2015年12月31日， ner_1008_ 司与 ner_1009_ 革签订了《印纪娱...
4,10041,一、金发科技拟与熊海涛女士签订《股份转让协议》，协议约定：以每股1.0509元的收购价格，收...,一、 ner_1 ner_1011_ 0_ 技拟与熊海涛女士签订《股份转让协议》，协议约定：...


In [15]:
ner_dict_new

defaultdict(<function __main__.<lambda>()>,
            {'氟化工': 1002,
             '李云峰': 1003,
             '深圳市新纶科技股份': 1004,
             '侯毅': 1005,
             '山东华鹏玻璃': 1006,
             '张德华': 1007,
             '印纪娱乐传媒': 1008,
             '肖文革': 1009,
             '金发科技': 1010,
             '熊海涛': 1011,
             '上海新朋实业': 1012,
             '宋琳': 1013,
             '王友林': 1014,
             '康力电梯': 1015,
             '彭聪': 1016,
             '神州易桥100股权': 1017,
             '百川能源': 1018,
             '曹飞': 1019,
             '珠海欧比特控制工程': 1020,
             '颜军': 1021,
             '成都云图': 1022,
             '宋睿': 1023,
             '上海岩石企业': 1024,
             '柯塞威': 1025,
             '广州阳普医疗科技股份': 1026,
             '邓冠华': 1027,
             '孙锋峰': 1028,
             '金固': 1029,
             '游族网络': 1030,
             '林奇': 1031,
             '厦门金达威集团股份': 1032,
             '江斌': 1033,
             '东方日升新能源': 1034,
             '林海峰': 1035,
             '神州数码集团股份': 1036

In [16]:
# deal with training data
train_data = pd.read_csv('../data/info_extract/train_data.csv', encoding = 'gb2312', header=0)
train_data['ner'] = None

for i in trange(len(train_data)):
    # if the sentence is not labelled, recognize the named entity and replaced with ner_index
    if train_data.iloc[i, :]['member1'] == '0' and train_data.iloc[i, :]['member2'] == '0':
        sentence = copy(train_data.iloc[i, 1])
        words, ners = fool.analysis(sentence)
        ners[0].sort(key=lambda x: x[0], reverse=True)
        for start, end, ner_type, ner_name in ners[0]:
            if ner_type == 'company' or ner_type == 'person':
                company_main_name = main_extract(ner_name, stop_word, d_4_delete, d_city_province)

                if company_main_name not in ner_dict_new:
                    ner_id += 1
                    ner_dict_new[company_main_name] = ner_id

                sentence = sentence[:start] + ' ner_' + str(ner_dict_new[company_main_name]) + '_ ' + sentence[end - 1:]
        train_data.iloc[i, -1] = sentence
    else:
        # for labelled sentence , replace the entity with ner_index
        sentence = copy(train_data.iloc[i, :]['sentence'])
        for company_main_name in [train_data.iloc[i, :]['member1'], train_data.iloc[i, :]['member2']]:

            company_main_name_new = main_extract(company_main_name, stop_word, d_4_delete, d_city_province)

            if company_main_name_new not in ner_dict_new:
                ner_id += 1
                ner_dict_new[company_main_name_new] = ner_id

            sentence = re.sub(company_main_name, ' ner_%s_ ' % (str(ner_dict_new[company_main_name_new])), sentence)
        train_data.iloc[i, -1] = sentence

ner_dict_reverse_new = {id:name for name, id in ner_dict_new.items()}
        
y = train_data.loc[:, ['tag']]
train_num = len(train_data)
X_train = train_data[['ner']]

X = pd.concat([X_train, X_test])

100%|██████████| 850/850 [01:32<00:00,  9.23it/s]


In [17]:
train_data.head()

Unnamed: 0,id,sentence,tag,member1,member2,ner
0,6461,与本公司关系:受同一公司控制 2，杭州富生电器有限公司企业类型: 有限公司注册地址: 富阳市...,0,0,0,与本公司关系:受同一公司控制 2， ner_1647_ 司企业类型: 有限公司注册地址: 富...
1,2111,三、关联交易标的基本情况 1、交易标的基本情况 公司名称:红豆集团财务有限公司 公司地址:无...,0,0,0,三、关联交易标的基本情况 1、交易标的基本情况 公司名称: ner_1649_ 司地址:无锡...
2,9603,2016年协鑫集成科技股份有限公司向瑞峰（张家港）光伏科技有限公司支付设备款人民币4，515...,1,协鑫集成科技股份有限公司,瑞峰（张家港）光伏科技有限公司,2016年 ner_1650_ 向 ner_1651_ 支付设备款人民币4，515，770.00元
3,3456,证券代码:600777 证券简称:新潮实业 公告编号:2015-091 烟台新潮实业股份有限...,0,0,0,证券代码:600777 证券简称: ner_1201_ 公告编号:2015-091 ne...
4,8844,本集团及广发证券股份有限公司持有辽宁成大股份有限公司股票的本期变动系买卖一揽子沪深300指数...,1,广发证券股份有限公司,辽宁成大股份有限公司,本集团及 ner_1090_ 持有 ner_1652_ 股票的本期变动系买卖一揽子沪深300...


In [18]:
X

Unnamed: 0,ner
0,与本公司关系:受同一公司控制 2， ner_1647_ 司企业类型: 有限公司注册地址: 富...
1,三、关联交易标的基本情况 1、交易标的基本情况 公司名称: ner_1649_ 司地址:无锡...
2,2016年 ner_1650_ 向 ner_1651_ 支付设备款人民币4，515，770.00元
3,证券代码:600777 证券简称: ner_1201_ 公告编号:2015-091 ne...
4,本集团及 ner_1090_ 持有 ner_1652_ 股票的本期变动系买卖一揽子沪深300...
...,...
414,近日，该子公司已完成工商注册登记手续，并领取了南京市工商行政管理局颁发的<企业法人营业执照>...
415,(二)本次交易构成关联交易正元投资拟认购金额不低于 13 亿元且不低于本次配套融资总额的 2...
416,证券代码:600225 证券简称: ner_1643_ 公告编号:临 20 ner_129...
417,2015年3月31日， ner_1644_ 司与广东恒润互兴 ner_1645_ 件生效的《...


## 3. Relation Extraction

After the NER part, we need to build a graph database to store the dug relationships. Based on our data, the relationship describes the equity transactions between two entities. We could use the undirectional edge to depict this kind of relationship. 

firstly, remove the stop words and change them to the tf-idf vectors

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from pyltp import Segmentor

In [20]:
# customer dict
with open('../data/user_dict.txt', 'w', encoding='utf-8') as fw:
    for v in ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']:
        fw.write( v + '号企业 ni\n')
fw.close()

In [21]:
import os
import re
LTP_DATA_DIR = './ltp_data_v3.4/ltp_data_v3.4.0'
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')
segmentor = Segmentor(cws_model_path, lexicon_path = '../data/user_dict.txt')  

fr = open(r'../data/dict/stopwords.txt', encoding='utf-8')   
stop_word = fr.readlines()
stop_word = [re.sub(r'(\r|\n)*','',stop_word[i]) for i in range(len(stop_word))]

"""
build rules to filter the ner column
1. remove stop words
2. remove entity
3. remove special marks and numbers
"""
f1 = lambda x: re.sub(r'ner\_\d\d\d\d\_','',x)
f2 = lambda x: re.compile("[^\u4e00-\u9fa5]").sub('', x)
# remain chinese characters, numbers, letters  
f3 = lambda x: ' '.join([word for word in segmentor.segment(x) if word not in stop_word])
f4 = lambda x: re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]").sub('', x)

corpus=X['ner'].map(f1).map(f2).map(f3).tolist()

corpus_parse=X['ner'].map(f4).tolist()
segmentor.release()

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()  
X_tfidf = vectorizer.fit_transform(corpus).toarray() 
print(X_tfidf.shape)


(1269, 3897)


Secondly, we retrive the syntactic features
1. absolute distance between entities
2. syntactic distance between entities
3. distance between key words and entities
4. dependency syntactic relationships

In [25]:
from pyltp import Parser
from pyltp import Segmentor
from pyltp import Postagger
import networkx as nx
import re
import matplotlib.pyplot as plt
from graphviz import Digraph
import numpy as np
import os

In [26]:
LTP_DATA_DIR = './ltp_data_v3.4/ltp_data_v3.4.0'
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')
segmentor = Segmentor(cws_model_path, lexicon_path = '../data/user_dict.txt') 

pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model')
postagger = Postagger(pos_model_path)

parse_model_path = os.path.join(LTP_DATA_DIR, 'parser.model')
parser = Parser(parse_model_path)


SEN_TAGS = ["SBV","VOB","IOB","FOB","DBL","ATT","ADV","CMP","COO","POB","LAD","RAD","IS","HED"]
key_words = ["收购", "竞拍", "转让", "扩张", "并购", "注资", "整合", "并入", "竞购", "竞买", "支付", "收购价", "收购价格", "承购", "购得", "购进",
             "购入", "买进", "买入", "赎买", "购销", "议购", "函购", "函售", "抛售", "售卖", "销售", "转售"]

In [27]:
def shortest_path(arcs_ret, words, source, target, isGraph = False):
    """
    calculate the shorest dependency parsing distance
    """
    G = nx.Graph()
    # add node
    for i in list(arcs_ret.index):
        G.add_node(i)
    
    #add edge
    for i in range(len(arcs_ret)):
        head = arcs_ret.iloc[i, -2]
        index = i + 1
        G.add_edge(index, head)

    if isGraph:
        nx.draw(G, with_labels=True)
        plt.savefig("undirected_graph_2.png")
        plt.close()

    try:
        source_index = words.index(source) + 1 
        target_index = words.index(target) + 1 
        distance = nx.shortest_path_length(G, source=source_index, target=target_index)
        return distance
    except:
        return -1

In [30]:
def name_entity_modification(s):
    """
    original entity name is ner_1003_ 
    we need to change it to another name to help tagging and parsing
    """
    tmp_ner_dict = defaultdict(lambda:"")
    num_lst = ['一', '二', '三', '四', '五', '六', '七', '八', '九', '十']


    # 将公司代码替换为特殊称谓，保证分词词性正确
    for i, ner in enumerate(list(set(re.findall(r'(ner\d\d\d\d)', s)))):
        try:
            tmp_ner_dict[num_lst[i] + '号企业'] = ner
        except IndexError:
            # TODO：定义错误情况的输出
            # TODO ...
            num_lst.append(str(i))
            tmp_ner_dict[num_lst[i] + '号企业'] = ner

        s = s.replace(ner, num_lst[i] + '号企业')
    return s, tmp_ner_dict

In [31]:
def seg_tag_par(s):
    """
    do segment, postag and parse
    """
    words = segmentor.segment(s)
    tags = postagger.postag(words)
    arcs = parser.parse(words, tags)
    arcs_lst = list(map(list, zip(*[[arc[0], arc[1]] for arc in arcs])))
    parse_result = pd.DataFrame([[a, b, c, d] for a, b, c, d in zip(list(words), list(tags), arcs_lst[0], arcs_lst[1])],
                                    index=range(1, len(words) + 1))
    return words, tags, arcs, arcs_lst, parse_result

In [32]:
def find_sen_tag(l_w, parse_result,  str_enti):
    """
    find the sentence tag type of entity1 and entity2
    -1 means we do not support this type
    """
    tag_type = -1 

    entity_index = l_w.index(str_enti)
    entity_sentence_type = parse_result.iloc[entity_index, -1]
    if entity_sentence_type in SEN_TAGS:
        tag_type = SEN_TAGS.index(entity_sentence_type)
    return tag_type

In [33]:
def find_keyword(words):
    """
    find the type of keyword in words
    """
    k_w = None
    for w in words:
        if w in key_words:
            k_w = w
            break
    return k_w
    

In [34]:
def parse(s, isGraph = False):
    """
    do dependency syntactic parsing and return the relative five feature vectors
    """
    s, tmp_ner_dict = name_entity_modification(s)
    words, tags, arcs, arcs_lst, parse_result = seg_tag_par(s)
    
    result = []
    rely_id = [arc[0] for arc in arcs]  
    relation = [arc[1] for arc in arcs]  
    heads = ['Root' if id == 0 else words[id - 1] for id in rely_id] 
    company_list = list(tmp_ner_dict.keys())
    str_enti_1 = "一号企业"
    str_enti_2 = "二号企业"
    l_w = list(words)
    is_two_company = str_enti_1 in l_w and str_enti_2 in l_w
    
    # add sen_tag_type features
    tag_type_1 = -1
    tag_type_2 = -1
    if is_two_company:
        tag_type_1 = find_sen_tag(l_w, parse_result, str_enti_1)
        tag_type_2 = find_sen_tag(l_w, parse_result, str_enti_2)
    result.append(tag_type_1) 
    result.append(tag_type_2)    
        
    # add syntactic distance between two entities
    distance_syntactic = 0
    if is_two_company:
        distance_syntactic = shortest_path(parse_result, list(words), str_enti_1, str_enti_2, isGraph=False)
    result.append(distance_syntactic)
    
    # add absolute distance between entities
    distance_entity = 0
    if is_two_company:
        distance_entity = np.abs(l_w.index(str_enti_1) - l_w.index(str_enti_2))
    result.append(distance_entity)
    
    # add distance
    k_w = find_keyword(words)
    dis_key_e_1 = -1
    dis_key_e_2 = -1

    if k_w != None and is_two_company:
        k_w = str(k_w)
        l_w = list(words)
        dis_key_e_1 = np.abs(l_w.index(str_enti_1) - l_w.index(k_w))
        dis_key_e_2 = np.abs(l_w.index(str_enti_2) - l_w.index(k_w))
    result.append(dis_key_e_1)
    result.append(dis_key_e_2)
    return result

In [35]:
def get_feature(s):
    """
    input corpus and return features(tf_idf and dependency syntactic parsing features)
    """
    sen_feature = []
    len_s = len(s)
    for i in trange(len_s):
        f_e = parse(s[i], isGraph = False)
        sen_feature.append(f_e)

    sen_feature = np.array(sen_feature)

    features = np.concatenate((X_tfidf,  sen_feature), axis= 1)

    return features

In [39]:
# generate features of 
f_v_s_path = "../data/feature_vector.npy"
is_exist_f_v = os.path.exists(f_v_s_path)
features = []
if not is_exist_f_v:
    features = get_feature(corpus_parse)
    np.save(f_v_s_path, features)
else:
    features = np.load(f_v_s_path)

features_train = features[:len(train_data), :]
segmentor.release()
postagger.release()
parser.release()
print(features_train)


100%|██████████| 1269/1269 [00:09<00:00, 131.68it/s]

[[ 0.  0.  0. ... 18. -1. -1.]
 [ 0.  0.  0. ... 16. -1. -1.]
 [ 0.  0.  0. ...  2.  1.  3.]
 ...
 [ 0.  0.  0. ... 10. 15.  5.]
 [ 0.  0.  0. ...  2. -1. -1.]
 [ 0.  0.  0. ... 10. -1. -1.]]





## 4. Build classifier to obtain the labels of the test set
Using the collected feature vectors to train a classifier and do parameters searching.


In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report
from sklearn.naive_bayes import BernoulliNB

seed = 9000

y = train_data.loc[:, ['tag']]
y = np.array(y.values)
y = y.reshape(-1)
Xtrain, Xtest, ytrain, ytest = train_test_split(features_train,  y, test_size = 0.2, random_state = seed)

def logistic_class(Xtrain, Xtest, ytrain, ytest):
    cross_validator = KFold(n_splits=10, shuffle=True, random_state = seed)

    lr = LogisticRegression(penalty = "l1", solver='liblinear')

    params = {"C":[0.1,1.0,10.0,15.0,20.0,30.0,40.0,50.0]}

    grid = GridSearchCV(estimator=lr, param_grid = params, cv=cross_validator)
    grid.fit(Xtrain, ytrain)
    print("best parameters：",grid.best_params_)
    model = grid.best_estimator_
    y_pred = model.predict(Xtest)

    y_test = [str(value) for value in ytest]
    y_pred = [str(value) for value in y_pred]

#     train_score = model.score(Xtrain, ytrain)
#     print("train_score", train_score)
#     test_score = model.score(Xtest, ytest)
#     print("test_score", test_score)


    proba_value = model.predict_proba(Xtest)
    p = proba_value[:,1]
    print("Logistic=========== ROC-AUC score: %.3f" % roc_auc_score(y_test, p))

    report = classification_report(y_pred=y_pred,y_true=y_test)
    print(report)


    return model



In [49]:
s_model = logistic_class(Xtrain, Xtest, ytrain, ytest)

features_test = features[len(train_data):, :]
y_pred_test = s_model.predict(features_test)

l_X_test_ner = X_test.values.tolist()

entity_dict = {}
relation_list = []

for i, label in enumerate(y_pred_test):
    if label == 1:
        cur_ner_content = str(l_X_test_ner[i])

        ner_list = list(set(re.findall(r'(ner\_\d\d\d\d\_)',cur_ner_content)))
        if len(ner_list) == 2:
            # print(ner_list)
            r_e_l = []
            for i, ner in enumerate(ner_list):
                split_list = str.split(ner, "_")
                if len(split_list) == 3:
                    ner_id = int(split_list[1])

                    if ner_id in ner_dict_reverse_new:
                        if ner_id not in entity_dict:

                            company_main_name = ner_dict_reverse_new[ner_id]

                            if company_main_name in dict_entity_name_unify:
                                entity_dict[ner_id] = company_main_name + dict_entity_name_unify[company_main_name]
                            else:
                                entity_dict[ner_id] = company_main_name

                        r_e_l.append(ner_id)
            if len(r_e_l) == 2:
                relation_list.append(r_e_l)


entity_list = [[item[0], item[1]] for item in entity_dict.items()]
pd_enti = pd.DataFrame(np.array(entity_list), columns=['实体编号','实体名'])


pd_re = pd.DataFrame(np.array(relation_list), columns=['实体1','实体2'])



best parameters： {'C': 20.0}
train_score 1.0
test_score 0.9235294117647059
              precision    recall  f1-score   support

           0       0.93      0.98      0.96       145
           1       0.83      0.60      0.70        25

    accuracy                           0.92       170
   macro avg       0.88      0.79      0.83       170
weighted avg       0.92      0.92      0.92       170



In [50]:
pd_enti

Unnamed: 0,实体编号,实体名
0,1002,氟化工|多氟多化工股份有限公司|多氟多化工股份有限公司
1,1003,李云峰
2,1006,山东华鹏玻璃|山东华鹏玻璃股份有限公司|山东华鹏玻璃股份有限公司|山东华鹏玻璃股份有限公司|...
3,1007,张德华
4,1009,肖文革
...,...,...
107,1590,赵笃学
108,1602,美康生物科技股份|美康生物科技股份有限公司|美康生物科技股份有限公司
109,1603,邹炳德
110,1644,湖南天润数字娱乐文化传媒|湖南天润数字娱乐文化传媒股份有限公司|湖南天润数字娱乐文化传媒股份...


In [51]:
pd_re

Unnamed: 0,实体1,实体2
0,1002,1003
1,1006,1007
2,1009,1008
3,1012,1013
4,1018,1019
...,...,...
57,1580,1579
58,1582,1581
59,1590,1346
60,1602,1603


In [94]:
entity_list

[[1002, '氟化工|多氟多化工股份有限公司|多氟多化工股份有限公司'],
 [1003, '李云峰'],
 [1006, '山东华鹏玻璃|山东华鹏玻璃股份有限公司|山东华鹏玻璃股份有限公司|山东华鹏玻璃股份有限公司|山东华鹏玻璃股份有限公司'],
 [1007, '张德华'],
 [1009, '肖文革'],
 [1008, '印纪娱乐传媒|印纪娱乐传媒股份有限公司'],
 [1012, '上海新朋实业|上海新朋实业股份有限公司'],
 [1013, '宋琳'],
 [1018, '百川能源|百川能源股份有限公司|百川能源股份有限公司'],
 [1019, '曹飞'],
 [1021, '颜军'],
 [1020, '珠海欧比特控制工程|珠海欧比特控制工程股份有限公司'],
 [1023, '宋睿'],
 [1022, '成都云图|成都云图控股股份有限公司'],
 [1026, '广州阳普医疗科技股份|广州阳普医疗科技股份有限公司|广州阳普医疗科技股份有限公司'],
 [1027, '邓冠华'],
 [1032, '厦门金达威集团股份|厦门金达威集团股份有限公司'],
 [1033, '江斌'],
 [1015, '康力电梯|康力电梯股份有限公司|康力电梯股份有限公司'],
 [1014, '王友林'],
 [1034, '东方日升新能源|东方日升新能源股份有限公司'],
 [1035, '林海峰'],
 [1036, '神州数码集团股份|神州数码集团股份有限公司'],
 [1038, '云科服务'],
 [1040, '吴宏亮'],
 [1039, '浙江唐德影视|浙江唐德影视股份有限公司'],
 [1060, '周福海'],
 [1059, '江苏亚太轻合金科技股份|江苏亚太轻合金科技股份有限公司'],
 [1067, '金浦钛业|金浦钛业股份有限公司|金浦钛业股份有限公司|金浦钛业股份有限公司'],
 [1068, '郭金东'],
 [1094, '刘梦龙'],
 [1093, '深圳市易尚展示|深圳市易尚展示股份有限公司|深圳市易尚展示股份有限公司|深圳市易尚展示股份有限公司'],
 [1133, '康美保险'],
 [1134, '康美健康保险|康美健康保险股份有限公司'],
 [1137, '天津金岸重工|天津金岸重工有限公司|天津金岸重工有限公司

In [95]:
relation_list

[[1002, 1003],
 [1006, 1007],
 [1009, 1008],
 [1012, 1013],
 [1018, 1019],
 [1021, 1020],
 [1023, 1022],
 [1026, 1027],
 [1002, 1003],
 [1032, 1033],
 [1015, 1014],
 [1034, 1035],
 [1036, 1038],
 [1040, 1039],
 [1026, 1027],
 [1060, 1059],
 [1067, 1068],
 [1067, 1068],
 [1094, 1093],
 [1133, 1134],
 [1137, 1136],
 [1141, 1140],
 [1062, 1061],
 [1191, 1192],
 [1223, 1224],
 [1236, 1237],
 [1251, 1252],
 [1255, 1256],
 [1277, 1136],
 [1278, 1279],
 [1284, 1285],
 [1094, 1093],
 [1326, 1325],
 [1329, 1330],
 [1331, 1139],
 [1346, 1347],
 [1349, 1348],
 [1358, 1359],
 [1388, 1387],
 [1391, 1392],
 [1396, 1397],
 [1429, 1430],
 [1439, 1440],
 [1207, 1445],
 [1460, 1090],
 [1461, 1249],
 [1477, 1476],
 [1492, 1493],
 [1496, 1497],
 [1504, 1503],
 [1512, 1511],
 [1523, 1524],
 [1529, 1528],
 [1537, 1536],
 [1539, 1540],
 [1563, 1564],
 [1477, 1476],
 [1580, 1579],
 [1582, 1581],
 [1590, 1346],
 [1602, 1603],
 [1644, 1645]]

## 5. Building a graph graph to represent the dug relationships between different entities.
Here, we use cypher language and NoSQL(noe4j graph database) to do insert/delete/query operations.


In [59]:
# insert data into graph
from py2neo import Node, Relationship, Graph
graph = Graph(
    "http://localhost:7474",
    username="project3",
    password = "password"
)
graph.delete_all()

for v in relation_list:
    a = Node('Company', name=str(v[0]))
    b = Node('Company', name=str(v[1]))

    #undirectional edge
    r = Relationship(a, 'INVEST', b)
    s = a | b | r
    graph.create(s)
    r = Relationship(b, 'INVEST', a)
    s = a | b | r
    graph.create(s)

In [68]:
def query_3_layers_relationship():
    """
    check whether it exits three-layer relationships or not a invest b, b invest c ---> a invest c
    """
    import random

    result_2 = []
    result_3 = []
    for value in entity_list:
        ner_id = value[0]
        str_sql_3 = "match data=(na:Company{{name:'{0}'}})-[:INVEST]->(nb:Company)-[:INVEST]->(nc:Company) where na.name <> nc.name return data".format(str(ner_id))
        result_3 = graph.run(str_sql_3).data()
        if len(result_3) > 0:
            break

    if len(result_3) > 0:
        print("step1")
        print(result_3)
    else:
        print("step2")
        random_index = random.randint(0, len(entity_list) - 1)
        random_ner_id = entity_list[random_index][0]
        str_sql_2 = "match data=(na:Company{{name:'{0}'}})-[*2]->(nb:Company) return data".format(str(random_ner_id))
        result_2 = graph.run(str_sql_2).data()
        print(result_2)

In [69]:
check_3_layers_relationship()

step2
[{'data': Path(Node('Company', name='1039'), INVEST(Node('Company', name='1039'), Node('Company', name='1040')), INVEST(Node('Company', name='1040'), Node('Company', name='1039')))}]


## 6. Entity DISAMBIGUATION
After representing the entities' relationships in the graph database, we would like to learn more about the found entities. But how could we distinguish the entities appearing in our dataset from the ones we found on the search engine. baike.baidu.com, like Wikipedia, provides polysemantic sections, which stores existing similar terms. Our goal is to compare the cosine similarity of the tf-idf vectors between the entities in the polysemantic list and the given datasets.


In [91]:
# train tf_idf
test_data_dis = pd.read_csv('../data/info_extract/test_data.csv', encoding = 'gb2312', header=0)
list_person_content = {}
window = 5

LTP_DATA_DIR = './ltp_data_v3.4/ltp_data_v3.4.0'
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')
segmentor = Segmentor(cws_model_path, lexicon_path = '../data/user_dict.txt') 

f = lambda x: ' '.join([word for word in segmentor.segment(x)])
f2 = lambda x: re.compile("[^\u4e00-\u9fa5]").sub('', x)
corpus_dis = test_data['sentence'].map(f).map(f2).tolist()
vectorizer = TfidfVectorizer()  
X_tfidf = vectorizer.fit_transform(corpus_dis).toarray()  

# get entity
for i in range(25):
    sentence = corpus_dis[i]
    len_sen = len(sentence)
    words, ners = fool.analysis(sentence)
    for start, end, ner_type, ner_name in ners[0]:
        if ner_type == 'person':

            start_index = max(0, start - window)
            end_index = min(len_sen - 1, end - 1 + window)
            left_str = sentence[start_index:start]
            right_str = sentence[end - 1:end_index]

            left_str = ' '.join([word for word in segmentor.segment(left_str)])
            right_str = ' '.join([word for word in segmentor.segment(right_str)])
            new_str = left_str + " " +right_str


            content_vec = vectorizer.transform([new_str])

            ner_id = ner_dict_new[ner_name]
            if ner_id not in list_person_content:
                list_person_content[ner_id] = content_vec

In [92]:
list_person_content

{1003: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1005: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1007: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1009: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1011: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1013: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1014: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1016: <1x412 sparse matrix of type '<class 'numpy.float64'>'
 	with 0 stored elements in Compressed Sparse Row format>,
 1019: <1x412 sparse matrix of t

In [96]:
def get_para_vector(para_elems):
    str_res = ""
    for p_e in para_elems:
        petext = re.sub(r'(\r|\n)*', '', p_e.text)
        petext = re.compile("[^\u4e00-\u9fa5]").sub('', petext)
        str_res += petext
    str_res = ' '.join([word for word in jieba.cut(str_res)])
    content_vec = vectorizer.transform([str_res])
    content_vec = content_vec.toarray()[0]
    return content_vec

In [97]:
from requests_html import HTMLSession
from requests_html import HTML
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
import jieba


list_company_names = [company for value in entity_list for company in str.split(value[1], "|")]


list_person_url = []
url_prefix = "https://baike.baidu.com/item/"
url_error = "https://baike.baidu.com/error.html"

l_p_items = list(list_person_content.items())
len_items = len(l_p_items)


for index in trange(len_items):
    value = l_p_items[index]

    person_id = value[0]
    vector_entity = csr_matrix(value[1])

    person_name = ner_dict_reverse_new[person_id]

    session = HTMLSession()
    url = url_prefix + person_name
    response = session.get(url)

    url_list = []
    if response.url != url_error:
        para_elems = response.html.find('.para')
        content_vec = get_para_vector(para_elems)
        url_list.append([response.url, content_vec])

        banks = response.html.find('.polysemantList-wrapper')

        if len(banks) > 0:
            banks_child = banks[0]
            persion_links = list(banks_child.absolute_links)
            for link in persion_links:
                r_link = session.get(link)

                if r_link.url == url_error:
                    continue

                para_elems = r_link.html.find('.para')
                content_vec = get_para_vector(para_elems)
                url_list.append([r_link.url, content_vec])

        vectorizer_list = [item[1] for item in url_list]
        vectorizer_list = csr_matrix(vectorizer_list)
        result = list(cosine_similarity(value[1], vectorizer_list)[0])
        max_index = result.index(max(result))
        list_person_url.append([person_id, person_name, url_list[max_index][0]])

print(list_person_url)

100%|██████████| 21/21 [11:25<00:00, 32.66s/it]

[[1003, '李云峰', 'https://baike.baidu.com/item/%E6%9D%8E%E4%BA%91%E5%B3%B0/13011276'], [1005, '侯毅', 'https://baike.baidu.com/item/%E4%BE%AF%E6%AF%85/3417673'], [1007, '张德华', 'https://baike.baidu.com/item/%E5%BC%A0%E5%BE%B7%E5%8D%8E/12640205'], [1009, '肖文革', 'https://baike.baidu.com/item/%E8%82%96%E6%96%87%E9%9D%A9/12761874'], [1011, '熊海涛', 'https://baike.baidu.com/item/%E7%86%8A%E6%B5%B7%E6%B6%9B/10849366'], [1013, '宋琳', 'https://baike.baidu.com/item/%E5%AE%8B%E7%90%B3/3967064'], [1014, '王友林', 'https://baike.baidu.com/item/%E7%8E%8B%E5%8F%8B%E6%9E%97/71412'], [1016, '彭聪', 'https://baike.baidu.com/item/%E5%BD%AD%E8%81%AA/19890127'], [1019, '曹飞', 'https://baike.baidu.com/item/%E6%9B%B9%E9%A3%9E/10396036'], [1021, '颜军', 'https://baike.baidu.com/item/%E9%A2%9C%E5%86%9B/3476040'], [1023, '宋睿', 'https://baike.baidu.com/item/%E5%AE%8B%E7%9D%BF/2629451'], [1027, '邓冠华', 'https://baike.baidu.com/item/%E9%82%93%E5%86%A0%E5%8D%8E'], [1028, '孙锋峰', 'https://baike.baidu.com/item/%E5%AD%99%E9%94%8B%E5%B


