## Demo4
In this notebook, I try to address output structure of this algorithm. The result should be a combination of degree and school, or position and company. This notebook focuses on the latter.
E.g, `广州xxx有限公司担任ceo`. Then I expect the result to be something like: `<广州xxx有限公司, ceo>`

Idea:
+ Link entity seperately, do linking each n-gram, thus can make use of the order of text.
+ Eliminate duplicate entities after linking

In [6]:
import jieba
import re
import chardet
from gensim.models import KeyedVectors
import numpy as np
import pandas as pd
import math
import datetime
import prettytable as pt

## Load File

In [7]:
def loadCompany():
    path1 = 'data/company_list_ch.csv'
    company_ch_df = pd.read_csv(path1, header=None, delimiter=",", skiprows=2, names=['rank', 'name', 'Location', 'Income'])
    company_ch_df = pd.DataFrame(company_ch_df, columns=['name'])
    path2 = 'data/member-data.csv'
    company_df = pd.read_csv(path2, header=None, delimiter=",", skiprows=2, names=['name', 'No.', 'Resume', 'Position'])
    company_df = pd.DataFrame(company_df, columns=['name'])
    company_ch_df = pd.concat([company_ch_df, company_df], axis=0, ignore_index=True)
    
    print(f'before dedup, company size: {company_ch_df.shape[0]}')
    company_ch_df = company_ch_df.drop_duplicates(subset=['name'], keep='first')
    print(f'after dedup, company size: {company_ch_df.shape[0]}')
    # print(type(company_ch_df))
    # print(company_ch_df.tail())
    return company_ch_df

In [39]:
def loadPosition():
    path5 = 'data/position.csv'
    position_df1 = pd.read_csv(path5, header=None, delimiter=",", skiprows=1, names=['name'])
    position_df1 = pd.DataFrame(position_df1)
    # print(df1.head())

    path6 = 'data/member-data.csv'
    position_df2 = pd.read_csv(path6, header=None, delimiter=",", skiprows=1, names=['Company', 'No.', 'Resume', 'name'])
    position_df2 = pd.DataFrame(position_df2, columns=['name'])
    # print(df2.head())

    member_position_list = []
    # process position in member-data
    for index, row in position_df2.iterrows():
        position = row['name']
        if isinstance(position, float) or position == " " or position.isalpha():
            continue
        if "&" in position:
            temp1 = position.split('&')
            member_position_list += temp1
        elif " " in position:
            temp2 = position.split( )
            member_position_list += temp2
        else:
            member_position_list.append(position)

    position_df3 = pd.DataFrame(member_position_list, columns=['name'])
    # df3 = df3.drop_duplicates(subset=['position'], keep='first')
    position_df1 = pd.concat([position_df1, position_df3], axis=0, ignore_index=True)
    print(f'before dedup, position size: {position_df1.shape[0]}')
    position_df1 = position_df1.drop_duplicates(subset=['name'], keep='first')
    print(f'after dedup, position size: {position_df1.shape[0]}')
    # print(df1.head())
    return position_df1

In [40]:
def loadMember():
    path = 'data/member-data.csv'
    member_df = pd.read_csv(path, header=None, delimiter=",", skiprows=1, names=['Company', 'No.', 'Resume', 'Position'])
    member_df = pd.DataFrame(member_df, columns=['Resume'])
    print(member_df.head())
    return member_df

## Preprocess Text

In [133]:
def removeStopWords(seglist):
    stopwords = {}
    fstop = open('data/stopwords_cn.txt', 'r', encoding='utf-8', errors='ignore')
    for w in fstop:
        stopwords[w.strip()] = w.strip()

    fstop.close()
    stopwords[' '] = ' '
    
    segListSanitized = []

    for word in seglist:
        # translation
        if word == 'omnigo':
            word = '酷刻'
        if word == 'Aibee':
            word = '爱笔'
        if word == 'ilife':
            word = '爱乐福'
        if word == 'oracleen':
            word = '爱芽'
        if word not in stopwords:
            segListSanitized.append(word)
    return segListSanitized

In [42]:
def preprocess(text):
    # remove punctuations
    text = re.sub(r"[\s+\.\!\/_,$%^*()?;；:【】+\"\']+|[+——！，;:。？、~@#￥%……&*（）]+", " ", text)
    text = text.lower()
    # seperate words
    words = jieba.cut(text, cut_all=False)
    seglist = list(words)
    # remove stopwords
    segListSanitized = removeStopWords(seglist)
    print(f'Before sanitize, len: {len(seglist)}. After sanitize, len: {len(segListSanitized)}')

    return segListSanitized

## N-gram Algorithm

In [43]:
def getNgrams(wordList, n):
    '''
    This function only generete N-Grams
    '''
    output = []
    for i in range (len(wordList) - n + 1):
        n_gram_temp = "".join(wordList[i:i+n])
        output.append(n_gram_temp)
    return output

In [44]:
def generateNgrams(wordList, n):
    '''
    This function genereates [1, N]-Grams
    '''
    result = set()
    for i in range(n):
        temp = getNgrams(wordList, i+1)
        result = result | temp
    
    return result

In [45]:
def generateNgramsV2(wordList, n):
    '''
    This function genereates [1, N]-Grams
    '''
    result = []
    for i in range(n):
        temp_list = getNgrams(wordList, i+1)
        temp = list(set(temp_list))
        temp.sort(key=temp_list.index)
        result.append(temp)
        
    return result

## Word Embedding

In [46]:
model = KeyedVectors.load('./test_50.bin')

In [87]:
def calculate_cosine_similarity(a, b):
    vector_a = np.mat(a)
    vector_b = np.mat(b)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    
    if denom == 0:
        return 0.0
    
    cos = num / denom
    sim = 0.5 + 0.5 * cos
    return sim

In [88]:
def generateEmbeddings(name):
    words = jieba.cut(name, cut_all=False)
    word_list = list(words)
    v = np.zeros((200))
    for word in word_list:
        if word in model.vocab:
            v += model[word]
    
    v /= len(v)
    return v

In [89]:
def calculate_IDF(df):
    company_num = 0
    m = dict()
    for index, row in df.iterrows():
        name = row['name']
        name = re.sub(r"[\s+\.\!\/_,$%^*()?;；:【】+\"\']+|[+——！，;:。？、~@#￥%……&*（）]+", " ", name)
        name = name.lower()
        if isinstance(name, float) or name == " ":
            continue
        company_num += 1
        words = jieba.cut(name, cut_all=False)
        word_list = list(words)
        for word in word_list:
            if word in m.keys():
                m[word] +=1
            else:
                m[word] = 1
    
    idf = dict()
    
    for (k, v) in m.items():
        idf[k] = math.log(((1+company_num) / v), 10)
    
    # Normalize min-max
    v = idf.values()
    max_value = max(v)
    min_value = min(v)
    denom = max_value - min_value
    for (k, v) in idf.items():
        idf[k] = (idf[k] - min_value) / denom

    return idf

In [90]:
def generateEmbeddingsWithIDF(name, idf):
    words = jieba.cut(name, cut_all=False)
    word_list = list(words)
    v = np.zeros((200))
    for word in word_list:
        if word in model.vocab:
            v += model[word] * idf[word]
    
    v /= len(v)
    return v

In [91]:
def preprocess_entity_list(df, model):
    '''
    df: dafaframe
    model: word embedding model
    '''
    
    df['embeddings'] = ''
    for index, row in df.iterrows():
        # df.loc[index, 'embeddings'] = z
        name = row['name']
        if isinstance(name, float):
            continue
        name = name.lower()
        if name in model.vocab:
            vec = model[name]
        else:
            vec = generateEmbeddings(name)
        df.set_value(index, 'embeddings', vec)

    # print(df.head())
    return df

In [92]:
def preprocess_entity_list_withIDF(df, model, idf):
    '''
    df: dafaframe
    model: word embedding model
    idf: IDF for each word
    '''
    
    df['embeddings_idf'] = ''
    for index, row in df.iterrows():
        name = row['name']
        name = re.sub(r"[\s+\.\!\/_,$%^*()?;；:【】+\"\']+|[+——！，;:。？、~@#￥%……&*（）]+", " ", name)
        name = name.lower()
        if isinstance(name, float):
            continue
        if name in model.vocab:
            vec = model[name]
        else:
            vec = generateEmbeddingsWithIDF(name, idf)
        df.set_value(index, 'embeddings_idf', vec)

    # print(df.head())
    return df

## Entity Link

In [152]:
def linkCompanyAndPosition(output, model, company_df, position_df, company_threshold1, company_threshold2, position_threshold):
    company_entity = []
    position_entity = []
    for index, li in enumerate(output):
        print(f'process {index}-Gram')
        for term in li:
            if len(term) <= 1:
                continue
            if term in model.vocab:
                term_vec = model[term]

                # Link Company
                company_candidate = dict()
                for index, row in company_df.iterrows():
                    name = row['name']
                    if isinstance(name, float):
                        continue
                    name_vec = row['embeddings_idf']
                    sim = calculate_cosine_similarity(term_vec, name_vec)
                    if (sim > company_threshold1):
                        company_candidate[name] = sim
                if len(company_candidate) != 0:
                    company_candidate = sorted(company_candidate.items(), key=lambda item:item[1], reverse=True)
                    print(f'company entity found: {term}->{company_candidate[0][0]}, sim = {company_candidate[0][1]}')
                    company_entity.append(company_candidate[0])
                    
                # Link Position
                position_candidate = dict()
                for index, row in position_df.iterrows():
                    name = row['name']
                    name_vec = row['embeddings']
                    sim = calculate_cosine_similarity(term_vec, name_vec)
                    if (sim > position_threshold):
                        position_candidate[name] = sim
                if len(position_candidate) != 0:
                    position_candidate = sorted(position_candidate.items(), key=lambda item:item[1], reverse=True)
                    print(f'position entity found: {term}->{position_candidate[0][0]}, sim = {position_candidate[0][1]}')
                    position_entity.append(position_candidate[0])
            else:
                term_vec = generateEmbeddings(term)
                for index, row in company_df.iterrows():
                    name = row['name']
                    if isinstance(name, float):
                        continue
                    name_vec = row['embeddings']
                    sim = calculate_cosine_similarity(term_vec, name_vec)
                    if (sim == 0.0 or sim > company_threshold2):
                        is_match = exact_match(name, term)
                        if is_match:
                            print(f'company entity found by exact match: {term}->{name}')
                            company_entity.append([name, 1.0])
                            break
                
    return company_entity, position_entity

In [153]:
def exact_match(entity, term):
    # print(f'exact_match, entity: {entity}, term: {term}')
    keyword = extractKeyword(term)
    is_match = False
    for key in keyword:
        idx = entity.find(key)
        if idx != -1:
            is_match = True
            # print(f'match {key}')
    return is_match

In [154]:
def extractKeyword(term):
    '''
    Extract keyword in a term, whose embedding can not be found in vocabulary
    Return a list
    '''
    words = jieba.cut(term, cut_all=False)
    word_list = list(words)
    if len(word_list) > 1:
        word_list = getNgrams(word_list, 2)
    v = np.zeros((200))
    keyword = []
    for word in word_list:
        if word not in model.vocab:
            # print(f'keyword found: {word}')
            keyword.append(word)
    return keyword

## Main Function

In [134]:
company_df = loadCompany()
position_df = loadPosition()
member_df = loadMember()

before dedup, company size: 11716
after dedup, company size: 3571
before dedup, position size: 5983
after dedup, position size: 199
                                              Resume
0  __团队成员#1__先生是公司创始人,也是中国最有影响力的商界领袖之一。1982年,__团队...
1  __团队成员#2__先生,现任TCL集团股份有限公司执行董事、总裁(COO)。1963年4月...
2  __团队成员#3__女士:1972年7月出生,中山大学法学博士,高级经济师。1993年6月至...
3  __团队成员#4__先生,1965年7月出生,东方电气集团党组副书记、副总经理,兼任东方电气...
4  __团队成员#5__女士,现任TCL多媒体集团有限公司非执行独立董事、A8新媒体集团非执行独...


In [135]:
company_idf = calculate_IDF(company_df)
company_df = preprocess_entity_list(company_df, model)
company_df = preprocess_entity_list_withIDF(company_df, model, company_idf)
# position_idf = calculate_IDF(position_df)
position_df = preprocess_entity_list(position_df, model)



In [136]:
def print_table_company_position(company_entity, position_entity):
    company_list = []
    company_sim_list = []
    position_list = []
    position_sim_list = []
        
    min_len = min(len(company_entity), len(position_entity))
    
    for i, s in enumerate(company_entity):
        if i == min_len:
            break
        company_list.append(s[0])
        company_sim_list.append(s[1])

    for i, d in enumerate(position_entity):
        if i == min_len:
            break
        position_list.append(d[0])
        position_sim_list.append(d[1])

    tb = pt.PrettyTable()
    tb.add_column("Company", company_list)
    tb.add_column("Company_Similarity", company_sim_list)
    tb.add_column("Position", position_list)
    tb.add_column("Position_Similarity", position_sim_list)
    print(tb)

In [137]:
text = '__团队成员#1__先生是公司创始人,也是中国最有影响力的商界领袖之一。1982年,__团队成员#1__先生于华南理工大学毕业,进入TCL的前身-TTK家庭电器有限公司。1985年,他担任新成立的TCL通讯设备公司总经理,创立了TCL品牌。2003年,__团队成员#1__担任TCL集团股份有限公司董事长兼CEO,随后TCL集团整体上市。在他的领导下,2004年TCL一举收购了法国汤姆逊全球彩电业务与阿尔卡特全球手机业务。目前TCL集团已经成为拥有6万名员工,业务遍及全球80多个国家和地区。2013年,TCL集团营业总收入超过855亿元,液晶电视全球销量1766万台,实际产量全球第三,品牌销售全球第三;TCL手机全球销量5520万台,行业排名全球第五。2012年__团队成员#1__被新华网评为“最具社会责任感企业家”;2011年荣获《中国企业家》“最具影响力的25位企业领袖”终身成就奖;2009年被评为“CCTV中国经济年度人物十年商业领袖”;2008年获改革开放30年经济人物称号;2004年被评为Fortune杂志“亚洲年度经济人物”、TIME杂志和CNN全球最具影响力的25名商界人士,同年法国总统希拉克向__团队成员#1__先生颁发了法国国家荣誉勋章。__团队成员#1__是中共第十六大代表,第十届、第十一届、第十二届全国人大代表。__团队成员#1__担任的社会职务包括:中国电子视像行业协会会长;中国国际商会副会长;全国工商联执行委员、广东省工商联(总商会)副主席。'

In [138]:
segListSanitized = preprocess(text)
output = generateNgramsV2(segListSanitized, 3)
company_entity, position_entity = linkCompanyAndPosition(output, model, company_df, position_df, 0.98, 0.9, 0.98)
print_table_company_position(company_entity, position_entity)
print()

Before sanitize, len: 325. After sanitize, len: 239
process 0-Gram
position entity found: 创始人->创始人, sim = 1.000000029802326
company entity found: 中国->尚科宁家（中国）科技有限公司, sim = 0.9905265283315191
position entity found: 总经理->总经理, sim = 1.0000000596046448
position entity found: 董事长->董事长, sim = 1.000000029802326
position entity found: ceo->CEO, sim = 1.0
process 1-Gram
position entity found: 公司创始人->公司创始人, sim = 1.0000000298023295
process 2-Gram
company entity found: tcl集团股份有限公司->TCL集团股份有限公司, sim = 1.0
+------------------------------+--------------------+----------+---------------------+
|           Company            | Company_Similarity | Position | Position_Similarity |
+------------------------------+--------------------+----------+---------------------+
| 尚科宁家（中国）科技有限公司 | 0.9905265283315191 |  创始人  |  1.000000029802326  |
|     TCL集团股份有限公司      |        1.0         |  总经理  |  1.0000000596046448 |
+------------------------------+--------------------+----------+---------------------+



In [155]:
text2 = '__团队成员#13__先生,现任深圳市华星光电技术有限公司高级副总裁.1955 年9 月生,硕士,韩国籍.1973 年至1981 年,韩国延世大学材料工程本科毕业;1991年至1995 年,韩国延世大学材料工程研究生毕业,获硕士学位;2003 年至2006年,McGill University Business 专业MBA 毕业,获硕士学位.1981 年至1999年,历任LG 半导体有限公司制程工程师、存储器制程发展部部长、高级技术中心(ATC)主管、C3 工厂厂长、执行总监;2000 年至2009 年,历任LG PHILIPS液晶显示IT 业务总部执行副总裁、LG PHILIPS 液晶显示生产技术中心总部执行副总裁;2009 年至2010 年,任日本FUHRMEISTER 电子高级顾问;2010 年3 月至今,任深圳市华星光电技术有限公司高级副总裁、总裁、首席执行官等职。'

In [156]:
segListSanitized = preprocess(text2)
output = generateNgramsV2(segListSanitized, 3)
company_entity, position_entity = linkCompanyAndPosition(output, model, company_df, position_df, 0.98, 0.9, 0.98)
print_table_company_position(company_entity, position_entity)
print()

Before sanitize, len: 188. After sanitize, len: 124
process 0-Gram
position entity found: 副总裁->副总裁, sim = 1.000000029802326
position entity found: 部长->部长, sim = 1.0000000596046448
position entity found: 总裁->总裁, sim = 1.000000029802326
process 1-Gram
position entity found: 高级副总裁->高级副总裁, sim = 1.0000000596046448
company entity found by exact match: 半导体有限公司->山东华芯半导体有限公司
position entity found: 执行副总裁->执行副总裁, sim = 1.0000000596046448
position entity found: 首席执行官->首席执行官, sim = 1.0
process 2-Gram
company entity found by exact match: lg半导体有限公司->山东华芯半导体有限公司
company entity found by exact match: 半导体有限公司制程->山东华芯半导体有限公司
+------------------------+--------------------+----------+---------------------+
|        Company         | Company_Similarity | Position | Position_Similarity |
+------------------------+--------------------+----------+---------------------+
| 山东华芯半导体有限公司 |        1.0         |  副总裁  |  1.000000029802326  |
| 山东华芯半导体有限公司 |        1.0         |   部长   |  1.0000000596046448 |
| 山东华芯半导

In [141]:
text3 = '__团队成员#6__先生,1980年7月生,硕士研究生学历。2002年,福州大学经济学本科毕业;2006年,云南大学法律硕士研究生毕业。2006年8月至2014年2月,任职国泰君安证券股份有限公司,历任国泰君安证券香港公司财务顾问部高级经理、总经理,深圳总部机构客户部总监,从事香港与中国资本市场的投资银行业务。2014年3月加入TCL集团股份有限公司,任公司董事会办公室主任;2014年4月起任公司董事会秘书;2014年12月起任公司执委会成员;2015年4月起任TCL集团控股子公司全球播有限公司董事;2015年5月起任TCL通讯科技控股有限公司(02618.HK)非执行董事。'

In [142]:
segListSanitized = preprocess(text3)
output = generateNgramsV2(segListSanitized, 3)
company_entity, position_entity = linkCompanyAndPosition(output, model, company_df, position_df, 0.98, 0.9, 0.98)
print_table_company_position(company_entity, position_entity)
print()

Before sanitize, len: 139. After sanitize, len: 111
process 0-Gram
position entity found: 经理->经理, sim = 1.0000000596046448
position entity found: 总经理->总经理, sim = 1.0000000596046448
company entity found: 深圳->深圳坂云智行有限公司, sim = 1.0000000159812819
company entity found: 中国->尚科宁家（中国）科技有限公司, sim = 0.9905265283315191
position entity found: 董事->董事, sim = 1.0
company entity found: 科技->多玛凯拔科技有限公司, sim = 1.0000000094492707
process 1-Gram
position entity found: 高级经理->高级经理, sim = 1.0
position entity found: 董事会秘书->董事会秘书, sim = 1.0000000596046448
company entity found by exact match: 科技控股->盛世乐居（武汉）科技控股有限公司
position entity found: 执行董事->执行董事, sim = 1.000000029802326
process 2-Gram
company entity found: tcl集团股份有限公司->TCL集团股份有限公司, sim = 1.0
position entity found: 董事会办公室主任->董事会办公室主任, sim = 1.0
company entity found by exact match: 科技控股有限公司->盛世乐居（武汉）科技控股有限公司
position entity found: 非执行董事->非执行董事, sim = 1.0000000298023295
+----------------------------------+--------------------+------------+---------------------+

In [143]:
text4 = '__团队成员#1__，北京爱国者新能源科技发展有限公司 CEO。'

In [147]:
segListSanitized = preprocess(text4)
output = generateNgramsV2(segListSanitized, 3)
company_entity, position_entity = linkCompanyAndPosition(output, model, company_df, position_df, 0.95, 0.9, 0.98)
print_table_company_position(company_entity, position_entity)
print()

Before sanitize, len: 16. After sanitize, len: 10
process 0-Gram
company entity found: 北京->宝希（北京）科技有限公司, sim = 0.9760900424845506
company entity found: 爱国者->爱国者电子科技有限公司, sim = 0.9713396702727501
company entity found: 新能源->上海烯美新能源科技有限公司, sim = 0.9637615946079356
company entity found: 科技->多玛凯拔科技有限公司, sim = 1.0000000094492707
company entity found: 发展->深圳市迈迪加科技发展有限公司, sim = 0.9604227663245726
position entity found: ceo->CEO, sim = 1.0
process 1-Gram
process 2-Gram
+--------------------------+--------------------+----------+---------------------+
|         Company          | Company_Similarity | Position | Position_Similarity |
+--------------------------+--------------------+----------+---------------------+
| 宝希（北京）科技有限公司 | 0.9760900424845506 |   CEO    |         1.0         |
+--------------------------+--------------------+----------+---------------------+



In [148]:
text5 = '__团队成员#1__，Omnigo机器人CEO。毕业于华中科技大学，原uArm创始团队核心成员，uArm机械臂主创设计师。'

In [150]:
segListSanitized = preprocess(text5)
output = generateNgramsV2(segListSanitized, 3)
company_entity, position_entity = linkCompanyAndPosition(output, model, company_df, position_df, 0.98, 0.9, 0.98)
print_table_company_position(company_entity, position_entity)
print()

Before sanitize, len: 27. After sanitize, len: 18
process 0-Gram
company entity found by exact match: 酷刻->广州酷刻科技有限公司
position entity found: ceo->CEO, sim = 1.0
process 1-Gram
process 2-Gram
+----------------------+--------------------+----------+---------------------+
|       Company        | Company_Similarity | Position | Position_Similarity |
+----------------------+--------------------+----------+---------------------+
| 广州酷刻科技有限公司 |        1.0         |   CEO    |         1.0         |
+----------------------+--------------------+----------+---------------------+



In [151]:
for index, row in member_df.iterrows():
    if index == 40:
        break
    print(f'Handle No.{index} text')
    start = datetime.datetime.now()
    text = row['Resume']
    segListSanitized = preprocess(text)
    output = generateNgramsV2(segListSanitized, 3)
    company_entity, position_entity = linkCompanyAndPosition(output, model, company_df, position_df, 0.98, 0.9, 0.98)
    print_table_company_position(company_entity, position_entity)
    print()
    end = datetime.datetime.now()
    print(f'cost time: {end - start} sec')
    print()
    

Handle No.0 text
Before sanitize, len: 325. After sanitize, len: 239
process 0-Gram
position entity found: 创始人->创始人, sim = 1.000000029802326
company entity found: 中国->尚科宁家（中国）科技有限公司, sim = 0.9905265283315191
position entity found: 总经理->总经理, sim = 1.0000000596046448
position entity found: 董事长->董事长, sim = 1.000000029802326
position entity found: ceo->CEO, sim = 1.0
process 1-Gram
position entity found: 公司创始人->公司创始人, sim = 1.0000000298023295
process 2-Gram
company entity found: tcl集团股份有限公司->TCL集团股份有限公司, sim = 1.0
+------------------------------+--------------------+----------+---------------------+
|           Company            | Company_Similarity | Position | Position_Similarity |
+------------------------------+--------------------+----------+---------------------+
| 尚科宁家（中国）科技有限公司 | 0.9905265283315191 |  创始人  |  1.000000029802326  |
|     TCL集团股份有限公司      |        1.0         |  总经理  |  1.0000000596046448 |
+------------------------------+--------------------+----------+-------------

company entity found by exact match: 广新控股集团有限公司->杭州大王椰控股集团有限公司
company entity found by exact match: 控股集团有限公司总经理->苏宁控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->国美控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->浙江吉利控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->厦门国贸控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->雪松控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->潍柴控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->青山控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->中南控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->盛虹控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->浙江荣盛控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->四川长虹电子控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->云南省建设投资控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->云南省投资控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理->日照钢铁控股集团有限公司

company entity found by exact match: 控股集团有限公司总经理-

+----------------------------+--------------------+------------+---------------------+
|          Company           | Company_Similarity |  Position  | Position_Similarity |
+----------------------------+--------------------+------------+---------------------+
|    四川创燚科技有限公司    | 0.9913867313860034 |  副总经理  |  1.0000000596046448 |
| 北京科力通电气股份有限公司 |        1.0         |    董事    |         1.0         |
|    国文电气股份有限公司    |        1.0         |   总经理   |  1.0000000596046448 |
|  河北旭辉电气股份有限公司  |        1.0         | 高级工程师 |  1.0000000596046448 |
+----------------------------+--------------------+------------+---------------------+

cost time: 0:02:35.103060 sec

Handle No.4 text
Before sanitize, len: 126. After sanitize, len: 94
process 0-Gram
position entity found: 董事->董事, sim = 1.0
position entity found: 总经理->总经理, sim = 1.0000000596046448
position entity found: 副总裁->副总裁, sim = 1.000000029802326
position entity found: 总裁->总裁, sim = 1.000000029802326
position entity found: ceo->CEO, sim 

position entity found: 董事长->董事长, sim = 1.000000029802326
process 1-Gram
process 2-Gram
company entity found: tcl集团股份有限公司->TCL集团股份有限公司, sim = 1.0
+---------------------+--------------------+----------+---------------------+
|       Company       | Company_Similarity | Position | Position_Similarity |
+---------------------+--------------------+----------+---------------------+
| TCL集团股份有限公司 |        1.0         |  副总裁  |  1.000000029802326  |
+---------------------+--------------------+----------+---------------------+

cost time: 0:01:56.807906 sec

Handle No.9 text
Before sanitize, len: 115. After sanitize, len: 88
process 0-Gram
position entity found: 总裁->总裁, sim = 1.000000029802326
position entity found: 董事->董事, sim = 1.0
position entity found: 总经理->总经理, sim = 1.0000000596046448
position entity found: 副总经理->副总经理, sim = 1.0000000596046448
company entity found: 科技->多玛凯拔科技有限公司, sim = 1.0000000094492707
position entity found: 副总裁->副总裁, sim = 1.000000029802326
process 1-Gram
position ent

company entity found by exact match: 科技有限责任公司总裁->湖南银通科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->深圳市本牛科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->深圳市通用互联科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->深圳市优乐佳科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->深圳市右转智能科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->四川维优科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->芜湖艾尔达科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->武汉小弦科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->西安中科比奇创新科技有限责任公司
company entity found by exact match: 科技有限责任公司总裁->云南格维科技有限责任公司
+------------------------------+--------------------+------------+---------------------+
|           Company            | Company_Similarity |  Position  | Position_Similarity |
+------------------------------+--------------------+------------+---------------------+
|     多玛凯拔科技有限公司     | 1.0000000094492707 |   副总裁   |  1.000000029802326  |
| 尚科宁家（中国）科技有限公司 | 0.99

company entity found by exact match: 欢网科技有限责任->深圳市本牛科技有限责任公司
company entity found by exact match: 欢网科技有限责任->深圳市通用互联科技有限责任公司
company entity found by exact match: 欢网科技有限责任->深圳市优乐佳科技有限责任公司
company entity found by exact match: 欢网科技有限责任->深圳市右转智能科技有限责任公司
company entity found by exact match: 欢网科技有限责任->四川维优科技有限责任公司
company entity found by exact match: 欢网科技有限责任->芜湖艾尔达科技有限责任公司
company entity found by exact match: 欢网科技有限责任->武汉小弦科技有限责任公司
company entity found by exact match: 欢网科技有限责任->云南格维科技有限责任公司
+----------------------------------+--------------------+------------+---------------------+
|             Company              | Company_Similarity |  Position  | Position_Similarity |
+----------------------------------+--------------------+------------+---------------------+
|     惠州市耐利普科技有限公司     | 0.9951196531996298 |   副总裁   |  1.000000029802326  |
|       深圳坂云智行有限公司       | 1.0000000159812819 |    经理    |  1.0000000596046448 |
|       多玛凯拔科技有限公司       | 1.0000000094492707 |    部长    |  1.0000000596

position entity found: 董事->董事, sim = 1.0
position entity found: 副董事长->副董事长, sim = 1.0
company entity found: 中国->尚科宁家（中国）科技有限公司, sim = 0.9905265283315191
position entity found: 董事长->董事长, sim = 1.000000029802326
position entity found: 合伙人->合伙人, sim = 1.0
position entity found: 监事->监事, sim = 1.0
process 1-Gram
position entity found: 独立董事->独立董事, sim = 1.0000000596046448
position entity found: 执行董事->执行董事, sim = 1.000000029802326
process 2-Gram
position entity found: 非执行董事->非执行董事, sim = 1.0000000298023295
+------------------------------+--------------------+----------+---------------------+
|           Company            | Company_Similarity | Position | Position_Similarity |
+------------------------------+--------------------+----------+---------------------+
| 尚科宁家（中国）科技有限公司 | 0.9905265283315191 |   董事   |         1.0         |
+------------------------------+--------------------+----------+---------------------+

cost time: 0:02:11.768088 sec

Handle No.22 text
Before sanitize, len: 90. 

process 2-Gram
company entity found: tcl集团股份有限公司->TCL集团股份有限公司, sim = 1.0
company entity found by exact match: lg半导体有限公司->山东华芯半导体有限公司
company entity found by exact match: lg半导体有限公司->深圳市海思半导体有限公司
company entity found by exact match: 半导体有限公司任职->山东华芯半导体有限公司
company entity found by exact match: 半导体有限公司任职->深圳市海思半导体有限公司
+--------------------------+--------------------+------------+---------------------+
|         Company          | Company_Similarity |  Position  | Position_Similarity |
+--------------------------+--------------------+------------+---------------------+
|  山东华芯半导体有限公司  |        1.0         |   副总裁   |  1.000000029802326  |
| 深圳市海思半导体有限公司 |        1.0         |    董事    |         1.0         |
|   TCL集团股份有限公司    |        1.0         |    总裁    |  1.000000029802326  |
|  山东华芯半导体有限公司  |        1.0         | 首席执行官 |         1.0         |
| 深圳市海思半导体有限公司 |        1.0         | 高级副总裁 |  1.0000000596046448 |
+--------------------------+--------------------+------------+--------------

process 1-Gram
process 2-Gram
+------------------------------+--------------------+----------+---------------------+
|           Company            | Company_Similarity | Position | Position_Similarity |
+------------------------------+--------------------+----------+---------------------+
|     深圳坂云智行有限公司     | 1.0000000159812819 |  董事长  |  1.000000029802326  |
| 尚科宁家（中国）科技有限公司 | 0.9905265283315191 |   CEO    |         1.0         |
+------------------------------+--------------------+----------+---------------------+

cost time: 0:03:56.605363 sec

Handle No.36 text
Before sanitize, len: 14. After sanitize, len: 8
process 0-Gram
position entity found: 创始人->创始人, sim = 1.000000029802326
position entity found: 董事长->董事长, sim = 1.000000029802326
process 1-Gram
process 2-Gram
+---------+--------------------+----------+---------------------+
| Company | Company_Similarity | Position | Position_Similarity |
+---------+--------------------+----------+---------------------+
+---------+--------