## Demo3
In this notebook, I try to address output structure of this algorithm. The result should be a combination of degree and school, or position and company.
E.g, `xxx本科是中山大学，在清华大学获得硕士学位`. Then I expect the result to be something like: `<中山大学, 本科>`,`<清华大学, 硕士>`

Idea:
+ Link entity seperately, do linking each n-gram, thus can make use of the order of text.
+ Eliminate duplicate entities after linking

In [1]:
import jieba
import re
import chardet
from gensim.models import KeyedVectors
import numpy as np
import pandas as pd
import math
import datetime
import prettytable as pt

## Load File

In [64]:
def loadSchool():
    path1 = 'data/chinese_university_list.csv'
    school_df = pd.read_csv(path1, header=None, delimiter=",", skiprows=4, names=["rank", "name", "code", "department", "city", "level", "notes"])
    school_df = pd.DataFrame(school_df, columns=['name'])
    print(school_df.shape[0])
    
    path2 = 'data/all_university_list.csv'
    school_global_df = pd.read_csv(path2, header=None, delimiter=',', skiprows=2, names=['Name_en', 'name', 'rank', 'score', 'location'])
    school_global_df = pd.DataFrame(school_global_df, columns=['name'])
    print(school_global_df.head())
    
    school_df = pd.concat([school_df, school_global_df], axis=0, ignore_index=True)
    print(school_df.shape[0])
    return school_df

In [3]:
def loadDegree():
    degree = {'name': ['本科', '硕士' ,'研究生', '博士']}
    degree_df = pd.DataFrame(degree)
    print(degree_df)
    return degree_df

In [4]:
def loadMember():
    path = 'data/member-data.csv'
    member_df = pd.read_csv(path, header=None, delimiter=",", skiprows=1, names=['Company', 'No.', 'Resume', 'Position'])
    member_df = pd.DataFrame(member_df, columns=['Resume'])
    print(member_df.head())
    return member_df

## Preprocess Text

In [5]:
def removeStopWords(seglist):
    stopwords = {}
    fstop = open('data/stopwords_cn.txt', 'r', encoding='utf-8', errors='ignore')
    for w in fstop:
        stopwords[w.strip()] = w.strip()

    fstop.close()
    stopwords[' '] = ' '
    
    segListSanitized = []

    for word in seglist:
        if word not in stopwords:
            segListSanitized.append(word)
    return segListSanitized

In [6]:
def preprocess(text):
    # remove punctuations
    text = re.sub(r"[\s+\.\!\/_,$%^*()?;；:【】+\"\']+|[+——！，;:。？、~@#￥%……&*（）]+", " ", text)
    text = text.lower()
    # seperate words
    words = jieba.cut(text, cut_all=False)
    seglist = list(words)
    # remove stopwords
    segListSanitized = removeStopWords(seglist)
    print(f'Before sanitize, len: {len(seglist)}. After sanitize, len: {len(segListSanitized)}')

    return segListSanitized

## N-gram Algorithm

In [7]:
def getNgrams(wordList, n):
    '''
    This function only generete N-Grams
    '''
    output = []
    for i in range (len(wordList) - n + 1):
        n_gram_temp = "".join(wordList[i:i+n])
        output.append(n_gram_temp)
    return output

In [8]:
def generateNgrams(wordList, n):
    '''
    This function genereates [1, N]-Grams
    '''
    result = set()
    for i in range(n):
        temp = getNgrams(wordList, i+1)
        result = result | temp
    
    return result

In [9]:
def generateNgramsV2(wordList, n):
    '''
    This function genereates [1, N]-Grams
    '''
    result = []
    for i in range(n):
        temp_list = getNgrams(wordList, i+1)
        temp = list(set(temp_list))
        temp.sort(key=temp_list.index)
        result.append(temp)
        
    return result

## Word Embedding

In [10]:
model = KeyedVectors.load('./test_50.bin')

In [11]:
def calculate_cosine_similarity(a, b):
    vector_a = np.mat(a)
    vector_b = np.mat(b)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    sim = 0.5 + 0.5 * cos
    return sim

In [12]:
def generateEmbeddings(name):
    words = jieba.cut(name, cut_all=False)
    word_list = list(words)
    v = np.zeros((200))
    for word in word_list:
        if word in model.vocab:
            v += model[word]
    
    v /= len(v)
    return v

In [13]:
def preprocess_entity_list(df, model):
    '''
    df: dafaframe
    model: word embedding model
    '''
    
    df['embeddings'] = ''
    for index, row in df.iterrows():
        # df.loc[index, 'embeddings'] = z
        name = row['name']
        if isinstance(name, float):
            continue
        name = name.lower()
        if name in model.vocab:
            vec = model[name]
        else:
            vec = generateEmbeddings(name)
        df.set_value(index, 'embeddings', vec)

    # print(df.head())
    return df

## Entity Link

In [31]:
def linkSchoolAndDegree(output, model, school_df, degree_df, school_threshold, degree_threshold):
    school_entity = []
    degree_entity = []
    for index, li in enumerate(output):
        print(f'process {index}-Gram')
        for term in li:
            if len(term) <= 1:
                continue
            if term in model.vocab:
                term_vec = model[term]
                school_candidate = dict()

                # Link School
                for index, row in school_df.iterrows():
                    name = row['name']
                    if isinstance(name, float):
                        continue
                    name_vec = row['embeddings']
                    sim = calculate_cosine_similarity(term_vec, name_vec)
                    if term == '韩国延世大学' and name == '韩国延世大学':
                        print(f'xxxxx sim: {sim}')
                    if (sim > school_threshold):
                        school_candidate[name] = sim
                if len(school_candidate) != 0:
                    school_candidate = sorted(school_candidate.items(), key=lambda item:item[1], reverse=True)
                    print(f'university entity found: {term}->{school_candidate[0][0]}, sim = {school_candidate[0][1]}')
                    school_entity.append(school_candidate[0])
                    
                # Link Degree
                degree_candidate = dict()
                for index, row in degree_df.iterrows():
                    name = row['name']
                    name_vec = row['embeddings']
                    sim = calculate_cosine_similarity(term_vec, name_vec)
                    if (sim > degree_threshold):
                        degree_candidate[name] = sim
                if len(degree_candidate) != 0:
                    degree_candidate = sorted(degree_candidate.items(), key=lambda item:item[1], reverse=True)
                    print(f'degree entity found: {term}->{degree_candidate[0][0]}, sim = {degree_candidate[0][1]}')
                    degree_entity.append(degree_candidate[0])
                    
    
    return school_entity, degree_entity

## Main Function

In [67]:
school_df = loadSchool()
degree_df = loadDegree()
member_df = loadMember()

2718
     name
0  麻省理工学院
1   斯坦福大学
2    哈佛大学
3    牛津大学
4  加州理工学院
3669
  name
0   本科
1   硕士
2  研究生
3   博士
                                              Resume
0  __团队成员#1__先生是公司创始人,也是中国最有影响力的商界领袖之一。1982年,__团队...
1  __团队成员#2__先生,现任TCL集团股份有限公司执行董事、总裁(COO)。1963年4月...
2  __团队成员#3__女士:1972年7月出生,中山大学法学博士,高级经济师。1993年6月至...
3  __团队成员#4__先生,1965年7月出生,东方电气集团党组副书记、副总经理,兼任东方电气...
4  __团队成员#5__女士,现任TCL多媒体集团有限公司非执行独立董事、A8新媒体集团非执行独...


In [71]:
school_df = preprocess_entity_list(school_df, model)
degree_df = preprocess_entity_list(degree_df, model)



In [79]:
def print_table_school_degree(school_entity, degree_entity):
    school_list = []
    school_sim_list = []
    degree_list = []
    degree_sim_list = []

    if len(degree_entity) == 0:
        d = dict()
        d= ['本科', 1.0]
        degree_entity.append(d)
        
    min_len = min(len(school_entity), len(degree_entity))
    
    for i, s in enumerate(school_entity):
        if i == min_len:
            break
        school_list.append(s[0])
        school_sim_list.append(s[1])

    for i, d in enumerate(degree_entity):
        if i == min_len:
            break
        degree_list.append(d[0])
        degree_sim_list.append(d[1])

    tb = pt.PrettyTable()
    tb.add_column("School", school_list)
    tb.add_column("School_Similarity", school_sim_list)
    tb.add_column("Degree", degree_list)
    tb.add_column("Degree_Similarity", degree_sim_list)
    print(tb)

In [80]:
text = '__团队成员#1__先生是公司创始人,也是中国最有影响力的商界领袖之一。1982年,__团队成员#1__先生于华南理工大学毕业,进入TCL的前身-TTK家庭电器有限公司。1985年,他担任新成立的TCL通讯设备公司总经理,创立了TCL品牌。2003年,__团队成员#1__担任TCL集团股份有限公司董事长兼CEO,随后TCL集团整体上市。在他的领导下,2004年TCL一举收购了法国汤姆逊全球彩电业务与阿尔卡特全球手机业务。目前TCL集团已经成为拥有6万名员工,业务遍及全球80多个国家和地区。2013年,TCL集团营业总收入超过855亿元,液晶电视全球销量1766万台,实际产量全球第三,品牌销售全球第三;TCL手机全球销量5520万台,行业排名全球第五。2012年__团队成员#1__被新华网评为“最具社会责任感企业家”;2011年荣获《中国企业家》“最具影响力的25位企业领袖”终身成就奖;2009年被评为“CCTV中国经济年度人物十年商业领袖”;2008年获改革开放30年经济人物称号;2004年被评为Fortune杂志“亚洲年度经济人物”、TIME杂志和CNN全球最具影响力的25名商界人士,同年法国总统希拉克向__团队成员#1__先生颁发了法国国家荣誉勋章。__团队成员#1__是中共第十六大代表,第十届、第十一届、第十二届全国人大代表。__团队成员#1__担任的社会职务包括:中国电子视像行业协会会长;中国国际商会副会长;全国工商联执行委员、广东省工商联(总商会)副主席。'

In [81]:
segListSanitized = preprocess(text)
output = generateNgramsV2(segListSanitized, 3)
school_entity, degree_entity = linkSchoolAndDegree(output, model, school_df, degree_df, 0.91, 0.91)
print_table_school_degree(school_entity, degree_entity)
print()

Before sanitize, len: 325. After sanitize, len: 239
process 0-Gram
university entity found: 华南理工大学->华南理工大学, sim = 1.000000029802326
process 1-Gram
process 2-Gram
+--------------+-------------------+--------+-------------------+
|    School    | School_Similarity | Degree | Degree_Similarity |
+--------------+-------------------+--------+-------------------+
| 华南理工大学 | 1.000000029802326 |  本科  |        1.0        |
+--------------+-------------------+--------+-------------------+



In [82]:
text2 = '__团队成员#13__先生,现任深圳市华星光电技术有限公司高级副总裁.1955 年9 月生,硕士,韩国籍.1973 年至1981 年,韩国延世大学材料工程本科毕业;1991年至1995 年,韩国延世大学材料工程研究生毕业,获硕士学位;2003 年至2006年,McGill University Business 专业MBA 毕业,获硕士学位.1981 年至1999年,历任LG 半导体有限公司制程工程师、存储器制程发展部部长、高级技术中心(ATC)主管、C3 工厂厂长、执行总监;2000 年至2009 年,历任LG PHILIPS液晶显示IT 业务总部执行副总裁、LG PHILIPS 液晶显示生产技术中心总部执行副总裁;2009 年至2010 年,任日本FUHRMEISTER 电子高级顾问;2010 年3 月至今,任深圳市华星光电技术有限公司高级副总裁、总裁、首席执行官等职。'

In [83]:
segListSanitized = preprocess(text2)
output = generateNgramsV2(segListSanitized, 3)
school_entity, degree_entity = linkSchoolAndDegree(output, model, school_df, degree_df, 0.9, 0.90)
print_table_school_degree(school_entity, degree_entity)
print()

Before sanitize, len: 188. After sanitize, len: 124
process 0-Gram
degree entity found: 硕士->硕士, sim = 1.0
university entity found: 延世大学->延世大学, sim = 1.000000029802326
degree entity found: 本科毕业->本科, sim = 0.9349876046180725
degree entity found: 研究生->研究生, sim = 1.0
degree entity found: 硕士学位->硕士, sim = 0.9077423512935638
university entity found: mcgill->麦吉尔大学, sim = 0.9162431259445103
process 1-Gram
university entity found: 韩国延世大学->延世大学, sim = 0.9029911640871959
degree entity found: 研究生毕业->研究生, sim = 0.9151657819747925
process 2-Gram
+------------+--------------------+--------+--------------------+
|   School   | School_Similarity  | Degree | Degree_Similarity  |
+------------+--------------------+--------+--------------------+
|  延世大学  | 1.000000029802326  |  硕士  |        1.0         |
| 麦吉尔大学 | 0.9162431259445103 |  本科  | 0.9349876046180725 |
|  延世大学  | 0.9029911640871959 | 研究生 |        1.0         |
+------------+--------------------+--------+--------------------+



In [78]:
print(school_entity)
print(degree_entity)

[('延世大学', 1.000000029802326), ('麦吉尔大学', 0.9162431259445103), ('延世大学', 0.9029911640871959)]
[('硕士', 1.0), ('本科', 0.9349876046180725), ('研究生', 1.0), ('硕士', 0.9077423512935638), ('研究生', 0.9151657819747925)]


In [84]:
text3 = '__团队成员#6__先生,1980年7月生,硕士研究生学历。2002年,福州大学经济学本科毕业;2006年,云南大学法律硕士研究生毕业。2006年8月至2014年2月,任职国泰君安证券股份有限公司,历任国泰君安证券香港公司财务顾问部高级经理、总经理,深圳总部机构客户部总监,从事香港与中国资本市场的投资银行业务。2014年3月加入TCL集团股份有限公司,任公司董事会办公室主任;2014年4月起任公司董事会秘书;2014年12月起任公司执委会成员;2015年4月起任TCL集团控股子公司全球播有限公司董事;2015年5月起任TCL通讯科技控股有限公司(02618.HK)非执行董事。'

In [85]:
segListSanitized = preprocess(text3)
output = generateNgramsV2(segListSanitized, 3)
linkSchoolAndDegree(output, model, school_df, degree_df, 0.9, 0.90)
print()

Before sanitize, len: 139. After sanitize, len: 111
process 0-Gram
degree entity found: 硕士->硕士, sim = 1.0
degree entity found: 研究生->研究生, sim = 1.0
university entity found: 福州大学->福州大学, sim = 1.0000000596046448
degree entity found: 本科毕业->本科, sim = 0.9349876046180725
university entity found: 云南大学->云南大学, sim = 1.0000000596046448
process 1-Gram
degree entity found: 硕士研究生->硕士, sim = 0.9260654151439667
degree entity found: 研究生毕业->研究生, sim = 0.9151657819747925
process 2-Gram



In [86]:
text4 = '__团队成员#4__先生,1965年7月出生,东方电气集团党组副书记、副总经理,兼任东方电气集团党校校长、东方电气集团总部直属党委书记,东方电气股份有限公司董事。大学本科毕业于上海交通大学船舶动力机械专业并获工学学士学位,研究生毕业于重庆大学热能工程专业并获工学硕士学位,博士研究生毕业于西南财经大学国际贸易学专业并获经济学博士学位。1989年1月加入东方电气集团,先后任东方电气集团成套设计处技干,四川东方电力设备联合公司火电部技干、经理助理、副经理,副总经理,总经理等职,2000年6月至2007年2月任东方电气集团副总经理,2007年2月至2008年9月任国家核电技术总公司副总经理、党组成员,2008年9月至2017年5月任东方电气集团副总经理、党组成员,2017年5月任东方电气集团党组副书记、副总经理至今。拥有正高级工程师职称。'

In [87]:
segListSanitized = preprocess(text4)
output = generateNgramsV2(segListSanitized, 3)
school_entity, degree_entity = linkSchoolAndDegree(output, model, school_df, degree_df, 0.91, 0.92)
print_table_school_degree(school_entity, degree_entity)

Before sanitize, len: 187. After sanitize, len: 149
process 0-Gram
university entity found: 大学->贾达普大学, sim = 1.0000000084665304
degree entity found: 本科毕业->本科, sim = 0.9349876046180725
university entity found: 上海交通大学->上海交通大学, sim = 1.0
degree entity found: 研究生->研究生, sim = 1.0
university entity found: 重庆大学->重庆大学, sim = 1.0000000596046448
degree entity found: 博士->博士, sim = 1.0
university entity found: 西南财经大学->西南财经大学, sim = 1.0000000596046448
process 1-Gram
process 2-Gram
+--------------+--------------------+--------+--------------------+
|    School    | School_Similarity  | Degree | Degree_Similarity  |
+--------------+--------------------+--------+--------------------+
|  贾达普大学  | 1.0000000084665304 |  本科  | 0.9349876046180725 |
| 上海交通大学 |        1.0         | 研究生 |        1.0         |
|   重庆大学   | 1.0000000596046448 |  博士  |        1.0         |
+--------------+--------------------+--------+--------------------+


In [88]:
for index, row in member_df.iterrows():
    if index == 40:
        break
    print(f'Handle No.{index} text')
    start = datetime.datetime.now()
    text = row['Resume']
    segListSanitized = preprocess(text)
    output = generateNgramsV2(segListSanitized, 3)
    school_entity, degree_entity = linkSchoolAndDegree(output, model, school_df, degree_df, 0.9, 0.91)
    print_table_school_degree(school_entity, degree_entity)
    end = datetime.datetime.now()
    print(f'cost time: {end - start} sec')
    print()
    

Handle No.0 text
Before sanitize, len: 325. After sanitize, len: 239
process 0-Gram
university entity found: 华南理工大学->华南理工大学, sim = 1.000000029802326
process 1-Gram
process 2-Gram
+--------------+-------------------+--------+-------------------+
|    School    | School_Similarity | Degree | Degree_Similarity |
+--------------+-------------------+--------+-------------------+
| 华南理工大学 | 1.000000029802326 |  本科  |        1.0        |
+--------------+-------------------+--------+-------------------+
cost time: 0:01:50.517446 sec

Handle No.1 text
Before sanitize, len: 330. After sanitize, len: 264
process 0-Gram
degree entity found: 博士->博士, sim = 1.0
university entity found: 西安交通大学->西安交通大学, sim = 1.0000000596046448
university entity found: 财经学院->宁波财经学院, sim = 0.9081512997142207
process 1-Gram
degree entity found: 博士毕业->硕士, sim = 0.9167363941669464
process 2-Gram
+--------------+--------------------+--------+--------------------+
|    School    | School_Similarity  | Degree | Degree_Similar

degree entity found: 研究生毕业->研究生, sim = 0.9151657819747925
process 2-Gram
+------------+--------------------+--------+--------------------+
|   School   | School_Similarity  | Degree | Degree_Similarity  |
+------------+--------------------+--------+--------------------+
|  延世大学  | 1.000000029802326  |  硕士  |        1.0         |
| 麦吉尔大学 | 0.9162431259445103 |  本科  | 0.9349876046180725 |
|  延世大学  | 0.9029911640871959 | 研究生 |        1.0         |
+------------+--------------------+--------+--------------------+
cost time: 0:00:53.001720 sec

Handle No.13 text
Before sanitize, len: 467. After sanitize, len: 372
process 0-Gram
degree entity found: 本科->本科, sim = 1.0
university entity found: 华南理工大学->华南理工大学, sim = 1.000000029802326
degree entity found: 本科毕业->本科, sim = 0.9349876046180725
university entity found: 武汉->武汉光谷职业学院, sim = 0.9132345485049689
process 1-Gram
process 2-Gram
+------------------+--------------------+--------+--------------------+
|      School      | School_Similarity  | D

degree entity found: 研究生->研究生, sim = 1.0
university entity found: 学院->马尼帕尔高等教育学院, sim = 0.9385821583210973
degree entity found: 教授->博士, sim = 0.9146097251540259
university entity found: 湖南->湖南工商大学, sim = 0.9128547738195922
university entity found: 农业大学->茂物农业大学, sim = 0.9050349574859177
university entity found: 商学院->郑州商学院, sim = 0.9071498487162017
university entity found: 大学->贾达普大学, sim = 1.0000000084665304
process 1-Gram
degree entity found: 硕士研究生->硕士, sim = 0.9260654151439667
university entity found: 惠州学院->惠州学院, sim = 1.0
university entity found: 衡阳师范学院->衡阳师范学院, sim = 1.0000000596046448
university entity found: 湖南农业大学->湖南农业大学, sim = 1.000000029802326
university entity found: 格林威治大学->格林威治大学, sim = 1.0
process 2-Gram
+----------------------+--------------------+--------+--------------------+
|        School        | School_Similarity  | Degree | Degree_Similarity  |
+----------------------+--------------------+--------+--------------------+
| 马尼帕尔高等教育学院 | 0.9385821583210973 |  硕士  |    

process 1-Gram
process 2-Gram
+--------+-------------------+--------+-------------------+
| School | School_Similarity | Degree | Degree_Similarity |
+--------+-------------------+--------+-------------------+
+--------+-------------------+--------+-------------------+
cost time: 0:00:03.734245 sec

Handle No.38 text
Before sanitize, len: 10. After sanitize, len: 5
process 0-Gram
process 1-Gram
process 2-Gram
+--------+-------------------+--------+-------------------+
| School | School_Similarity | Degree | Degree_Similarity |
+--------+-------------------+--------+-------------------+
+--------+-------------------+--------+-------------------+
cost time: 0:00:03.045282 sec

Handle No.39 text
Before sanitize, len: 11. After sanitize, len: 7
process 0-Gram
process 1-Gram
process 2-Gram
+--------+-------------------+--------+-------------------+
| School | School_Similarity | Degree | Degree_Similarity |
+--------+-------------------+--------+-------------------+
+--------+--------------