In [1]:
import pandas as pd
import csv
import json

rpath = '../data/raw/'
wpath = '../data/tyc/'

已爬取的公司列表（`comp_dict`），和部分机构->公司对应关系（`org_comp`）：

In [2]:
comp_dict = {}
org_comp = {}

for line in open(rpath + 'company_total.json', encoding='utf-8'):
    line = json.loads(line)
    comp_id, comp_name = line['id'], line['名称']
    comp_dict[comp_id] = comp_name
    
    org = line.get('投资机构')
    # 投资机构不存在或者为空列表
    if not org:
        continue
    org_comp[org[0]] = comp_id

扩充机构->公司对应关系：

In [3]:
org_manager = {}

for line in open(rpath + 'organize_all.json', encoding='utf-8'):
    line = json.loads(line)
    org_id, man = line['id'], line.get('基金管理人')
    if man:
        org_manager[org_id] = [c['企业名称ID'] for c in man]

In [4]:
org_comp_total = org_comp.copy()
for k, v in org_manager.items():
    # skip companies that are not crawled
    if v[0] not in comp_dict:
        continue
    org_comp_total[k] = v[0]

In [5]:
len(comp_dict), len(org_comp), len(org_comp_total)

(92524, 2441, 5176)

所有投资事件，包括工商追踪和公开投资（未公开投资没有时间）：

In [6]:
touzi_all = {}        

for line in open(rpath + 'organize_all.json', encoding='utf-8'):
    line = json.loads(line)
    org_id, gszz, gktz = line['id'], line.get('工商追踪'), line.get('公开投资事件')
    if org_id not in org_comp_total:
        # print(org_id)
        continue

    if gszz:
        for rec in gszz:
            content = rec['内容']
            if content['CompanyID'] not in comp_dict:
                continue
            if rec['投资日期'] < '1950-01-01':
                continue
            if '对外投资' in content['文本']:
                touzi_all[(org_comp_total[org_id], content['CompanyID'])] = rec['投资日期']

    if gktz:
        for rec in gktz:
            if 'CompanyID' not in rec:
                continue
            if rec['CompanyID'] not in comp_dict:
                continue
            if rec['投资时间'] < '1950-01-01':
                continue
            touzi_all[(org_comp_total[org_id], rec['CompanyID'])] = rec['投资时间']

共8w+条有效的投资记录，涉及到的公司均爬取过：

In [7]:
len(touzi_all)

88334

In [8]:
src_cids = [sid for (sid, _) in touzi_all.keys()]
dst_cids = [did for (_, did) in touzi_all.keys()]
len(set(src_cids)), len(set(dst_cids)), len(set(src_cids + dst_cids))

(4902, 45158, 49593)

In [9]:
# ensure all src_cids and dst_cids are in comp_dict
assert len(set(comp_dict.keys()) & set(src_cids + dst_cids)) == len(set(src_cids + dst_cids))

构造公司列表，顺序为：纯投资公司，投资公司与被投资公司交集，纯被投资公司。

给公司编号：

In [10]:
len(set(src_cids) - set(dst_cids)), len(set(src_cids) & set(dst_cids)), len(set(dst_cids) - set(src_cids))

(4435, 467, 44691)

In [11]:
src_list = list(set(src_cids) - set(dst_cids))
common_list = list(set(src_cids) & set(dst_cids))
dst_list = list(set(dst_cids) - set(src_cids))

src_list.sort()
common_list.sort()
dst_list.sort()

comp_list = src_list + common_list + dst_list
comp_ind = {v: k for k, v in enumerate(comp_list)}

In [12]:
assert len(comp_ind) == len(set(src_cids + dst_cids))

保存所有爬过的公司的名单：

In [13]:
import pickle

with open(wpath + 'comp_dict.pkl', 'wb') as f:
    pickle.dump(comp_dict, f)

保存公司标号、url和名称的对应：

In [14]:
df = pd.DataFrame({ 'cid': comp_list, 'cname': [comp_dict[c] for c in comp_list] })
df.to_csv(wpath + 'comps.csv', index=False)

导出投资关系图：

In [15]:
pd.read_csv(wpath + 'comps.csv')

Unnamed: 0,cid,cname
0,company/1001373759,浙江国贸东方投资管理有限公司
1,company/10055837,亿阳信通股份有限公司
2,company/1007260421,武汉光谷生物城华岭基金管理有限公司
3,company/10082609,北京华谊嘉信整合营销顾问集团股份有限公司
4,company/1012358787,南京高科新创投资有限公司
...,...,...
49588,company/997749750,上海衡益特陶新材料有限公司
49589,company/99780228,北京北比信息技术有限公司
49590,company/99871402,北京红龙文化传媒有限公司
49591,company/998961890,福建省蓝深环保技术股份有限公司


导出投资关系图作为label：

In [16]:
with open(wpath + 'comp_touzi_comp.csv', 'w', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['src_ind', 'src_cid', 'src_cname', 'dst_ind', 'dst_cid', 'dst_cname', 'date'])
    for (src_cid, dst_cid), date in touzi_all.items():
        writer.writerow([comp_ind[src_cid], src_cid, comp_dict[src_cid], comp_ind[dst_cid], dst_cid, comp_dict[dst_cid], date])