# Vector Embeddings

In this notebook, we will be constructing vectors for each document such that they can be evaluated for 'closeness' to the subsequent filing. 

In [1]:
import numpy as np
import pandas as pd
import os
from bs4 import BeautifulSoup
import urllib
import pickle

import jieba
import string
from string import punctuation
from collections import Counter

import re

# Corpus of Stop-Words

We will be taking a list of stopwords used by the popular search engine, Baidu. The list can be found here: http://www.baiduguide.com/baidu-stopwords/

The list was simply copied and pasted into a .txt file and will be cleaned locally in this notebook:

In [26]:
# Open the copy/pasted text file
with open('../data/stopwords/stopwords.txt', encoding='GB2312') as f:
    text = f.readlines()

In [27]:
# Remove newlines
text = [x.replace('\n','') for x in text]

# Remove spaces
text = [x.replace(' ','') for x in text]

# Remove alphabet
text = [x for x in text if not x.replace(' ','').isalpha()]

# Split lines on commas
text = [x.split(',') for x in text]

In [28]:
# Put it all together into a single list
stopwords = []
for element in text:
    stopwords += element

In [29]:
# Remove the empty string at the end of the list
stopwords = stopwords[:-1]

# Token Frequency Vector

The first embedding we will use is a simple token frequency vector.

## Single-Document Example

First, we will find a way to get the token counts of a single document, then we can prepare a universe of tokens that contain every unique token in each document. This will allow us to construct our vector.

In [2]:
# Take a sample document
directory = '../../China Fin Report Text/tokenized/2016/'
filenames = os.listdir(directory)

# Open a sample filing
with open(directory+filenames[0]) as f:
    text = f.read()

In [66]:
# What the original tokenized text looks like
print(text[:500])

融钰 集团股份 有限公司   2016   年 年度报告 全文   

  

融钰 集团股份 有限公司   

2016   年 年度报告 全文   

2017 - 031   

  

  

  

  

  

2017   年   4   月   

  

1   

 融钰 集团股份 有限公司   2016   年 年度报告 全文   

第一节     重要 提示 、 目录 和 释义   

本 公司 董事会 、 监事会 及 董事 、 监事 、 高级 管理人员 保证 年度报告 内容 的 

真实 、 准确 、 完整 ， 不 存在 虚假 记载 、 误导性 陈述 或 重大 遗漏 ， 并 承担 个别 和 

连带 的 法律责任 。   

公司 负责人 尹 宏伟 、 主管 会计工作 负责人 邓强 及 会计 机构 负责人 ( 会计 主管 

人员 ) 刘丹 声明 ： 保证 年度报告 中 财务报告 的 真实 、 准确 、 完整 。   

所有 董事 均 已 出席 了 审议 本 报告 的 董事会 会议 。   

公司 不 存在 对 生产 经营 、 财务状况 和 持续 盈


In [72]:
# Remove any non-Chinese characters including numbers and punctuation, keep space
# Maybe we need to keep alphabet too? #ToDo
chars_only_text = re.sub('[^\u4E00-\u9FFF| ]', '', text)
chars_only_text[:500]

'融钰 集团股份 有限公司      年 年度报告 全文     融钰 集团股份 有限公司      年 年度报告 全文                     年      月         融钰 集团股份 有限公司      年 年度报告 全文   第一节     重要 提示  目录 和 释义   本 公司 董事会  监事会 及 董事  监事  高级 管理人员 保证 年度报告 内容 的 真实  准确  完整  不 存在 虚假 记载  误导性 陈述 或 重大 遗漏  并 承担 个别 和 连带 的 法律责任    公司 负责人 尹 宏伟  主管 会计工作 负责人 邓强 及 会计 机构 负责人  会计 主管 人员  刘丹 声明  保证 年度报告 中 财务报告 的 真实  准确  完整    所有 董事 均 已 出席 了 审议 本 报告 的 董事会 会议    公司 不 存在 对 生产 经营  财务状况 和 持续 盈利 能力 有 严重 不利 影响 的 重大 风险 因素        公司 计划 不 派 发现 金红利  不 送 红 股  不 以 公积金 转增 股本        融钰 集团股'

In [86]:
# Split on spaces/newlines
tokens = chars_only_text.split()
tokens[:20]

['融钰',
 '集团股份',
 '有限公司',
 '年',
 '年度报告',
 '全文',
 '融钰',
 '集团股份',
 '有限公司',
 '年',
 '年度报告',
 '全文',
 '年',
 '月',
 '融钰',
 '集团股份',
 '有限公司',
 '年',
 '年度报告',
 '全文']

In [77]:
token_freqs = Counter()
token_freqs.update(tokens)

In [87]:
token_freqs

Counter({'融钰': 272,
         '集团股份': 208,
         '有限公司': 423,
         '年': 586,
         '年度报告': 190,
         '全文': 185,
         '月': 256,
         '第一节': 2,
         '重要': 69,
         '提示': 4,
         '目录': 2,
         '和': 384,
         '释义': 5,
         '本': 154,
         '公司': 989,
         '董事会': 75,
         '监事会': 14,
         '及': 313,
         '董事': 98,
         '监事': 38,
         '高级': 43,
         '管理人员': 36,
         '保证': 11,
         '内容': 18,
         '的': 2476,
         '真实': 5,
         '准确': 4,
         '完整': 10,
         '不': 256,
         '存在': 87,
         '虚假': 1,
         '记载': 1,
         '误导性': 1,
         '陈述': 1,
         '或': 246,
         '重大': 114,
         '遗漏': 1,
         '并': 128,
         '承担': 14,
         '个别': 4,
         '连带': 1,
         '法律责任': 1,
         '负责人': 11,
         '尹': 19,
         '宏伟': 19,
         '主管': 7,
         '会计工作': 4,
         '邓强': 9,
         '会计': 90,
         '机构': 36,
         '人员': 23,
         '刘丹': 11,
     

This is exactly what we are looking for. Now, we can create a list of every unique token across all documents. This way, we can construct our vector embeddings. First, let us create these frequency dictionaries, then we can construct our token universe. Keep in mind, we still need to also remove stop words.

In [136]:
directory = '../../China Fin Report Text/tokenized/'

all_tokens = Counter()
freq_dict = {}

# Iterate through each year
for year in list(range(2006,2017)):
    
    # Create a dictionary entry for each year
    freq_dict[year] = {}
    
    # Get list of document .txt files
    filenames = os.listdir(directory+str(year))
    
    # Iterate through each document
    for file in filenames:
        
        # Open the document
        with open(directory+str(year)+'/'+file) as f:
            text = f.read()
        
        # Same as before
        chars_only_text = re.sub('[^\u4E00-\u9FFF| ]', '', text)
        tokens = chars_only_text.split()
        token_freqs = Counter()
        token_freqs.update(tokens)
        
        # Remove .txt from file to make indexing easier
        freq_dict[year][file[:-4]] = token_freqs
        
        # Add any new unique values to all_tokens
        all_tokens = all_tokens | token_freqs

In [114]:
import pickle
with open('../../freq_dict.pickle', 'wb') as handle:
    pickle.dump(freq_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [139]:
with open('alltokens.pickle', 'wb') as handle:
    pickle.dump(list(all_tokens.keys()), handle, protocol=pickle.HIGHEST_PROTOCOL)

In [161]:
with open('alltokens.pickle', 'rb') as handle:
    alltokens = pickle.load(handle)

In [163]:
# Create a sample vector:
sample_vec = [token_freqs[key] if key in token_freqs.keys() else 0 for key in alltokens]
len(sample_vec) == len(alltokens)

True

Let's try another way that will not take up so much local space (approx. 1GB for this file). Instead, let's try to group together all the companies:

In [6]:
directory = '../../China Fin Report Text/tokenized/'
files_df = pd.DataFrame()

for year in list(range(2006,2017)):

    files = os.listdir(directory+str(year))

    files_df = pd.concat([files_df, pd.DataFrame([[year]*len(files), files]).T], axis=0)

In [8]:
files_df.columns = ['year', 'file']
files_df

Unnamed: 0,year,file
0,2006,002073.SZ.txt
1,2006,000408.SZ.txt
2,2006,600820.SH.txt
3,2006,000516.SZ.txt
4,2006,000416.SZ.txt
...,...,...
3113,2016,300034.SZ.txt
3114,2016,300383.SZ.txt
3115,2016,300134.SZ.txt
3116,2016,603658.SH.txt


In [9]:
company_filings = {}

for company in list(set(files_df['file'])):
    company_filings[company] = files_df[files_df['file'] == company]

In [10]:
company_filings['002762.SZ.txt']

Unnamed: 0,year,file
2227,2015,002762.SZ.txt
2475,2016,002762.SZ.txt


In [11]:
def norm(vec):
    return np.sqrt(vec.dot(vec))

In [196]:
for company in list(company_filings.keys())[200:201]:
    company_filings[company]

# Single-Company Cosine Similarity

In [12]:
sample = company_filings[company]
sample

Unnamed: 0,year,file
239,2010,002521.SZ.txt
261,2011,002521.SZ.txt
277,2012,002521.SZ.txt
207,2013,002521.SZ.txt
296,2014,002521.SZ.txt
315,2015,002521.SZ.txt
349,2016,002521.SZ.txt


In [13]:
scores = [np.nan]

for i in range(1,len(sample)):
    
    # Get filenames
    with open(directory+str(sample.iloc[i-1,:]['year'])+'/'+ sample.iloc[i-1,:]['file']) as f:
            file1 = f.read()

    with open(directory+str(sample.iloc[i,:]['year'])+'/'+ sample.iloc[i,:]['file']) as f:
            file2 = f.read()

    # File 1 tokens
    chars_only_text = re.sub('[^\u4E00-\u9FFF| ]', '', file1)
    tokens = chars_only_text.split()
    token_freqs1 = Counter()
    token_freqs1.update(tokens)

    # File 2 tokens
    chars_only_text = re.sub('[^\u4E00-\u9FFF| ]', '', file2)
    tokens = chars_only_text.split()
    token_freqs2 = Counter()
    token_freqs2.update(tokens)

    token_union = (token_freqs1 | token_freqs2).keys()

    vec1 = np.array([token_freqs1[key] if key in token_freqs1.keys() else 0 for key in token_union])
    vec2 = np.array([token_freqs2[key] if key in token_freqs2.keys() else 0 for key in token_union])
    
    
    # Calculate Cosine Similarity between the two
    cosine_similarity = np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))
    
    # Save to list
    scores.append(cosine_similarity)
    
# Add to sample df
sample['Cosine_Similarity'] = scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [14]:
scores

[nan,
 0.9808492800253158,
 0.9430258437302796,
 0.9776089691636185,
 0.9851754625278325,
 0.990760715959739,
 0.9916775575148704]

In [15]:
sample

Unnamed: 0,year,file,Cosine_Similarity
239,2010,002521.SZ.txt,
261,2011,002521.SZ.txt,0.980849
277,2012,002521.SZ.txt,0.943026
207,2013,002521.SZ.txt,0.977609
296,2014,002521.SZ.txt,0.985175
315,2015,002521.SZ.txt,0.990761
349,2016,002521.SZ.txt,0.991678


# Extend to All Companies

In [16]:
# Keep track of progress
company_num = 1

for company in company_filings.keys():
    if company_num % 100 == 0:
        print(company_num)
    
    # Create a new sample df for each company
    sample = company_filings[company]
    
    scores = [np.nan]

    for i in range(1,len(sample)):

        # Get filenames
        with open(directory+str(sample.iloc[i-1,:]['year'])+'/'+ sample.iloc[i-1,:]['file']) as f:
                file1 = f.read()

        with open(directory+str(sample.iloc[i,:]['year'])+'/'+ sample.iloc[i,:]['file']) as f:
                file2 = f.read()

        # File 1 tokens
        chars_only_text = re.sub('[^\u4E00-\u9FFF| ]', '', file1)
        tokens = chars_only_text.split()
        token_freqs1 = Counter()
        token_freqs1.update(tokens)

        # File 2 tokens
        chars_only_text = re.sub('[^\u4E00-\u9FFF| ]', '', file2)
        tokens = chars_only_text.split()
        token_freqs2 = Counter()
        token_freqs2.update(tokens)

        token_union = (token_freqs1 | token_freqs2).keys()

        vec1 = np.array([token_freqs1[key] if key in token_freqs1.keys() else 0 for key in token_union])
        vec2 = np.array([token_freqs2[key] if key in token_freqs2.keys() else 0 for key in token_union])


        # Calculate Cosine Similarity between the two
        cosine_similarity = np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

        # Save to list
        scores.append(cosine_similarity)

    # Add to company_filings dict
    company_filings[company]['Cosine_Similarity'] = scores
    
    company_num += 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100


In [17]:
with open('company_filings.pickle', 'wb') as handle:
    pickle.dump(company_filings, handle, protocol=pickle.HIGHEST_PROTOCOL)