## Data preparation:

Load and check the A_share_list.json file to determine the company name data format.

In [1]:
import pandas as pd
import json

with open('./data/A_share_list.json', 'r', encoding='utf-8') as file:
    a_share_data = json.load(file)

a_share_df = pd.DataFrame(a_share_data)

print(a_share_df)




      name          fullname       code location        time
0     邵阳液压      邵阳维克液压股份有限公司     301079  深圳证券交易所  2021-10-19
1      同益中  北京同益中新材料科技股份有限公司     688722  上海证券交易所  2021-10-19
2     华瓷股份      湖南华联瓷业股份有限公司     001216  深圳证券交易所  2021-10-19
3      鸿富瀚    深圳市鸿富瀚科技股份有限公司     301086  深圳证券交易所  2021-10-20
4     高铁电气    中铁高铁电气装备股份有限公司     688285  上海证券交易所  2021-10-20
...    ...               ...        ...      ...         ...
4649  众信旅游      众信旅游集团股份有限公司  002707.SZ  深圳证券交易所  2014-01-23
4650  北新建材      北新集团建材股份有限公司  000786.SZ  深圳证券交易所  1997-06-06
4651  乾景园林      北京乾景园林股份有限公司  603778.SH  上海证券交易所  2015-12-31
4652  能科股份        能科科技股份有限公司  603859.SH  上海证券交易所  2016-10-21
4653  中成股份       中成进出口股份有限公司  000151.SZ  深圳证券交易所  2000-09-06

[4654 rows x 5 columns]


Load News data news.xlsx to understand the structure and content of news data.

In [2]:
import pandas as pd

news_data = pd.read_excel('./data/News.xlsx')
# 确定DataFrame的行数

# news_data = news_data.iloc[:10000]


print(news_data.head())


   NewsID                       Title  \
0       1          建设银行原董事长张恩照一审被判15年   
1       2                农行信用卡中心搬到上海滩   
2       3           外运发展：价值型蓝筹股补涨要求强烈   
3       4           胜利股份：稳步走强形成标准上升通道   
4       5  [港股快讯]恒指收市报18960点 成交467亿港元   

                                         NewsContent NewsSource  
0  　　本报记者 田雨 李京华    　　中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣...      中国证券报  
1  　　中国农业银行信用卡中心由北京搬到上海了！  　　农行行长杨明生日前在信用卡中心揭牌仪式上...       人民日报  
2  　　在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段...      杭州新希望  
3  　　胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿...       源达投资  
4  　　全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点...        全景网  


In [3]:
news_data = news_data.iloc[:10000]


In [4]:
rows, columns = news_data.shape
print(f'rows: {rows}')


rows: 10000


In [5]:
news_data.isnull().sum()

NewsID          0
Title           0
NewsContent     2
NewsSource     32
dtype: int64

## Data cleaning：

I started using it when I was doing vectorization, but I didn't use it when I was using ner

In [14]:
# Make sure all news content is of string type
news_data['NewsContent'] = news_data['NewsContent'].astype(str)
news_data['NewsContent'] = news_data['NewsContent'].astype(str).str.strip()
news_data2=news_data

In [None]:
import re
# Define a function to remove name prefixes (all non-Chinese characters starting with * and S)
def remove_all_non_chinese_prefix(text):
    # Remove the beginning '*', 'S', and 'T' until you encounter a Chinese character
    return re.sub(r"^[*ST]+", "", text)

a_share_df['name'] = a_share_df['name'].apply(remove_all_non_chinese_prefix)
a_share_df['fullname'] = a_share_df['fullname'].apply(remove_all_non_chinese_prefix)

a_share_df[['name', 'fullname']]


In [None]:
# filter out all names and fullnames starting with 'S' or '*'

special_start_regex = re.compile(r"^[S*]")

special_names_df = a_share_df[a_share_df['name'].str.match(special_start_regex)]

special_names_df[['name', 'fullname']]  



In [9]:
import torch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 4060 Ti


## vector similarity based method

Text vectorization:
Vectorization of news data and names of listed companies

Similarity calculation:
Calculate the similarity between the news text vector and the name vector of each listed company. This is usually done by calculating the cosine similarity between vectors.

Filter news:
Based on the similarity score, a threshold is set to determine which news stories mention public companies.
If the similarity score between a news item and the name of any listed company is higher than the threshold, the news is retained; Otherwise, treat it as noise and remove it.

Bert

In [7]:

from transformers import BertTokenizer, BertModel
import torch
from torch.nn import DataParallel
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese').to(device)
model = DataParallel(model)
print("BERT is ready！")

bin D:\Anaconda3\envs\pytorch\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
BERT is ready！


In [8]:
import time
from tqdm import tqdm

#  News get BERT embedded vector
def get_bert_embedding(text):

    encoded_input = tokenizer.encode_plus(
        text,
        add_special_tokens=True,  
        max_length=512,  
        padding='max_length',  
        truncation=True,  
        return_attention_mask=True,  
        return_tensors='pt' 
    )
    encoded_input = {key: value.to(device) for key, value in encoded_input.items()}

    with torch.no_grad():
        model_output = model(**encoded_input)
    
    return model_output.last_hidden_state[:, 0, :].detach().cpu().numpy()

# Get the embed vector for each news item separately
news_embeddings_bert = []
start_time = time.time()  
for text in tqdm(news_data['NewsContent'], desc="Processing News"):
    embedding = get_bert_embedding(text)
    news_embeddings_bert.append(embedding)
end_time = time.time()  
print(f"new data cost time：{end_time - start_time:.2f}s。")

# Get an embeddings vector for each listed company name
a_share_embeddings_bert = []
start_time = time.time()  
for name in tqdm(a_share_df['name'], desc="Processing A-share Companies"):
    embedding = get_bert_embedding(name)
    a_share_embeddings_bert.append(embedding)
end_time = time.time()  
print(f"company data cost time：{end_time - start_time:.2f}s。")


Processing News: 100%|██████████| 10000/10000 [02:30<00:00, 66.55it/s]


new data cost time：150.27s。


Processing A-share Companies: 100%|██████████| 4654/4654 [01:04<00:00, 72.59it/s]

company data cost time：64.12s。





TF-IDF

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Gets the TF-IDF embedding vector
def get_tfidf_embedding(texts):
    vectorizer = TfidfVectorizer(max_features=512)  
    tfidf_matrix = vectorizer.fit_transform(texts)
    return tfidf_matrix.toarray(), vectorizer


vectorizer = TfidfVectorizer(max_features=512)

# Vectorized news data using TF-IDF
news_texts = news_data['NewsContent']  
vectorizer.fit(news_texts)

news_embeddings_tf = []
for text in news_texts:
    embedding = vectorizer.transform([text]).toarray()
    news_embeddings_tf.append(embedding)

# The TF-IDF embedding vector is obtained for each listed company name
a_share_embeddings_tf = []
for name in a_share_df['name']:
    embedding = vectorizer.transform([name]).toarray()
    a_share_embeddings_tf.append(embedding)



In [10]:
import jieba
from gensim.models import KeyedVectors
import numpy as np
from tqdm import tqdm


# Use jieba for word segmentation
news_texts = [' '.join(jieba.cut(content)) for content in tqdm(news_data['NewsContent'], desc="news")]
company_names = [' '.join(jieba.cut(name)) for name in tqdm(a_share_df['name'], desc="company")]

# Load the pre-trained Word2Vec model
model_path = './model/sgns.financial.char.bz2'
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

#Use the Word2Vec model to get the average embedding vector of text
def get_w2v_embedding(text, model):
    words = text.split()
    words = [word for word in words if word in model.key_to_index]
    if len(words) == 0:
        return np.zeros(model.vector_size).tolist()
    return model[words].mean(axis=0).tolist()


news_embeddings_w2v = []
for text in tqdm(news_texts, desc="news"):
    embedding = get_w2v_embedding(text, w2v_model)
    news_embeddings_w2v.append(embedding)


a_share_embeddings_w2v = []
for name in tqdm(company_names, desc="company"):
    embedding = get_w2v_embedding(name, w2v_model)
    a_share_embeddings_w2v.append(embedding)



news:   0%|          | 0/10000 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.310 seconds.
Prefix dict has been built successfully.
news: 100%|██████████| 10000/10000 [00:08<00:00, 1165.55it/s]
company: 100%|██████████| 4654/4654 [00:00<00:00, 62053.69it/s]
news: 100%|██████████| 10000/10000 [00:02<00:00, 4060.07it/s]
company: 100%|██████████| 4654/4654 [00:00<00:00, 64184.67it/s]


## Similarity calculation & Evaluation

cosine_similarities

Bert + cos

In [34]:
from sklearn.metrics.pairwise import cosine_similarity
news_embeddings =news_embeddings_bert
a_share_embeddings =a_share_embeddings_bert
news_embeddings = [emb.squeeze() for emb in news_embeddings]  # 将每个新闻嵌入向量压缩成一维
a_share_embeddings = [emb.squeeze() for emb in a_share_embeddings]  # 将每个公司名称嵌入向量压缩成一维

def rescale_cosine_similarities(cosine_similarities):

    min_sim = np.min(cosine_similarities)
    max_sim = np.max(cosine_similarities)
    return (cosine_similarities - min_sim) / (max_sim - min_sim)

def filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold=0.5):

    cosine_similarities = cosine_similarity(news_embeddings, a_share_embeddings)

    rescaled_similarities = rescale_cosine_similarities(cosine_similarities)

    filtered_companies = []
    for similarities in rescaled_similarities:
        company_indices = [idx for idx, sim in enumerate(similarities) if sim >= threshold]
        company_names = ', '.join(a_share_df['name'][company_indices])
        filtered_companies.append(company_names)

    filtered_news_count = len([comp for comp in filtered_companies if comp != ''])
    total_news_count = len(news_embeddings)
    filter_rate = filtered_news_count / total_news_count if total_news_count > 0 else 0

    return filtered_companies, filter_rate

filtered_companies1, filter_rate1 = filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold=0.68)
print(filter_rate1)

news_data_cos_bert=news_data2
news_data_cos_bert['Explicit_Company'] = filtered_companies1
news_data_cos_bert = news_data_cos_bert[[ 'NewsContent', 'Explicit_Company']]
news_data_cos_bert.head(5)

0.5269


Unnamed: 0,NewsContent,Explicit_Company
0,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣判，...,
1,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上表示...,
2,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段权重...,
3,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿元左...,
4,全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点，曾...,


In [41]:
news_data_cos_bert.head(1155)

Unnamed: 0,NewsContent,Explicit_Company
0,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣判，...,
1,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上表示...,
2,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段权重...,
3,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿元左...,
4,全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点，曾...,
...,...,...
1150,本报讯 先锋股份(600246)称，近日，公司董事会接到控股股东北京万通星河实业有限公司通知...,"亚康股份, 卓锦股份, 科汇股份, 财达证券, 致远新能, 晓鸣股份, 恒帅股份, 海天股份..."
1151,本报讯 S乐电(600644)称，公司于近日接到乐山市物价局有关批复文件，居民生活用电实行阶...,
1152,纽约证券交易所日前向美国证券交易监督委员会（SEC）递交申请，请求立即取消美国及非美国公司从...,
1153,本报讯 雪莱特（002076）今日披露称，该公司于2006年12月4日与丹阳市光点汽车灯具...,"开勒股份, 卓锦股份, 金沃股份, 迈拓股份, 晓鸣股份, 迎丰股份, 中辰股份, 中伟股份..."


The filter rate is based on the threshold, but no matter how I adjust the threshold, the company name checked out is not correct

tf-idf + cos

In [52]:
news_embeddings =news_embeddings_tf
a_share_embeddings =a_share_embeddings_tf
news_embeddings = [emb.squeeze() for emb in news_embeddings]  
a_share_embeddings = [emb.squeeze() for emb in a_share_embeddings]  
import numpy as np

def rescale_cosine_similarities(cosine_similarities):

    min_sim = np.min(cosine_similarities)
    max_sim = np.max(cosine_similarities)
    return (cosine_similarities - min_sim) / (max_sim - min_sim)

def filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold):

    cosine_similarities = cosine_similarity(news_embeddings, a_share_embeddings)

    rescaled_similarities = rescale_cosine_similarities(cosine_similarities)

    filtered_companies = []
    for similarities in rescaled_similarities:
        company_indices = [idx for idx, sim in enumerate(similarities) if sim >= threshold]
        company_names = ', '.join(a_share_df['name'][company_indices])
        filtered_companies.append(company_names)

    filtered_news_count = len([comp for comp in filtered_companies if comp != ''])
    total_news_count = len(news_embeddings)
    filter_rate = filtered_news_count / total_news_count if total_news_count > 0 else 0

    return filtered_companies, filter_rate

filtered_companies2, filter_rate2 = filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold=0.000000001)
print(filter_rate2)

news_data_cos_tf=news_data2
news_data_cos_tf['Explicit_Company'] = filtered_companies2

news_data_cos_tf=news_data_cos_tf[[ 'NewsContent', 'Explicit_Company']]


0.0708


In [53]:
news_data_cos_tf.head(1115)

Unnamed: 0,NewsContent,Explicit_Company
0,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣判，...,
1,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上表示...,
2,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段权重...,
3,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿元左...,
4,全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点，曾...,
...,...,...
1110,本周，惠誉评级将中国工商银行的个体评级从“D/E”上调至“D”。该次评级上调是在工行完成21...,
1111,天威保变(600550.SH) 今日公告称，其为宝硕股份（600155.SH）提供的7000...,
1112,沪指在站上2100点之后，昨日步伐稍显蹒跚。 昨日，上证指数在2167.36点高开之后...,"招商银行, 工商银行"
1113,匡志勇 白云机场 （600004.SH）今日宣布将用定向增发方式以20.3亿元价格...,


Although the correct rate of the company name is much higher than bert, the filter rate is too low, so this algorithm is also abandoned

w2v + cos

In [70]:
news_embeddings =news_embeddings_w2v
a_share_embeddings =a_share_embeddings_w2v


def rescale_cosine_similarities(cosine_similarities):

    min_sim = np.min(cosine_similarities)
    max_sim = np.max(cosine_similarities)
    return (cosine_similarities - min_sim) / (max_sim - min_sim)

def filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold):

    cosine_similarities = cosine_similarity(news_embeddings, a_share_embeddings)

    rescaled_similarities = rescale_cosine_similarities(cosine_similarities)


    filtered_companies = []
    for similarities in rescaled_similarities:
        company_indices = [idx for idx, sim in enumerate(similarities) if sim >= threshold]
        company_names = ', '.join(a_share_df['name'][company_indices])
        filtered_companies.append(company_names)


    filtered_news_count = len([comp for comp in filtered_companies if comp != ''])
    total_news_count = len(news_embeddings)
    filter_rate = filtered_news_count / total_news_count if total_news_count > 0 else 0

    return filtered_companies, filter_rate


filtered_companies3, filter_rate3 = filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold=0.81)
print(filter_rate3)

news_data_cos_w2v=news_data2
news_data_cos_w2v['Explicit_Company'] = filtered_companies3

news_data_cos_w2v=news_data_cos_w2v[[ 'NewsContent', 'Explicit_Company']]
news_data_cos_w2v.head(5)

0.4822


Unnamed: 0,NewsContent,Explicit_Company
0,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣判，...,
1,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上表示...,
2,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段权重...,"正和生态, 呈和科技, 同有科技"
3,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿元左...,"正和生态, 呈和科技, 中体产业"
4,全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点，曾...,


w2v is also not available

jaccard_similarity + bert

In [72]:
from sklearn.metrics import pairwise_distances

def calculate_jaccard_similarity(embeddings1, embeddings2):

    embeddings1_array = np.array(embeddings1)
    embeddings2_array = np.array(embeddings2)

    embeddings1_binarized = np.where(embeddings1_array > 0, 1, 0)
    embeddings2_binarized = np.where(embeddings2_array > 0, 1, 0)

    jaccard_distances = pairwise_distances(embeddings1_binarized, embeddings2_binarized, metric='jaccard')
    jaccard_similarities = 1 - jaccard_distances

    return jaccard_similarities


# jaccard_similarity

def filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold):

    jaccard_similarities = calculate_jaccard_similarity(news_embeddings, a_share_embeddings)


    filtered_companies = []
    for similarities in jaccard_similarities:
        company_indices = [idx for idx, sim in enumerate(similarities) if sim >= threshold]
        company_names = ', '.join(a_share_df['name'][company_indices])
        filtered_companies.append(company_names)

    # 计算过滤比率
    filtered_news_count = len([comp for comp in filtered_companies if comp != ''])
    total_news_count = len(news_embeddings)
    filter_rate = filtered_news_count / total_news_count if total_news_count > 0 else 0

    return filtered_companies, filter_rate

filtered_companies4, filter_rate4 = filter_news_by_rescaled_similarity(news_embeddings, a_share_embeddings, a_share_df, threshold=0.635)
print(filter_rate4)

news_data_jard_bert=news_data2
news_data_jard_bert['Explicit_Company'] = filtered_companies4

news_data_jard_bert=news_data_jard_bert[[ 'NewsContent', 'Explicit_Company']]
news_data_jard_bert.head(5)



0.4813


Unnamed: 0,NewsContent,Explicit_Company
0,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣判，...,
1,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上表示...,
2,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段权重...,呈和科技
3,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿元左...,和达科技
4,全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点，曾...,


According to the above results, we can see that the method based on text vectorization and similarity calculation is not very good, so we choose another method

## Method based on Named Entity Recognition（Selected）

1. Use NER to detect text
2. extract entities(org)
3. using

just use bert-base-chinese

In [20]:
news_data = news_data.iloc[:30]

In [21]:
from transformers import BertTokenizerFast, BertForTokenClassification, pipeline
import torch

# Load the pre-trained Chinese BERT model and tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = BertForTokenClassification.from_pretrained('bert-base-chinese')

# Creating NER Pipelines
nlp = pipeline("ner", model=model, tokenizer=tokenizer, device=0) 

def get_entities(text):
    ner_results = nlp(text)
    # extract entities and their categories 
    entities = [(entity['word'], entity['entity']) for entity in ner_results if entity['entity'] != 'LABEL_0']
    return entities
# entities = news_data['NewsContent'].apply(lambda x: get_entities(x))



for index, entities_list in enumerate(news_data['NewsContent'].apply(lambda x: get_entities(x))):

    entities_str = ', '.join([f"({entity[0]}, {entity[1]})" for entity in entities_list])
    print(f"news {index} entities: {entities_str}")




Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


news 0 entities: (本, LABEL_1), (报, LABEL_1), (记, LABEL_1), (者, LABEL_1), (田, LABEL_1), (雨, LABEL_1), (李, LABEL_1), (京, LABEL_1), (华, LABEL_1), (中, LABEL_1), (国, LABEL_1), (银, LABEL_1), (行, LABEL_1), (股, LABEL_1), (份, LABEL_1), (有, LABEL_1), (限, LABEL_1), (公, LABEL_1), (董, LABEL_1), (事, LABEL_1), (张, LABEL_1), (恩, LABEL_1), (受, LABEL_1), (贿, LABEL_1), (３, LABEL_1), (日, LABEL_1), (一, LABEL_1), (，, LABEL_1), (北, LABEL_1), (京, LABEL_1), (市, LABEL_1), (第, LABEL_1), (一, LABEL_1), (中, LABEL_1), (级, LABEL_1), (人, LABEL_1), (民, LABEL_1), (法, LABEL_1), (院, LABEL_1), (以, LABEL_1), (受, LABEL_1), (贿, LABEL_1), (张, LABEL_1), (恩, LABEL_1), (照, LABEL_1), (有, LABEL_1), (期, LABEL_1), (刑, LABEL_1), (１５, LABEL_1), (年, LABEL_1), (。, LABEL_1), (法, LABEL_1), (院, LABEL_1), (经, LABEL_1), (开, LABEL_1), (，, LABEL_1), (２０, LABEL_1), (##００, LABEL_1), (年, LABEL_1), (至, LABEL_1), (２０, LABEL_1), (##０, LABEL_1), (##４, LABEL_1), (年, LABEL_1), (期, LABEL_1), (间, LABEL_1), (，, LABEL_1), (被, LABEL_1), (告, LABEL_1), (人, LAB

It can be seen that the effect of using bert-based-chinese to directly name entity recognition is not good, and the company entity cannot be correctly extracted, so we need to find another model

In [22]:


import torch
from ltp import LTP

ltp = LTP("LTP/base")  
if torch.cuda.is_available():
    ltp.to("cuda")

output = ltp.pipeline(["中国农业银行信用卡中心由北京搬到上海了！农行行长杨明生日前在信用卡中心揭牌仪式上表示，此举标志着农行开始了信用卡业务的新里程，其信用卡中心进入一个崭新的发展阶段。杨明生表示，银行卡业务是农行的一项优质资产。目前，农行的发卡量已经超过２．４亿张，农行金穗贷记卡的各项指标持续保持快速增长势头。他说，信用卡中心迁沪是看中了上海作为国际金融中心的区位优势，业已形成的银行卡产业环境及高素质的人才市场。    　　农行信用卡中心２００３年创建于北京，以金穗贷记卡为主营业务，具有全行信用卡产品研发中心、业务处理中心和客户服务中心的职能。据悉，农行信用卡中心已完成了所有搬迁工作并进入正常运转。"], tasks=["cws", "pos", "ner"])
print(output.ner)  



[[('Ni', '中国农业'), ('Ns', '北京'), ('Ns', '上海'), ('Nh', '杨明生'), ('Nh', '杨明生'), ('Ns', '上海'), ('Ns', '北京'), ('Ni', '中心'), ('Ni', '农行')]]


In [23]:
index = 2
combined_text = news_data["Title"][index] + " " + news_data["NewsContent"][index]
output = ltp.pipeline(combined_text, tasks=["cws", "pos", "ner"])
# output = ltp.pipeline(news_data["NewsContent"][8], tasks=["cws", "pos", "ner"])
print(output.ner)

[('Ns', '德国'), ('Ns', '中国')]


In [24]:
from Levenshtein import distance as levenshtein_distance
from ltp import LTP

# 初始化 LTP 模型
ltp = LTP("LTP/base")
if torch.cuda.is_available():
    ltp.to("cuda")
# Iterate over the DataFrame and add name and fullname to the vocabulary
for _, row in a_share_df.iterrows():
    ltp.add_word(row['name'], freq=15)
    ltp.add_word(row['fullname'], freq=5)
company_names = list(set(a_share_df['name'].tolist() + a_share_df['fullname'].tolist()))

def print_explicit_companies(news_data, company_names, threshold=2):
    for index, row in news_data.iterrows():
        content = row['NewsContent'].strip()
        output = ltp.pipeline([content], tasks=["cws", "pos", "ner"])  
        ner_entities = [entity[1] for entity in output.ner[0] if entity[0] == 'Ni']  

        explicit_companies = []

        for entity in ner_entities:
            if entity in company_names:
                explicit_companies.append(entity)
            elif len(entity) >= 3:
                for company in company_names:
                    if entity in company or company in entity:
                        levenshtein_dist = levenshtein_distance(entity, company)
                        if levenshtein_dist <= threshold:
                            explicit_companies.append(company)

        if explicit_companies:
            print(f"news {index}: Companies that are explicitly mentioned - {', '.join(set(explicit_companies))}")
        else:
            print(f"news {index}: No companies were mentioned")

print_explicit_companies(news_data, company_names)






news 0: No companies were mentioned
news 1: Companies that are explicitly mentioned - 农业银行
news 2: No companies were mentioned
news 3: Companies that are explicitly mentioned - 生物
news 4: No companies were mentioned
news 5: No companies were mentioned
news 6: No companies were mentioned
news 7: Companies that are explicitly mentioned - 南风, 盐湖
news 8: No companies were mentioned
news 9: Companies that are explicitly mentioned - 招商局能源运输股份有限公司, 招商轮船
news 10: No companies were mentioned
news 11: No companies were mentioned
news 12: No companies were mentioned
news 13: No companies were mentioned
news 14: No companies were mentioned
news 15: No companies were mentioned
news 16: No companies were mentioned
news 17: No companies were mentioned
news 18: No companies were mentioned
news 19: No companies were mentioned
news 20: No companies were mentioned
news 21: No companies were mentioned
news 22: Companies that are explicitly mentioned - 中国石化, 中国联通
news 23: No companies were mentioned
news 2

### **RaNER**(Selected)

In [25]:
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

ner_pipeline = pipeline(Tasks.named_entity_recognition, 'damo/nlp_raner_named-entity-recognition_chinese-base-news')
import re

def find_sentence_boundary(text, max_length=500):
    boundary_punctuations = r"[。！？!?.]"
    for i in range(max_length, 0, -1):
        if re.match(boundary_punctuations, text[i]):
            return i + 1
    return -1

def split_text(text, max_length=500):
    segments = []
    while len(text) > max_length:
        boundary = find_sentence_boundary(text, max_length)
        if boundary == -1:
            # 如果没有找到句子边界，使用max_length作为分割点
            boundary = max_length
        segments.append(text[:boundary])
        text = text[boundary:]
    segments.append(text)
    return segments


def process_text(text):
    # The text is divided and NER processed
    segments = split_text(text)
    result = []
    for segment in segments:
        if segment.strip():
            ner_result = ner_pipeline(segment)
            if 'output' in ner_result:
                result.extend(ner_result['output'])
            else:
                pass
    return result



# {'output': [{'type': 'PER', 'start': 0, 'end': 2, 'span': '国正'}]}

2023-12-01 11:59:35,144 - modelscope - INFO - PyTorch version 2.1.1+cu118 Found.
2023-12-01 11:59:35,146 - modelscope - INFO - Loading ast index from C:\Users\Administrator\.cache\modelscope\ast_indexer
2023-12-01 11:59:35,359 - modelscope - INFO - Loading done! Current index file version is 1.9.5, with md5 9d857e0795de1892e211da812c513b75 and a total number of 945 components indexed
2023-12-01 11:59:38,996 - modelscope - INFO - initiate model from C:\Users\Administrator\.cache\modelscope\hub\damo\nlp_raner_named-entity-recognition_chinese-base-news
2023-12-01 11:59:38,996 - modelscope - INFO - initiate model from location C:\Users\Administrator\.cache\modelscope\hub\damo\nlp_raner_named-entity-recognition_chinese-base-news.
2023-12-01 11:59:39,001 - modelscope - INFO - initialize model from C:\Users\Administrator\.cache\modelscope\hub\damo\nlp_raner_named-entity-recognition_chinese-base-news
2023-12-01 11:59:39,603 - modelscope - INFO - head has no _keys_to_ignore_on_load_missing
2023

In [26]:
ner_pipeline.device

device(type='cuda', index=0)

In [27]:
from Levenshtein import distance as levenshtein_distance

company_names = list(set(a_share_df['name'].tolist() + a_share_df['fullname'].tolist()))

# duplicate removal
company_names = list(set(company_names))

threshold = 15

# Walk through each column of news content
for content in news_data['NewsContent']:
    ner_results = process_text(content)
    # Extract the organization name from the NER output (contents of 'type': 'ORG')
    ner_entities = [entity['span'] for entity in ner_results if entity['type'] == 'ORG']

    # Collects entities that match directly
    direct_matches = set()
    for entity in ner_entities:
        if entity in company_names:
            direct_matches.add(entity)

    # If there is a direct match, print the match result
    if direct_matches:
        for match in direct_matches:
            print(f"match: {match}")
    else:
        # If there is no direct match, the Levenshtein distance is calculated
        for entity in ner_entities:
            if len(entity) >= 3:
                for company in company_names:
                    if entity in company or company in entity: 
                        levenshtein_dist = levenshtein_distance(entity, company)
                        if levenshtein_dist <= threshold:
                            print(f"entity '{entity}' similar to company '{company}' ，Levenshtein {levenshtein_dist}")
                            break

    print("end")  




match: 中国建设银行股份有限公司
end
entity '中国农业银行信用卡中心' similar to company '农业银行' ，Levenshtein 7
end
match: 外运发展
match: 中国国航
end
entity '胜利股份（000407）公司子公司' similar to company '胜利股份' ，Levenshtein 13
end
end
end
end
match: 冠农股份
end
match: 博汇纸业
end
match: 中金公司
match: 招商局能源运输股份有限公司
match: 招商轮船
end
entity '杭汽轮' similar to company '杭汽轮B' ，Levenshtein 1
end
match: 山东黄金
match: 中国银行
match: 冀东水泥
match: 工商银行
match: 航天机电
match: 三友化工
match: 中金黄金
match: 保利地产
match: 天山股份
match: 祁连山
match: 中信证券
match: 大秦铁路
match: 鞍钢股份
match: 武钢股份
match: 新兴铸管
end
end
end
end
end
end
end
end
end
end
end
match: 中国石化
match: 迈瑞医疗
match: 中国联通
end
end
end
end
match: 中工国际
end
match: 中国电信
end
end
end


It is found that the use of two words le distance effect is not very good, so add some commonly used listed company abbreviations

In [6]:
company_abbreviations ={'中国银行': '中行', '贵州茅台': '茅台', '中国石化': '中石化', '中国石油': '中石油', '中国平安': '平安', '中国人寿': '中人寿', '中国太保': '太保', '中国联通': '联通', '中国铁建': '铁建', '中国建筑': '中建', '中国国航': '国航', '中国中车': '中车', '中国国旅': '国旅', '中国神华': '神华', '中国核电': '核电', '中国黄金': '黄金', '中国铝业': '铝业', '万科A': '万科A', '深振业A': '深振业A', '神州高铁': '神州高铁', '中国宝安': '中国宝安', '深物业A': '深物业A', '南玻A': '南玻A', '沙河股份': '沙河股份', '深康佳A': '深康佳A', '深中华A': '深中华A', '深粮控股': '深粮控股', '深华发A': '深华发A', '深科技': '深科技', '特力A': '特力A', '深圳能源': '深圳能源', '深深房A': '深深房A', '富奥股份': '富奥股份', '大悦城': '大悦城', '深桑达A': '深桑达A', '神州数码': '神州数码', '中国天楹': '中国天楹', '华联控股': '华联控股', '深南电A': '深南电A', '比亚迪': '比亚迪', '海康威视': '海康', '宁德时代': '宁德', '大华股份': '大华', '恒瑞医药': '恒瑞', '药明康德': '药明', '伊利股份': '伊利', '洋河股份': '洋河', '海天味业': '海天', '泸州老窖': '泸州', '顺丰控股': '顺丰', '中兴通讯': '中兴', '海尔智家': '海尔', '长城汽车': '长城', '广汽集团': '广汽', '东风汽车': '东风', '上汽集团': '上汽', '长江电力': '长电', '华夏银行': '华夏', '兴业银行': '兴业', '浦发银行': '浦发'}


In [29]:
from Levenshtein import distance as levenshtein_distance

company_names = list(set(a_share_df['name'].tolist() + a_share_df['fullname'].tolist()))


# Add these short names from the existing company_names list
company_names.extend(list(company_abbreviations.values()))

# duplicate removal
company_names = list(set(company_names))

threshold = 1500

# Walk through each column of news content
for content in news_data['NewsContent']:
    ner_results = process_text(content)
    # Extract the organization name from the NER output (contents of 'type': 'ORG')
    ner_entities = [entity['span'] for entity in ner_results if entity['type'] == 'ORG']

    # Collects entities that match directly
    direct_matches = set()
    for entity in ner_entities:
        if entity in company_names:
            direct_matches.add(entity)

    # If there is a direct match, print the match result
    if direct_matches:
        for match in direct_matches:
            print(f"match: {match}")
    else:
        # If there is no direct match, the Levenshtein distance is calculated
        for entity in ner_entities:
            if len(entity) >= 3:
                for company in company_names:
                    if entity in company or company in entity: 
                        levenshtein_dist = levenshtein_distance(entity, company)
                        if levenshtein_dist <= threshold:
                            print(f"entity '{entity}' similar to company '{company}' ，Levenshtein {levenshtein_dist}")
                            break

    print("end")  






match: 中国建设银行股份有限公司
end
entity '中国农业银行信用卡中心' similar to company '农业银行' ，Levenshtein 7
end
match: 外运发展
match: 中国国航
end
entity '胜利股份（000407）公司子公司' similar to company '胜利股份' ，Levenshtein 13
end
match: 中行
end
end
end
match: 冠农股份
end
match: 博汇纸业
end
match: 中金公司
match: 招商局能源运输股份有限公司
match: 招商轮船
end
entity '杭汽轮' similar to company '杭汽轮B' ，Levenshtein 1
end
match: 山东黄金
match: 中国银行
match: 冀东水泥
match: 工商银行
match: 航天机电
match: 三友化工
match: 中金黄金
match: 保利地产
match: 天山股份
match: 祁连山
match: 中信证券
match: 大秦铁路
match: 鞍钢股份
match: 武钢股份
match: 新兴铸管
end
end
end
end
end
end
end
end
end
end
end
match: 中国石化
match: 迈瑞医疗
match: 中国联通
end
end
end
end
match: 中工国际
end
match: 中国电信
end
end
end


In [30]:
import pandas as pd
from Levenshtein import distance as levenshtein_distance
from tqdm import tqdm
import warnings


company_names = list(set(a_share_df['name'].tolist() + a_share_df['fullname'].tolist()))

company_names.extend(list(company_abbreviations.values()))

company_names = list(set(company_names))

threshold = 100
warnings.filterwarnings("ignore", category=FutureWarning)


deleted_news_ids = []
news_data['Explicit_Company'] = ''


for index, row in tqdm(news_data.iterrows(), total=news_data.shape[0], desc="Processing"):
    content = row['NewsContent']
    ner_results = process_text(content) 

    ner_entities = [entity['span'] for entity in ner_results if entity['type'] == 'ORG']
    explicit_companies = []

    for entity in ner_entities:
        if entity in company_names:
            # If the entity is short, it is converted to the full name
            full_name = [key for key, value in company_abbreviations.items() if value == entity]
            explicit_companies.extend(full_name if full_name else [entity])
        elif len(entity) >= 3:
            for company in company_names:
                if entity in company or company in entity:
                    levenshtein_dist = levenshtein_distance(entity, company)
                    if levenshtein_dist <= threshold:
                        # If the entity is similar to the company name, convert to the full name
                        full_name = [key for key, value in company_abbreviations.items() if value == company]
                        explicit_companies.extend(full_name if full_name else [company])

    if explicit_companies:
        #Add explicitly mentioned company names to the new column, separated by commas 
        news_data.at[index, 'Explicit_Company'] = ', '.join(set(explicit_companies))
    else:
        deleted_news_ids.append(row['NewsID'])
        news_data.drop(index, inplace=True)

news_data = news_data[['NewsID', 'NewsContent', 'Explicit_Company']]




# news_data.to_csv('Task1.csv', index=False)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_data['Explicit_Company'] = ''
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_data.drop(index, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_data.drop(index, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news

In [31]:
# news_data.to_csv('remaining_news_data_1.csv', index=False)

Merge folders that are run separately

In [32]:
# import pandas as pd
# 
# combined_df = pd.DataFrame()
# 
# for i in range(1, 11):
#     filename = f'./data_new/remaining_news_data_{i}.csv'
# 
#     temp_df = pd.read_csv(filename)
# 
#     combined_df = pd.concat([combined_df, temp_df])
# 
# df=combined_df


or To make sure it works, use the entire news

In [5]:
df=news_data

After the check, it was found that duplicate entries were matched because the abbreviation was added. Therefore, the search was performed again to remove fullname and abbreviation and retain name

In [2]:
import json


file_path = './data/A_share_list.json'
with open(file_path, 'r', encoding='utf-8') as file:
    json_data = json.load(file)

name_fullname_map = {company['fullname']: company['name'] for company in json_data}
name_set = set(name_fullname_map.values())



def process_explicit_company(row):
    companies = set(row['Explicit_Company'].split(', '))
    valid_companies = set()
    for company in companies:
        if company in name_set:
            valid_companies.add(company)
        elif company in name_fullname_map:
            valid_companies.add(name_fullname_map[company])

    return ', '.join(valid_companies) if valid_companies else None


df['Explicit_Company'] = df.apply(process_explicit_company, axis=1)
df.dropna(subset=['Explicit_Company'], inplace=True)

df.to_excel('./Task1Q1.xlsx', index=False)
df

Unnamed: 0,NewsID,NewsContent,Explicit_Company,label
0,1,本报记者 田雨 李京华 中国建设银行股份有限公司原董事长张恩照受贿案３日一审宣判，...,建设银行,0
1,2,中国农业银行信用卡中心由北京搬到上海了！ 农行行长杨明生日前在信用卡中心揭牌仪式上表示...,农业银行,1
2,3,在新基金快速发行以及申购资金回流的情况下，市场总体上呈现资金流动性过剩格局，考虑到现阶段权重...,"外运发展, 中国国航",1
3,4,胜利股份（000407）公司子公司填海造地2800亩，以青岛的地价估算，静态价值在10亿元左...,胜利股份,1
4,5,全景网11月30日讯 外围股市造好，带动港股今早造好，恒指高开后反覆上升，最高升252点，曾...,中国银行,1
...,...,...,...,...
489884,1037031,每经AI快讯，有投资者在投资者互动平台提问：请问公司目前有没有电解槽产能，规划情况能否详细介...,亿华通,0
489885,1037032,依米康（SZ 300249，收盘价：10.38元）发布公告称，2023年10月12日，依米康...,"依米康, 中泰证券",1
489886,1037033,天风证券10月13日发布研报称，给予中核科技（000777.SZ，最新价：13.03元）买入...,"天风证券, 中核科技",1
489887,1037034,有投资者提问：抗癌药CPT获批后，公司是否应该按照股权协议继续收购沙东股权，适应症为MM的C...,海特生物,1
