## FinBERT介绍
FinBERT， 是使用49亿词的英文金融语料库数据，生成的BERT预训练语言模型。语料库上大小为 49亿个词。

- 公司报告 10-K 和 10-Q：25亿个词
- 电话会议记录：13亿个词
- 分析师报告：11亿个词

FinBERT开发者在多个金融 NLP 任务上对 FinBERT 预训练模型进行了微调，均优于传统机器学习模型、深度学习模型和微调 BERT 模型。 所有经过微调的 FinBERT 模型都公开托管在 Huggingface 🤗。  目前支持包括**情绪分析、ESG 分类、前瞻性陈述 (FLS) 分类**。 


<br>


### FinBERT功能

具体来说，FinBERT有以下内容：

- [FinBERT-Pretrained](https://huggingface.co/yiyanghkust/finbert-pretrain)： 针对大规模金融文本的预训练 FinBERT 模型。 
- [FinBERT-Sentiment](https://huggingface.co/yiyanghkust/finbert-tone)： 用于情感分类任务。 
- [FinBERT-ESG](https://huggingface.co/yiyanghkust/finbert-esg)： 用于 ESG 分类任务。 
- [FinBERT-FLS](https://huggingface.co/yiyanghkust/finbert-fls)： 用于前瞻性陈述（FLS）分类任务。 

<br>


### 环境配置

```
pip install transformers==4.18.0
```

本次实验使用的transformers版本为

```
import transformers
transformers.__version__
```

Run

```
4.18.0
```

<br><br>


## 一、情感分析

金融文本情绪可以调动管理者、信息中介和投资者的观点和意见, 因此分析金融文本情感(情绪)是有价值的。 FinBERT-Sentiment 是一个 FinBERT 模型，它根据标准普尔 500 家公司的分析师报告中的 10,000 个手动注释的句子进行了Fine-tune(微调)。

>Fine-Tune微调 是 深度学习的一种语言处理技术，可以在前人（已有）的语言模型文件基础上加入少量新场景的文本数据进行更新训练，生成出新场景的语言模型。

- **输入**：金融文本。
- **输出**：Positive, Neutral or Negative.



In [1]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

#首次运行，因为会下载FinBERT模型，耗时会比较久
senti_finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
senti_tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
senti_nlp = pipeline("text-classification", model=senti_finbert, tokenizer=senti_tokenizer)

Downloading:   0%|          | 0.00/533 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [2]:
# 待分析的文本数据
senti_results = senti_nlp(['growth is strong and we have plenty of liquidity.', 
                           'there is a shortage of capital, and we need extra financing.',
                           'formulation patents might protect Vasotec to a limited extent.'])
senti_results

[{'label': 'Positive', 'score': 1.0},
 {'label': 'Negative', 'score': 0.9952379465103149},
 {'label': 'Neutral', 'score': 0.9979718327522278}]


<br><br>

## 二、ESG分类
ESG 分析可以帮助投资者确定企业的长期可持续性并识别相关风险。 FinBERT-ESG 是一个 FinBERT 模型，根据来自公司 ESG 报告和年度报告的 2,000 个手动注释句子进行微调。

- **输入**：金融文本。
- **输出**：Environmental, Social, Governance or None.

In [3]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

esg_finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-esg',num_labels=4)
esg_tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-esg')
esg_nlp = pipeline("text-classification", model=esg_finbert, tokenizer=esg_tokenizer)

Downloading:   0%|          | 0.00/781 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [4]:
esg_results = esg_nlp(['Managing and working to mitigate the impact our operations have on the environment is a core element of our business.',
                      'Rhonda has been volunteering for several years for a variety of charitable community programs.',
                      'Cabot\'s annual statements are audited annually by an independent registered public accounting firm.',
                      'As of December 31, 2012, the 2011 Term Loan had a principal balance of $492.5 million.'])

esg_results

[{'label': 'Environmental', 'score': 0.9805498719215393},
 {'label': 'Social', 'score': 0.9906041026115417},
 {'label': 'Governance', 'score': 0.6738430857658386},
 {'label': 'None', 'score': 0.9960240125656128}]

<br><br>

## 三、FLS识别

**前瞻性陈述 (FLS)** 告知投资者经理人对公司未来事件或结果的信念和意见。 从公司报告中识别前瞻性陈述可以帮助投资者进行财务分析。 FinBERT-FLS 是一个 FinBERT 模型，它基于罗素 3000 家公司年报的管理讨论和分析部分的 3,500 个手动注释的句子进行了微调。

- **输入**：金融文本。
- **输出**：Specific-FLS(特定 FLS) , Non-specific FLS(非特定 FLS),  Not-FLS(非 FLS)。

In [5]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

fls_finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-fls',num_labels=3)
fls_tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-fls')

fls_nlp = pipeline("text-classification", model=fls_finbert, tokenizer=fls_tokenizer)

Downloading:   0%|          | 0.00/761 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [6]:
fls_results = fls_nlp(['we expect the age of our fleet to enhance availability and reliability due to reduced downtime for repairs.',
                      'on an equivalent unit of production basis, general and administrative expenses declined 24 percent from 1994 to $.67 per boe.',
                      'we will continue to assess the need for a valuation allowance against deferred tax assets considering all available evidence obtained in future reporting periods.'])


fls_results

[{'label': 'Specific FLS', 'score': 0.7727874517440796},
 {'label': 'Not FLS', 'score': 0.9905241131782532},
 {'label': 'Non-specific FLS', 'score': 0.975904107093811}]

<br><br>

## 文档及引用说明

- 文档github地址 https://github.com/yya518/FinBERT 

<br>


```
@misc{yang2020finbert,
    title={FinBERT: A Pretrained Language Model for Financial Communications},
    author={Yi Yang and Mark Christopher Siy UY and Allen Huang},
    year={2020},
    eprint={2006.08097},
    archivePrefix={arXiv},
    }
```
