<a href="https://colab.research.google.com/github/pei0217/fin_hw6_week10/blob/main/fin_hw6_week10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [3]:
from datasets import load_dataset


# 加载 Financial PhraseBank 数据集的 "sentences_allagree" 配置
dataset = load_dataset("takala/financial_phrasebank", "sentences_allagree", split="train")

# 数据预处理
texts = [sample['sentence'] for sample in dataset]
labels = [sample['label'] for sample in dataset]  # 标签: 0 = Negative, 1 = Neutral, 2 = Positive

from transformers import BertTokenizer

# 初始化 BERT 的分词器
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# 将文本编码为 BERT 输入格式
encodings = tokenizer(texts, truncation=True, padding=True, max_length=128, return_tensors="pt")

import torch

class FinancialDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

dataset = FinancialDataset(encodings, labels)

from transformers import BertForSequenceClassification

# 初始化预训练的 BERT 模型 (3 分类任务)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

from torch.utils.data import DataLoader
from transformers import AdamW

# 数据加载器
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

# 优化器
optimizer = AdamW(model.parameters(), lr=5e-5)

from torch.nn import CrossEntropyLoss
from torch.optim import AdamW

# 确保模型和数据在 GPU 上运行
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# 训练模型
model.train()
for epoch in range(3):  # 3 个训练轮次
    for batch in dataloader:
        # 将数据移动到 GPU
        batch = {key: val.to(device) for key, val in batch.items()}

        # 前向传播
        outputs = model(**batch)
        loss = outputs.loss
        logits = outputs.logits

        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} completed with loss: {loss.item()}")

model.eval()
test_texts = [
    "The company's profit has increased significantly this quarter.",
    "The increase in costs negatively affected the revenue.",
    "The company's performance remained stable."
]
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors="pt").to(device)
outputs = model(**test_encodings)
preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()

# 映射结果到标签
label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
predicted_labels = [label_map[pred] for pred in preds]
print(predicted_labels)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

The repository for takala/financial_phrasebank contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/takala/financial_phrasebank.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch 1 completed with loss: 0.013371452689170837
Epoch 2 completed with loss: 0.07115218788385391
Epoch 3 completed with loss: 0.0007682127179577947
['Positive', 'Negative', 'Positive']
