<a href="https://colab.research.google.com/github/nana881023/Financial_Big_Data_Analysis/blob/main/Unit10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- 設定環境
- 確保已安裝 Hugging Face Transformers 與 Datasets：



In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

- 2.載入 Financial PhraseBank 資料集
- 使用 datasets 來處理資料集：

In [2]:
from datasets import load_dataset

# 指定配置並載入資料集
dataset = load_dataset("takala/financial_phrasebank", "sentences_allagree")

# 查看分割名稱
print(dataset)  # 顯示可用的分割名稱（例如 train）

# 存取 train 分割中的第一筆資料
print(dataset['train'][0])  # {'sentence': "Text", 'label': 0 (Negative) / 1 (Neutral) / 2 (Positive)}


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

The repository for takala/financial_phrasebank contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/takala/financial_phrasebank.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})
{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}


In [3]:
# 檢查資料集分割
print(f"Available splits: {dataset.keys()}")

# 查看 train 分割的大小
print(f"Train split size: {len(dataset['train'])}")

# 查看 train 分割中的數據格式
print(dataset['train'][0])


Available splits: dict_keys(['train'])
Train split size: 2264
{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}


- 3. 資料處理
- 將文本資料轉換為模型可處理的格式：

In [4]:
from transformers import AutoTokenizer

# 使用 BERT 基礎模型
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize 資料
def preprocess_data(example):
    return tokenizer(example['sentence'], truncation=True, padding='max_length', max_length=128)

# 應用 Tokenizer
encoded_dataset = dataset.map(preprocess_data, batched=True)
encoded_dataset = encoded_dataset.rename_column("label", "labels")  # 確保標籤列名稱正確
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/2264 [00:00<?, ? examples/s]

- 4. 建立模型與訓練器
- 使用預訓練的 BERT 模型並微調：

In [5]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# 加載 BERT 模型，設定輸出層數為 3（Positive, Negative, Neutral）
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

# 加載對應的 Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 加載 Financial PhraseBank 資料集，僅取部分樣本
dataset = load_dataset("takala/financial_phrasebank", "sentences_allagree")
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(500))  # 取前 500 筆資料
small_eval_dataset = dataset["train"].shuffle(seed=42).select(range(100))  # 取前 100 筆資料

# 對資料集進行 Tokenization
def preprocess_function(examples):
  return tokenizer(examples["sentence"], truncation=True, padding=True, max_length=128)

train_dataset = small_train_dataset.map(preprocess_function, batched=True)
eval_dataset = small_eval_dataset.map(preprocess_function, batched=True)

# 設定訓練參數，減少 epoch 和加快訓練
training_args = TrainingArguments(
  output_dir="./results",
  eval_strategy="epoch",
  save_strategy="no",  # 不保存模型以節省時間
  learning_rate=2e-5,
  per_device_train_batch_size=32,  # 增加批次大小
  per_device_eval_batch_size=32,
  num_train_epochs=1,  # 減少到 1 個 epoch
  logging_steps=10,
  report_to="none"
)

# 初始化 Trainer
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=eval_dataset,
  tokenizer=tokenizer
)

# 開始訓練
trainer.train()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.9969,0.878504


TrainOutput(global_step=16, training_loss=0.9428015947341919, metrics={'train_runtime': 396.6617, 'train_samples_per_second': 1.261, 'train_steps_per_second': 0.04, 'total_flos': 20555735760000.0, 'train_loss': 0.9428015947341919, 'epoch': 1.0})

- 5. 模型推論
- 對新的金融文本進行情緒分析：

In [6]:
# 測試文本
test_texts = [
    "The company's profit has increased significantly this quarter.",
    "The increase in costs negatively affected the revenue.",
    "The company's performance remained stable."
]

# Tokenize 測試資料
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors="pt")

# 模型推論
model.eval()
outputs = model(**test_encodings)
preds = outputs.logits.argmax(dim=-1).cpu().numpy()

# 將預測結果轉換為文字標籤
label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
predicted_labels = [label_map[pred] for pred in preds]
print(predicted_labels)


['Neutral', 'Neutral', 'Neutral']


- 6. 視覺化與模型評估
- 可透過 sklearn 庫對模型進行評估，計算準確率、精確率等指標：

In [7]:
from sklearn.metrics import classification_report

# 假設有標籤 y_true 與模型預測值 y_pred
y_true = [2, 0, 1]  # 標準答案
y_pred = preds      # 模型預測值

print(classification_report(y_true, y_pred, target_names=["Negative", "Neutral", "Positive"]))


              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         1
     Neutral       0.33      1.00      0.50         1
    Positive       0.00      0.00      0.00         1

    accuracy                           0.33         3
   macro avg       0.11      0.33      0.17         3
weighted avg       0.11      0.33      0.17         3



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
