# Running ROUGE Evaluation for Three Models

This notebook will feed each prompt from the dataset into three models (e.g., base / fine-tuned), compute the ROUGE score for each sample using the project's `rouge_calculator.py`, and save the aggregated statistics.

Note: Please modify the sections marked as **TODO** according to your environment (model checkpoint paths, cache directory, whether to use CUDA, etc.).

## Dependencies (if needed)

If you are running this in Colab or a new environment, please install the required dependencies first.

Below are the recommended installation commands to run in Colab (including commonly used acceleration and evaluation libraries):


In [None]:
!pip install -U pip setuptools wheel
!pip install -U transformers datasets peft bitsandbytes rouge-score tqdm accelerate einops rouge-score finnhub-python

from google.colab import drive
drive.mount('/content/drive')

print('Installation complete')

Collecting pip
  Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
Collecting setuptools
  Downloading setuptools-82.0.0-py3-none-any.whl.metadata (6.6 kB)
Downloading pip-26.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading setuptools-82.0.0-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 75.2.0
    Uninstalling setuptools-75.2.0:
      Successfully uninstalled setuptools-75.2.0
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavio

Collecting transformers
  Downloading transformers-5.2.0-py3-none-any.whl.metadata (32 kB)
Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting finnhub-python
  Downloading finnhub_python-2.4.27-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-23.0.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.1 kB)
Downloading transformers-5.2.0-py3-none-any.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m151.5 MB/s[0m  [33m0:00:00[0m
[?25hDownloading datasets-4.5.0-py3-none-any.whl (515 kB)
Downloading bits

In [None]:
import os
import sys
import json
from types import SimpleNamespace
from pprint import pprint
from google.colab import userdata
PROJECT_ROOT = '/content/juankim834'  # TODO: Change to your Path
!unzip /content/drive/MyDrive/juankim834.zip -d /content/juankim834
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_FOR_FINGPT')
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)




Archive:  /content/drive/MyDrive/juankim834.zip
   creating: /content/juankim834/finetuned_models/
   creating: /content/juankim834/finetuned_models/deepseek-ai/
   creating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707/
  inflating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707/adapter_config.json  
  inflating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707/adapter_model.safetensors  
  inflating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707/chat_template.jinja  
   creating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707/checkpoint-39/
  inflating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707/checkpoint-39/adapter_config.json  
  inflating: /content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B

In [None]:
import os
os.chdir(PROJECT_ROOT)
!ls

finetuned_models  rouge_calculator.py	       utils
llm_test	  run_three_models_eval.ipynb


In [None]:
from llm_test import model_runner

## TODO: Configure the paths for the three models (replace the placeholders below with the actual paths in your Colab/Drive environment)

In [None]:
# TODO: Replace your path and 
models = [
    # {
    #     'label': 'model_llama3-base',
    #     'model_name': 'meta-llama/Llama-3.1-8B-Instruct',  # base model name (HuggingFace repo name)
    #     'fine_tuned_path': None,
    #     'cache_dir': '/content/juankim834/pretrained_models',  # TODO: Change to your Path of Cache
    #     'use_cuda': True,
    # },
    {
        'label': 'model_finetuned_qwen',
        'model_name': 'Qwen/Qwen2.5-7B-Instruct',
        'fine_tuned_path': '/content/juankim834/finetuned_models/Qwen/Qwen2.5-7B-Instruct_202602201649/checkpoint-39', # TODO: Change to your Path of fine-tuned Cache
        'cache_dir': '/content/juankim834/pretrained_models',
        'use_cuda': True,
    },
    {
        'label': 'model_finetuned_llama3',
        'model_name': 'meta-llama/Meta-Llama-3.1-8B-Instruct',
        'fine_tuned_path': '/content/juankim834/finetuned_models/meta-llama/Meta-Llama-3-8B-Instruct_202602201606/checkpoint-50', # TODO: Change to your Path of fine-tuned Cache
        'cache_dir': '/content/juankim834/pretrained_models',
        'use_cuda': True,
    },
    {
        'label': 'model_finetuned_deepseek',
        'model_name': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
        'fine_tuned_path': '/content/juankim834/finetuned_models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B_202602210707', # TODO: Change to your Path of fine-tuned Cache
        'cache_dir': '/content/juankim834/pretrained_models',
        'use_cuda': True,
    },
]

# Evaluation hyperparameters
common_args = {
    'dataset': 'dow30-202305-202405',
    'from_remote': True,
    'split': 'test',
    'start_date': None,
    'end_date': None,
    'symbol': None,
    'max_samples': 0,
    'max_length': 2048,
    'temperature': 0.7,
    'output_dir': '/content/drive/MyDrive/rouge_results',
    'device': 'cuda:0',
    'batch_size': 12
}

print('Configuring model complete, please check your model configuration again')
# Get tokens
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_FOR_FINGPT')
from huggingface_hub import login
login(token=os.environ["HF_TOKEN"]) # Implicitly login fails sometimes, use explicit way to login again

模型配置示例就绪。请检查并修改 TODO 路径。


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [None]:
# # Smoke test
# # To confirm model generation ability
# import gc
# import torch
# import rouge_calculator
# import torch
# # Load one sample from dataset. 
# ds_list = rouge_calculator.load_dataset(common_args['dataset'], from_remote=common_args['from_remote'])
# ds = ds_list[0].get(common_args['split'])
# # Filtering Samples
# try:
#     filtered = ds.filter(lambda x: common_args['start_date'] <= x['period'] <= common_args['end_date'] and x['symbol'] == common_args['symbol'])
# except Exception:
#     filtered = ds
# if len(filtered) == 0:
#     print('No filtered samples found; using first sample from split')
#     item = ds[0]
# else:
#     item = filtered[0]
# prompt = item['prompt']
# ground_truth = item['answer']
# print('=== Ground truth preview ===')
# print(ground_truth[:500])

# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# print('Using device:', device)

# for m in models:
#     print('--- Testing model:', m['label'], '---')
#     try:
#         runner = model_runner(m['model_name'], cache_dir=m['cache_dir'], device=device, fine_tuned_path=m['fine_tuned_path'])
#         gen = runner.generate_from_one_prompt(prompt, max_length=512, temperature=0.7)
#         print('Generated (truncated 100 chars):')
#         print(gen[:100])
#     except Exception as e:
#         print('Error while testing', m['label'], e)
#     gc.collect()
#     if torch.cuda.is_available():
#         torch.cuda.empty_cache()
# gc.collect()
# if torch.cuda.is_available():
#     torch.cuda.empty_cache()


In [None]:
import gc
import torch
import rouge_calculator
from types import SimpleNamespace
results = {}
for m in models:
    print('Running eval for:', m['label'])
    args_ns = SimpleNamespace(**common_args)
    args_ns.model_name = m['model_name']
    args_ns.fine_tuned_path = m['fine_tuned_path']
    args_ns.cache_dir = m['cache_dir']
    args_ns.use_cuda = m['use_cuda']
    args_ns.output_dir = os.path.join(common_args['output_dir'], m['label'])
    os.makedirs(args_ns.output_dir, exist_ok=True)

    per_path, summary_path = rouge_calculator.evaluate(args_ns)
    print('Finished:', m['label'])
    results[m['label']] = { 'per_sample': per_path, 'summary': summary_path }
    gc.collect() # Force to unload memory of CUDA and RAM
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

with open(os.path.join(common_args['output_dir'], 'runs_index.json'), 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print('All runs finished. Index saved to', os.path.join(common_args['output_dir'], 'runs_index.json'))

Running eval for: model_finetuned_qwen


README.md:   0%|          | 0.00/581 [00:00<?, ?B/s]

data/train-00000-of-00001-7c4c80aa07272d(…):   0%|          | 0.00/3.57M [00:00<?, ?B/s]

(…)-00000-of-00001-28531804b005ddc6.parquet:   0%|          | 0.00/925k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1230 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/300 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1230 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/300 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

Generating and scoring:   0%|          | 0/25 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Generating and scoring: 100%|██████████| 25/25 [07:51<00:00, 18.86s/it]


Saved per-sample results to: /content/drive/MyDrive/rouge_results/model_finetuned_qwen/per_sample_rouge.json
Saved summary stats to: /content/drive/MyDrive/rouge_results/model_finetuned_qwen/summary_rouge.json
Memory usage after cleaning: 9.12 MB
Finished: model_finetuned_qwen
Running eval for: model_finetuned_llama3


Saving the dataset (0/1 shards):   0%|          | 0/1230 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/300 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Generating and scoring:   0%|          | 0/25 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Generating and scoring:   4%|▍         | 1/25 [00:34<13:42, 34.28s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Generating and scoring:   8%|▊         | 2/25 [02:06<26:12, 68.37s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Generating and scoring:  12%|█▏        | 3/25 [03:52<31:26, 85.74s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Generating and scoring:  16%|█▌        | 4/25 [04:20<21:56, 62.67s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Generating and scoring:  20%|██        | 5/25 [06:16<27:19, 81.95s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Genera

Saved per-sample results to: /content/drive/MyDrive/rouge_results/model_finetuned_llama3/per_sample_rouge.json
Saved summary stats to: /content/drive/MyDrive/rouge_results/model_finetuned_llama3/summary_rouge.json
Memory usage after cleaning: 9.12 MB
Finished: model_finetuned_llama3
Running eval for: model_finetuned_deepseek


Saving the dataset (0/1 shards):   0%|          | 0/1230 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/300 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Generating and scoring:   0%|          | 0/25 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating and scoring:   4%|▍         | 1/25 [00:32<12:58, 32.45s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating and scoring:   8%|▊         | 2/25 [01:18<15:34, 40.62s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating and scoring:  12%|█▏        | 3/25 [01:51<13:31, 36.89s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating and scoring:  16%|█▌        | 4/25 [02:20<11:50, 33.82s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating and scoring:  20%|██        | 5/25 [02:57<11:40, 35.02s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating and scoring:  24%|██▍       | 6/25 [03:30<10:54, 34.43s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generat

Saved per-sample results to: /content/drive/MyDrive/rouge_results/model_finetuned_deepseek/per_sample_rouge.json
Saved summary stats to: /content/drive/MyDrive/rouge_results/model_finetuned_deepseek/summary_rouge.json
Memory usage after cleaning: 9.12 MB
Finished: model_finetuned_deepseek
All runs finished. Index saved to /content/drive/MyDrive/rouge_results/runs_index.json


In [None]:
# This is for saving calculation units. 
from google.colab import runtime
runtime.unassign()

## Quick Start for Colab (CLI Examples)
Below is an example of the shell commands used to execute the complete evaluation pipeline in a Google Colab environment:

In [None]:
# Use to run in the colab environment with only few commands, but running this notebook is also practical
# from google.colab import drive
# drive.mount('/content/drive')

# %cd /content/drive/MyDrive/juankim834

# !pip install -U transformers datasets peft bitsandbytes rouge-score tqdm accelerate

# !python rouge_calculator.py --dataset dow30-202305-202405 --from_remote \
#    --split test --start_date 2024-01-07 --end_date 2024-01-14 --symbol GS \
#    --model_name 'Qwen/Qwen2.5-7B-Instruct' --fine_tuned_path '/content/drive/MyDrive/finetuned_models/Qwen/checkpoint-39' \
#    --use_cuda --output_dir '/content/drive/MyDrive/rouge_results/model_finetuned_qwen'

In [None]:
# 
index_path = '/content/drive/MyDrive/rouge_results/runs_index.json'
if os.path.exists(index_path):
    with open(index_path, 'r', encoding='utf-8') as f:
        runs = json.load(f)
    pprint(runs)
else:
    print('Index file not found at', index_path)