# Paper Reproduction Launcher (Smoke Run)

This Colab-friendly notebook drives the existing CLI scripts in `summarization/` to run very small LED and Llama jobs. It uses the `train_last_100.json` / `valid_last_100.json` splits so we can exercise the full code paths quickly before launching the long runs described in the paper.

In [1]:
# @title Sync the local repository from Google Drive (no git clone)
from pathlib import Path
import os
import importlib.util

COLAB = importlib.util.find_spec("google.colab") is not None
if COLAB:
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive', force_remount=True)
    REPO_IN_DRIVE = Path('/content/drive/Othercomputers/My Mac/patient_summaries_with_llms/data/')  # @param {type:"string"}
    if not REPO_IN_DRIVE.exists():
        raise FileNotFoundError(f"Upload/sync the local repo to {REPO_IN_DRIVE} first.")
    TARGET_DIR = Path('/content/patient_summaries_with_llms')
    TARGET_DIR.mkdir(parents=True, exist_ok=True)
    os.system(f"rsync --progress -a --delete '{REPO_IN_DRIVE}/summarization' '{TARGET_DIR}/'")
    os.system(f"rsync --progress -a --delete '{REPO_IN_DRIVE}/data' '{TARGET_DIR}/'")
    os.system(f"rsync --progress -a --delete '{REPO_IN_DRIVE}/requirements.txt' '{TARGET_DIR}/'")
    os.system(f"rsync --progress -a --delete '{REPO_IN_DRIVE}/requirements-llama.txt' '{TARGET_DIR}/'")
    %cd /content/patient_summaries_with_llms
else:
    print("Running outside Colab; using the current working directory.")
    TARGET_DIR = Path.cwd()

Mounted at /content/drive
/content/patient_summaries_with_llms


In [1]:
# @title Install base dependencies shared by LED + evaluation
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement torch==2.3.1+cu121 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0, 2.9.0, 2.9.1)[0m[31m
[0m[31mERROR: No matching distribution found for torch==2.3.1+cu121[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# @title more dependencies
%pip install bert bert_score sacremoses

Note: you may need to restart the kernel to use updated packages.


In [None]:
# @title Install dependencies used by both scripts
# %pip install -q transformers==4.39.3 datasets==2.18.0 accelerate==0.28.0 evaluate rouge-score sacrebleu sentencepiece trl==0.7.10 peft==0.10.0 wandb

Found existing installation: peft 0.10.0
Uninstalling peft-0.10.0:
  Successfully uninstalled peft-0.10.0
[0mNote: you may need to restart the kernel to use updated packages.


## LED smoke run (`summarization/run_summarization.py`)

Trains/evaluates `allenai/led-base-16384` on the last 100 training examples for one epoch to ensure that tokenization, dataloading, and Trainer hooks work.

In [4]:
# # @title LED-base on train_last_100 / valid_last_100
# import os
# from pathlib import Path

# os.environ["WANDB_MODE"] = "offline"
# DATA_DIR = Path("data/ann-pt-summ/1.0.1/mimic-iv-note-ext-di-bhc/dataset")

# assert DATA_DIR.exists(), f"Missing dataset folder: {DATA_DIR}"

#     # --do_train --do_eval --do_predict \
# !python summarization/run_summarization.py \
#     --model_name_or_path data/led_4000_600_chars/ \
#     --test_file {DATA_DIR / 'test_4000_600_chars_last_100.json'} \
#     --text_column text \
#     --summary_column summary \
#     --output_dir results/led_full_run_predict \
#     --do_predict \
#     --predict_with_generate \
#     --per_device_eval_batch_size 1 \
#     --max_source_length 4096 \
#     --max_target_length 350 \
#     --report_to none

## Llama LoRA smoke run (`summarization/fine_tune_llama.py`)

Runs a tiny LoRA training job on the same dataset subset (100 examples). This assumes you already have access to `meta-llama/Llama-2-7b-hf` and a GPU runtime; set `HF_TOKEN` in the environment if needed.

In [5]:
# @title (Optional) Install Llama-specific dependencies
%pip uninstall -y transformers
%pip install -r requirements-llama.txt

Found existing installation: transformers 4.38.2
Uninstalling transformers-4.38.2:
  Successfully uninstalled transformers-4.38.2
Collecting transformers==4.38.2 (from -r requirements-llama.txt (line 3))
  Using cached transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
Using cached transformers-4.38.2-py3-none-any.whl (8.5 MB)
Installing collected packages: transformers
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 5.1.2 requires transformers<5.0.0,>=4.41.0, but you have transformers 4.38.2 which is incompatible.[0m[31m
[0mSuccessfully installed transformers-4.38.2


In [6]:
# @title Configure HF token and output directories
import os
from pathlib import Path

HF_TOKEN = ""  # @param {"type": "string"}
if HF_TOKEN:
    os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["WANDB_MODE"] = "offline"

DATA_DIR = Path("data/ann-pt-summ/1.0.1/mimic-iv-note-ext-di-bhc/dataset")
assert DATA_DIR.exists(), f"Missing dataset folder: {DATA_DIR}"

In [11]:
%pip uninstall peft
%pip install peft==0.17.0

Found existing installation: peft 0.10.0
Uninstalling peft-0.10.0:
  Would remove:
    /usr/local/lib/python3.12/dist-packages/peft-0.10.0.dist-info/*
    /usr/local/lib/python3.12/dist-packages/peft/*
Proceed (Y/n)? y
  Successfully uninstalled peft-0.10.0
Collecting peft==0.17.0
  Using cached peft-0.17.0-py3-none-any.whl.metadata (14 kB)
Using cached peft-0.17.0-py3-none-any.whl (503 kB)
Installing collected packages: peft
Successfully installed peft-0.17.0


In [12]:
# @title Llama 2 7B LoRA fine-tuning on 100-example subset
!python summarization/fine_tune_llama.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path {DATA_DIR} \
    --output_path results/llama_full_run_predict \
    --evaluation \
    --evaluation_model_path data/llama_4000_600_chars/best_val_loss \
    --num_test_examples 100

Traceback (most recent call last):
  File "/content/patient_summaries_with_llms/summarization/fine_tune_llama.py", line 10, in <module>
    from peft import (
  File "/usr/local/lib/python3.12/dist-packages/peft/__init__.py", line 17, in <module>
    from .auto import (
  File "/usr/local/lib/python3.12/dist-packages/peft/auto.py", line 31, in <module>
    from .config import PeftConfig
  File "/usr/local/lib/python3.12/dist-packages/peft/config.py", line 24, in <module>
    from .utils import CONFIG_NAME, PeftType, TaskType
  File "/usr/local/lib/python3.12/dist-packages/peft/utils/__init__.py", line 16, in <module>
    from .loftq_utils import replace_lora_weights_loftq
  File "/usr/local/lib/python3.12/dist-packages/peft/utils/loftq_utils.py", line 25, in <module>
    from accelerate.utils.memory import clear_device_cache
ImportError: cannot import name 'clear_device_cache' from 'accelerate.utils.memory' (/usr/local/lib/python3.12/dist-packages/accelerate/utils/memory.py)


In [8]:
%pip show accelerate

Name: accelerate
Version: 0.28.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: peft, trl


In [13]:
%pip show peft

Name: peft
Version: 0.17.0
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: benjamin@huggingface.co
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, huggingface_hub, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by: 


In [23]:
import peft
print("accelerate", accelerate.__version__, accelerate.__file__)
# print("peft", peft.__version__, peft.__file__)


ImportError: cannot import name 'clear_device_cache' from 'accelerate.utils.memory' (/usr/local/lib/python3.12/dist-packages/accelerate/utils/memory.py)