**1. Setup**
- Import all pipeline components and configure logging once for the notebook.

In [1]:
from logging_utils import configure_logging
from arxiv_fetcher import ArxivFetcher
from paper_summarizer import PaperSummarizer
from reward_model_trainer import RewardModelTrainer
from evaluation_runner import EvaluationRunner

log = configure_logging()

**2. Fetch PDFs**
- Use ArxivFetcher to download recent papers as PDFs into the RAW_FOLDER.
- Each PDF will be stored locally for later summarization.

In [2]:
try:
    ArxivFetcher().run()
except Exception as e:
    log.error(f"ArXiv fetch failed: {e}", exc_info=True)

2025-09-28 19:21:21 - INFO     - arxiv_fetcher - Output folder ready: /home/jovyan/MLE_in_Gen_AI-Course/class8/homework/data/raw
2025-09-28 19:21:21 - INFO     - arxiv_fetcher - Fetching arXiv papers...
2025-09-28 19:21:21 - INFO     - arxiv - Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=machine+learning&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100


2025-09-28 19:21:22 - INFO     - arxiv - Got first page: 100 of 435760 total results
2025-09-28 19:21:25 - INFO     - arxiv_fetcher - Stored 10 PDFs in /home/jovyan/MLE_in_Gen_AI-Course/class8/homework/data/raw


**3. Summarize PDFs into reward dataset**
- PaperSummarizer extracts text from PDFs (or abstracts if needed), generates both "chosen" and "rejected" summaries, and writes them into reward_data.jsonl for training.

In [3]:
try:
    PaperSummarizer().run()
except Exception as e:
    log.error(f"Summarization failed: {e}", exc_info=True)

2025-09-28 19:21:51 - INFO     - accelerate.utils.modeling - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

2025-09-28 19:22:07 - INFO     - paper_summarizer - Processing 10 PDF(s)...
2025-09-28 19:22:07 - INFO     - paper_summarizer - Paper 1/10: 2509.21320v1.pdf
2025-09-28 19:24:23 - INFO     - paper_summarizer - Paper 2/10: 2509.21319v1.pdf
2025-09-28 19:25:54 - INFO     - paper_summarizer - Paper 3/10: 2509.21278v1.pdf
2025-09-28 19:27:42 - INFO     - paper_summarizer - Paper 4/10: 2509.21286v1.pdf
2025-09-28 19:29:14 - INFO     - paper_summarizer - Paper 5/10: 2509.21281v1.pdf
2025-09-28 19:30:42 - INFO     - paper_summarizer - Paper 6/10: 2509.21309v1.pdf
2025-09-28 19:32:02 - INFO     - paper_summarizer - Paper 7/10: 2509.21282v1.pdf
2025-09-28 19:33:31 - INFO     - paper_summarizer - Paper 8/10: 2509.21296v1.pdf
2025-09-28 19:35:01 - INFO     - paper_summarizer - Paper 9/10: 2509.21302v1.pdf
2025-09-28 19:36:29 - INFO     - paper_summarizer - Paper 10/10: 2509.21293v1.pdf
2025-09-28 19:38:07 - INFO     - paper_summarizer - Saved reward dataset to /home/jovyan/MLE_in_Gen_AI-Course/cla

**4. Train reward model**
- RewardModelTrainer loads the dataset, trains a reward model, saves it to REWARD_OUTPUT_DIR, and we also inspect the classifier head weights.

In [4]:
try:
    trainer = RewardModelTrainer()
    trainer.run()
    weights = trainer.extract_weights()
    print(type(weights))
    print(weights.shape)
    print(weights[:5])
except Exception as e:
    log.error(f"Reward model training failed: {e}", exc_info=True)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2025-09-28 19:40:23 - INFO     - reward_model_trainer - Loading and preprocessing dataset...


Generating train split: 0 examples [00:00, ? examples/s]

2025-09-28 19:40:23 - INFO     - reward_model_trainer - Loaded dataset with 10 entries. Keys: ['chosen', 'rejected']


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

2025-09-28 19:40:26 - INFO     - reward_model_trainer - Starting reward model training...


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


2025-09-28 19:40:48 - INFO     - reward_model_trainer - Training complete. Model + config saved to /home/jovyan/MLE_in_Gen_AI-Course/class8/homework/reward_model
2025-09-28 19:40:48 - INFO     - reward_model_trainer - Extracting weights from /home/jovyan/MLE_in_Gen_AI-Course/class8/homework/reward_model/model.safetensors
2025-09-28 19:40:48 - INFO     - reward_model_trainer - Classifier head weights shape: torch.Size([1, 768])
<class 'torch.Tensor'>
torch.Size([1, 768])
tensor([[ 3.0603e-02,  2.3429e-02, -4.6433e-03, -1.8017e-02,  1.8536e-02,
         -1.5608e-02, -8.6062e-03,  2.1255e-02, -3.3708e-03,  3.7745e-02,
         -1.5757e-02,  5.3781e-02,  6.6413e-03, -6.3821e-03, -3.0791e-03,
          2.7429e-02, -2.0517e-02, -3.7056e-03, -2.2385e-02,  2.0431e-02,
         -2.7640e-02,  2.4315e-02, -2.8198e-02, -2.0454e-02,  5.9544e-03,
         -5.4145e-03,  2.0578e-02,  2.2125e-03, -2.1326e-03, -4.8510e-02,
         -2.4436e-02, -1.1499e-02,  6.9951e-03, -1.1106e-03, -4.0756e-03,
       

**5. Evaluate**
- EvaluationRunner computes ROUGE and BERTScore between generated and reference summaries, and scores the generated summaries with the trained reward model.
- Replace the example summaries with real outputs for a full evaluation.

In [5]:
try:
    generated_summaries = [
        "This is a generated summary of the paper.",
        "Another candidate summary with different wording.",
    ]
    reference_summaries = [
        "This is the gold reference summary of the paper.",
        "Another gold reference summary for comparison.",
    ]

    runner = EvaluationRunner()

    rouge, bertscore_raw, bertscore_avg = runner.compute_metrics(
        generated_summaries, reference_summaries
    )
    reward_scores = runner.score_with_reward_model(generated_summaries)

    print("=== Evaluation Results ===")
    print("ROUGE (aggregate):")
    for k, v in rouge.items():
        print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

    print("\nBERTScore (averaged):")
    for k, v in bertscore_avg.items():
        print(f"  {k}: {v:.4f}")

    print("\nBERTScore (first 3 raw values per metric):")
    for k, v in bertscore_raw.items():
        if isinstance(v, list):
            print(f"  {k}: {v[:3]}")

    print("\nReward model scores:")
    print(reward_scores)

except Exception as e:
    log.error(f"Evaluation failed: {e}", exc_info=True)

2025-09-28 19:41:23 - INFO     - absl - Using default tokenizer.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2025-09-28 19:41:24 - INFO     - evaluation_runner - ROUGE: {'rouge1': np.float64(0.5196078431372549), 'rouge2': np.float64(0.26666666666666666), 'rougeL': np.float64(0.5196078431372549), 'rougeLsum': np.float64(0.5196078431372549)}
2025-09-28 19:41:24 - INFO     - evaluation_runner - BERTScore avg: {'precision': 0.924, 'recall': 0.9141, 'f1': 0.919}
2025-09-28 19:41:24 - INFO     - evaluation_runner - Reward model scores: [0.05041991174221039, 0.04892019182443619]
=== Evaluation Results ===
ROUGE (aggregate):
  rouge1: 0.5196
  rouge2: 0.2667
  rougeL: 0.5196
  rougeLsum: 0.5196

BERTScore (averaged):
  precision: 0.9240
  recall: 0.9141
  f1: 0.9190

BERTScore (first 3 raw values per metric):
  precision: [0.953811764717102, 0.894155740737915]
  recall: [0.9362310171127319, 0.8918960690498352]
  f1: [0.9449396133422852, 0.8930245041847229]

Reward model scores:
[0.05041991174221039, 0.04892019182443619]
