## Purpose of the Evaluation
The goal is to generate numeral-aware headlines for news articles. The evaluation focuses on two main aspects:
1. How well the headlines are summarised and capture the key numerical information from the article
2. The correctness of this numerical information (numerical reasoning capability)

## The Evaluation Process
The notebook sets up an evaluation pipeline for three different model configurations:
1. **Base Model**: Original Llama 3.1-8B-Instruct model without fine-tuning
2. **Fine-Tuned Model**: Llama 3.1 model fine-tuned on the NumHG dataset
3. **Chain-of-Thought Fine-Tuned Model**: The fine-tuned model with additional prompting that guides the model through a reasoning process (chain of thought)

## Evaluation Metrics
The NumHG evaluation script (numhg_eval.py) implements several metrics mentioned in the project PDF:
* **Accuracy**: Assessing the correctness of numerical information
* **ROUGE**: Evaluating text overlap with reference headlines
* **BERTScore**: Measuring semantic similarity using contextual embeddings
* **MoverScore**: Computing semantic similarity with earth mover distance

## Dataset
The evaluation uses the NumHG dataset (Numerical Headline Generation), which contains:
* Target headlines (ground truth)
* News articles
* Ground truth numerical values
* Number types information

## Flow of the Evaluation Code
1. The code first sets up the environment by cloning repositories and installing dependencies
2. It prepares the necessary evaluation metrics (MoverScore, ROUGE, etc.)
3. It then runs the evaluation script three times, once for each model variant, comparing their generated headlines against the ground truth

In [None]:
# Clone the NumHG repository (Numerical Headline Generation)
# This repository contains the dataset and evaluation code for the task
!git clone https://github.com/ArrowHuang/NumHG.git

In [None]:
# Install required dependencies for the evaluation
# The repository requirements include all necessary packages for evaluation metrics
!pip install -r NumHG/requirements.txt

In [None]:
# Clone the MoverScore repository
# MoverScore is an evaluation metric that measures semantic similarity between texts
# It's one of the evaluation metrics mentioned in the project outline
!git clone https://github.com/AIPHES/emnlp19-moverscore.git
%cd emnlp19-moverscore/
!python setup.py install

In [None]:
# Update TensorFlow and Keras to the latest versions for compatibility
!pip install --upgrade tensorflow keras

In [None]:
# Install the Transformers library for working with pre-trained models
# This is needed for several evaluation metrics including BERTScore and MoverScore
!pip install --upgrade transformers

In [None]:
# Install pytorch-pretrained-bert for MoverScore functionality

# import os
# os.environ['MOVERSCORE_MODEL'] = "albert-base-v2"
# from moverscore_v2 import word_mover_score, get_idf_dict
!pip install pytorch-pretrained-bert

In [None]:
# Navigate back to the NumHG directory for evaluation
%cd ../NumHG/

In [None]:
# Download NLTK punktab tokenizer for text tokenization
# This is needed for proper text preprocessing during evaluation
import nltk
nltk.download('punkt_tab')

In [None]:
# Run the evaluation script on the base model predictions
# This evaluates the original Llama 3.1 model without fine-tuning
# Parameters:
# - tgt_path: Ground truth headlines
# - pre_path: Model predictions from base model
# - num_gt_path: Ground truth for numerical values
# - num_type_path: Types of numbers in each headline
!python numhg_eval.py \
--tgt_path=Dataset/fold-1/target.txt \
--pre_path=../BASEpreds-head.txt \
--num_gt_path=Dataset/fold-1/number_gt.txt \
--num_type_path=Dataset/fold-1/number_type.txt

In [None]:
# Run the evaluation script on the fine-tuned model predictions
# This evaluates the Llama 3.1 model after basic fine-tuning
# Using the same ground truth paths as above but different predictions
!python numhg_eval.py \
--tgt_path=Dataset/fold-1/target.txt \
--pre_path=../FTpreds-head.txt \
--num_gt_path=Dataset/fold-1/number_gt.txt \
--num_type_path=Dataset/fold-1/number_type.txt

In [None]:
# Run the evaluation script on the chain-of-thought fine-tuned model
# This evaluates the Llama 3.1 model that was fine-tuned with chain-of-thought prompting
# Which provides more detailed instructions during generation
!python numhg_eval.py \
--tgt_path=Dataset/fold-1/target.txt \
--pre_path=../FTCOTpreds-head.txt \
--num_gt_path=Dataset/fold-1/number_gt.txt \
--num_type_path=Dataset/fold-1/number_type.txt