`seqeval` is a Python framework for sequence labeling evaluation. 

`seqeval` can evaluate the performance of chunking tasks such as `named-entity recognition`, `part-of-speech tagging`, `semantic role labeling` and so on.

In [7]:
#!pip install seqeval

In [2]:
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score

In [3]:
y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'],      ['B-PER', 'I-PER', 'O']]

In [4]:
y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]

In [5]:
accuracy_score(y_true, y_pred)

0.8

In [6]:
f1_score(y_true, y_pred)

0.5

#### SeqEval: A python package for evaluating Seq2Seq models

- Sequence Evaluate (SeqEval) is a python package that computes metrics useful for evaluating Seq2Seq models on multiple tasks such as: machine translation, dialogue response generation, and text summarization. There already exists many packages to compute those metrics, but SeqEval puts them all in one place, and allows you to compute them in two lines of code!

In [8]:
!pip install sequence-evaluate

Collecting sequence-evaluate
  Downloading sequence_evaluate-0.0.3-py3-none-any.whl.metadata (3.6 kB)
Downloading sequence_evaluate-0.0.3-py3-none-any.whl (7.2 kB)
Installing collected packages: sequence-evaluate
Successfully installed sequence-evaluate-0.0.3


In [9]:
!pip install rogue

Collecting rogue
  Downloading rogue-0.0.2.tar.gz (5.4 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rogue
  Building wheel for rogue (setup.py): started
  Building wheel for rogue (setup.py): finished with status 'done'
  Created wheel for rogue: filename=rogue-0.0.2-py3-none-any.whl size=7222 sha256=fb0c8dce842859770bd360061dc271e2f4a6531fbc487ba013026e084980c98b
  Stored in directory: c:\users\bhupe\appdata\local\pip\cache\wheels\6e\4d\87\c82eb52b617e6146533d5644d081bca15472f8d2592c0b6c80
Successfully built rogue
Installing collected packages: rogue
Successfully installed rogue-0.0.2


#### rouge_scorer

In [11]:
#pip install rouge-score

In [12]:
from rouge_score import rouge_scorer

In [13]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

The evaluator expects two python lists containing candidates (outputs generated by the model) and references (ground-truth data).

In [14]:
# Define your candidate summary and reference summaries (can be a list)
candidate_summary   = "The quick brown fox jumps over the lazy dog."
reference_summaries = "The dog is lazy. The fox is quick."

In [15]:
# Initialize scorer object (optional: specify metrics and stemming)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'])

In [18]:
# Calculate ROUGE scores
# scorer.score(target, prediction)
scores = scorer.score(prediction=candidate_summary, target=reference_summaries)

In [19]:
scores

{'rouge1': Score(precision=0.6666666666666666, recall=0.75, fmeasure=0.7058823529411765),
 'rougeL': Score(precision=0.2222222222222222, recall=0.25, fmeasure=0.23529411764705882)}

... example

In [20]:
# Initialize the scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])

In [21]:
# Compute the Rouge scores for reference and candidate.
scores = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')

In [22]:
scores

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765),
 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666),
 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471),
 'rougeLsum': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}

#### py-rouge

In [31]:
pip install py-rouge

Collecting py-rouge
  Downloading py_rouge-1.1-py3-none-any.whl (56 kB)
     ---------------------------------------- 56.8/56.8 kB 1.5 MB/s eta 0:00:00
Installing collected packages: py-rouge
Successfully installed py-rouge-1.1
Note: you may need to restart the kernel to use updated packages.


In [32]:
import rouge

In [33]:
def prepare_results(m, p, r, f):
    return '\t{}:\t{}: {:5.2f}\t{}: {:5.2f}\t{}: {:5.2f}'.format(m, 'P', 100.0 * p, 'R', 100.0 * r, 'F1', 100.0 * f)

In [35]:
for aggregator in ['Avg', 'Best', 'Individual']:
    print('Evaluation with {}'.format(aggregator))
    apply_avg  = aggregator == 'Avg'
    apply_best = aggregator == 'Best'

    evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'],
                           max_n=4,
                           limit_length=True,
                           length_limit=100,
                           length_limit_type='words',
                           apply_avg=apply_avg,
                           apply_best=apply_best,
                           alpha=0.5, # Default F1_score
                           weight_factor=1.2,
                           stemming=True)
    
    hypothesis_1 = "King Norodom Sihanouk has declined requests to chair a summit of Cambodia 's top political leaders , saying the meeting would not bring any progress in deadlocked negotiations to form a government .\nGovernment and opposition parties have asked King Norodom Sihanouk to host a summit meeting after a series of post-election negotiations between the two opposition groups and Hun Sen 's party to form a new government failed .\nHun Sen 's ruling party narrowly won a majority in elections in July , but the opposition _ claiming widespread intimidation and fraud _ has denied Hun Sen the two-thirds vote in parliament required to approve the next government .\n"
    references_1 = ["Prospects were dim for resolution of the political crisis in Cambodia in October 1998.\nPrime Minister Hun Sen insisted that talks take place in Cambodia while opposition leaders Ranariddh and Sam Rainsy, fearing arrest at home, wanted them abroad.\nKing Sihanouk declined to chair talks in either place.\nA U.S. House resolution criticized Hun Sen's regime while the opposition tried to cut off his access to loans.\nBut in November the King announced a coalition government with Hun Sen heading the executive and Ranariddh leading the parliament.\nLeft out, Sam Rainsy sought the King's assurance of Hun Sen's promise of safety and freedom for all politicians.",
                    "Cambodian prime minister Hun Sen rejects demands of 2 opposition parties for talks in Beijing after failing to win a 2/3 majority in recent elections.\nSihanouk refuses to host talks in Beijing.\nOpposition parties ask the Asian Development Bank to stop loans to Hun Sen's government.\nCCP defends Hun Sen to the US Senate.\nFUNCINPEC refuses to share the presidency.\nHun Sen and Ranariddh eventually form a coalition at summit convened by Sihanouk.\nHun Sen remains prime minister, Ranariddh is president of the national assembly, and a new senate will be formed.\nOpposition leader Rainsy left out.\nHe seeks strong assurance of safety should he return to Cambodia.\n",
                    ]
    
    hypothesis_2 = "China 's government said Thursday that two prominent dissidents arrested this week are suspected of endangering national security _ the clearest sign yet Chinese leaders plan to quash a would-be opposition party .\nOne leader of a suppressed new political party will be tried on Dec. 17 on a charge of colluding with foreign enemies of China '' to incite the subversion of state power , '' according to court documents given to his wife on Monday .\nWith attorneys locked up , harassed or plain scared , two prominent dissidents will defend themselves against charges of subversion Thursday in China 's highest-profile dissident trials in two years .\n"
    references_2 = "Hurricane Mitch, category 5 hurricane, brought widespread death and destruction to Central American.\nEspecially hard hit was Honduras where an estimated 6,076 people lost their lives.\nThe hurricane, which lingered off the coast of Honduras for 3 days before moving off, flooded large areas, destroying crops and property.\nThe U.S. and European Union were joined by Pope John Paul II in a call for money and workers to help the stricken area.\nPresident Clinton sent Tipper Gore, wife of Vice President Gore to the area to deliver much needed supplies to the area, demonstrating U.S. commitment to the recovery of the region.\n"

    all_hypothesis = [hypothesis_1, hypothesis_2]
    all_references = [references_1, references_2]
    
    scores = evaluator.get_scores(all_hypothesis, all_references)

    for metric, results in sorted(scores.items(), key=lambda x: x[0]):
        if not apply_avg and not apply_best: # value is a type of list as we evaluate each summary vs each reference
            for hypothesis_id, results_per_ref in enumerate(results):
                nb_references = len(results_per_ref['p'])
                for reference_id in range(nb_references):
                    print('\tHypothesis #{} & Reference #{}: '.format(hypothesis_id, reference_id))
                    print('\t' + prepare_results(metric,results_per_ref['p'][reference_id], results_per_ref['r'][reference_id], results_per_ref['f'][reference_id]))
            print()
        else:
            print(prepare_results(metric, results['p'], results['r'], results['f']))
    print()
    
    

Evaluation with Avg
	rouge-1:	P: 28.62	R: 26.46	F1: 27.49
	rouge-2:	P:  4.21	R:  3.92	F1:  4.06
	rouge-3:	P:  0.80	R:  0.74	F1:  0.77
	rouge-4:	P:  0.00	R:  0.00	F1:  0.00
	rouge-l:	P: 30.52	R: 28.57	F1: 29.51
	rouge-w:	P: 15.85	R:  8.28	F1: 10.87

Evaluation with Best
	rouge-1:	P: 30.44	R: 28.36	F1: 29.37
	rouge-2:	P:  4.74	R:  4.46	F1:  4.59
	rouge-3:	P:  1.06	R:  0.98	F1:  1.02
	rouge-4:	P:  0.00	R:  0.00	F1:  0.00
	rouge-l:	P: 31.54	R: 29.71	F1: 30.60
	rouge-w:	P: 16.42	R:  8.82	F1: 11.47

Evaluation with Individual
	Hypothesis #0 & Reference #0: 
		rouge-1:	P: 38.54	R: 35.58	F1: 37.00
	Hypothesis #0 & Reference #1: 
		rouge-1:	P: 45.83	R: 43.14	F1: 44.44
	Hypothesis #1 & Reference #0: 
		rouge-1:	P: 15.05	R: 13.59	F1: 14.29

	Hypothesis #0 & Reference #0: 
		rouge-2:	P:  7.37	R:  6.80	F1:  7.07
	Hypothesis #0 & Reference #1: 
		rouge-2:	P:  9.47	R:  8.91	F1:  9.18
	Hypothesis #1 & Reference #0: 
		rouge-2:	P:  0.00	R:  0.00	F1:  0.00

	Hypothesis #0 & Reference #0: 
		rouge-3:	P: 

#### Using OpenAI

In [1]:
import openai
from openai import OpenAI
import json

from rouge_score import rouge_scorer

In [2]:
client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    # api_key = api_key
)

In [3]:
original_text = """ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. 
It is a set of metrics used to evaluate the quality of machine-generated text. 
ROUGE primarily compares the overlap of n-grams, sequences, or words between a generated text and a reference text. 
Common ROUGE variants include ROUGE-1, ROUGE-2, and ROUGE-L. 
ROUGE-1 measures unigram overlap, while ROUGE-2 measures bigram overlap. 
ROUGE-L captures the longest common subsequence between the texts. 
Higher ROUGE scores indicate better quality, with scores ranging from 0 to 1. 
ROUGE is widely used in tasks like summarization, machine translation, and text generation. 
Precision, recall, and F1-score are calculated for each ROUGE variant. 
Despite its popularity, ROUGE focuses on word overlap and may not capture semantic quality."""

In [4]:
# Reference text (manually written summary)
reference_text = """ROUGE is a metric for evaluating text quality based on overlap with reference texts. 
Variants like ROUGE-1, ROUGE-2, and ROUGE-L measure unigram, bigram, and longest common subsequence overlaps. 
It’s commonly used in summarization and machine translation tasks."""

In [5]:
response = client.chat.completions.create(
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": f"Summarize this text:\n{original_text}"}
                ],
                model="gpt-4o-mini",
                #max_tokens = 10
)

In [6]:
# Extract the generated text
generated_text = response.choices[0].message.content

In [7]:
# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [8]:
# Compute ROUGE scores between reference text and generated text
scores = scorer.score(reference_text, generated_text)

In [9]:
# Display ROUGE scores
print("\nROUGE Scores (Reference vs Generated):")
for metric, score in scores.items():
    print(f"{metric}: Precision={score.precision:.4f}, Recall={score.recall:.4f}, F1={score.fmeasure:.4f}")


ROUGE Scores (Reference vs Generated):
rouge1: Precision=0.3608, Recall=0.8537, F1=0.5072
rouge2: Precision=0.1562, Recall=0.3750, F1=0.2206
rougeL: Precision=0.2784, Recall=0.6585, F1=0.3913
