-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Evaluating Large Language Models (LLMs)
This notebook demonstrates methods for evaluating LLMs.  We focus on the task of summarization and cover accuracy, ROUGE-N, and perplexity.

### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives
1. Know how to compute ROUGE-N and other metrics.
2. Gain an intuitive understanding of ROUGE-N.
3. Test various models and model sizes on the same data, and compare their results.

## Classroom Setup

In [0]:
%pip install rouge_score==0.1.2

Python interpreter will be restarted.
Collecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py): started
  Building wheel for rouge-score (setup.py): finished with status 'done'
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=e0e8b269b9bcac0e832127be5b39db54b8f619d8c52b206228d7fede16f88ed6
  Stored in directory: /root/.cache/pip/wheels/9b/3d/39/09558097d3119ca0a4d462df68f22c6f3c1b345ac63a09b86e
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
Python interpreter will be restarted.


In [0]:
%pip install datasets evaluate

Python interpreter will be restarted.
Collecting datasets
  Using cached datasets-2.13.0-py3-none-any.whl (485 kB)
Collecting evaluate
  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)
Collecting aiohttp
  Using cached aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
Collecting pyarrow>=8.0.0
  Using cached pyarrow-12.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.0 MB)
Collecting multiprocess
  Using cached multiprocess-0.70.14-py39-none-any.whl (132 kB)
Collecting xxhash
  Using cached xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
Collecting responses<0.19
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Collecting frozenlist>=1.1.1
  Using cached frozenlist-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)
Collecting yarl<2.0,>=1.0
  Using cached yarl-1.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (269 kB)
Collect

In [0]:
# %run /Includes/Classroom-Setup

## How can we evaluate summarization?

Suppose you are developing a smartphone news app and need to display automatically generated summaries of breaking news articles.  How can you evaluate whether or not the summaries you are generating are good?

![](https://drive.google.com/uc?export=view&id=1V6cMD1LgivCb850JDhva1DO9EWVH8rJ7)

## Dataset

We will use a subset of the `cnn_dailymail` dataset from See et al., 2017, downloadable from the Hugging Face `datasets` hub: https://huggingface.co/datasets/cnn_dailymail

This dataset provides news article paired with summaries (in the "highlights" column).  Let's load the data and take a look at some examples.

In [0]:
import torch
from datasets import load_dataset

full_dataset = load_dataset("cnn_dailymail", version="3.0.0") 

# Use a small sample of the data during this lab, for speed.
sample_size = 100
sample = (
    full_dataset["train"]
    .filter(lambda r: "CNN" in r["article"][:25])
    .shuffle(seed=42)
    .select(range(sample_size))
)
sample

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/default to /root/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Filter:   0%|          | 0/287113 [00:00<?, ? examples/s]

Out[2]: Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 100
})

In [0]:
display(sample.to_pandas())

article,highlights,id
"(CNN) -- A magnitude 6.7 earthquake rattled Papua New Guinea early Friday afternoon, according to the U.S. Geological Survey. The quake was centered about 200 miles north-northeast of Port Moresby and had a depth of 28 miles. No tsunami warning was issued, according to the Tsunami Warning Center. Papua New Guinea is on the so-called Ring of Fire, an arc of fault lines circling the Pacific Basin that is prone to frequent earthquakes and volcanic eruptions.",Papua New Guinea is on the so-called Ring of Fire . It's on an arc of fault lines that is prone to frequent earthquakes . No tsunami warning was issued .,8093dba7bc2260c26f18939826909ef27549c758
"(CNN) -- Pakistan took big steps towards leveling the two-Test cricket series against Australia after bowling out their opponents for just 88 runs in Britain on Wednesday. The Australians, who won the first match by 150 runs, failed to reach three figures in an innings for the first time since 1984 as captain Ricky Ponting paid for his decision to bat first in bowler-friendly conditions after winning the toss. Pakistan, led by new captain Salman Butt after Shahid Afridi ended his brief return to the Test arena following that defeat at Lord's, then reached 148-3 before rain cut short the opening day in Leeds. Teenage paceman Mohammad Aamer and fellow opening bowler Mohammad Asif took three wickets each as Australia crashed to the nation's lowest first-innings score since 1956 -- when dismissed for 80, also by Pakistan. ""There was moisture under the wicket so I think it was a shocking decision, especially for Australia,"" Pakistan bowler Umar Gul told reporters. ""I don't know what the captain and coach's decision was going to be but if we won the toss we were ready to bowl first. This morning it was swinging a lot and the ball was seaming."" Tim Paine's leading knock of 17 was Australia's fourth-lowest in a Test innings in which all 11 players have batted. The No. 7 was last man out in a fighting 47-ball stay, scoring one more run than all-rounder Marcus North. Gul also claimed two wickets while Umar Amin dismissed North and was also involved in the run-out of tailender Ben Hilfenhaus. The Pakistanis went into a 13-match losing run against Australia that goes back to 1995, but struck an early blow when the 18-year-old Aamer trapped opener Simon Katich leg before wicket for 13. Afridi quits Tests after 150-run defeat at Lord's . That opened the floodgates as Asif sent Shane Watson (5) and Ponting (6) back to the pavilion at Headingley in similar fashion, and Gul bowled Michael Clarke (3). From 29-4, Australia slumped further to 73-6 at lunch as Mike Hussey (5) also went lbw to Gul and North was caught behind off the medium pace of Amin. Aamer then took wickets with the first two balls of the middle session, bowling Steven Smith and Mitchell Johnson, but Hilfenhaus avoided becoming the hat-trick victim. Australia's bowlers could not extract the same amount of swing from the overcast conditions, and Butt put on 80 for the first wicket with Imran Farhat before being bowled by Hilfenhaus. Farhat also fell short of a half-century as he was trapped lbw by Watson for 45 with the score at 133, and the all-rounder then dispatched Azar Ali (30) seven runs later. Amin (1) and Umar Akmal (8) were unbeaten when bad light and then rain ended a day on which 13 wickets fell for just 236 runs. The series is being played in England due to security issues in Pakistan.","Australia collapse to 88 all out on opening day of second Test against Pakistan in Leeds . Pakistan, seeking to level two-match series, reached 148-3 when bad light stopped play . Australia captain Ricky Ponting surprisingly chose to bat first in overcast conditions . His team failed to reach three figures in a test for the first time since 1984 .",67d626156f971d0bf55e5f2a48e1ed965eb622a6
"(CNN) -- Federal prosecutors are pushing to force the Arizona man accused of fatally gunning down six people and wounding 13 others, including U.S. Rep. Gabrielle Giffords, to submit a handwriting sample -- a request that he, thus far, has refused. A motion was filed Monday, out of the office of U.S. Attorney Dennis Burke in Arizona, asking the court to compel Jared Lee Loughner to write out something so authorities can view his writing style. The government wants the sample to compare with handwritten notes found in Loughner's residence that include mentions of Giffords ""as well as references to guns and bullets,"" according to a court document. It says he has resisted such requests to date, ""arguing that the court lacks authority"" to force him to provide a sample. ""There being no other avenue to obtain the defendant's handwriting exemplar, the government now seeks an order to compel,"" prosecutors wrote in the motion. Last Thursday, a federal grand jury returned a new indictment against Loughner in which he is charged on 49 counts -- including murder and attempted murder -- related to the shooting outside a Tucson supermarket in January. The 22-year-old Tucson man was indicted on three counts of attempted murder, including one alleging that he tried to kill Giffords with a Glock semiautomatic handgun during the event she was hosting for constituents. After being shot through the head, the congresswoman is now undergoing rehabilitation at a medical facility in Houston, Texas. The new indictment -- which supersedes an earlier one, which had fewer charges -- adds murder charges connected to the deaths of John M. Roll, a federal district judge, and Gabriel M. Zimmerman, a staff member for Giffords. Loughner also faces charges in the deaths of Dorothy J. Morris, Phyllis C. Schneck, Dorwan C. Stoddard, and a child, referred to in the indictment as C-T G. Nine-year-old Christina-Taylor Green was among those killed in the shooting. Autopsy reports released Monday showed that Zimmerman, Schneck and Stoddard suffered fatal head wounds, while the three others were shot in the chest. Loughner could face a death sentence if convicted, Burke said last week, although prosecutors have not said yet whether they will seek the death penalty. Loughner, who is being held by authorities in Arizona, is expected to be arraigned on the new charges Wednesday in Tucson, the district attorney's office said.",Jared Loughner is refusing the government's request for a writing sample . Authorities want it to compare with notes found in his home after the shooting . Loughner faces 49 charges related to a mass shooting outside a Tucson market .,0d02fb8f0d406db956b128a5c1cc7bf3f13860a6
"Centennial, Colorado (CNN) -- McKayla Hicks sat in the courtroom Monday and stared at the man accused of a shooting spree that left her with a bullet in her jaw, killed a dozen people, and traumatized a community. ""I saw him and he's just this freak lookin' dude with some orange hair,"" the 17-year-old high school student said, calling suspect James Holmes ""pathetic looking."" ""You could feel in the room all the anger that everyone had for him,"" she said. ""Everyone in America hates this guy."" She wants him to get the death penalty. ""He tried to kill people,"" Hicks said. ""So I think he needs to be killed."" Hicks was in the theater next door to the one in which a gunman opened fire at a midnight showing of the new Batman movie on Friday in Aurora. Why did suspect avoid social media? One of the bullets -- she thinks it was early in the shooting spree -- pierced the wall and entered the theater she was in, slashing through her mouth and becoming lodged in her jaw. It will stay there forever, in the lower left part of her jaw. Doctors tell her that removing it would cause too much nerve damage. ""I like the idea. I think it's cool that I have a bullet in my chin,"" she said with a smile, calling it ""a souvenir."" She considers herself lucky to be alive, and not injured worse. Lori Schafer helped save her that night, and was in court with her Monday. ""It made it hit home,"" she said. ""It wasn't just a picture on the TV anymore."" ""I just want him to be where he's never going to be able to hurt someone again."" Opinion: Mass murder and powerful guns . Hicks said that when she was shot, she didn't know what had happened. She started bleeding, but ""thought someone had thrown something at me. And when I got up to leave the theater everyone else was still sitting down. So I was really upset, like, 'Who's personally doing this to me?' "" ""Then we saw the victims from Theater Nine -- and I realized it wasn't just me."" Schafer, who will begin college in the fall, helped her out of the building, pulled a fire alarm, called 911, and helped get Hicks seated outside on a curb. It was only then that she touched her face and realized there was a bullet in her. ""It changed me as a person,"" she said, adding, ""I just don't take people for granted -- especially the ones that save your life."" ""Being in a real life-or-death situation,"" she said, helped her realize that ""the little things don't matter. It shows you what really does matter.""","Shooting victim McKayla Hicks went to hearing for accused killer James Holmes . She said she could feel ""all the anger that everyone had for"" Holmes . The incident has changed her, said Hicks . A bullet lodged in Hicks' jaw -- doctors said it is safer to leave it there .",39aee887c6d34bd311c826142b14037e6f2639ee
"(CNN) -- Double-amputee sprinter Oscar Pistorius will compete at the Olympic Games after he was named in both the individual 400m and the South Africa 4x400m relay squad for London 2012. The four-time Paralympic Games gold medalist won a silver medal as part of South Africa's 4x400m relay at the World Championships in Daegu last year, although he was left out of the line-up for the final. He also looked set to be excluded from the individual event in London after failing to run the Olympic 'A' standard qualification mark twice in international competition. But the South African selectors relaxed their qualification rules Wednesday and named him in both events, much to his delight. ""Today is truly one of the proudest days of my life. To have been selected to represent Team South Africa at the London 2012 Olympic Games in the individual 400m and the 4x400m relay is a real honor and I am so pleased that years of hard work, determination and sacrifice have all come together,"" he told his official website. ""I have a phenomenal team behind me who have helped get me here and will now put everything we can into the final few weeks of preparations before the Olympic Games where I am aiming to race well, post good times and maybe even a personal best time on the biggest stage of them all."" Pistorius, whose legs were amputated below the knee when he was 11 months old due to a bone defect, runs on special carbon fiber blades from which his nickname ""The Blade Runner"" derives. Saudi female Olympians: Historic breakthrough or false dawn? He will become the first Paralympian to compete in track and field at the able bodied Olympics. The Johannesburg-born athlete is joined in the South Africa track and field team by Caster Semanya. The 800m world champion was the subject of a gender test by the International Association of Athletics Federations following her victory in Berlin at the world championships three years ago, but has since been cleared to compete. In other selection news, Team Great Britain have announced cyclist David Millar will be part of its road race team. The Scot was handed a two-year doping ban in 2004, which prevented him competing for Britain in previous Games. But earlier this year, the British Olympic Association was forced by the Court of Arbitration for Sport (CAS) to overturn its policy of not selecting athletes who have been found guilty of doping. CAS's verdict opened the door for 100m sprinter Dwain Chambers to compete at his home Games. Human to Hero: Blade Runner's 2012 Olympic ambition .",Oscar Pistorius to become first double-amputee Olympian in London . The 25-year-old has been selected in the individual 400 and 4x400m relay . Pistorius had both of his legs amputated when he was 11 months old . He won a silver medal at last year's World Championships in the 4 x 400m relay .,cc83ecdf08f0b598c3b97b3e2819c7e0ae7ca4f2
"(CNN) -- A grand jury has indicted Texas Gov. Rick Perry, a potential 2016 presidential candidate, saying he abused his power by trying to pressure a district attorney to resign. The two felony counts against Perry, a Republican, stem from his threat to veto funding for a statewide public integrity unit run by Travis County District Attorney Rosemary Lehmberg unless she stepped down, the special prosecutor in the case, Michael McCrum, said. Perry attorney David L. Botsford called the indictment a ""political abuse of the court system."" He said the action ""violated the separation of powers"" and ""sets a dangerous precedent by allowing a grand jury to punish the exercise of a lawful and constitutional authority afforded to the Texas governor."" CNN affiliate KVUE reported that Perry will have to report to the Travis County Jail in the capital of Austin to be booked, fingerprinted and have his photo made for a mugshot. Perry can continue to serve as governor while under indictment, KVUE reported. His attorneys could seek to have the charges thrown out, a motion that would delay the case, at the very least. The grand jury in Travis County indicted the governor on charges of coercion of a public servant and abuse of his official capacity. Read the indictment (PDF) The charges have serious political implications, both in Texas and beyond. Perry is entering his final few months in office after a historic 14-year run in Austin. The Republican running to replace Perry is state Attorney General Greg Abbott, who will have to answer questions about the legal drama. Abbott is facing off against Democratic star Wendy Davis, whose campaign is already making hay of Friday's news. Perry's presidential prospects could be damaged. It's an open secret he's laying groundwork for a second presidential campaign after his disastrous 2012 effort. The governor has positioned himself as an early conservative alternative to New Jersey Gov. Chris Christie, another GOP presidential contender. Perry is scheduled to visit the early primary states of New Hampshire and South Carolina in the coming weeks to meet with Republican activists and legislators. According to McCrum, the indictment alleges that the circumstances around Perry's veto threat amounted to a misuse of state money earmarked by the Legislature to fund the public integrity unit in Travis County run by Lehmberg. The second charge alleges that he improperly used the veto threat to get her to resign following her arrest on a drunk driving charge. She stayed in office. ""I'm ready to go forward (in) my task as district attorney. In this case, the grand jury has spoken and I'm going forward to carry out the duties that have been bestowed upon me,"" McCrum said. ""I feel confident about the charges that have been filed,"" he added. Mary Anne Wiley, general counsel for Perry's office, said the ""veto in question was made in accordance"" with the authority ""afforded to every governor"" under the state's constitution. ""We will continue to aggressively defend the governor's lawful and constitutional action, and believe we will ultimately prevail,"" Wiley said in a statement. Political opponents of Perry, who unsuccessfully sought the Republican presidential nomination in 2012, urged him to resign. ""Governor Rick Perry has brought dishonor to his office, his family and the state of Texas,"" the Texas Democratic Party said in a statement. ""We call on Governor Perry to immediately step down from office. Texans deserve real leadership and this is unbecoming of our governor."" Opinion: The case against Perry . CNN's Steve Brusk, Dana Davidsen, Leslie Bentz and Peter Hamby contributed to this report.","NEW: Perry lawyer calls indictments ""political abuse of the court system"" Indictment by country grand jury in Texas stems from effort to remove local prosecutor . Perry allegedly threatened to veto funding for a program run by the DA in Austin . Indictment could have political implications .",51fb6465303595cb201b427ca04b594b182a9722
"(CNN)An Argentine prosecutor said Friday there is enough evidence to continue an investigation into whether President Cristina Fernandez de Kirchner hid Iran's alleged role in a deadly 1994 bombing, a probe that paused after a different prosecutor died mysteriously last month. Federal Prosecutor Gerardo Pollicita filed a 61-page report essentially endorsing what prosecutor Alberto Nisman claimed before he died in January: that evidence shows Fernandez and other top officials tried to cover up Iran's alleged involvement in the bombing that killed 85 people at a Jewish center in Buenos Aires. Pollicita sent his report to a judge, who is expected to examine it next week after he returns from vacation. The judge can determine whether the case can proceed to trial. Nisman, a special prosecutor investigating the 1994 bombing of the Argentine Israelite Mutual Association, alleged last month that Iran was behind the attack, and that Fernandez, to help sweeten a trade deal, covered up Tehran's involvement. That trade deal, Nisman alleged, involved cash-strapped Argentina receiving Iranian oil in exchange for meat and grain. Nisman made the allegation in a nearly 300-page report. But on January 18, one day before he was to testify before lawmakers about his allegations, he was found dead in his apartment with a gunshot wound to the head. Nisman filled with fear before his death, friend says . In his trash can, investigators say, was a 6-month-old draft warrant for Fernandez's arrest. Fernandez initially called it a suicide, but a test found no gunpowder residue on Nisman's hands. Fernandez said a few days later that she didn't believe Nisman killed himself, alleging that some members of the country's intelligence agency fed Nisman false information about a cover-up and were responsible for his death. The President and the other accused government officials deny any cover-up in the bombing. Anibal Fernandez, the President's general secretary, said the allegations against Fernandez are ""a clear maneuver to destabilize democracy"" in the South American country. He also said the investigation lacks ""judicial value"" or ""importance."" About 10 years ago, Nisman was appointed as special prosecutor to investigate the bombing by then-President Nestor Kirchner, Fernandez's late husband. Conspiracy theories aplenty in wake of prosecutor's death . CNN's Rafael Romo, Elwyn Lopez and Ben Brumfield contributed to this report.","Prosecutor to judge: Enough evidence for investigation of President to continue . Different prosecutor, before he died, alleged President hid Iran's alleged involvement in bombing . President Cristina Fernandez de Kirchner, other officials deny cover-up .",f4d3394791035a0571f1841d5d21661fdb39d74f
"Warsaw, Poland (CNN) -- European football's governing body opened disciplinary proceedings Saturday against Croatia over what it said was racist behavior by its supporters during a Euro 2012 match against Italy. UEFA said it was acting over ""the setting-off and throwing of fireworks, and the improper conduct of supporters,"" including racist chants and the displaying of racist symbols. The UEFA Control and Disciplinary Body will deal with the case on Tuesday, it said. Croatia drew 1-1 in the match against Italy in Poznan on Thursday, with Mario Mandzukic scoring for the Croatians and Andrea Pirlo claiming a goal for Italy. UEFA president Michel Platini urged fans who are attending decisive matches on Saturday night to ""conduct themselves with dignity and respect."" Russia plays Greece in Warsaw, while the Czech Republic plays Poland in Wroclaw. ""Of course, there is rivalry and passion, and all teams want to win -- but we must remember that the results on the pitch are what really matter,"" Platini said in a statement. ""EURO 2012 is a celebration of football and I invite the fans, the vast majority of whom have conducted themselves in an exemplary manner so far, to continue to do so for the remainder of the tournament."" The issue of racism has threatened to mar the soccer tournament, which is being co-hosted by Poland and Ukraine. Members of the Dutch squad claimed to hear monkey noises during an open training session in Krakow, Poland, before the tournament started, though the Dutch FA opted not to lodge an official complaint with UEFA. In addition, family members of two black English players chose not to travel to the competition for fear of being subjected to racism. UEFA has already taken the step of writing a letter to the mayors of each host city asking for a zero-tolerance approach to racist abuse. UEFA's disciplinary body has also been busy since the tournament started just over a week ago. Russia has already been fined and handed a suspended points deduction for improper conduct against the Czech Republic, while also still awaiting the verdict of a UEFA investigation into alleged racist chanting during the match. The German Football Federation was fined $12,500 after its fans threw paper onto the pitch during a meeting with Portugal, with the Iberian team also ordered to pay $6,250 for delaying the start of the second half. A Danish player, Nicklas Bendtner, may also face sanctions next week after lowering his shorts during a game Wednesday to reveal that he was wearing underwear displaying the name of an Irish bookmaker named Paddy Power. Under the laws of the games as outlined by FIFA, the global body which governs soccer, undershorts must be the same color as the shorts worn by the player. His shorts were red but the Paddy Power undershirts were green. CNN's Claudia Dominguez and Tom McGowan contributed to this report.",NEW: UEFA president Michel Platini urges fans to behave at decisive matches Saturday . UEFA says there were racist chants from Croatian fans during a match against Italy . The issue of racism threatens to mar the Euro 2012 soccer tournament . A disciplinary panel will consider the cause against Croatia on Tuesday .,76ba8e9110a66a1b1293abe34ef4fab254371af8
"(CNN) -- Two issues -- security and immigration -- often get too much attention when it comes to talking about the U.S.-Mexico relationship, U.S. President Barack Obama said Thursday. Now, Obama said, it's time to forge deeper economic connections to create more jobs and more trade on both sides of the border. ""That's the focus of my visit,"" he told reporters after meeting with Mexican President Enrique Peña Nieto in the country's capital. But even as Obama and Peña Nieto pushed to shift the tone more toward trade and economics, security issues loomed large over Thursday's meeting. Peña Nieto said his government remains committed to fighting organized crime, but that the United States and Mexico must ""cooperate on the basis of mutual respect, to be more efficient in our security strategy that we are implementing in Mexico."" Obama stressed that the countries will continue to cooperate closely on security, but he didn't specify how. ""I agreed to continue our close cooperation on security, even as that nature of that close cooperation will evolve,"" he said. It's up to the Mexican people, Obama said, ""to determine their security structures and how it engages with other nations, including the United States."" In the meantime, he said, the United States remains committed to reducing the demand for drugs north of the border, and the southward flow of illegal guns and cash that help fuel violence. ""I think it's natural that a new administration here in Mexico is looking carefully at how it's going to approach what is obviously a serious problem,"" Obama said, ""and we are very much looking forward to cooperating in any ways that we can to battle organized crime."" High-profile cartel takedowns were a hallmark of former President Felipe Calderon's tenure. Peña Nieto has vowed to take a different approach, focusing more on education problems and social inequality that he says fuel drug violence. The details of his policies are still coming into focus, and analysts say his government has deliberately tried to shift drug violence out of the spotlight. Before Obama's arrival, a spate of news reports this week on both sides of the border detailed changes in how Mexico cooperates with the United States. Under the new rules, all U.S. requests for collaboration with Mexican agencies will flow through a single office, Interior Minister Miguel Angel Osorio Chong told Mexico's state-run Notimex news agency. It is a drastic change from recent years, when U.S. agents enjoyed widespread access to their Mexican counterparts. Critics have expressed concerns that Peña Nieto's government will turn a blind eye to cartels or negotiate with them -- something he repeatedly denied on the campaign trail last year. On Tuesday -- two days before Obama's arrival -- his government arrested the father-in-law of Joaquin ""El Chapo"" Guzman, head of Mexico's Sinaloa cartel and one of the country's most-wanted drug lords. Speaking to reporters after his meeting with Obama on Thursday, Peña Nieto emphasized the importance of reducing violence, and also the importance of Mexico's relationship with the United States extending beyond the drug war. ""We don't want to make this relationship targeted on one single issue,"" he said. ""We want to place particular emphasis on the potential in the economic relationship between Mexico and the United States."" To achieve that goal, Peña Nieto said, the presidents agreed to create a new high-level group to discuss economic and trade relations between the two nations. The group, which will include Cabinet ministers from both countries and U.S. Vice President Joe Biden, will have its first meeting this fall, Peña Nieto said. Imports and exports between the United States and Mexico totaled nearly $500 billion last year, and before Obama's arrival officials on both sides of the border said economic relations would be a focal point during the U.S. president's visit. ""When the economy in Mexico has grown, and people have opportunity, a lot of our problems are solved, or we have the resources to solve them,"" Obama said Thursday. The emphasis on the economy Thursday was a significant shift, said Jason Marczak, director of policy at the Americas Society and Council of the Americas. ""The conversations between Mexico and the United States are changing,"" he told CNN en Español. Obama is scheduled to deliver a speech at the National Anthropology Museum in Mexico City on Friday morning. In the afternoon, he will travel to Costa Rica, where he will meet President Laura Chinchilla and other regional leaders. CNN's Mariano Castillo and CNN en Español's Juan Carlos Lopez and Mario Gonzalez contributed to this report.",A new high-level group to discuss economic cooperation will convene in the fall . Obama says ties between the U.S. and Mexico go beyond security and immigration . Mexico's president says his administration is committed to fighting organized crime . The U.S. president will travel to Costa Rica on Friday to meet with Central American leaders .,fbca9bf96c440bbfab59de6bd5f6d06ed609ed99
"(CNN) -- More than 100 police officers and others were searching Friday in a southeastern Louisiana parish for a murder suspect who escaped from jail with three other inmates, a law enforcement official said. Timothy Murray, 29, who is charged with murder, remains at large, authorities in Louisiana say. Searchers are still focusing inside St. Tammany Parish, on the northern shore of Lake Pontchartrain, 30 miles north of New Orleans, said Capt. George Bonnett of the St. Tammany Parish Sheriff's Office. At large is Timothy Murray, 29, who is charged with murder, Bonnett said. Authorities believe Murray may have been injured during the escape, but Bonnett wouldn't elaborate. The inmates escaped about 9 p.m. Thursday from the St. Tammany Parish Jail in Covington, Bonnett said. As many as 250 sheriff's deputies, Covington police officers, Louisiana State police and corrections officials were involved in the search overnight, using dogs, two helicopters and thermal-imaging equipment loaned from Livingston Parish, Bonnett said. The other three men were found about 1:30 a.m. Friday in a wooded area about a mile from the jail, he said. Three of the inmates were awaiting trial; one already had been convicted, Bonnett said. The captured inmates were Gary Slaydon, 27; Eric Buras, 30, and Jason Gainey, 27. Slaydon is charged with attempted murder. Buras is a murder suspect and Gainey has been convicted of murder, Bonnett said. He said the escape was not discovered until a resident and Covington police reported seeing what appeared to be inmates in jail uniforms walking down a street. About the time those calls came in, jailers were doing a routine head count and found the four men missing, Bonnett said. He said the means of escape was under investigation, but it has been determined that their escape wasn't due to human error. He repeated what St. Tammany Parish Sheriff Jack Strain said early Friday: ""Four inmates were able to defeat the structure of the maximum security area of our jail."" Deputies have canvassed neighborhoods, going door to door to warn residents that an inmate is still at large.","Four inmates escape from jail in St. Tammany Parish, Louisiana . Three found in area near jail north of New Orleans, official says . Man charged with murder remains at large, official says . Deputies canvassing neighborhoods in hunt for escapee .",a91b42eb3bfaa9dd1d6fe5e07d595f0acdbf29bc


In [0]:
example_article = sample["article"][0]
example_summary = sample["highlights"][0]
print(f"Article:\n{example_article}\n")
print(f"Summary:\n{example_summary}")

Article:

Summary:
Papua New Guinea is on the so-called Ring of Fire .
It's on an arc of fault lines that is prone to frequent earthquakes .


## Summarization

In [0]:
import pandas as pd
import torch
import gc
from transformers import AutoTokenizer, T5ForConditionalGeneration

In [0]:
def batch_generator(data: list, batch_size: int):
    """
    Creates batches of size `batch_size` from a list.
    """
    s = 0
    e = s + batch_size
    while s < len(data):
        yield data[s:e]
        s = e
        e = min(s + batch_size, len(data))


def summarize_with_t5(
    model_checkpoint: str, articles: list, batch_size: int = 8
) -> list:
    """
    Compute summaries using a T5 model.
    This is similar to a `pipeline` for a T5 model but does tokenization manually.

    :param model_checkpoint: Name for a model checkpoint in Hugging Face, such as "t5-small" or "t5-base"
    :param articles: List of strings, where each string represents one article.
    :return: List of strings, where each string represents one article's generated summary
    """
    if torch.cuda.is_available():
        device = "cuda:0"
    else:
        device = "cpu"

    model = T5ForConditionalGeneration.from_pretrained(
        model_checkpoint).to(device)
    tokenizer = AutoTokenizer.from_pretrained(
        model_checkpoint, model_max_length=1024)

    def perform_inference(batch: list) -> list:
        inputs = tokenizer(
            batch, max_length=1024, return_tensors="pt", padding=True, truncation=True
        )

        summary_ids = model.generate(
            inputs.input_ids.to(device),
            attention_mask=inputs.attention_mask.to(device),
            num_beams=2,
            min_length=0,
            max_length=40,
        )
        return tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

    res = []

    summary_articles = list(map(lambda article: "summarize: " + article, articles))
    for batch in batch_generator(summary_articles, batch_size=batch_size):
        res += perform_inference(batch)

        torch.cuda.empty_cache()
        gc.collect()

    # clean up
    del tokenizer
    del model
    torch.cuda.empty_cache()
    gc.collect()
    return res

In [0]:
t5_small_summaries = summarize_with_t5("t5-small", sample["article"])

In [0]:
reference_summaries = sample["highlights"]

In [0]:
display(
    pd.DataFrame.from_dict(
        {
            "generated": t5_small_summaries,
            "reference": reference_summaries,
        }
    )
)

generated,reference
"a magnitude 6.7 earthquake rattles Papua new Guinea early Friday afternoon. no tsunami warning was issued, according to the Tsunami Warning Center. papu",Papua New Guinea is on the so-called Ring of Fire . It's on an arc of fault lines that is prone to frequent earthquakes . No tsunami warning was issued .
the two-Test cricket series is being played in England due to security issues. the series is being played in England due to security issues in Pakistan. the series is being played in,"Australia collapse to 88 all out on opening day of second Test against Pakistan in Leeds . Pakistan, seeking to level two-match series, reached 148-3 when bad light stopped play . Australia captain Ricky Ponting surprisingly chose to bat first in overcast conditions . His team failed to reach three figures in a test for the first time since 1984 ."
federal prosecutors want jared Lee Loughner to submit a handwriting sample. prosecutors want the sample to compare with handwritten notes found in his residence. the federal grand,Jared Loughner is refusing the government's request for a writing sample . Authorities want it to compare with notes found in his home after the shooting . Loughner faces 49 charges related to a mass shooting outside a Tucson market .
"new: ""he tried to kill people,"" a 17-year-old student says. new: ""you could feel in the room all the anger that everyone had for him,"" she","Shooting victim McKayla Hicks went to hearing for accused killer James Holmes . She said she could feel ""all the anger that everyone had for"" Holmes . The incident has changed her, said Hicks . A bullet lodged in Hicks' jaw -- doctors said it is safer to leave it there ."
double-amputee sprinter Oscar Pistorius will compete at the 2012 able bodied Olympics. the four-time paralympic gold medalist won a silver,Oscar Pistorius to become first double-amputee Olympian in London . The 25-year-old has been selected in the individual 400 and 4x400m relay . Pistorius had both of his legs amputated when he was 11 months old . He won a silver medal at last year's World Championships in the 4 x 400m relay .
new: a grand jury indicted the governor on charges of coercion of a public servant and abuse of his capacity. new: a grand jury indicted the,"NEW: Perry lawyer calls indictments ""political abuse of the court system"" Indictment by country grand jury in Texas stems from effort to remove local prosecutor . Perry allegedly threatened to veto funding for a program run by the DA in Austin . Indictment could have political implications ."
prosecutor says there is enough evidence to continue an investigation. prosecutor Alberto Nisman alleged that he covered up the 1994 bombing. he was found dead in,"Prosecutor to judge: Enough evidence for investigation of President to continue . Different prosecutor, before he died, alleged President hid Iran's alleged involvement in bombing . President Cristina Fernandez de Kirchner, other officials deny cover-up ."
"UEFA says it is acting over ""the setting-off and throwing of fireworks"" UEFA president calls fans to ""conduct themselves with dignity and respect"" UEFA has already",NEW: UEFA president Michel Platini urges fans to behave at decisive matches Saturday . UEFA says there were racist chants from Croatian fans during a match against Italy . The issue of racism threatens to mar the Euro 2012 soccer tournament . A disciplinary panel will consider the cause against Croatia on Tuesday .
new: a new high-level group will discuss economic and trade relations between the two nations. new: the presidents will meet with the president on friday. new: the,A new high-level group to discuss economic cooperation will convene in the fall . Obama says ties between the U.S. and Mexico go beyond security and immigration . Mexico's president says his administration is committed to fighting organized crime . The U.S. president will travel to Costa Rica on Friday to meet with Central American leaders .
"police, others searching for murder suspect in southeastern Louisiana parish. Timothy Murray, 29, charged with murder, remains at large, authorities say. inmates escaped about 9 p","Four inmates escape from jail in St. Tammany Parish, Louisiana . Three found in area near jail north of New Orleans, official says . Man charged with murder remains at large, official says . Deputies canvassing neighborhoods in hunt for escapee ."


You may see some warning messages in the output above.  While pipelines are handy, they provide less control over the tokenizer and model; we will dive deeper later.

But first, let's see how our summarization pipeline does!  We'll compute 0/1 accuracy, a classic ML evaluation metric.

In [0]:
accuracy = 0.0
for i in range(len(reference_summaries)):
    generated_summary = t5_small_summaries[i]
    if generated_summary == reference_summaries[i]:
        accuracy += 1.0
accuracy = accuracy / len(reference_summaries)

print(f"Achieved accuracy {accuracy}!")

Achieved accuracy 0.0!


Accuracy zero?!?  We can see that the (very generic) metric of 0/1 accuracy is not useful for summarization.  Thinking about this more, small variations in wording may not matter much, and many different summaries may be equally valid.  So how can we evaluate summarization?

## ROUGE

Now that we can generate summaries---and we know 0/1 accuracy is useless here---let's look at how we can compute a meaningful metric designed to evaluate summarization: ROUGE.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of evaluation metrics designed for comparing summaries from Lin et al., 2004.  See https://en.wikipedia.org/wiki/ROUGE_(metric) for more info.  Here, we use the Hugging Face Evaluator wrapper to call into the `rouge_score` package.  This package provides 4 scores:

* `rouge1`: ROUGE computed over unigrams (single words or tokens)
* `rouge2`: ROUGE computed over bigrams (pairs of consecutive words or tokens)
* `rougeL`: ROUGE based on the longest common subsequence shared by the summaries being compared
* `rougeLsum`: like `rougeL`, but at "summary level," i.e., ignoring sentence breaks (newlines)

In [0]:
import evaluate
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

rouge_score = evaluate.load("rouge")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

You can call `rouge_score` evaluator directly, but we provide a convenience function below to handle the expected input format.

In [0]:
def compute_rouge_score(generated: list, reference: list) -> dict:
    """
    Compute ROUGE scores on a batch of articles.

    This is a convenience function wrapping Hugging Face `rouge_score`,
    which expects sentences to be separated by newlines.

    :param generated: Summaries (list of strings) produced by the model
    :param reference: Ground-truth summaries (list of strings) for comparison
    """
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

In [0]:
# ROUGE scores for our batch of articles
compute_rouge_score(t5_small_summaries, reference_summaries)

Out[16]: {'rouge1': 0.30949498883996074,
 'rouge2': 0.10631721148565595,
 'rougeL': 0.2216480761444404,
 'rougeLsum': 0.2818479073623581}

## Understanding ROUGE scores

In [0]:
# Sanity check: What if our predictions match the references exactly?
compute_rouge_score(reference_summaries, reference_summaries)

Out[17]: {'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

In [0]:
# And what if we fail to predict anything?
compute_rouge_score(
    generated=["" for _ in range(len(reference_summaries))],
    reference=reference_summaries,
)

Out[18]: {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}

Stemming predictions and references can help to ignore minor differences.

We will use `rouge_score.compute()` directly for these hand-constructed examples.

In [0]:
rouge_score.compute(
    predictions=["Large language models beat world record"],
    references=["Large language models beating world records"],
    use_stemmer=False,
)

Out[19]: {'rouge1': 0.6666666666666666,
 'rouge2': 0.4000000000000001,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

In [0]:
rouge_score.compute(
    predictions=["Large language models beat world record"],
    references=["Large language models beating world records"],
    use_stemmer=True,
)

Out[20]: {'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

Let's look at how the ROUGE score behaves in various situations.

In [0]:
# What if we predict exactly 1 word correctly?
rouge_score.compute(
    predictions=["Large language models beat world record"],
    references=["Large"],
    use_stemmer=True,
)

Out[21]: {'rouge1': 0.2857142857142857,
 'rouge2': 0.0,
 'rougeL': 0.2857142857142857,
 'rougeLsum': 0.2857142857142857}

In [0]:
# The ROUGE score is symmetric with respect to predictions and references.
rouge_score.compute(
    predictions=["Large"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

Out[22]: {'rouge1': 0.2857142857142857,
 'rouge2': 0.0,
 'rougeL': 0.2857142857142857,
 'rougeLsum': 0.2857142857142857}

In [0]:
# What about 2 words?  Note how 'rouge1' and 'rouge2' compare with the case when we predict exactly 1 word correctly.
rouge_score.compute(
    predictions=["Large language"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

Out[23]: {'rouge1': 0.5, 'rouge2': 0.33333333333333337, 'rougeL': 0.5, 'rougeLsum': 0.5}

In [0]:
# Note how rouge1 differs from the rougeN (N>1) scores when we predict word subsequences correctly.
rouge_score.compute(
    predictions=["Models beat large language world record"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

Out[24]: {'rouge1': 1.0,
 'rouge2': 0.6,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

d ## Compare small and large models

 We've been working with the `t5-small` model so far.  Let's compare several models with different architectures in terms of their ROUGE scores and some example generated summaries.

In [0]:
def compute_rouge_per_row(
    generated_summaries: list, reference_summaries: list
) -> pd.DataFrame:
    """
    Generates a dataframe to compare rogue score metrics.
    """
    generated_with_newlines = [
        "\n".join(sent_tokenize(s.strip())) for s in generated_summaries
    ]
    reference_with_newlines = [
        "\n".join(sent_tokenize(s.strip())) for s in reference_summaries
    ]
    scores = rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
        use_aggregator=False,
    )
    scores["generated"] = generated_summaries
    scores["reference"] = reference_summaries
    return pd.DataFrame.from_dict(scores)

### T5-small

The [T5](https://huggingface.co/docs/transformers/model_doc/t5) [[paper]](https://arxiv.org/pdf/1910.10683.pdf) family of models are text-to-text transformers that have been trained on a multi-task mixture of unsupervised and supervised tasks. They are well suited for task such as summarization, translation, text classification, question answering, and more.

The t5-small version of the T5 models has 60 million parameters.

In [0]:
# We computed t5_small_summaries above already.
compute_rouge_score(t5_small_summaries, reference_summaries)

Out[26]: {'rouge1': 0.30949498883996074,
 'rouge2': 0.10631721148565595,
 'rougeL': 0.2216480761444404,
 'rougeLsum': 0.2818479073623581}

In [0]:
t5_small_results = compute_rouge_per_row(
    generated_summaries=t5_small_summaries, reference_summaries=reference_summaries
)
display(t5_small_results)

rouge1,rouge2,rougeL,rougeLsum,generated,reference
0.4074074074074074,0.2307692307692307,0.2962962962962963,0.4074074074074074,"a magnitude 6.7 earthquake rattles Papua new Guinea early Friday afternoon. no tsunami warning was issued, according to the Tsunami Warning Center. papu",Papua New Guinea is on the so-called Ring of Fire . It's on an arc of fault lines that is prone to frequent earthquakes . No tsunami warning was issued .
0.2391304347826086,0.0,0.1956521739130435,0.217391304347826,the two-Test cricket series is being played in England due to security issues. the series is being played in England due to security issues in Pakistan. the series is being played in,"Australia collapse to 88 all out on opening day of second Test against Pakistan in Leeds . Pakistan, seeking to level two-match series, reached 148-3 when bad light stopped play . Australia captain Ricky Ponting surprisingly chose to bat first in overcast conditions . His team failed to reach three figures in a test for the first time since 1984 ."
0.4545454545454546,0.15625,0.3939393939393939,0.4545454545454546,federal prosecutors want jared Lee Loughner to submit a handwriting sample. prosecutors want the sample to compare with handwritten notes found in his residence. the federal grand,Jared Loughner is refusing the government's request for a writing sample . Authorities want it to compare with notes found in his home after the shooting . Loughner faces 49 charges related to a mass shooting outside a Tucson market .
0.3733333333333333,0.1917808219178082,0.2666666666666666,0.3466666666666667,"new: ""he tried to kill people,"" a 17-year-old student says. new: ""you could feel in the room all the anger that everyone had for him,"" she","Shooting victim McKayla Hicks went to hearing for accused killer James Holmes . She said she could feel ""all the anger that everyone had for"" Holmes . The incident has changed her, said Hicks . A bullet lodged in Hicks' jaw -- doctors said it is safer to leave it there ."
0.2631578947368421,0.1081081081081081,0.1842105263157895,0.2105263157894736,double-amputee sprinter Oscar Pistorius will compete at the 2012 able bodied Olympics. the four-time paralympic gold medalist won a silver,Oscar Pistorius to become first double-amputee Olympian in London . The 25-year-old has been selected in the individual 400 and 4x400m relay . Pistorius had both of his legs amputated when he was 11 months old . He won a silver medal at last year's World Championships in the 4 x 400m relay .
0.2816901408450704,0.0579710144927536,0.1971830985915492,0.2816901408450704,new: a grand jury indicted the governor on charges of coercion of a public servant and abuse of his capacity. new: a grand jury indicted the,"NEW: Perry lawyer calls indictments ""political abuse of the court system"" Indictment by country grand jury in Texas stems from effort to remove local prosecutor . Perry allegedly threatened to veto funding for a program run by the DA in Austin . Indictment could have political implications ."
0.4262295081967213,0.1016949152542372,0.2950819672131147,0.360655737704918,prosecutor says there is enough evidence to continue an investigation. prosecutor Alberto Nisman alleged that he covered up the 1994 bombing. he was found dead in,"Prosecutor to judge: Enough evidence for investigation of President to continue . Different prosecutor, before he died, alleged President hid Iran's alleged involvement in bombing . President Cristina Fernandez de Kirchner, other officials deny cover-up ."
0.2077922077922077,0.08,0.1298701298701298,0.2077922077922077,"UEFA says it is acting over ""the setting-off and throwing of fireworks"" UEFA president calls fans to ""conduct themselves with dignity and respect"" UEFA has already",NEW: UEFA president Michel Platini urges fans to behave at decisive matches Saturday . UEFA says there were racist chants from Croatian fans during a match against Italy . The issue of racism threatens to mar the Euro 2012 soccer tournament . A disciplinary panel will consider the cause against Croatia on Tuesday .
0.4705882352941177,0.216867469879518,0.3294117647058823,0.4,new: a new high-level group will discuss economic and trade relations between the two nations. new: the presidents will meet with the president on friday. new: the,A new high-level group to discuss economic cooperation will convene in the fall . Obama says ties between the U.S. and Mexico go beyond security and immigration . Mexico's president says his administration is committed to fighting organized crime . The U.S. president will travel to Costa Rica on Friday to meet with Central American leaders .
0.40625,0.1935483870967742,0.28125,0.34375,"police, others searching for murder suspect in southeastern Louisiana parish. Timothy Murray, 29, charged with murder, remains at large, authorities say. inmates escaped about 9 p","Four inmates escape from jail in St. Tammany Parish, Louisiana . Three found in area near jail north of New Orleans, official says . Man charged with murder remains at large, official says . Deputies canvassing neighborhoods in hunt for escapee ."


### T5-base

The [T5-base](https://huggingface.co/t5-base) model has 220 million parameters.

In [0]:
t5_base_summaries = summarize_with_t5(
    model_checkpoint="t5-base", articles=sample["article"]
)
compute_rouge_score(t5_base_summaries, reference_summaries)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Out[28]: {'rouge1': 0.3266942185400705,
 'rouge2': 0.13383992335021888,
 'rougeL': 0.23764253293287807,
 'rougeLsum': 0.3035933487661705}

In [0]:
t5_base_results = compute_rouge_per_row(
    generated_summaries=t5_base_summaries, reference_summaries=reference_summaries
)
display(t5_base_results)

rouge1,rouge2,rougeL,rougeLsum,generated,reference
0.4528301886792453,0.3529411764705882,0.2641509433962263,0.4528301886792453,the quake was centered about 200 miles north-northeast of Port Moresby. no tsunami warning was issued. Papua New Guinea is on the,Papua New Guinea is on the so-called Ring of Fire . It's on an arc of fault lines that is prone to frequent earthquakes . No tsunami warning was issued .
0.3414634146341463,0.125,0.1951219512195122,0.2926829268292683,Pakistan reach 148-3 on opening day of two-Test series against australia. teenagers mohammad aamer and mohammad asif take three wickets each,"Australia collapse to 88 all out on opening day of second Test against Pakistan in Leeds . Pakistan, seeking to level two-match series, reached 148-3 when bad light stopped play . Australia captain Ricky Ponting surprisingly chose to bat first in overcast conditions . His team failed to reach three figures in a test for the first time since 1984 ."
0.3478260869565217,0.1791044776119402,0.2608695652173913,0.3478260869565217,federal prosecutors want a handwriting sample from a man accused of killing six people. they want it to compare with notes that include references to guns and bullets. the government,Jared Loughner is refusing the government's request for a writing sample . Authorities want it to compare with notes found in his home after the shooting . Loughner faces 49 charges related to a mass shooting outside a Tucson market .
0.0833333333333333,0.0,0.0833333333333333,0.0833333333333333,"""he's just this freak lookin' dude with some orange hair,"" teen says. ""i think he needs to be killed,"" she says. she says she","Shooting victim McKayla Hicks went to hearing for accused killer James Holmes . She said she could feel ""all the anger that everyone had for"" Holmes . The incident has changed her, said Hicks . A bullet lodged in Hicks' jaw -- doctors said it is safer to leave it there ."
0.3783783783783783,0.1666666666666666,0.2162162162162162,0.3243243243243243,double-amputee sprinter Oscar Pistorius will compete at the 2012 olympics. he was named in the individual 400m and 4x400m,Oscar Pistorius to become first double-amputee Olympian in London . The 25-year-old has been selected in the individual 400 and 4x400m relay . Pistorius had both of his legs amputated when he was 11 months old . He won a silver medal at last year's World Championships in the 4 x 400m relay .
0.3478260869565218,0.1791044776119403,0.3188405797101449,0.3478260869565218,"new: governor's attorney calls indictment a ""political abuse of the court system"" new: he can continue to serve as governor while under indictment.","NEW: Perry lawyer calls indictments ""political abuse of the court system"" Indictment by country grand jury in Texas stems from effort to remove local prosecutor . Perry allegedly threatened to veto funding for a program run by the DA in Austin . Indictment could have political implications ."
0.4262295081967213,0.1016949152542372,0.3278688524590163,0.3934426229508197,a prosecutor says there is enough evidence to continue an investigation into a 1994 bombing. prosecutor Alberto Nisman alleged last month that Fernandez covered up Iran,"Prosecutor to judge: Enough evidence for investigation of President to continue . Different prosecutor, before he died, alleged President hid Iran's alleged involvement in bombing . President Cristina Fernandez de Kirchner, other officials deny cover-up ."
0.2307692307692307,0.0263157894736842,0.1282051282051282,0.2307692307692307,"UEFA opens disciplinary proceedings against Croatia over racist behavior. it says it was acting over ""the setting-off and throwing of fireworks"" and ""the improper conduct of supporters""",NEW: UEFA president Michel Platini urges fans to behave at decisive matches Saturday . UEFA says there were racist chants from Croatian fans during a match against Italy . The issue of racism threatens to mar the Euro 2012 soccer tournament . A disciplinary panel will consider the cause against Croatia on Tuesday .
0.2409638554216867,0.0246913580246913,0.144578313253012,0.216867469879518,"""that's the focus of my visit,"" he says after meeting with pea Nieto. security issues loom large over the meeting. he says the countries will",A new high-level group to discuss economic cooperation will convene in the fall . Obama says ties between the U.S. and Mexico go beyond security and immigration . Mexico's president says his administration is committed to fighting organized crime . The U.S. president will travel to Costa Rica on Friday to meet with Central American leaders .
0.2456140350877192,0.1454545454545454,0.2105263157894736,0.2456140350877192,"three inmates escaped from jail in covington, southeastern la., about 9 p.m. on tuesday. one of the inmates","Four inmates escape from jail in St. Tammany Parish, Louisiana . Three found in area near jail north of New Orleans, official says . Man charged with murder remains at large, official says . Deputies canvassing neighborhoods in hunt for escapee ."


### GPT-2

The [GPT-2](https://huggingface.co/gpt2) model is a generative text model that was trained in a self-supervised fashion. Its strengths are in using a 'completing the sentence' for a given prompt.  It has 124 million parameters.

In [0]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


def summarize_with_gpt2(
    model_checkpoint: str, articles: list, batch_size: int = 8
) -> list:
    """
    Convenience function for summarization with GPT2 to handle these complications:
    - Append "TL;DR" to the end of the input to get GPT2 to generate a summary.
    https://huggingface.co/course/chapter7/5?fw=pt
    - Truncate input to handle long articles.
    - GPT2 uses a max token length of 1024.  We use a shorter 512 limit here.

    :param model_checkpoint: reference to checkpointed model
    :param articles: list of strings
    :return: generated summaries, with the input and "TL;DR" removed
    """
    if torch.cuda.is_available():
        device = "cuda:0"
    else:
        device = "cpu"

    tokenizer = GPT2Tokenizer.from_pretrained(
        model_checkpoint, padding_side="left")
    tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
    model = GPT2LMHeadModel.from_pretrained(
        model_checkpoint,
        pad_token_id=tokenizer.eos_token_id
    ).to(device)

    def perform_inference(batch: list) -> list:
        tmp_inputs = tokenizer(
            batch, max_length=500, return_tensors="pt", padding=True, truncation=True
        )
        tmp_inputs_decoded = tokenizer.batch_decode(
            tmp_inputs.input_ids, skip_special_tokens=True
        )
        inputs = tokenizer(
            [article + " TL;DR:" for article in tmp_inputs_decoded],
            max_length=512,
            return_tensors="pt",
            padding=True,
            truncation=True,
        )
        summary_ids = model.generate(
            inputs.input_ids.to(device),
            attention_mask=inputs.attention_mask.to(device),
            num_beams=2,
            min_length=0,
            max_length=512 + 32,
        )
        return tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

    decoded_summaries = []
    for batch in batch_generator(articles, batch_size=batch_size):
        decoded_summaries += perform_inference(batch)

        # batch clean up
        torch.cuda.empty_cache()
        gc.collect()

    # post-process decoded summaries
    summaries = [
        summary[summary.find("TL;DR:") + len("TL;DR: ") :]
        for summary in decoded_summaries
    ]

    # cleanup
    del tokenizer
    del model
    torch.cuda.empty_cache()
    gc.collect()

    return summaries

In [0]:
gpt2_summaries = summarize_with_gpt2(
    model_checkpoint="gpt2", articles=sample["article"]
)
compute_rouge_score(gpt2_summaries, reference_summaries)

Out[38]: {'rouge1': 0.1887105436713102,
 'rouge2': 0.043525041567497966,
 'rougeL': 0.14172060517205282,
 'rougeLsum': 0.17324809101824595}

In [0]:
gpt2_results = compute_rouge_per_row(
    generated_summaries=gpt2_summaries, reference_summaries=reference_summaries
)
display(gpt2_results)

rouge1,rouge2,rougeL,rougeLsum,generated,reference
0.2950819672131147,0.0677966101694915,0.1967213114754098,0.2295081967213114,"A magnitude 6.7 earthquake rattled Papua New Guinea early Friday afternoon, according to the U.S. Geological Survey. The quake was centered about 200 miles north-northeast of Port Mores",Papua New Guinea is on the so-called Ring of Fire . It's on an arc of fault lines that is prone to frequent earthquakes . No tsunami warning was issued .
0.108695652173913,0.0222222222222222,0.108695652173913,0.0869565217391304,"Australia lost the toss. ""I think it was a big mistake,"" Gul said. ""I think it was a big mistake for the team. I think it was a big mistake for the team.","Australia collapse to 88 all out on opening day of second Test against Pakistan in Leeds . Pakistan, seeking to level two-match series, reached 148-3 when bad light stopped play . Australia captain Ricky Ponting surprisingly chose to bat first in overcast conditions . His team failed to reach three figures in a test for the first time since 1984 ."
0.5135135135135135,0.1388888888888888,0.3243243243243243,0.4324324324324324,The government wants to force Jared Lee Loughner to write out something so authorities can view his writing style. The government wants the sample to compare with handwritten notes found in Loughner's residence that,Jared Loughner is refusing the government's request for a writing sample . Authorities want it to compare with notes found in his home after the shooting . Loughner faces 49 charges related to a mass shooting outside a Tucson market .
0.144578313253012,0.0246913580246913,0.1204819277108433,0.1204819277108433,"I'm not going to be able to do anything about it,"" she said. ""I'm not going to be able to do anything about it. I'm not going to be able to do anything","Shooting victim McKayla Hicks went to hearing for accused killer James Holmes . She said she could feel ""all the anger that everyone had for"" Holmes . The incident has changed her, said Hicks . A bullet lodged in Hicks' jaw -- doctors said it is safer to leave it there ."
0.2444444444444444,0.0454545454545454,0.2,0.2444444444444444,"""The CAS has determined that the athlete's performance in the event of a disqualification is not in the public interest, and that the athlete's performance in the event of a disqualification is not in",Oscar Pistorius to become first double-amputee Olympian in London . The 25-year-old has been selected in the individual 400 and 4x400m relay . Pistorius had both of his legs amputated when he was 11 months old . He won a silver medal at last year's World Championships in the 4 x 400m relay .
0.1,0.0,0.1,0.1,"I'm not going to resign,"" Lehmberg said in a statement to CNN. ""I'm not going to resign because I'm not going to be a good district attorney. I'm not going to","NEW: Perry lawyer calls indictments ""political abuse of the court system"" Indictment by country grand jury in Texas stems from effort to remove local prosecutor . Perry allegedly threatened to veto funding for a program run by the DA in Austin . Indictment could have political implications ."
0.0930232558139534,0.0,0.0465116279069767,0.0930232558139534,Fernandez's death was a suicide. Read More,"Prosecutor to judge: Enough evidence for investigation of President to continue . Different prosecutor, before he died, alleged President hid Iran's alleged involvement in bombing . President Cristina Fernandez de Kirchner, other officials deny cover-up ."
0.2857142857142856,0.1219512195121951,0.1904761904761905,0.2857142857142856,"""I'm not a racist, I'm a good player."" The Czech Republic has also been fined $1,000 for racist chanting during a match against Italy on Saturday. The Czech Republic has also",NEW: UEFA president Michel Platini urges fans to behave at decisive matches Saturday . UEFA says there were racist chants from Croatian fans during a match against Italy . The issue of racism threatens to mar the Euro 2012 soccer tournament . A disciplinary panel will consider the cause against Croatia on Tuesday .
0.2921348314606741,0.0919540229885057,0.2022471910112359,0.2921348314606741,"The U.S. has been working with Mexico on drug trafficking since the 1980s, and the U.S. has been working with Mexico on drug trafficking since the 1990s. The new rules",A new high-level group to discuss economic cooperation will convene in the fall . Obama says ties between the U.S. and Mexico go beyond security and immigration . Mexico's president says his administration is committed to fighting organized crime . The U.S. president will travel to Costa Rica on Friday to meet with Central American leaders .
0.1095890410958904,0.0,0.0821917808219178,0.1095890410958904,The escape was not discovered until a resident and Covington police reported seeing what appeared to be inmates in jail uniforms walking down a street. The escape was not discovered until a resident and Covington,"Four inmates escape from jail in St. Tammany Parish, Louisiana . Three found in area near jail north of New Orleans, official says . Man charged with murder remains at large, official says . Deputies canvassing neighborhoods in hunt for escapee ."


### Comparing all models

We use a couple of helper functions to compare the above models, first by their evaluation metrics (quantitative) and second by their generated summaries (qualitative).

In [0]:
def compare_models(models_results: dict) -> pd.DataFrame:
    """
    :param models_results: dict of "model name" string mapped to pd.DataFrame of results computed by `compute_rouge_per_row`
    :return: pd.DataFrame with 1 row per model, with columns: model, rouge1, rouge2, rougeL, rougeLsum
    where metrics are averages over input results for each model
    """
    agg_results = []
    for r in models_results:
        model_results = models_results[r].drop(
            labels=["generated", "reference"], axis=1
        )
        agg_metrics = [r]
        agg_metrics[1:] = model_results.mean(axis=0)
        agg_results.append(agg_metrics)
    return pd.DataFrame(
        agg_results, columns=["model", "rouge1", "rouge2", "rougeL", "rougeLsum"]
    )

In [0]:
display(
    compare_models(
        {
            "t5-small": t5_small_results,
            "t5-base": t5_base_results,
            "gpt2": gpt2_results,
        }
    )
)

model,rouge1,rouge2,rougeL,rougeLsum
t5-small,0.3096417280107126,0.1061325387792787,0.221508680832857,0.2821843556987063
t5-base,0.3271029923105547,0.1332878537336556,0.2372643025011569,0.303475912290666
gpt2,0.1885960545646232,0.0437285645984154,0.1415833622279245,0.1728018445721218


In [0]:
def compare_models_summaries(models_summaries: dict) -> pd.DataFrame:
    """
    Aggregates results from `models_summaries` and returns a dataframe.
    """
    comparison_df = None
    for model_name in models_summaries:
        summaries_df = models_summaries[model_name]
        if comparison_df is None:
            comparison_df = summaries_df[["generated"]].rename(
                {"generated": model_name}, axis=1
            )
        else:
            comparison_df = comparison_df.join(
                summaries_df[["generated"]].rename({"generated": model_name}, axis=1)
            )
    return comparison_df

In [0]:
# In the output table below, scroll to the right to see all models.
display(
    compare_models_summaries(
        {
            "t5_small": t5_small_results,
            "t5_base": t5_base_results,
            "gpt2": gpt2_results,
        }
    )
)

t5_small,t5_base,gpt2
"a magnitude 6.7 earthquake rattles Papua new Guinea early Friday afternoon. no tsunami warning was issued, according to the Tsunami Warning Center. papu",the quake was centered about 200 miles north-northeast of Port Moresby. no tsunami warning was issued. Papua New Guinea is on the,"A magnitude 6.7 earthquake rattled Papua New Guinea early Friday afternoon, according to the U.S. Geological Survey. The quake was centered about 200 miles north-northeast of Port Mores"
the two-Test cricket series is being played in England due to security issues. the series is being played in England due to security issues in Pakistan. the series is being played in,Pakistan reach 148-3 on opening day of two-Test series against australia. teenagers mohammad aamer and mohammad asif take three wickets each,"Australia lost the toss. ""I think it was a big mistake,"" Gul said. ""I think it was a big mistake for the team. I think it was a big mistake for the team."
federal prosecutors want jared Lee Loughner to submit a handwriting sample. prosecutors want the sample to compare with handwritten notes found in his residence. the federal grand,federal prosecutors want a handwriting sample from a man accused of killing six people. they want it to compare with notes that include references to guns and bullets. the government,The government wants to force Jared Lee Loughner to write out something so authorities can view his writing style. The government wants the sample to compare with handwritten notes found in Loughner's residence that
"new: ""he tried to kill people,"" a 17-year-old student says. new: ""you could feel in the room all the anger that everyone had for him,"" she","""he's just this freak lookin' dude with some orange hair,"" teen says. ""i think he needs to be killed,"" she says. she says she","I'm not going to be able to do anything about it,"" she said. ""I'm not going to be able to do anything about it. I'm not going to be able to do anything"
double-amputee sprinter Oscar Pistorius will compete at the 2012 able bodied Olympics. the four-time paralympic gold medalist won a silver,double-amputee sprinter Oscar Pistorius will compete at the 2012 olympics. he was named in the individual 400m and 4x400m,"""The CAS has determined that the athlete's performance in the event of a disqualification is not in the public interest, and that the athlete's performance in the event of a disqualification is not in"
new: a grand jury indicted the governor on charges of coercion of a public servant and abuse of his capacity. new: a grand jury indicted the,"new: governor's attorney calls indictment a ""political abuse of the court system"" new: he can continue to serve as governor while under indictment.","I'm not going to resign,"" Lehmberg said in a statement to CNN. ""I'm not going to resign because I'm not going to be a good district attorney. I'm not going to"
prosecutor says there is enough evidence to continue an investigation. prosecutor Alberto Nisman alleged that he covered up the 1994 bombing. he was found dead in,a prosecutor says there is enough evidence to continue an investigation into a 1994 bombing. prosecutor Alberto Nisman alleged last month that Fernandez covered up Iran,Fernandez's death was a suicide. Read More
"UEFA says it is acting over ""the setting-off and throwing of fireworks"" UEFA president calls fans to ""conduct themselves with dignity and respect"" UEFA has already","UEFA opens disciplinary proceedings against Croatia over racist behavior. it says it was acting over ""the setting-off and throwing of fireworks"" and ""the improper conduct of supporters""","""I'm not a racist, I'm a good player."" The Czech Republic has also been fined $1,000 for racist chanting during a match against Italy on Saturday. The Czech Republic has also"
new: a new high-level group will discuss economic and trade relations between the two nations. new: the presidents will meet with the president on friday. new: the,"""that's the focus of my visit,"" he says after meeting with pea Nieto. security issues loom large over the meeting. he says the countries will","The U.S. has been working with Mexico on drug trafficking since the 1980s, and the U.S. has been working with Mexico on drug trafficking since the 1990s. The new rules"
"police, others searching for murder suspect in southeastern Louisiana parish. Timothy Murray, 29, charged with murder, remains at large, authorities say. inmates escaped about 9 p","three inmates escaped from jail in covington, southeastern la., about 9 p.m. on tuesday. one of the inmates",The escape was not discovered until a resident and Covington police reported seeing what appeared to be inmates in jail uniforms walking down a street. The escape was not discovered until a resident and Covington


-sandbox
&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>