# NE Recognition with State-of-the-art models

We have used different SOTA models and evaluated how they do Named Entity Recognition.

* Can they be used as verificators/labellers for the base NER?
* How do they perform in comparison to the baseline NER?

A text with 10K characters is used for experiments.
The text was carefully manually labelled for 3 NE types: `persons`, `locations`, and `organizations`.

These SOTA models were evaluated together with the specialized BERT-based model, `dslim/bert-base-NER`:
- Baseline Hugging Face `dslim/bert-base-NER` model
- OpenAI ChatGPT 4o mini
- OpenAI ChatGPT 4o
- Claude Sonet 3.5
- Google Gemini 1.5 Pro
- Google Gemini 1.5 Flash
- Google Gemma 2 27B


See the Results and Conclusions sections.

## Data

### Test data

I use a political article from [CNN](https://www.cnn.com/2022/08/01/politics/kyrsten-sinema-democrats-big-week-for-biden-presidency/index.html) for the evaluation.

The text has several ambiguities discussed below. They show the real-life difficulties in the NER task.

In [11]:

s = """"articleBody":"Democrats this week have the chance to validate their monopoly on political power in Washington, create a legacy of true significance for President Joe Biden and even boost their hopes in daunting midterm elections in three months. But first they must push a back-from-the-dead climate and health care initiative through the Senate using their tiny majority, notably by locking in the crucial vote of moderate Arizona Sen. Kyrsten Sinema, who has yet to sign off on a bill that may not save Democrats in November but may at least give them a big new win to run on. Tensions are, meanwhile, soaring between the parties, especially over a Republican blockade of a bill that would fund health care for veterans sickened by exposure to burn pits during service in America’s foreign wars. The move opened up the GOP to accusations of cruelty – and, for once, Republican leader Mitch McConnell has looked outmaneuvered. The opening on climate did not exist a week ago. But the stunning deal between Senate Majority Leader Chuck Schumer and holdout moderate West Virginia Sen. Joe Manchin created possibilities most of the party thought were gone. Democrats, who have suffered as Biden’s legislative agenda repeatedly stalled, will be desperate to pass the bill this week before a Senate recess slows its momentum. But there remains at least one huge question mark – the vote of Sinema, whose support is just as critical as Manchin’s in the 50-50 Senate. Like Manchin, she has opposed dismantling the Senate filibuster to pass other Democratic priority bills. She did help remodel Biden’s larger Build Back Better bill before Manchin blocked it last year. But now there are questions over whether she will back tax changes affecting private equity investors in the Manchin-Schumer compromise. As the 50th Democrat needed to pass the measure with Vice President Kamala Harris’ tie-breaking vote, Sinema has huge leverage to seek changes that threaten the bill’s fragile foundation, and she has so far avoided giving her verdict on the deal. Manchin suggested on CNN’s “State of the Union” on Sunday that he hadn’t spoken to Sinema about the package since he agreed on it with Schumer. But he paid tribute to his Arizona colleague and her previous work on reducing prescription drugs prices – a goal that is included in the new draft law.  “When she looks at the bill and sees the whole spectrum of what we’re doing and all of the energy we’re bringing in – all of the reduction of prices and fighting inflation by bringing prices down, by having more energy – hopefully, she will be positive about it,” Manchin said. “But she will make her decision. And I respect that.” Manchin wields his power Manchin, blanketing Sunday TV talk shows, demonstrated his power at the fulcrum of a closely divided Senate as he put his spin on the legislation – always with an eye on voters back home in a deeply red state. Once again, Manchin has succeeded in putting his state, one of the poorest and smallest in the nation, at the center of Washington policy making.  He has also used his power to champion centrism at a time when both parties seem to be moving toward their most radical base supporters. After repeatedly infuriating Democrats by thwarting Biden’s agenda, he’s now disappointing Republicans who had hoped he would maintain his opposition. On Sunday, Manchin insisted his package would lower inflation, expand domestic energy production, ensure certain corporations pay their fair tax share, and would benefit Americans by cutting prescription drugs costs for Medicare patients. The measure would also spend nearly $370 billion on fighting climate change and developing a new green energy economy, reviving efforts that had seemed doomed just weeks ago by opposition from the coal-state senator. If the bill does pass the Senate and later makes it through the House, it would instantly transform Biden into the President who made the greatest commitment to cutting greenhouse gas emissions and would enshrine his global leadership of the effort to stave off the most disastrous future effects of climate change. It comes as extreme weather events – from drought in the American West to flooding in Kentucky that has killed at least 28 people – are ravaging the US.  The climate funding is not the only key Democratic priority in the bill. The Manchin-Schumer bill, now rebranded as the “Inflation Reduction Act,” includes extended Affordable Care Act subsidies that would also cement another key reform wrought by Democratic power in the 21st century – Obamacare. These twin achievements could go some way to changing perceptions of the Biden presidency – which, despite some successes, including a $1.9 trillion Covid-19 rescue package and a rare bipartisan infrastructure law – has seen key agenda items like voting rights and police reform founder in the Senate.  While the passage of the bill could come too late to save Democrats from the painful punch of high inflation in midterm elections in November, it might juice turnout of progressives demoralized by the failure to do more with the party’s thin control of Washington power. Taken together with the mobilization of liberals following the conservative Supreme Court’s overturning of the constitutional right to an abortion, and majority public support for gun restrictions in the wake of a string of mass shootings, Democrats would at least have a platform to run on in November if they can succeed in weaving a coherent narrative on their achievements.  While Republican strategists believe that the House is already heading toward them, according to new CNN reporting over the weekend, a late spike in Democratic enthusiasm could spur the hopes of party leaders who believe the Senate is not a lost cause – especially against a clutch of candidates in ex-President Donald Trump’s image who could scare off suburban voters.  GOP mobilizes to prevent Democratic win  Manchin explained on Sunday that he understood the invective hurled his way by many Democrats, and Vermont independent Sen. Bernie Sanders, after he derailed the previous “Build Back Better” plan over his belief that it would fuel already soaring inflation. He said that he hoped the new measure would pass by the end of this week, when the Senate is due to break for an August recess.  The timetable remains a high wire act – just one case of Covid-19 among Democratic senators, for example, could fracture the party’s majority since all Republicans are expected to be against it. There have been several recent positive tests among senators that have sent them into isolation, including Manchin.  In defending his deal with Schumer, the West Virginia senator said that “in normal times,” Republicans would support the bill, since it would pay down the deficit, accelerate permitting for oil and gas drilling and increase energy production – all of which the GOP has previously been on record supporting. But GOP senators are mobilizing to try to prevent passage of the bill, which would represent a victory for Biden and the Democrats before the midterms. “It really looks to me like Joe Manchin has been taken to the cleaners,” Pennsylvania Sen. Pat Toomey told Jake Tapper on “State of the Union.” “Look, this bill, the corporate tax increase, is going to slow down growth, probably exacerbate a recession that we’re probably already in,” said Toomey, who’s retiring. He argued that prescription drug price controls would slow development of life-saving medicines and that the bill would subsidize “wealthy people buying Teslas.” Republican Sen. Bill Cassidy of Louisiana said on ABC News’ “This Week” that another multi-billion dollar spending bill could inject “an incredible amount of uncertainty” into the economy just as it entered a recession.  Debate is raging in Washington on that last point following the release of an official report last week showing a second straight quarter of negative growth. The White House insists that given strong job growth, the economy is not in a classic contraction. In practical terms, however, the inside-the-Beltway semantics make little difference to Americans confronted by grocery bills that are far more expensive than a year ago, even if the prices at the pump have eased somewhat in recent weeks. Republicans accused of ‘cruelty’ over veterans’ health care The battle over the climate and health care bill will take place in parallel this week with a fierce controversy over the GOP blockage of a bill that would provide health care to veterans exposed to toxic fumes from burn pits, which were used to incinerate waste at military installations during the Iraq and Afghanistan wars. Activists, including comedian Jon Stewart, have accused the GOP of “cruelty” after some senators who voted for a previous version of the bill voted not to advance this one. Republicans, meanwhile, accuse Democrats of inserting new spending and complain that their amendments were not included. Veterans Affairs Secretary Denis McDonough said on “State of the Union” that a Toomey amendment would put a “year-on-year” cap on what the department can spend on veterans exposed to burn pits and would lead to “rationing of care.”  Biden, in a FaceTime call from isolation after he registered another positive Covid-19 test on Saturday, promised protesters at the Capitol that he’d fight for the legislation “as long as I have a breath in me.” Toomey told Tapper, however, that he had long raised opposition to the measure since he wanted funding for burn pit care included in year-on-year appropriations rather than in the mandatory spending column. He said the current legislation would allow Democrats to divert $400 billion to other purposes. And he denied claims that Republicans are holding up the bill to prevent Democrats from scoring another win, following the closing of the Manchin-Schumer deal, as “absurd and dishonest.”  However, the sight of Republicans voting against veterans’ health care – whatever the intricate details of the case – threatens to further an impression that the party is becoming more extreme. And it also takes the focus off the key issues that are most likely to sway the midterm elections in the GOP’s favor, including inflation, gasoline prices and Biden’s low approval rating.",
"""

### Prompt

I am using this one-shot prompt.

In [12]:
task = """Extract named entities from the below text. Extract only Persons, Locations, and Organizations.
Person examples: 'Jon Doe', 'Jack', 'Osama bin Laden'. 
Location examples: 'Paris', 'Canada', 'Azhar University', 'Texas Department Of Public Safety', 'Detroit’s 12th Street'.
Organization examples: 'US Centers for Disease Control and Prevention', 'CNN', 'Tesla', 'MIT', 'Conviction Review Unit', 'the Seekers'.
Extract only the longest name, i.e. do not extract 'Osama' and 'bin Laden' but only 'Osama bin Laden' for the 'Osama bin Laden'
If the same entity was previously extracted, do not extract it. So, the list of entities should not have duplicates.
Extracted entities are placed in the JSON file with this structure:
{
    'persons': 'Jon Doe;Jack;Osama bin Laden',
    'organizations': 'Paris;Canada;Azhar University;Texas Department Of Public Safety;Detroit’s 12th Street',
    'locations': 'US Centers for Disease Control and Prevention;CNN;Tesla;MIT;Conviction Review Unit;the Seekers'
}
All entities in one category are placed in one string with character ';' as a separator between entities. If no entities were found, place an empty string, like ''.
"""

In [13]:
prompt_template = f"{task}/nTEXT:/n{s}"

In [35]:
# # of characters; # ~ of words; # ~ of tokens

len(s), len(s.split()), len(s) // 4

(10289, 1681, 2572)

In [36]:
len(prompt_template), len(prompt_template.split()), len(prompt_template) // 4

(11459, 1843, 2864)

### Labels

I am manually labelling this text. NE are deduplicated. See the `name_normalization` notebook for the name normalization/deduplication.

In [27]:
labels = {
    "persons": "Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Donald Trump;Bernie Sanders;Pat Toomey;Jake Tapper;Bill Cassidy;Denis McDonough;Jon Stewart",
    "organizations": "Senate;CNN;GOP;Supreme Court;Democrats;Republicans;ABC News;Veterans Affairs;White House",
    "locations": "Washington;Arizona;America;American West;West Virginia;Vermont;Pennsylvania;Arizona;Washington;Kentucky;Louisiana;Iraq;Afghanistan;Capitol;US"
}

## Experiments

#### Evaluation Code

In [53]:
dfs = [] # for compound results

In [54]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score

# Convert the string list into a set of values and deduplicate values
def to_set(string):
    return set(string.lower().split(';'))

def evaluate_row(labels, res):
    # Convert input into sets for comparison
    true_labels = {k: to_set(v) for k, v in labels.items()}
    pred_results = {k: to_set(v) for k, v in res.items()}
    
    # Create DataFrame with desired structure
    data = {
        "Labels": [],
        "Results": [],
        "TP": [],
        "TN": [],  # Placeholder, not calculated in this specific example
        "FP": [],
        "FN": [],
        "precision": [],
        "recall": [],
        "F1": [],
        "TP_values": [],
        "FP_values": [],
        "FN_values": []
    }
    
    # Calculate metrics for each category (persons, organizations, locations)
    for category in true_labels.keys():
        labels_set = true_labels[category]
        results_set = pred_results[category]
    
        # True positives (TP): Correctly predicted
        TP_values = labels_set.intersection(results_set)
        # False positives (FP): Predicted but not true
        FP_values = results_set.difference(labels_set)
        # False negatives (FN): True but not predicted
        FN_values = labels_set.difference(results_set)
    
        # Calculate precision, recall, and F1 score
        TP = len(TP_values)
        FP = len(FP_values)
        FN = len(FN_values)
        precision = TP / (TP + FP) if (TP + FP) > 0 else 0
        recall = TP / (TP + FN) if (TP + FN) > 0 else 0
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
        # Add results to the DataFrame
        data["Labels"].append(";".join(labels_set))
        data["Results"].append(";".join(results_set))
        data["TP"].append(TP)
        data["TN"].append(0)  # Not computed in this example, so setting it as 0
        data["FP"].append(FP)
        data["FN"].append(FN)
        data["precision"].append(precision)
        data["recall"].append(recall)
        data["F1"].append(f1)
        data["TP_values"].append(";".join(TP_values))
        data["FP_values"].append(";".join(FP_values))
        data["FN_values"].append(";".join(FN_values))
    
    df = pd.DataFrame(data)
    return df


In [39]:
# verify:

results = {
    "persons": "Joe Biden;##iden;Manchin;Kamal;K;##yr;Joe Manchin;Harris;Sin;B;##sten Sinema;Chuck Schum;Schum;Sc;Mitch McConnell;Sinema;Man",
    "organizations": "GOP;Senate;CNN",
    "locations": "West Virginia;America;Washington;Arizona"
}

df = evaluate_row(labels, results)

In [40]:
df.head()

Unnamed: 0,Labels,Results,TP,TN,FP,FN,precision,recall,F1,TP_values,FP_values,FN_values
0,jon stewart;donald trump;joe biden;mitch mccon...,harris;b;kamal;##sten sinema;chuck schum;joe b...,3,0,14,10,0.176471,0.230769,0.2,joe biden;mitch mcconnell;joe manchin,harris;b;chuck schum;kamal;##sten sinema;schum...,jon stewart;donald trump;bill cassidy;kyrsten ...
1,gop;republicans;white house;abc news;cnn;veter...,gop;cnn;senate,3,0,0,6,1.0,0.333333,0.5,gop;cnn;senate,,republicans;white house;abc news;veterans affa...
2,vermont;washington;america;iraq;louisiana;afgh...,america;west virginia;washington;arizona,4,0,0,9,1.0,0.307692,0.470588,america;west virginia;washington;arizona,,vermont;afghanistan;capitol;iraq;louisiana;pen...


### Baseline Hugging Face `dslim/bert-base-NER` model


In [55]:
res_baseline_hf = {
    "persons": "Joe Biden;##iden;Manchin;Kamal;K;##yr;Joe Manchin;Harris;Sin;B;##sten Sinema;Chuck Schum;Schum;Sc;Mitch McConnell;Sinema;Man",
    "organizations": "GOP;Senate;CNN",
    "locations": "West Virginia;America;Washington;Arizona"
}

In [56]:
df = evaluate_row(labels, res_baseline_hf)

In [57]:
df['model'] = "Hugging Face dslim/bert-base-NER"
dfs.append(df.copy())
len(dfs)

1

### OpenAI ChatGPI 4o mini

In [58]:
res_chatgpi_4o_mini = {
    "persons": "Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Bernie Sanders;Pat Toomey;Denis McDonough;Jon Stewart",
    "organizations": "CNN;GOP;Affordable Care Act;Obamacare;Democrats;Republicans;Veterans Affairs;White House",
    "locations": "Washington;Arizona;West Virginia;Kentucky;Louisiana;Iraq;Afghanistan;Capitol;US"
}


In [59]:
df = evaluate_row(labels, res_chatgpi_4o_mini)

In [60]:
df['model'] = "OpenAI ChatGPI 4o mini"
dfs.append(df.copy())
len(dfs)

2

### OpenAI ChatGPT 4o

In [61]:
res_chatgpi_4o = {
    "persons": "Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Bernie Sanders;Jon Stewart;Pat Toomey;Bill Cassidy;Denis McDonough",
    "organizations": "Democrats;Republicans;Senate Majority;CNN;GOP;Supreme Court;Veterans Affairs;US;Affordable Care Act;Senate;House",
    "locations": "Arizona;West Virginia;Washington;Kentucky;Vermont;Pennsylvania;Louisiana;Iraq;Afghanistan"
}


In [62]:
df = evaluate_row(labels, res_chatgpi_4o)

In [63]:
df['model'] = "OpenAI ChatGPT 4o"
dfs.append(df.copy())
len(dfs)

3

### Claude Sonet 3.5

In [64]:
res_claude_sonnet_35 = {
    "persons": "President Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Bernie Sanders;Pat Toomey;Bill Cassidy;Jon Stewart;Denis McDonough",
    "organizations": "Democrats;Republican;GOP;Senate;US Centers for Disease Control and Prevention;CNN;ABC News",
    "locations": "Washington;Arizona;America;West Virginia;United States;Capitol"
}

In [65]:
df = evaluate_row(labels, res_claude_sonnet_35)

In [66]:
df['model'] = "Claude Sonet 3.5"
dfs.append(df.copy())
len(dfs)

4

### Google Gemini 1.5 Pro

In [67]:
res_google_gemini = {
'persons': 'Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Manchin;Bernie Sanders;Jon Stewart;Denis McDonough;Pat Toomey;Jake Tapper;Bill Cassidy;Donald Trump',
'organizations': 'Democrats;Washington;Senate;GOP;CNN;State of the Union;House;Affordable Care Act;Supreme Court;Republican;White House;ABC News;This Week;Veterans Affairs;Capitol',
'locations': 'Washington;Arizona;America;West Virginia;Kentucky;CNN;Sunday TV talk shows;United States;American West;Louisiana;Pennsylvania;Iraq;Afghanistan;Capitol'
}

In [68]:
df = evaluate_row(labels, res_google_gemini)

In [69]:
df['model'] = "Google Gemini 1.5 Pro"
dfs.append(df.copy())
len(dfs)

5

### Google Gemini 1.5 Flash

In [70]:
res_google_gemini_flash = {
    "persons": "Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Bernie Sanders;Jon Stewart;Denis McDonough;Donald Trump;Pat Toomey;Bill Cassidy;Jake Tapper",
    "organizations": "Democrats;Senate;GOP;Republicans;CNN;State of the Union;Build Back Better;Affordable Care Act;Obamacare;Supreme Court;House;CNN reporting;Veterans Affairs;ABC News;This Week;White House",
    "locations": "Washington;Arizona;West Virginia;US;America;American West;Kentucky;Iraq;Afghanistan;Capitol"
}

In [71]:
df = evaluate_row(labels, res_google_gemini_flash)

In [72]:
df['model'] = "Google Gemini 1.5 Flash"
dfs.append(df.copy())
len(dfs)

6

### Googel Gemma 2 27B

In [73]:
res_google_gemma_2_27b = {
 "persons": "Joe Biden;Kyrsten Sinema;Mitch McConnell;Chuck Schumer;Joe Manchin;Kamala Harris;Bernie Sanders;Pat Toomey;Bill Cassidy;Jon Stewart;Denis McDonough;Donald Trump",
 "organizations": "US Centers for Disease Control and Prevention;CNN;Tesla;MIT;Conviction Review Unit;the Seekers;Texas Department Of Public Safety;Detroit’s 12th Street;Azhar University;America’s;Veterans Affairs",
 "locations": "Paris;Canada;West Virginia;Kentucky;Iraq;Afghanistan"
}

In [74]:
df = evaluate_row(labels, res_google_gemma_2_27b)

In [75]:
df['model'] = "Googel Gemma 2 27B"
dfs.append(df.copy())
len(dfs)

7

## All experiments

In [76]:
result_df = pd.concat(dfs, ignore_index=True)
excel_path = '../data/external/experiments_results.xlsx'
result_df.to_excel(excel_path, index=False)

print(f"Excel file saved to {excel_path}")

Excel file saved to ../data/external/experiments_results.xlsx


![image.png](attachment:3955a8cf-74a2-4d00-9b08-f60d0e5c5df7.png)

## Experiment Results, Conclusions

### Results

The experiment was conducted on a text example of about 10,000 characters.

* As anticipated, the state-of-the-art (SOTA) LLM models produced significantly better results than the specialized open-source NER model (`dslim/bert-base-NER`).
* This demonstrates that SOTA LLM models are highly effective in the `Named Entity Recognition` (`NER`) task.
* The top-performing model was `Google Gemini 1.5 Flash`, which is considerably more affordable than the most expensive models in the experiment (`OpenAI ChatGPT 4o`).
* Another noteworthy model, `Google Gemma 2 27B`, is an open-source LLM that achieved an F1 score of 0.96 for person entities, substantially outperforming the baseline. However, its performance on organizations and locations was weak, aligning with the baseline model.
* Named entity classification can often be ambiguous. For instance:
  - Is "Democrats" an accurate name for the political party?
  - Is "Capitol" considered a location?
  - Is "American West" a geographic location?
  - Should "US" be treated as shorthand for "USA," or is that entirely context-dependent and case-dependent?
  - Should models normalize "US" to "The USA"?
  - Is "Brooklyn neighborhood" a correct reference to a location?
- `Manual labeling` is time-consuming and not an easy task.
 

### Conclusions

- State-of-the-art (SOTA) LLM models can serve as a solid foundation for Named Entity (NE) projects.
- Even if their direct use for name entity recognition is not free, they can still be used to verify the results produced by free, open-source models. This verification process requires significantly smaller input and output prompts, making it far more cost-effective.