# Named Entity Recognition

Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as person, organization, location, date, time, etc.

Named entities are specific words or phrases that refer to a specific object, concept, or entity in the real world. Examples of named entities include:

* Person: John Smith, Jane Doe
* Organization: Google, Apple
* Location: New York, London
* Date: January 1, 2022
* Time: 3:00 PM
* Event: World Cup, Olympics

The goal of NER is to automatically identify and classify these named entities in text data, which can be useful for a variety of applications such as:

* Information retrieval: NER can help improve the accuracy of search results by identifying specific entities mentioned in the query.
* Text summarization: NER can help identify the most important entities mentioned in a text and summarize the text around those entities.
* Sentiment analysis: NER can help identify the sentiment towards specific entities mentioned in the text.
* Question answering: NER can help identify the entities mentioned in a question and provide more accurate answers.

There are several techniques used in NER, including:

* Rule-based approaches: These approaches use hand-crafted rules to identify named entities.
* Machine learning approaches: These approaches use machine learning algorithms to learn patterns and relationships in the data.
* Hybrid approaches: These approaches combine rule-based and machine learning approaches to improve accuracy.

Some common NER tasks include:

* Entity recognition: Identifying named entities in text data.
* Entity disambiguation: Resolving ambiguity in entity names (e.g., "John" could refer to multiple people).
* Entity classification: Classifying named entities into predefined categories.
* Entity linking: Linking named entities to a knowledge base or database.

NER is a challenging task due to the complexity of natural language and the variability of entity names. However, it has many applications in areas such as information retrieval, text summarization, sentiment analysis, and question answering.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load pre-trained model and tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

def perform_ner(text):
    # Perform NER
    entities = ner_pipeline(text)

    # Process and print results
    print(f"Text: {text}\n")
    print("Entities found:")
    for entity in entities:
        print(f"  - {entity['word']} ({entity['entity_group']})")
        print(f"    Score: {entity['score']:.4f}")
        print(f"    Start: {entity['start']}, End: {entity['end']}")
        print()

    # Visualize entities in text
    words = text.split()
    entity_positions = [(entity['start'], entity['end'], entity['entity_group']) for entity in entities]

    visualized_text = ""
    current_pos = 0
    for word in words:
        word_start = text.index(word, current_pos)
        word_end = word_start + len(word)
        current_pos = word_end

        entity_type = next((e[2] for e in entity_positions if e[0] <= word_start and e[1] >= word_end), None)
        if entity_type:
            visualized_text += f"[{word}]({entity_type}) "
        else:
            visualized_text += word + " "

    print("Visualized text with entities:")
    print(visualized_text.strip())

# Example texts
texts = [
    "Apple Inc. is planning to open a new store in New York City next month.",
    "The Eiffel Tower in Paris, France, was completed on March 31, 1889.",
    "Albert Einstein developed the theory of relativity while working at the Swiss Patent Office in Bern."
]

# Perform NER on each text
for text in texts:
    perform_ner(text)
    print("\n" + "="*50 + "\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Text: Apple Inc. is planning to open a new store in New York City next month.

Entities found:
  - Apple Inc (ORG)
    Score: 0.9995
    Start: 0, End: 9

  - New York City (LOC)
    Score: 0.9995
    Start: 46, End: 59

Visualized text with entities:
[Apple](ORG) Inc. is planning to open a new store in [New](LOC) [York](LOC) [City](LOC) next month.


Text: The Eiffel Tower in Paris, France, was completed on March 31, 1889.

Entities found:
  - E (LOC)
    Score: 0.9936
    Start: 4, End: 5

  - ##iff (LOC)
    Score: 0.7460
    Start: 5, End: 8

  - ##el Tower (LOC)
    Score: 0.9524
    Start: 8, End: 16

  - Paris (LOC)
    Score: 0.9996
    Start: 20, End: 25

  - France (LOC)
    Score: 0.9997
    Start: 27, End: 33

Visualized text with entities:
The Eiffel [Tower](LOC) in Paris, France, was completed on March 31, 1889.


Text: Albert Einstein developed the theory of relativity while working at the Swiss Patent Office in Bern.

Entities found:
  - Albert Einstein (PER)
    Score: