<a href="https://colab.research.google.com/github/mohammadreza-mohammadi94/Transformers-Hub/blob/main/NER_CONLL2003_Bert_Base_NER/ner_dslim_bert_base_ner_conll_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Libraries

In [6]:
!pip install -q --upgrade huggingface_hub fsspec evaluate datasets

# Import Libraries

In [7]:
from transformers import pipeline
from datasets import load_dataset

### Load model and Test NER

In [8]:
ner = pipeline('ner', 'dslim/bert-base-NER')
dataset = load_dataset("conll2003", split='test[:5]')

for item in dataset:
    text = " ".join(item['tokens'])
    entities = ner(text)
    print(f"Text: {text[:100]}...")
    print("Entities: ", [(e["word"], e["entity"], e['score']) for e in entities])

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Text: SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT ....
Entities:  [('J', 'B-MISC', np.float32(0.48014238)), ('##AP', 'I-LOC', np.float32(0.29028258)), ('L', 'B-PER', np.float32(0.43156973)), ('##UC', 'I-LOC', np.float32(0.42212436)), ('CH', 'B-ORG', np.float32(0.64400214)), ('##IN', 'I-LOC', np.float32(0.512721)), ('##A', 'I-ORG', np.float32(0.5850009))]
Text: Nadim Ladki...
Entities:  [('Na', 'B-PER', np.float32(0.99730563)), ('##di', 'B-PER', np.float32(0.8018445)), ('##m', 'B-PER', np.float32(0.6068873)), ('La', 'I-PER', np.float32(0.99857855)), ('##ki', 'I-PER', np.float32(0.7398141))]
Text: AL-AIN , United Arab Emirates 1996-12-06...
Entities:  [('AL', 'B-LOC', np.float32(0.9976654)), ('-', 'I-LOC', np.float32(0.9957873)), ('AI', 'I-LOC', np.float32(0.9619766)), ('##N', 'I-LOC', np.float32(0.9851366)), ('United', 'B-LOC', np.float32(0.9994467)), ('Arab', 'I-LOC', np.float32(0.9993622)), ('Emirates', 'I-LOC', np.float32(0.99942786))]
Text: Japan began the defence of the

# Develop NER Model

In [15]:
#-----------------#
# Libraries       #
#-----------------#
from transformers import pipeline
from datasets import load_dataset
import logging

#-----------------#
# Logging         #
#-----------------#
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')


#-----------------#
# Configuration   #
#-----------------#
MODEL_NAME = "dslim/bert-base-NER"
DATASET_NAME = "conll2003"
DATASET_SPLIT = "test[:5]"

## Helper Functions

In [24]:
def format_entities(entities):
    """
    Formats the raw entity output from the pipeline for better readability.

    Args:
        entities (list): A list of dictionaries, where each dictionary
                         represents an entity found by the NER model.

    Returns:
        list: A list of tuples, each containing (word, entity_label, score).
    """
    formatted = []
    for entity in entities:
        # Standardize access to entity details
        word = entity.get("word", "N/A")
        # THIS IS THE KEY CHANGE: "entity" -> "entity_group"
        entity_label = entity.get("entity_group", "N/A") # Use "entity_group" with aggregation
        score = entity.get("score", 0.0)
        formatted.append((word, entity_label, f"{score:.4f}")) # Format score for readability
    return formatted

In [25]:
def process_text_for_ner(text, ner_pipeline):
    """
    Processes a single text string using the NER pipeline.

    Args:
        text (str): The input text to analyze.
        ner_pipeline (transformers.Pipeline): The initialized NER pipeline.

    Returns:
        list: A list of formatted entities found in the text.
    """
    if not text.strip(): # Handle empty strings
        logging.warning("Received empty text for NER processing.")
        return []
    try:
        raw_entities = ner_pipeline(text)
        return format_entities(raw_entities)
    except Exception as e:
        logging.error(f"Error during NER processing for text: '{text[:50]}...': {e}")
        return []

In [26]:
def main():
    """
    Main function to load the NER model, dataset, and perform entity recognition.
    """
    logging.info(f"Loading NER model: {MODEL_NAME}")
    try:
        ner_pipeline = pipeline("ner", model=MODEL_NAME, aggregation_strategy="simple")
        # Using aggregation_strategy="simple" groups sub-word tokens (like ##ing for 'running')
        # into single entities. Other options: "first", "average", "max".
        # "none" would return entities for each token.
    except Exception as e:
        logging.error(f"Failed to load NER model: {e}")
        return

    logging.info(f"Loading dataset: {DATASET_NAME}, split: {DATASET_SPLIT}")
    try:
        dataset = load_dataset(DATASET_NAME, split=DATASET_SPLIT)
    except Exception as e:
        logging.error(f"Failed to load dataset: {e}")
        return

    logging.info("Starting NER processing on the dataset...")
    for i, item in enumerate(dataset):
        # The CoNLL2003 dataset items have 'tokens' and 'ner_tags'
        # 'tokens' is a list of words.
        text = " ".join(item["tokens"])

        # It's also insightful to see the *ground truth* labels if available
        ground_truth_tags = item.get("ner_tags", []) # ner_tags are numerical in conll2003
        # To make ground truth human-readable, you'd need the dataset's feature info:
        # feature_info = dataset.features["ner_tags"].feature
        # ground_truth_labels = [feature_info.int2str(tag) for tag in ground_truth_tags]

        print(f"\n--- Sample {i+1} ---")
        print(f"Original Text (first 100 chars): {text[:100]}...")
        # print(f"Ground Truth NER Tags (numerical): {ground_truth_tags}")
        # print(f"Ground Truth NER Labels: {ground_truth_labels}") # If you convert them

        predicted_entities = process_text_for_ner(text, ner_pipeline)

        if predicted_entities:
            print("Predicted Entities:")
            for word, entity_label, score in predicted_entities:
                print(f"  - Word: \"{word}\", Type: {entity_label}, Confidence: {score}")
        else:
            print("  No entities predicted or an error occurred.")

    logging.info("NER processing finished.")

In [27]:
main()

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu



--- Sample 1 ---
Original Text (first 100 chars): SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT ....
Predicted Entities:
  - Word: "J", Type: MISC, Confidence: 0.4801
  - Word: "##AP", Type: LOC, Confidence: 0.2903
  - Word: "L", Type: PER, Confidence: 0.4316
  - Word: "##UC", Type: LOC, Confidence: 0.4221
  - Word: "CH", Type: ORG, Confidence: 0.6440
  - Word: "##IN", Type: LOC, Confidence: 0.5127
  - Word: "##A", Type: ORG, Confidence: 0.5850

--- Sample 2 ---
Original Text (first 100 chars): Nadim Ladki...
Predicted Entities:
  - Word: "Na", Type: PER, Confidence: 0.9973
  - Word: "##di", Type: PER, Confidence: 0.8018
  - Word: "##m La", Type: PER, Confidence: 0.8027
  - Word: "##ki", Type: PER, Confidence: 0.7398

--- Sample 3 ---
Original Text (first 100 chars): AL-AIN , United Arab Emirates 1996-12-06...
Predicted Entities:
  - Word: "AL - AIN", Type: LOC, Confidence: 0.9851
  - Word: "United Arab Emirates", Type: LOC, Confidence: 0.9994

--- Sample 4 ---
Original Text