Named Entity Recognition
  Information Extraction
  Detect and classify the named entities in unstructured data

In [48]:
#load spacy
import spacy

In [49]:
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

In [50]:
# Process whole documents
doc = nlp("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
for ent in doc.ents:
        print(ent.text, "|", ent.label, "|", spacy.explain(ent.label_))

Sebastian Thrun | 380 | People, including fictional
Google | 383 | Companies, agencies, institutions, etc.
2007 | 391 | Absolute or relative dates or periods
American | 381 | Nationalities or religious or political groups
Thrun | 380 | People, including fictional
Recode | 383 | Companies, agencies, institutions, etc.
earlier this week | 391 | Absolute or relative dates or periods


In [51]:
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']


In [52]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE


Visualization

In [53]:
from spacy import displacy

In [54]:
## saving to jupyter notebook
displacy.render(doc, style="ent", jupyter=True)

Convert data to .spacy format

Custom Train NER Pipeline
Data Preparation

In [55]:
train = [
          ("An average-sized strawberry has about 200 seeds on its outer surface and are quite edible.",{"entities":[(17,27,"Fruit")]}),
          ("The outer skin of Guava is bitter tasting and thick, dark green for raw fruits and as the fruit ripens, the bitterness subsides. ",{"entities":[(18,23,"Fruit")]}),
          ("Grapes are one of the most widely grown types of fruits in the world, chiefly for the making of different wines. ",{"entities":[(0,6,"Fruit")]}),
          ("Watermelon is composed of 92 percent water and significant amounts of Vitamins and antioxidants. ",{"entities":[(0,10,"Fruit")]}),
          ("Papaya fruits are usually cylindrical in shape and the size can go beyond 20 inches. ",{"entities":[(0,6,"Fruit")]}),
          ("Mango, the King of the fruits is a drupe fruit that grows in tropical regions. ",{"entities":[(0,5,"Fruit")]}),
          ("undefined",{"entities":[(0,6,"Fruit")]}),
          ("Oranges are great source of vitamin C",{"entities":[(0,7,"Fruit")]}),
          ("A apple a day keeps doctor away. ",{"entities":[(2,7,"Fruit")]})
        ]

In [56]:
import pandas as pd
import os
from tqdm import tqdm
from spacy.tokens import DocBin

db = DocBin() # create a DocBin object

for text, annot in tqdm(train): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

100%|██████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 2566.20it/s]

Skipping entity





In [57]:
"""The recommended way to train your spaCy pipelines is via the spacy train command on the command line.
It only needs a single config.cfg configuration file that includes all settings and hyperparameters.
You can optionally overwrite settings on the command line, 
and load in a Python file to register custom functions and architectures.
This quickstart widget helps you generate a starter config with the recommended settings for your specific use case. 
It’s also available in spaCy as the init config command."""

'The recommended way to train your spaCy pipelines is via the spacy train command on the command line.\nIt only needs a single config.cfg configuration file that includes all settings and hyperparameters.\nYou can optionally overwrite settings on the command line, \nand load in a Python file to register custom functions and architectures.\nThis quickstart widget helps you generate a starter config with the recommended settings for your specific use case. \nIt’s also available in spaCy as the init config command.'

Create base config file

In [58]:
#pip install pytokenizations

In [59]:
#! pip install -U spacy -q
! pip install -U spacy



In [45]:
! python -m spacy info

[1m

spaCy version    3.4.3                         
Location         C:\Users\LZ575NE\Anaconda3\lib\site-packages\spacy
Platform         Windows-10-10.0.19044-SP0     
Python version   3.9.12                        
Pipelines        en_core_web_lg (3.4.1), en_core_web_sm (3.4.1)



The below code is to fill the config file

In [60]:
#python -m spacy init fill-config ./base_config.cfg ./config.cfg
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[!] To generate a more effective transformer-based config (GPU-only), install
the spacy-transformers package and re-run this command. The config generated now
does not use transformers.
[i] Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[+] Auto-filled config with all values
[+] Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Now training the Model

In [61]:
!python -m spacy train config.cfg --output ./ .train ./train.spacy --paths.dev ./train.spacy

[+] Created output directory: output
[i] Saving to output directory: output
[i] Using CPU
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001

[2022-11-23 07:42:34,667] [INFO] Set up nlp object from config
[2022-11-23 07:42:34,673] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-11-23 07:42:34,675] [INFO] Created vocabulary
[2022-11-23 07:42:34,676] [INFO] Finished initializing nlp object
[2022-11-23 07:42:34,740] [INFO] Initialized pipeline components: ['tok2vec', 'ner']



E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     32.50   11.76    6.25  100.00    0.12
156     200          3.56    544.95  100.00  100.00  100.00    1.00
356     400          0.00      0.00  100.00  100.00  100.00    1.00
556     600          0.00      0.00  100.00  100.00  100.00    1.00
756     800          0.00      0.00  100.00  100.00  100.00    1.00
956    1000          0.00      0.00  100.00  100.00  100.00    1.00
1156    1200          0.00      0.00  100.00  100.00  100.00    1.00
1356    1400          0.00      0.00  100.00  100.00  100.00    1.00
1556    1600          0.00      0.00  100.00  100.00  100.00    1.00
1756    1800          0.00      0.00  100.00  100.00  100.00    1.00
[+] Saved pipeline to output directory
output\model-last


Load Trained Model

In [62]:
#loading the best model
nlp1 = spacy.load(r"./output/model-best")



In [64]:
# testing with input sample text

doc = nlp1("Strawberry is a luscious, red fruit grown on plants belonging to the Rose or Rosaceae family.")
doc.ents

(Strawberry, Rosaceae)

In [65]:
colors = {'Fruit': "#85C1E9"}
options = {"ents": ['Fruit'], "colors": colors}

In [66]:
# to display in Jupyter
spacy.displacy.render(doc, style="ent", jupyter=True, options=options)