In [2]:
# Check if spacy is install, and what models are available
!python -m spacy info

[1m

spaCy version    3.7.5                         
Location         C:\Users\Haingo\AppData\Roaming\Python\Python311\site-packages\spacy
Platform         Windows-10-10.0.22631-SP0     
Python version   3.11.7                        
Pipelines        en_core_web_sm (3.7.1), en_core_web_md (3.7.1)



Spacy is a popular open-source library used for natural language processing (NLP) tasks. It provides efficient and accurate tokenization, part-of-speech tagging, named entity recognition, and other NLP functionalities.

DocBin is a class provided by SpaCy that allows for efficient serialization and deserialization of SpaCy Doc objects. It is used to store and load large collections of documents in a binary format, which can be useful for training and processing large datasets.

tqdm is a Python library that provides a progress bar for iterating over iterable objects. It is used to visualize the progress of tasks such as iterating over training or validation data, making it easier to track the progress and estimate the remaining time for completion.

In [26]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

In [27]:
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

In [28]:
# Load the training data
import json
f = open('./all_train_data.json')
TRAIN_DATA = json.load(f)

In [29]:
# Load the validation data
import json
f = open('./all_validation_data.json')
VALIDATION_DATA = json.load(f)

In [5]:
# Converting the training data JSON file into .spacy (docbin) objects
for text, annot in tqdm(TRAIN_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    # Loop through the entities in each annotation
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 44/44 [00:00<00:00, 2782.33it/s]


In [6]:
# Converting the validation data into .spacy (docbin) objects
for text, annot in tqdm(VALIDATION_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./validation_data.spacy") # save the docbin object

100%|██████████| 10/10 [00:00<00:00, 1761.05it/s]


In [7]:
# Extracting config file using spacy config widget
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [8]:
# Training
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./validation_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     54.50    0.00    0.00    0.00    0.00
 50     200          7.05   1016.43   91.75   90.82   92.71    0.92
114     400          0.00      0.00   93.19   93.68   92.71    0.93
182     600         37.73     10.64   95.38   93.94   96.88    0.95
282     800         53.30      8.96   92.31   90.91   93.75    0.92
382    1000          0.00      0.00   93.81   92.86   94.79    0.94
492    1200         58.79     15.18   95.34   94.85   95.83    0.95
692    1400         17.88      3.16   92.78   91.84   93.75    0.93
892    1600         17.94      3.78   93.33   91.92   94.79    0.93
1092    1800          0.30      0.07   92.86   91.0

The previous lines of code were used to create the model. It is saved under the folder model-best.

The next line of codes are required in the chatbot implementation.

In [9]:
# Load the customer model saved in the model-best folder
nlp_ner = spacy.load("./model-best")

In [24]:
# Add user input here
doc = nlp_ner('''In the morning I eat oatmeal and clothes.
              For lunch I usually eat pasta, pizza and an elephant from my favorite italian restaurant.
              For dinner, I eat greek sandwiches with my collegues during the week days, and occasionally burgers during the weekend.
              During snack time, I usually have fruits, energy bars, and popcorn.''')

In [25]:
spacy.displacy.render(doc, style="ent", jupyter=True)

In this example, the model correcly predicts:
- oatmeal
- pasta
- pizza
- sandwiches
- burgers
- fruits
- energy bars
- popcorn 
as food 

But it also, incorrectly predicts:
- clothes
- greek
- occasionally (burgers)

In [33]:
# Return the food items from the sentence
food_ner = []

for ent in doc.ents:
    if ent.label_ == 'FOOD':
        food_ner.append(ent.text)

print(food_ner)

['oatmeal', 'clothes', 'pasta', 'pizza', 'greek', 'sandwiches', 'occasionally burgers', 'fruits', 'energy bars', 'popcorn']
