In [15]:
# Check if spacy is installed, and what models are available
!python -m spacy info

[1m

spaCy version    3.7.5                         
Location         C:\Users\Haingo\AppData\Roaming\Python\Python311\site-packages\spacy
Platform         Windows-10-10.0.22631-SP0     
Python version   3.11.7                        
Pipelines        en_core_web_sm (3.7.1), en_core_web_md (3.7.1)



Spacy is a popular open-source library used for natural language processing (NLP) tasks. It provides efficient and accurate tokenization, part-of-speech tagging, named entity recognition, and other NLP functionalities.

DocBin is a class provided by SpaCy that allows for efficient serialization and deserialization of SpaCy Doc objects. It is used to store and load large collections of documents in a binary format, which can be useful for training and processing large datasets.

tqdm is a Python library that provides a progress bar for iterating over iterable objects. It is used to visualize the progress of tasks such as iterating over training or validation data, making it easier to track the progress and estimate the remaining time for completion.

In [1]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

In [2]:
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

In [18]:
# Load the training data
import json
f = open('./all_train_data.json')
TRAIN_DATA = json.load(f)

In [19]:
# Load the validation data
import json
f = open('./all_validation_data.json')
VALIDATION_DATA = json.load(f)

In [20]:
# Converting the training data JSON file into .spacy (docbin) objects
for text, annot in tqdm(TRAIN_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    # Loop through the entities in each annotation
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 383/383 [00:00<00:00, 5558.06it/s]


In [21]:
# Converting the validation data into .spacy (docbin) objects
for text, annot in tqdm(VALIDATION_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./validation_data.spacy") # save the docbin object

100%|██████████| 120/120 [00:00<00:00, 5077.34it/s]


In [22]:
# Extracting config file using spacy config widget
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [23]:
# Training
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./validation_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     49.21    0.00    0.00    0.00    0.00
  4     200         83.46   1436.14   92.18   92.28   92.09    0.92
 11     400         49.88    120.97   94.97   94.21   95.74    0.95
 18     600         24.25     29.71   94.27   93.43   95.13    0.94
 27     800          0.02      0.01   94.98   94.04   95.94    0.95
 39    1000         12.08      4.86   94.95   94.57   95.33    0.95
 52    1200         22.30      6.29   94.60   93.29   95.94    0.95
 69    1400          7.33      7.97   95.36   94.79   95.94    0.95
 90    1600        188.19     58.27   93.89   92.69   95.13    0.94
115    1800        154.51     19.69   94.26   93.60

The previous lines of code were used to create the model. It is saved under the folder model-best.

The next line of codes are required in the chatbot implementation.

In [3]:
# Load the customer model saved in the model-best folder
nlp_ner = spacy.load("./model-best")

In [25]:
# Add user input here
doc = nlp_ner('''In the morning I eat oatmeal and clothes.
              For lunch I usually eat pasta, pizza and an elephant from my favorite italian restaurant.
              For dinner, I eat greek sandwiches with my collegues during the week days, and occasionally burgers during the weekend.
              During snack time, I usually have fruits, energy bars, and popcorn.''')

In [26]:
spacy.displacy.render(doc, style="ent", jupyter=True)

In the previous model, the correctly predicted foods were:
- oatmeal
- pasta
- pizza
- greek sandwiches
- burgers
- fruits
- energy bars
- popcorn 
as food 

But it also, incorrectly predicts:
- clothes
- occasionally (burgers)

In the current mode, the correctly predicted foods are:
- oatmeal
-pasta
- pizza
- greek sandwich
- burgers
- bars

But it also incorrctly predicts:
- colleges during & during
- week days
- weekend
- fruits
- energy [bars]
- popcorn


In [27]:
# Return the food items from the sentence
food_ner = []

for ent in doc.ents:
    if ent.label_ == 'FOOD':
        food_ner.append(ent.text)

print(food_ner)

['oatmeal', 'pasta', 'pizza', 'elephant', 'favorite italian', 'greek sandwiches', 'collegues during', 'burgers during', 'weekend', 'bars']


In [11]:
# Add user input here
doc1 = nlp_ner('''I am allergic to milk. I am pregnant and I want some ice cream.
              I used to eat fish and bacon before following a vegetarian diet.''')

In [12]:
spacy.displacy.render(doc1, style="ent", jupyter=True)

In this piece of sentence, the model correctly predicts:
- Allergic to milk (as a special need)
- Pregnant (as a special need)
- [some] ice cream (as food)
- fish (as food)
- bacon (as food)

But incorrectly predicts:
- Vegetarian (as food instead of preference)

In [13]:
# Add user input here
doc2 = nlp_ner('''I am on Keto diet. I eat a lot of meat, eggs, and cheese.
               I am a body builder..''')

In [14]:
spacy.displacy.render(doc2, style="ent", jupyter=True)

The model correctly classifies: 
- meat (as food)
- eggs (as food)
- cheese (as food)
- body builder (as preference)  as defined by the training data

But it incorrectly classifies 
- Keto diet (as food instead of a preference)