**Importing the Required Libraries and Downloading SpaCy Models**
Here, we import the spacy (**version 2**) library and load a model named **en_core_web_sm** using spacy.load("en_core_web_sm"). The loaded model is assigned to a variable named nlp. The model has to be downloaded before it is loaded. This was done by following the steps below:
1. Open the anaconda prompt with administrator privilege
2. Switch to the correct anaconda environment (rdkit-env)
3. Download the model with python -m spacy download en_core_web_sm

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Creating a SpaCy Doc Object**
We create a Doc object by passing a text string to the loaded SpaCy model (nlp). This text is processed by the NLP pipeline, and the resulting document is assigned to the variable doc.

In [2]:
doc = nlp("University of Peradeniya is the most beautiful university in Sri Lanka")

**Accessing and Visualizing the Entities in the Document**
The ents attribute of the Doc object contains the named entities recognized in the document. Following code imports the displacy module from SpaCy and renders the named entities in the document using the "ent" (entity) style.

In [3]:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

**Loading a Training Dataset in JSON Format**
Here, we import the json library and load training data from a JSON file named "Corona2.json" using the json.load() function. The loaded data is stored in the data variable.

In [4]:
import json
with open('Corona2.json', 'r') as f:
    data = json.load(f)

**Analyzing the Format of the JSON File**
Here, we analyze the format of the JSON file with pretty print package. A depth level of 2 is a good starting point to figure out the overall structure of the file.

In [5]:
from pprint import pprint
pprint(data,depth=2, compact=True)
pprint(data['examples'][0], depth=1, compact=True)

{'examples': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...},
              {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...},
              {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...},
              {...}, {...}, {...}, {...}]}
{'annotations': [...],
 'classifications': [],
 'content': 'While bismuth compounds (Pepto-Bismol) decreased the number of '
            "bowel movements in those with travelers' diarrhea, they do not "
            'decrease the length of illness.[91] Anti-motility agents like '
            'loperamide are also effective at reducing the number of stools '
            'but not the duration of disease.[8] These agents should be used '
            'only if bloody diarrhea is not present.[92]\n'
            '\n'
            'Diosmectite, a natural aluminomagnesium silicate clay, is '
            'effective in alleviating symptoms of acute diarrhea in '
            'children,[93] and also has some effects in chronic funct

**Converting Data Into a Format Suitable for SpaCy**
This code prepares the training data in the required format for SpaCy. It iterates over the examples in the loaded data, extracts the text and annotations (start position, end position, and label) for each example, and appends them to a list called training_data.

In [6]:
training_data = []
for example in data['examples']:
    temp_dict = {}
    temp_dict['text'] = example['content']
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end']
        label = annotation['tag_name'].upper()
        temp_dict['entities'].append((start, end, label))
    training_data.append(temp_dict)
pprint(training_data[0])

{'entities': [(360, 371, 'MEDICINE'),
              (383, 408, 'MEDICINE'),
              (104, 112, 'MEDICALCONDITION'),
              (679, 689, 'MEDICINE'),
              (6, 23, 'MEDICINE'),
              (25, 37, 'MEDICINE'),
              (461, 470, 'MEDICALCONDITION'),
              (577, 589, 'MEDICINE'),
              (853, 865, 'MEDICALCONDITION'),
              (188, 198, 'MEDICINE'),
              (754, 762, 'MEDICALCONDITION'),
              (870, 880, 'MEDICALCONDITION'),
              (823, 833, 'MEDICINE'),
              (852, 853, 'MEDICALCONDITION'),
              (461, 469, 'MEDICALCONDITION'),
              (535, 543, 'MEDICALCONDITION'),
              (692, 704, 'MEDICINE'),
              (563, 571, 'MEDICALCONDITION')],
 'text': 'While bismuth compounds (Pepto-Bismol) decreased the number of bowel '
         "movements in those with travelers' diarrhea, they do not decrease "
         'the length of illness.[91] Anti-motility agents like loperamide are '
         

**Converting training data to SpaCy DocBin format**
In this section, we convert the training data to SpaCy's DocBin format, which is an efficient binary format for storing Doc objects. We initialize a blank English language model using spacy.blank("en"), create an empty DocBin, and iterate over each training example. For each example, we create a Doc object using nlp.make_doc(text) and create spans for the entities. We filter the spans to remove overlapping entities using filter_spans, update the ents attribute of the Doc object, and add the Doc to the DocBin. Finally, we save the DocBin to a file named "train. spacy".

In [7]:
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en")
doc_bin = DocBin()

from spacy.util import filter_spans

for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

100%|██████████| 31/31 [00:00<00:00, 419.01it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity





**Initializing the Training Configuration**
This command initializes a base configuration file named "base_config.cfg" and fills it with default settings. The filled configuration is saved as "config.cfg", which can be further customized for training the NER model. The base config file was generated using a widget located at https://spacy.io/usage/training. Make sure to select **ner** when generating the base configuration file. The generated file may contain incompatible version numbers that might need to be corrected to match your SpaCy installation.

In [8]:
!python -m spacy init fill-config base_config.cfg config.cfg

[+] Auto-filled config with all values
[+] Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


**Training the NER Model**
This command trains the NER model using the specified configuration file ("config. cfg") and the training data from the "train.spacy" file. The trained model is saved in the current directory.

In [9]:
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

[i] Using CPU
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    150.79    0.00    0.00    0.00    0.00
  7     200        758.68   3352.17   76.56   81.28   72.36    0.77
 14     400        281.45    704.34   96.73   97.13   96.34    0.97
 22     600        202.63    288.23   96.77   96.00   97.56    0.97
 30     800        326.15    252.71   97.76   97.96   97.56    0.98
 40    1000        197.85    193.20   97.97   97.97   97.97    0.98
 52    1200        268.27    181.26   98.57   99.18   97.97    0.99
 65    1400        915.85    233.16   98.37   98.37   98.37    0.98
 82    1600        300.12    214.70   98.17   98.37   97.97    0.98
103    1800        134.77    194.41   98.38   97.98   98.78    0.98
128    2000        163.05    220.61   98.78   98.78   98.78    0.99
160 

Set up nlp object from config
Pipeline: ['tok2vec', 'ner']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: ['tok2vec', 'ner']


**Loading the Trained NER Model and Visualizing Entities**
In this part, we load the trained NER model using spacy.load("model-best"). Then, we process a sample text using the loaded model and store the result in the doc variable. Finally, we define colors for entity types, specify options, and use displacy.render to visualize the entities with their corresponding colors in the Jupyter Notebook.

In [14]:
nlp_ner = spacy.load("model-best")
doc = nlp_ner("While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.")
colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION": "#a6e22d"}
options = {"colors": colors}
spacy.displacy.render(doc, style="ent", options=options, jupyter=True)