# Training a Custom NER Model with spaCy

  This notebook contains our code to train custom named-entity recognition (NER) models. We trained one NER model with 2 custom tags: "SELECT" or "CONSTRAINT". "SELECT" words are attributes that need to be selected. "CONTRAINT" words are values that limit the number of records to be selected. For instance, in "WHERE player = Paul Seiler", "Paul Selier" is the constraint. We also trained another NER model that supports aggregate function tagging, that is,  it supports the following custom tags: "COUNT SELECT", "AVG SELECT", "SUM SELECT", "MIN SELECT", "MAX SELECT", "SELECT". Thus, this model identifies which select attributes are associated with an aggregate function. The outputs of this notebook currently reflect our training process for the aggregate NER model. The notebook requires functions from the sqlExtract class.  It also assumes spaCy is already installed as spaCy comes pre-installed in the Google Colab environment.

  Note: change file paths wherever appropriate to avoid errors.

  References:


*   [Custom NER with spaCy v3 Tutorial by 1littlecoder](https://www.youtube.com/watch?v=p_7hJvl7P2A)

*   [Training Pipelines and Models - spaCy Documentation](https://spacy.io/usage/training)


# Setup

Mount this notebook so that it can access Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Verify that the notebook is connected to a premium Colab GPU - either a V100 or an A100.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Dec 14 00:54:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Verify that the version of spacy installed is >= 3.

In [None]:
!python -m spacy info

[1m

spaCy version    3.6.1                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.58+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.6.0)        



Import the sqlExtract file which will extract values and attributes from a given SQL query.

In [None]:
%load_ext autoreload
%autoreload 2

from sqlExtract import *

# Create Custom NER Annotations

Automate the process of annotating words in the Natural Language Queries (NLQ) that are present in the train and validation split of the WikiSQL dataset.

Import necessary libraries and load the WikiSQL dataset.

In [None]:
!pip install datasets

Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [None]:
from tqdm import tqdm
from datasets import load_dataset
import json
from operator import itemgetter

In [None]:
dataset = load_dataset('wikisql')

Downloading builder script:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.80k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/15878 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8421 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/56355 [00:00<?, ? examples/s]

The function below checks for overlapping entities. Overlapping entities are when one word or a part of a word is tagged multiple times. Consider the below example:

  NLQ:

*   NLQ:

    What is the report for races where Will Power had both pole position and fastest lap?
    
*   Expected SQL Query:

    SELECT Report FROM table WHERE Pole position = Will Power AND Fastest lap = Will Power


Will Power shouldn't be tagged as a value multiple times as a Value.

In [None]:
def is_overlap(s2, e2, iterable):
  """Check for overlapping Entities."""
  for elt in iterable:
    s1 = elt[0]
    e1 = elt[1]
    if s2 <= e1 and s1 <= e2:
      return True
  return False

The functions below check if the extracted select attributes and values are explicitly present in the NLQ. While select attributes may not always be present in the NLQ, it may seem weird for a NLQ to not have a value explicitly present in it. This is due to some formatting issues in the wikiSQL dataset. Following is an example of the same:

*   NLQ:

  If the lorentz factor γ dt/dτ = e/mc 2 is √5 ≅ 2.236, what is the proper velocity w dx/dτ in units of c
*   Expected SQL Query:

  SELECT Proper velocity w dx/dτ in units of c FROM table WHERE Lorentz factor γ dt/dτ = E/mc 2 = √5 ≅ 2.236

The value to be tagged here has an '=' sign in it which whereas the actual value has an 'is' instead in the NLQ. To avoid such queries, these are filtered out from the training and validation data.

In [None]:
def is_select_not_present_in_nlq(selectAttr, question):
  # assume all the select attributes are explicitly present in the nlq
  for attr in selectAttr:
    idx = question.lower().find(attr.lower())
    if attr != '*' and idx == -1:
        return True
  return False

In [None]:
def value_not_present_in_nlq(whereAttrValue, question):
  for attr, value in whereAttrValue.items():
      if question.lower().find(value.lower()) == -1:
        return True
  return False

The create_annotation function automatically annotates tags words in the NLQ to be either of type "SELECT" or "CONSTRAINT" or others based on the select attributes and where attributes extracted from the NLQ's corresponding expected SQL quert in the WikiSQL dataset's train or validation split. The function filters out queries that have select attributes and values explicitly present in them.

The count variable will do this for the first 'count' rows, that is, if count is set to 5, the function will only annotate the first 5 rows of the WikiSQL dataset.

In [None]:
def create_annotations(split='train', count=None):
  assert split in {'train', 'validation'}
  tags = ["CONSTRAINT", "COUNT SELECT", "AVG SELECT", "SUM SELECT", "MIN SELECT", "MAX SELECT", "SELECT"]
  jsonObject = {"classes": tags,"annotations":[]}
  if count == None:
    count = len(dataset[split])
  i = 0
  for row in tqdm(dataset[split]):
      nlq = row['question']
      sqlQuery = row['sql']['human_readable']
      columns = row['table']['header']
      entityList = []
      sqlExtractor = sqlExtract(sqlQuery)
      selectAttr = sqlExtractor.fetch_select_attr()
      correctWhereAttr, whereAttrVals = sqlExtractor.fetch_where_attr()
      if is_select_not_present_in_nlq(selectAttr, nlq):
        continue
      if value_not_present_in_nlq(whereAttrVals, nlq):
        continue
      for val in whereAttrVals.values():
          idx = nlq.lower().find(val.lower())
          if idx == -1:
            print(nlq)
            print(sqlQuery)
            print(val)
          # assert that the value exists in the NLQ
          assert idx != -1
          if not is_overlap(idx, idx + len(val), entityList):
              entityList.append([idx, idx + len(val), jsonObject["classes"][0]])
      for attr, aggr in selectAttr.items():
          idx = nlq.lower().find(attr.lower())
          if idx == -1:
            print(nlq)
            print(sqlQuery)
            print(attr)
            # assert that the attribute exists in the NLQ
          assert idx != -1
          end = idx + len(attr)
          if not is_overlap(idx, end, entityList):
            if aggr == 'COUNT':
              entityList.append([idx, end, jsonObject["classes"][1]])
            elif aggr == 'AVG':
              entityList.append([idx, end, jsonObject["classes"][2]])
            elif aggr == 'SUM':
              entityList.append([idx, end, jsonObject["classes"][3]])
            elif aggr == 'MIN':
              entityList.append([idx, end, jsonObject["classes"][4]])
            elif aggr == 'MAX':
              entityList.append([idx, end, jsonObject["classes"][5]])
            else:
              entityList.append([idx, end, jsonObject["classes"][6]])
      sorted_list = sorted(entityList, key=itemgetter(0, 1))
      temp = [nlq, {"entities": sorted_list}]
      jsonObject["annotations"].append(temp)
      i += 1
      if i == count:
        break
  return jsonObject

The function below saves annotated data to a json file.

In [None]:
def save_annotated_data(filePath, jsonObject):
    f = open(filePath, "w")
    json_string = json.dumps(jsonObject, ensure_ascii=False)
    f.write(json_string)
    f.close()

In the cell below, the automatically annotated data's format is tested by comparing it with manually annotated data for the first 17 rows. Data was manually annotated using a mannual NER annotator from [this](https://tecoholic.github.io/ner-annotator/) link and then saved in the mannual_annotations.json file. The mannual annotator added carriage return characters ('\r') at the end of every string. Those were manually removed to ensure fair testing.

In [None]:
autoAnnotations = create_annotations('train', 17)
mannualAnnotationsFilePath = '/content/drive/MyDrive/train_spacy/mannual_annotations.json'
f1 = open(mannualAnnotationsFilePath)
assert json.load(f1) == autoAnnotations
f1.close()

Call the previously defined functions above to automatically annotate the training and validation split of the WikiSQL dataset.

In [None]:
trainingDataFilePath = '/content/drive/MyDrive/train_spacy/training_data.json'
validationDataFilePath = '/content/drive/MyDrive/train_spacy/validation_data.json'
trainJsonObject = create_annotations('train')
print(f'\nNumber of NLQs annotated from the train split:', len(trainJsonObject["annotations"]), '\n')
save_annotated_data(trainingDataFilePath, trainJsonObject)
validationJsonObject = create_annotations('validation')
print(f'\nNumber of NLQs annotated from the validation split:', len(validationJsonObject["annotations"]), '\n')
save_annotated_data(validationDataFilePath, validationJsonObject)
print("Completed annotations!")

100%|██████████| 56355/56355 [00:20<00:00, 2719.08it/s]



Number of NLQs annotated from the train split: 44570 



100%|██████████| 8421/8421 [00:02<00:00, 3271.46it/s]



Number of NLQs annotated from the validation split: 6713 

Completed annotations!


The saved annotated training and validation data are opened to convert them into a .spacy object.

In [None]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

In [None]:
import json
f = open(trainingDataFilePath)
TRAIN_DATA = json.load(f)
f1  = open(validationDataFilePath)
VALIDATION_DATA = json.load(f1)

Verifying the training and validation data by displaying them.

In [None]:
TRAIN_DATA

{'classes': ['CONSTRAINT',
  'COUNT SELECT',
  'AVG SELECT',
  'SUM SELECT',
  'MIN SELECT',
  'MAX SELECT',
  'SELECT'],
 'annotations': [['Tell me what the notes are for South Australia ',
   {'entities': [[17, 22, 'SELECT'], [31, 46, 'CONSTRAINT']]}],
  ['What is the current series where the new series began in June 2011?',
   {'entities': [[12, 26, 'SELECT'], [37, 66, 'CONSTRAINT']]}],
  ['What is the format for South Australia?',
   {'entities': [[12, 18, 'SELECT'], [23, 38, 'CONSTRAINT']]}],
  ['what is the fuel propulsion where the fleet series (quantity) is 310-329 (20)?',
   {'entities': [[12, 27, 'SELECT'], [65, 77, 'CONSTRAINT']]}],
  ['who is the manufacturer for the order year 1998?',
   {'entities': [[11, 23, 'SELECT'], [43, 47, 'CONSTRAINT']]}],
  ['what is the powertrain (engine/transmission) when the order year is 2000?',
   {'entities': [[12, 44, 'SELECT'], [68, 72, 'CONSTRAINT']]}],
  ['What if the description of a ch-47d chinook?',
   {'entities': [[12, 23, 'SELECT'

In [None]:
VALIDATION_DATA

{'classes': ['CONSTRAINT',
  'COUNT SELECT',
  'AVG SELECT',
  'SUM SELECT',
  'MIN SELECT',
  'MAX SELECT',
  'SELECT'],
 'annotations': [['What position does the player who played for butler cc (ks) play?',
   {'entities': [[5, 13, 'SELECT'], [45, 59, 'CONSTRAINT']]}],
  ['Who is the player that wears number 42?',
   {'entities': [[11, 17, 'SELECT'], [36, 38, 'CONSTRAINT']]}],
  ['What player played guard for toronto in 1996-97?',
   {'entities': [[5, 11, 'SELECT'],
     [19, 24, 'CONSTRAINT'],
     [40, 47, 'CONSTRAINT']]}],
  ['Who are all of the players on the Westchester High School club team?',
   {'entities': [[19, 25, 'SELECT'], [34, 57, 'CONSTRAINT']]}],
  ['What school/club team is Amir Johnson on?',
   {'entities': [[5, 21, 'SELECT'], [25, 37, 'CONSTRAINT']]}],
  ['What are the total number of positions on the Toronto team in 2006-07?',
   {'entities': [[29, 37, 'COUNT SELECT'], [62, 69, 'CONSTRAINT']]}],
  ['What are the nationality of the players on the Fresno State schoo

The cells below convert the training and validation data into a .spacy object.

In [None]:
for text, annot in tqdm(TRAIN_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("/content/drive/MyDrive/train_spacy/training_data.spacy") # save the docbin object

In [None]:
for text, annot in tqdm(VALIDATION_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("/content/drive/MyDrive/train_spacy/validation_data.spacy") # save the docbin object

The spacy-transfomers library is installed to train the transformer based roBERTa model for custom NER.

In [None]:
!pip install spacy-transformers

Installing collected packages: spacy-alignments, spacy-transformers
Successfully installed spacy-alignments-0.9.1 spacy-transformers-1.3.3


The base_config.cfg file contains important specifications for the model like the initial learning rate, batch size, optimizer, and more. Many of these values are set by default in the spaCy documentation and are recommended for training. We created a base_config file to optimize for accuracy using [this](https://spacy.io/usage/training) link.

In [None]:
!python -m spacy init fill-config '/content/drive/MyDrive/train_spacy/base_config.cfg' '/content/drive/MyDrive/train_spacy/config.cfg'

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/drive/MyDrive/train_spacy/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Below are logs from training our custom NER model. Training partially stopped automatically due to lack of memory in our Colab environment. However, we tried to train for as many epochs as possible to minimize loss.

In [None]:
! python -m spacy train '/content/drive/MyDrive/train_spacy/config.cfg' --output '/content/drive/MyDrive/train_spacy' --paths.train '/content/drive/MyDrive/train_spacy/training_data.spacy' --paths.dev '/content/drive/MyDrive/train_spacy/validation_data.spacy' --gpu-id 0

[38;5;4mℹ Saving to output directory: /content/drive/MyDrive/train_spacy[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
config.json: 100% 481/481 [00:00<00:00, 2.07MB/s]
vocab.json: 100% 899k/899k [00:00<00:00, 15.7MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 19.2MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 27.8MB/s]
model.safetensors: 100% 499M/499M [00:03<00:00, 153MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0         598.41    503.67    0.10    0.07    0.19    0.

Test the model after training.

In [None]:
import spacy_transformers
nlp_ner = spacy.load("/content/drive/MyDrive/train_spacy/model-best")

In [None]:
doc = nlp_ner('''What is the maximum capacity of the Otkrytie Arena stadium?''') # input sample text

In [None]:
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Colab

The cells below can be used to download the model from Google Drive for external use.

In [None]:
!zip -r /content/file.zip '/content/drive/MyDrive/train_spacy/model-best'

  adding: content/drive/MyDrive/train_spacy/model-best/ (stored 0%)
  adding: content/drive/MyDrive/train_spacy/model-best/tokenizer (deflated 81%)
  adding: content/drive/MyDrive/train_spacy/model-best/meta.json (deflated 56%)
  adding: content/drive/MyDrive/train_spacy/model-best/config.cfg (deflated 61%)
  adding: content/drive/MyDrive/train_spacy/model-best/transformer/ (stored 0%)
  adding: content/drive/MyDrive/train_spacy/model-best/transformer/cfg (stored 0%)
  adding: content/drive/MyDrive/train_spacy/model-best/transformer/model (deflated 11%)
  adding: content/drive/MyDrive/train_spacy/model-best/ner/ (stored 0%)
  adding: content/drive/MyDrive/train_spacy/model-best/ner/model (deflated 8%)
  adding: content/drive/MyDrive/train_spacy/model-best/ner/moves (deflated 48%)
  adding: content/drive/MyDrive/train_spacy/model-best/ner/cfg (deflated 33%)
  adding: content/drive/MyDrive/train_spacy/model-best/vocab/ (stored 0%)
  adding: content/drive/MyDrive/train_spacy/model-best/vo

In [None]:
from google.colab import files
files.download("/content/file.zip")