This notebook was programmed by Tristram Dacayan locally on Google Colaboratory.

## Section I.  Installing Dependencies

To start, we must first download and install the Spacy libraries as this is the main library that we will be using to train and use the Named Entity Recognition Tagger. 

In [1]:
!pip install -U spacy -q
!pip install -U spacy-transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.7/193.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!python -m spacy info

2023-04-21 21:26:52.928364: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-21 21:26:55.248201: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-21 21:26:55.248744: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-

To save our progress, we mount our google drive to the notebook so we can use the model after we train it.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Section II. Generating the dataset

We first need to initialize a blank pipeline that we will use for our NER task. We will also initialize a DocBin object to use for our annotations. We use a DocBin in particular because it is easier to use for debugging in case an error occurs.

In [4]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json

nlp = spacy.blank("en")
db = DocBin()

In [5]:
f = open("train_data_updated_4801.json")
TRAIN_DATA = json.load(f)

f = open("valid_data_updated_4801.json")
VALID_DATA = json.load(f)

The for loop here is used to gather all the annotations for each entry in the annotation file. The annotations are in the format (starting_position, ending_position, label) which is then appended to an ents variable and added to the DocBin object. If, however, the annotation overlaps or conflicts with another annotation, the entry is then skipped due to how the Spacy library works.

In [13]:
skipped = []
for text, annot in tqdm(TRAIN_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    
    # Manually Fix this later (CURRENT SKIPPED #: 21/167)
    try:
      doc.ents = ents 
      db.add(doc)
    except ValueError:
      skipped.append(annot)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 79/79 [00:00<00:00, 5216.80it/s]


In [None]:
# Run to find skipped entries
print(skipped)

The same process is then done on the validation set.

In [14]:
skipped_valid = []
for text, annot in tqdm(VALID_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    
    # Manually Fix this later (CURRENT SKIPPED #: 1/16)
    try:
      doc.ents = ents 
      db.add(doc)
    except ValueError:
      skipped_valid.append(annot)

db.to_disk("./valid_data.spacy") # save the docbin object

100%|██████████| 30/30 [00:00<00:00, 2451.18it/s]


In [15]:
# Run to find skipped entries
print(skipped_valid)

[{'entities': [[0, 4, 'BRAND'], [5, 21, 'MODEL'], [28, 46, 'ISSUE'], [51, 80, 'ISSUE'], [67, 80, 'COMPONENT']]}, {'entities': [[0, 6, 'BRAND'], [55, 60, 'BRAND'], [7, 21, 'MODEL'], [39, 73, 'ISSUE'], [85, 101, 'ISSUE'], [61, 73, 'COMPONENT']]}, {'entities': [[0, 9, 'BRAND'], [77, 84, 'BRAND'], [10, 26, 'MODEL'], [43, 72, 'ISSUE'], [59, 72, 'COMPONENT'], [85, 97, 'COMPONENT']]}]


This section debugs and checks whether or not the spacy files that were generated in the previous step are valid by counting how many entities are in the files.

In [19]:
from spacy.lang.en import English

# Check Data Validity
nlp = English()

doc_bin = DocBin().from_disk("./training_data.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))
entities = 0
for doc in docs:
    entities += len(doc.ents)
print(f"TRAIN docs: {len(docs)} with {entities} entities")

doc_bin = DocBin().from_disk("./valid_data.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))
entities = 0
for doc in docs:
    entities += len(doc.ents)
print(f"DEV docs: {len(docs)} with {entities} entities")

TRAIN docs: 315 with 1474 entities
DEV docs: 342 with 1666 entities


In [20]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
locale.getpreferredencoding()

'UTF-8'

## Section III. Training
Using the previous spacy training files that we generated, we can use spacy's integrated training command in the CLI to train our ouwn custom NER model. First, we must create the config file that will be used for the training.

This config file specifies that our pipeline will be in English, used for the NER task, and will be optimized for efficiency over accuracy. However, we can change the optimization metric if the accuracy is too low.

In [21]:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --gpu --force

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


This line trains the model (simple as that). We chose to run it with the default amount of epochs.

In [22]:
!python -m spacy train config.cfg --output /content/drive/MyDrive/notebooks/NLP/output --paths.train ./training_data.spacy --paths.dev ./valid_data.spacy --gpu-id 0

[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/notebooks/NLP/output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-04-21 21:31:57,827] [INFO] Set up nlp object from config
[2023-04-21 21:31:57,838] [INFO] Pipeline: ['transformer', 'ner']
[2023-04-21 21:31:57,841] [INFO] Created vocabulary
[2023-04-21 21:31:57,842] [INFO] Finished initializing nlp object
Downloading (…)lve/main/config.json: 100% 481/481 [00:00<00:00, 71.8kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 3.57MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 5.15MB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 4.46MB/s]
Downloading pytorch_model.bin: 100% 501M/501M [00:04<00:00, 100MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expect

# Section IV. Run model inference/ Model Testing


In [23]:
# Add path to the model
nlp_ner = spacy.load("/content/drive/MyDrive/notebooks/NLP/output/model-best")

In [26]:
# Test 1
doc = nlp_ner('''My MacBook Pro is overheating and crashing while running Adobe Premiere Pro with its Intel Core i9 processor. 
The Samsung Galaxy S20 is experiencing frequent app crashes and freezing when using Snapchat with its Qualcomm Snapdragon processor. 
My Dell XPS is running slow and struggling to handle multiple programs at once with its 8GB RAM. 
The Google Pixel 5 is experiencing connectivity issues and dropping calls frequently with its Qualcomm Snapdragon 765G chipset. 
The Lenovo ThinkPad is producing a loud fan noise and heating up when running heavy software with its Intel Core i7 processor.''')

In [28]:
# Test 2
doc = nlp_ner(''' The Asus ROG laptop is experiencing black screens and crashes when playing Cyberpunk 2077 with its NVIDIA RTX 3080 graphics card."
"My iPhone 12 Pro Max is draining its battery quickly and heating up while using the camera with its A14 Bionic chip."
"The LG OLED TV is displaying flickering and color distortion when playing high-resolution content with its HDMI 2.1 port."
"My Windows desktop is freezing and showing error messages when running virtual machines with its AMD Ryzen processor."
"The Sony WH-1000XM4 headphones are producing crackling and popping sounds when connected to Bluetooth devices with its LDAC codec.''')

In [29]:
# Visualiser function
spacy.displacy.render(doc, style="ent", jupyter=True)