In [1]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from spacy.matcher import PhraseMatcher
from spacy.util import filter_spans
from spacy import displacy
from spacy.tokens import DocBin
from spacy.tokens import Span
tqdm.pandas()
spacy.__version__ 

'3.2.4'

Our first goal here is to create a model capable of recognizing the term in our patents. For that, we will be using a Named Entity Recognition model given by `spacy`.

## Read and preprocess data

In [2]:
# if you've already unzipped the file
patent_data=open('G06K.txt').read().strip()

# split into patents texts | 1 entry = 1 patent
patent_texts = patent_data.split('\n\n')

# split each patent into lines
patent_lines = patent_data.split('\n')

### Our terms database: Manyterms

In [3]:
# here are the potential terms
mwes = open('manyterms.lower.txt').read().lower().strip().split('\n')
print(mwes[44444:44456])
print(len(mwes),'mwes')

['antonio superchi', 'antonio tarver', 'antonio torres jurado', 'antonio valdes', 'antonio valdes y fernandez bazan', 'antonio valdez', 'antonio valdés y bazán', 'antonio valdés y fernández bazán', 'antonio valente', 'antonio vitali', 'antonio vivaldi', 'antonio xavier machado e cerveira']
743274 mwes


### We extract the terms from our patents using manyterms

In [4]:
# Here lowercase=False option is used to keep the original case of the terms, since we possibly could have term abbreviations. Like API, CAT, etc.
cvectorizer = CountVectorizer(ngram_range=(1, 4), stop_words="english", vocabulary=mwes, lowercase=True)
X=cvectorizer.fit_transform(patent_texts)

# Show top-25 most frequent terms
termdf_cv = pd.DataFrame(np.sum(X, axis=0), columns=cvectorizer.get_feature_names()).T.sort_values(by = 0, ascending = False)
termdf_cv.head(25)



Unnamed: 0,0
electronic device,16280
image processing,12224
control unit,9263
mobile terminal,9165
information processing,7732
neural network,6734
user interface,6177
computer readable,6103
fingerprint sensor,5980
display device,5666


## 🪄 SpaCy NER

Let's start from understanding. Here is an example of showing part of text on one patent with default NER.

In [5]:
nlp = spacy.load("en_core_web_lg")
doc = nlp(patent_texts[0][18000:20000]) # 
displacy.render(doc, style="ent", jupyter = True)

We want to create a such model capable of recognizing the terms that are in the context of our patents. For that, we need to create a dataset and we will be using `manyterms` as a terms database.

### Create DataSet

We need to create propper dataset that is compatible with SpaCy 3.0 to train a NER model.

In [6]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(text) for text in termdf_cv.index]
matcher.add("Tech", patterns)

In [7]:
train_lines, test_lines = train_test_split(patent_lines, test_size=0.3, random_state=42)

We are using PharsesMatcher to find entities similar to one from mayterms.txt  
Then Span is labeled and saved into the binary `.spacy` format

In [10]:
def create_dataset(text, n_lines, filename, offset=0):
  LABEL = "TECH"
  doc_bin = DocBin() # create a DocBin object

  for training_example  in tqdm(text[offset:offset+n_lines]):
      doc = nlp.make_doc(training_example) 
      ents = []
      
      for match_id, start, end in matcher(doc):
          span = Span(doc, start, end, label=LABEL)
          if span is None:
              print("Skipping entity")
          else:
              ents.append(span)

      filtered_ents = filter_spans(ents)
      doc.ents = filtered_ents 
      doc_bin.add(doc)
  doc_bin.to_disk(filename)

In [11]:
create_dataset(train_lines, 40_000, "training_data.spacy")
create_dataset(test_lines, 12_000, "valid_data.spacy")

  0%|          | 0/40000 [00:00<?, ?it/s]

  0%|          | 0/12000 [00:00<?, ?it/s]

Now that our datasets are created, we can train a spacy NER model.

### Train the model

Donwnload __base_config.cfg__ for your system at https://spacy.io/usage/training#quickstart

In [12]:
# Run to generate full training config
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Run the training. The best and last model will be stored into __./spacy_output__  

In [None]:
!python -m spacy train config.cfg --output ./spacy_output --paths.train ./training_data.spacy --paths.dev ./valid_data.spacy --gpu-id 0

### Test the model

In [13]:
nlp = spacy.load("spacy_output/model-best")

doc = nlp("Wi-Fi Direct (registered trademark, which will be hereinafter referred to as WFD) \
           corresponding to a technology for directly performing a communication based on a \
           wireless LAN between communication devices without intermediation of an access \
           point (hereinafter referred to as AP) is standardized in Wi-Fi Alliance serving \
           as a wireless LAN industry group.")

colors = {"TECH": "#F67DE3"}
options = {"colors": colors} 

spacy.displacy.render(doc, style="ent", options=options, jupyter=True)



In our patents, a lot of terms are abbreviated.
For example, `WI-FI Direct` is mentioned as `WFD`, or `P2P Group Owner` as `GO`.

Our original model does not recognize these abbreviations and therefore a huge part of the terms are ignored.
So we will fine-tune our model using prodigy

## Prodigy

We create a train dataset for fine-tuning our ner model.

In [None]:
!prodigy ner.correct fine_tune_g06k2 spacy_output/model-best G06K.txt --loader txt --label TECH

Now let's fine-tune our model!

In [None]:
!prodigy train ./prodigy_output/ --ner fine_tune_g06k --base-model spacy_output/model-best --gpu-id 0

Now our refined model should recognize better the abbreviations.

In [14]:
nlp = spacy.load("prodigy_output/model-best")

doc = nlp("According to the WFD, the communication is performed when one of the \
           communication devices that directly perform the wireless LAN communication \
           operates as the AP. According to the WFD, a role of the device that operates \
           as the AP will be referred to as P2P Group Owner (hereinafter, referred to as GO). \
           On the other hand, a role of the device that participates in a network generated by \
           the GO will be referred to as P2P Client (hereinafter, referred to as CL). \
           According to the WFD, a communication parameter necessary for participating in \
           the network generated by the GO is shared between the devices by transmitting the \
           communication parameter from the GO to the CL, and thereafter, the wireless \
           communication according to the WFD is executed on the basis of the shared communication parameter.")

colors = {"TECH": "#F67DE3"}
options = {"colors": colors} 

spacy.displacy.render(doc, style="ent", options=options, jupyter=True)

This fine tuned model is more accurate than the first one. We will use it in our Hearst patterns recognition.