## Relation Classification

This example uses few-shot classification to avoid the hassle that training a text classifier inside spacy is....

See https://github.com/Pandora-Intelligence/classy-classification for details, esp other pre-trained models that could work as well


In [23]:
import spacy

from spacy.language import Language

@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for i, token in enumerate(doc):
        if i == 0:
            doc[i].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i].is_sent_start = False
    return doc

import classy_classification


## training data

Training data is a dictionary where the key is the class label and the value is a list containing all the examples for this class. In the example below, we read the training data from a tab separated file with one class label and sentence per line


In [17]:
# the format of the training data 
# data = {'director' : [  'Who directed Tenet?', 
#                      'Who directed Bad Times at the El Royale?', 
#                      'Inception was directed by whom?' ],
#'deathdate' : ['When did John Hamilton die?', 
#               'In which year did Alan Rickman die?']
#                      }
   
import csv        
from collections import defaultdict

trainingdata = defaultdict(list)

with open('labeled.csv') as file :
    csvFile = csv.reader(file,delimiter='\t')
    for label,sentence in csvFile :
        trainingdata[label].append(sentence)
        
for label,examples in trainingdata.items() :
    print('{}\t{}'.format(label,len(examples)))
        
#for example in trainingdata['composer'] :
#    print (example)

award	24
birthdate	28
birthplace	10
composer	21
director	88
released	44
other	100


## adding the categorizer to the pipeline

In Spacy, you can add modules to the annotation pipepline. Note that in this particular case, adding this module does the training as well. I.e. it does few-shot training using a pre-trained language model. The pre-trained language model is the huggingface spacy neural language model. 

The categorizer predicts a label for a given input sentence. The label predictions can be found in the extention attribute \_.cats



In [24]:
nlp = spacy.load('en_core_web_lg')

nlp.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser


<function __main__.custom_sentencizer(doc)>

In [25]:
nlp.add_pipe('text_categorizer',config = {'data': trainingdata, 'model': 'spacy'})

Fitting 2 folds for each of 6 candidates, totalling 12 fits


<classy_classification.classifiers.spacy_internal.classySpacyInternal at 0x7fd522048bb0>

In [14]:
spacy.info('en_core_web_lg')

{'lang': 'en',
 'name': 'core_web_lg',
 'version': '3.2.0',
 'description': 'English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.',
 'author': 'Explosion',
 'email': 'contact@explosion.ai',
 'url': 'https://explosion.ai',
 'license': 'MIT',
 'spacy_version': '>=3.2.0,<3.3.0',
 'spacy_git_version': 'bb26550e2',
 'vectors': {'width': 300,
  'vectors': 684830,
  'keys': 684830,
  'name': 'en_vectors'},
 'labels': {'tok2vec': [],
  'tagger': ['$',
   "''",
   ',',
   '-LRB-',
   '-RRB-',
   '.',
   ':',
   'ADD',
   'AFX',
   'CC',
   'CD',
   'DT',
   'EX',
   'FW',
   'HYPH',
   'IN',
   'JJ',
   'JJR',
   'JJS',
   'LS',
   'MD',
   'NFP',
   'NN',
   'NNP',
   'NNPS',
   'NNS',
   'PDT',
   'POS',
   'PRP',
   'PRP$',
   'RB',
   'RBR',
   'RBS',
   'RP',
   'SYM',
   'TO',
   'UH',
   'VB',
   'VBD',
   'VBG',
   'VBN',
   'VBP',
   'VBZ',
   'WDT',
   'WP',
   'WP$',
   'WRB',
   'XX',
   '``'],
  'parser': ['ROOT',
   'ac

In [20]:
question = nlp('Who directed Jaws?')
print(question._.cats)

{'award': 0.0010720687215706717, 'birthdate': 0.0021955275692360767, 'birthplace': 0.003109594411016382, 'composer': 0.0006949782125362799, 'director': 0.9918597789813286, 'other': 0.0004746037143408171, 'released': 0.000593448389970855}


In [21]:
question = nlp('What is the date of death of John Wayne?')
print(question._.cats)

{'award': 0.006506048950995459, 'birthdate': 0.028593738016062194, 'birthplace': 0.008716237929607943, 'composer': 0.007102167012836441, 'director': 0.01329791333298159, 'other': 0.7137977230923639, 'released': 0.22198617166515272}


## Note

To improve classification performance, add more examples and categories. I am not sure about which pre-trained models can be used exactly, the github mentions a few more but unclear what it would take to use one of the transformer-based sentence encoders. 

For simple questions, finding the relation and finding a named entity is enough to formulate a sparql query. 

The biggest challenge for the classifier is that for relations that are not in the training data, the system will nevertheless make a prediction. Is it possible to use a threshold, so that predictions that are below the threshold can be ignored? Alternatively, you could add an 'other' class consisting of various questions for which no relation can be readily provided or that have other issues. Also, you could add manual patterns (for complex questions, for other relantions) that take precendence over the automatic classification approach (and automatic relation classification would only be a fall-back).

### Other 

Adding an other category makes the classification more robust. When adding 100 other questions, and testing on 2021 test questions, results are mostly correct. Mistakes are cases where other has been assigned incorrectly. 



In [26]:

with open('test_all1.csv') as questions :
    for line in questions:
        (id, question) = line.strip().split('\t')
        doc = nlp(question)
        for sent in doc.sents :
            print(sent.text)
        # sort predications by score, highest scoring first
        categories = sorted(doc._.cats.items(), key=lambda x:x[1], reverse=1)
        print(categories)

Who directed Tenet?
[('director', 0.9941102385945935), ('birthplace', 0.0026385505470632336), ('award', 0.0011384164331722552), ('birthdate', 0.0009146537380075174), ('composer', 0.0006383975163335762), ('other', 0.0002974174094209785), ('released', 0.00026232576140891226)]
Who directed Bad Times at the El Royale?
[('director', 0.7922966416674118), ('other', 0.061186156817180516), ('released', 0.046022719983288427), ('award', 0.039062846561057005), ('birthdate', 0.036642203648794856), ('composer', 0.014597481895847825), ('birthplace', 0.010191949426419442)]
Who wrote the music for Once Upon a Time in the West?
[('other', 0.4984421603599385), ('composer', 0.23667994203525236), ('director', 0.13058888916222777), ('released', 0.07896414943664815), ('birthdate', 0.027359843509150057), ('award', 0.020591924427171074), ('birthplace', 0.007373091069611951)]
Which streaming service is the distributor of Tiger King?
[('other', 0.9697259149840318), ('released', 0.015018830999793684), ('composer'