Referencing [this article](https://towardsdatascience.com/extract-knowledge-from-text-end-to-end-information-extraction-pipeline-with-spacy-and-neo4j-502b2b1e075).

In [11]:
import spacy
import crosslingual_coreference
from pprint import pprint
import requests
from spacy.tokens import Doc
from typing import List
# make the factory work
from rel_pipe import make_relation_extractor, score_relations
# make the config work
from rel_model import create_relation_model, create_classification_layer, create_instances 

DEVICE = 0 # Number of the GPU, -1 if want to use CPU

ModuleNotFoundError: No module named 'rel_pipe'

In [2]:
# Add coreference resolution model
# RUN python -m spacy download en_core_web_sm
coref = spacy.load('en_core_web_sm', disable=['ner', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
coref.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": DEVICE})

[nltk_data] Downloading package omw-1.4 to /home/jianyang/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Some weights of the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large were not used when initializing XLMRobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large and are newly

<crosslingual_coreference.CrossLingualPredictorSpacy.CrossLingualPredictorSpacy at 0x7f8bf87989d0>

Replacing pronouns with the actual name (doesn't work superbly well if it's from first person)

In [3]:
input_text = "Christian Drosten works in Germany. He likes to work for Google. Jack Dorsey works at Twitter. He works at Twitter. Also, I am Tom and has a pet called Doge. It is of great importance that he eats his food on time."
pprint(coref(input_text)._.resolved_text)

('Christian Drosten works in Germany. Christian Drosten likes to work for '
 'Google. Jack Dorsey works at Twitter. Jack Dorsey works at Twitter. Also, '
 'Jack Dorsey am Tom and has a pet called Doge. It is of great importance that '
 "a pet called Doge eats a pet called Doge's food on time.")


In [4]:
# Returns unique entity id from wikidata
def call_wiki_api(item):
  try:
    url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={item}&language=en&format=json"
    data = requests.get(url).json()
    print(data)
    # Return the first id (Could upgrade this in the future)
    return data['search'][0]['id']
  except:
    return 'id-less'

In [5]:
call_wiki_api('Christian Drosten')

{'searchinfo': {'search': 'Christian Drosten'}, 'search': [{'id': 'Q1079331', 'title': 'Q1079331', 'pageid': 1027519, 'concepturi': 'http://www.wikidata.org/entity/Q1079331', 'repository': 'wikidata', 'url': '//www.wikidata.org/wiki/Q1079331', 'display': {'label': {'value': 'Christian Drosten', 'language': 'en'}, 'description': {'value': 'German virologist and university teacher', 'language': 'en'}}, 'label': 'Christian Drosten', 'description': 'German virologist and university teacher', 'match': {'type': 'label', 'language': 'en', 'text': 'Christian Drosten'}}], 'success': 1}


'Q1079331'

In [6]:
def set_annotations(self, doc: Doc, triplets: List[dict]):
  for triplet in triplets:

      # Remove self-loops (relationships that start and end at the entity)
      if triplet['head'] == triplet['tail']:
          continue

      # Use regex to search for entities
      head_span = re.search(triplet["head"], doc.text)
      tail_span = re.search(triplet["tail"], doc.text)

      # Skip the relation if both head and tail entities are not present in the text
      # Sometimes the Rebel model hallucinates some entities
      if not head_span or not tail_span:
        continue

      index = hashlib.sha1("".join([triplet['head'], triplet['tail'], triplet['type']]).encode('utf-8')).hexdigest()
      if index not in doc._.rel:
          # Get wiki ids and store results
          doc._.rel[index] = {"relation": triplet["type"], "head_span": {'text': triplet['head'], 'id': self.get_wiki_id(triplet['head'])}, "tail_span": {'text': triplet['tail'], 'id': self.get_wiki_id(triplet['tail'])}}

In [8]:
# Define rel extraction model

rel_ext = spacy.load('en_core_web_sm', disable=['ner', 'lemmatizer', 'attribute_rules', 'tagger'])
rel_ext.add_pipe("rebel", config={
    'device':DEVICE, # Number of the GPU, -1 if want to use CPU
    'model_name':'Babelscape/rebel-large'} # Model used, will default to 'Babelscape/rebel-large' if not given
    )

ValueError: [E002] Can't find factory for 'rebel' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, xx_coref, en.lemmatizer

In [None]:
input_text = "Christian Drosten works in Germany. He likes to work for Google."
coref_text = coref(input_text)._.resolved_text
doc = rel_ext(coref_text)

for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")
    
# {'relation': 'country of citizenship', 'head_span': {'text': 'Christian Drosten', 'id': 'Q1079331'}, 'tail_span': {'text': 'Germany', 'id': 'Q183'}}
# {'relation': 'employer', 'head_span': {'text': 'Christian Drosten', 'id': 'Q1079331'}, 'tail_span': {'text': 'Google', 'id': 'Q95'}} 

NameError: name 'rel_ext' is not defined

Once we have our knowledge graph up, it's possible that langchain already has knowledge graph capabilities built in, so just use Neo4j.from_existing_graph.

Try [this](https://neo4j.com/developer-blog/knowledge-graph-rag-application/) neo4j guide