# Objective

We want to create a minified knowledge graph that can be generated quickly and economically from existing data.

We will do so via the following process, an iterative method of constructing knowledge graphs from unstructured, unlabeled text data.

* Define an empty set of unique entities $E$.
* Define an empty set of unique properties $P$.
* Define an empty knowledge graph $G$ as a set of 3-tuples $(e_1, p, e_2)$ s.t. $e_1, e_2\in E$ and $p\in P$.
* Preprocess the input text data. We only applied minimal preprocessing, with the aim of retaining as much information as possible from the original data. In fact, we augment the existing text data by providing simple *coreference resolution*, in the hope of restoring information that would otherwise be lost by sentence-level chunking.
    * For example: "Barack Obama was the 44th President of the United States. *He* was nominated as the Democratic Party's candidate in 2008." $\to$ "... *Barack Obama* was nominated..." 
* Chunk the preprocessed text dataset into a size appropriate for processing within an LLM's context window; we found 1 sentence to be optimal.
* For each data chunk $d$, do the following:
  * Pass $E$, $P$, and $d$ to the LLM. Prompt it to return $C$, a set of 3-tuples $(e_1, p, e_2)$ s.t. $e_1, e_2\in E$ and $p\in P$. The instructions should indicate that, when creating each element $c$ of $C$, if an entity or property referenced in $c$ is not present in $E$ or $P$, it should define a new entity/property and indicate that it is new.
  * For each claim $c\in C$, if either of $e_1, e_2\notin E$ or $p\notin P$, add the relevant element to its set. Then, append $c$ to the locally-constructed knowledge graph $G$.

In this way, we iteratively build a knowledge graph using only the entity and property types that are relevant to the data in our input documents.

In [1]:
%pip install mwxml

Collecting mwxml
  Using cached mwxml-0.3.6-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting jsonschema>=2.5.1 (from mwxml)
  Using cached jsonschema-4.23.0-py3-none-any.whl.metadata (7.9 kB)
Collecting mwcli>=0.0.2 (from mwxml)
  Using cached mwcli-0.0.3-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting mwtypes>=0.4.0 (from mwxml)
  Using cached mwtypes-0.4.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting para>=0.0.1 (from mwxml)
  Using cached para-0.0.8-py3-none-any.whl.metadata (2.0 kB)
Collecting attrs>=22.2.0 (from jsonschema>=2.5.1->mwxml)
  Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting jsonschema-specifications>=2023.03.6 (from jsonschema>=2.5.1->mwxml)
  Using cached jsonschema_specifications-2024.10.1-py3-none-any.whl.metadata (3.0 kB)
Collecting referencing>=0.28.4 (from jsonschema>=2.5.1->mwxml)
  Using cached referencing-0.36.2-py3-none-any.whl.metadata (2.8 kB)
Collecting rpds-py>=0.7.1 (from jsonschema>=2.5.1->mwxml)
  Using cached rpds_py-0.

In [12]:
import sys

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
  from google.colab import userdata
  openai_token = userdata.get("OPENAI_API_KEY")
else:
  import os
  import dotenv
  dotenv.load_dotenv()
  openai_token = os.environ.get("OPENAI_API_KEY")

assert openai_token is not None, "Must set the OPENAI_API_KEY environment variable"

In [1]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("biu-nlp/lingmess-coref")
model = AutoModel.from_pretrained("biu-nlp/lingmess-coref")



In [3]:
with open("proof-of-concept/unique_wiki_urls.txt", "r") as f:
    wiki_urls = f.read().split("\n")

wiki_urls = sorted([
    url.replace("_", " ").replace("-COLON-", ":")
    for url in wiki_urls
    if url
])
len(wiki_urls)

1460

In [4]:
import requests
import json
from tqdm.auto import tqdm

BASE_URL = "https://en.wikipedia.org/w/api.php"

summaries = {}
for slug in tqdm(wiki_urls):
    try:
        data = requests.get(
            BASE_URL,
            params={
                "action": "query",
                "format": "json",
                "titles": slug,
                "prop": "extracts",
                "exintro": True,
                "explaintext": True
            }
        ).json()
        summary = next(iter(data["query"]["pages"].values()))["extract"]
        summaries[slug] = summary
    except Exception as e:
        print(slug, e)
        continue

with open("all_summaries.json", "w") as f:
    json.dump(summaries, f)

  0%|          | 0/1460 [00:00<?, ?it/s]

In [76]:
import re

# Lightly preprocess summaries. In particular, we naively replace all
#  third-person personal pronouns (he/she/they/them/it) with the
#  title of the article.

with open("all_summaries.json", "r") as f:
    summaries = json.load(f)

summaries = {
    title: re.sub(r"\s+([hH]e|[sS]he|[tT]hey|[tT]hem|[iI]t)[\s\.,!?]+", f" {title} ", summary).replace("...", ".")
    for title, summary in summaries.items()
}

In [119]:
import openai
import time
from dataclasses import dataclass
import json
from typing import Optional
from pydantic import BaseModel, TypeAdapter

class SemanticTriple(BaseModel):
  entityA: str
  relationship: str
  entityB: str
    
  def __hash__(self):
      return hash((self.entityA, self.relationship, self.entityB))
      

def create_extract_triples_fn(entities, relations):
    return {
        "name": "extract_triples",
        "description": f"""
        Extract all semantic triples (entityA, relationship, entityB) from a sentence.
        Attempt to do so using only the entities and relationships provided to you, to the
        best of your ability.
                
        Return a JSON object with a single field `triples`, an array of objects:
          {{ "new": <true|false>, "entityA": <ENTITY_ID>, "relationship": <RELATION_ID>, "entityB": <ENTITY_ID> }}
        
        If there are no triples, return `{{"triples":[]}}`.
        """,
        "parameters": {
            "type": "object",
            "properties": {
                "triples": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "entityA": {
                                "type": "string",
                                "enum": entities,
                                "description": "ID of the first entity"
                            },
                            "relationship": {
                                "type": "string",
                                "enum": relations,
                                "description": "ID of the relationship"
                            },
                            "entityB": {
                                "type": "string",
                                "enum": entities,
                                "description": "ID of the second entity"
                            },
                        },
                        "required": ["entityA", "relationship", "entityB"],
                    },
                },
            },
            "required": ["triples"],
        },
    }


@dataclass
class SemanticTripleExtractor:
    client: openai.OpenAI
    GPT_MODEL = "gpt-4o"
    ERROR_RETRY_SLEEP = 0.001

    def get_semantic_triples(self, text: str, entities: list[str], relationships: list[str]):
        system_prompt = """
        You are a semantic role and entity extractor.

        Given an input text (which may contain multiple sentences), identify every (entityA, relationship, entityB) tuple,
        **even if it's factually incorrect**. For each triple you identify, first analyze whether it can be adequately
        described using the provided entities and relationships. If it cannot, **you may create a new one** to fit the data.

        Some sentences may contain multiple triples, and the semantic triples that are explicitly stated in the sentence
        may not be the only implications of the sentence. For example, the sentence "John graduated college" also implies
        the sentence "John holds a degree". Within reason, attempt to capture all explicit and implicit semantic triples.

        Always output exactly valid JSON with a single key "triples" consisting of a list of semantic triples:
        {
        "triples": [
            { "entityA": "<ENTITY_ID>", "relationship": "<REL_ID>", "entityB": "<ENTITY_ID>" },
            …
        ]
        }
        If there are none, return `{ "triples": [] }`.
        All relationships should be formatted using camelCase, and all entities should use PascalCase.
        ---
        Below is an example of proper processing.
        Sentence: "Princess Diana was a British royal."
        Output: {
            ["entityA": "PrincessDiana", "relationship": "countryOfOrigin", "entityB": "GreatBritain"],
            ["entityB": "PrincessDiana", "relationship": "instanceOf", "entityB": "Royal"]
        }
        ---
        Think step by step before giving your output.
        """
        return self._request_with_retry(system_prompt, text, entities, relationships)

    def _request_with_retry(self, system_prompt: str, text: str, entities: list[str], relationships: list[str]):
        n_retries = 0
        while True:
            try:
                response = (
                    self.client.beta.chat.completions.parse(
                        model=self.GPT_MODEL,
                        temperature=1,
                        messages=[
                            {"role": "system", "content": system_prompt},
                            {"role": "user", "content": text},
                        ],
                        functions=[create_extract_triples_fn(entities, relationships)],
                        function_call={"name": "extract_triples"},
                    )
                    .choices[0]
                    .message.function_call
                )
                break

            except openai.RateLimitError as err:
                n_retries += 1
                print(err)
                print("Exceeded rate limit")
                print(f"Sleeping before retry (done {n_retries} time(s))")
                time.sleep(self.ERROR_RETRY_SLEEP)

            except Exception as err:
                n_retries += 1
                print(f"Unexpected error ({err})")
                print(f"Sleeping before retry (done {n_retries} time(s))")
                time.sleep(self.ERROR_RETRY_SLEEP)

        if response is None:
            raise ValueError("Got null response")
        
        data = json.loads(response.arguments)["triples"]
        adapter = TypeAdapter(list[SemanticTriple])

        return adapter.validate_python(data)

In [120]:
client = openai.OpenAI(api_key=openai_token)
semantic_extractor = SemanticTripleExtractor(client)

In [124]:
semantic_extractor.get_semantic_triples("Val Kilmer played Batman in Batman Forever.", [], [])

[SemanticTriple(entityA='ValKilmer', relationship='playedCharacter', entityB='Batman'),
 SemanticTriple(entityA='Batman', relationship='appearsIn', entityB='BatmanForever'),
 SemanticTriple(entityA='ValKilmer', relationship='actedIn', entityB='BatmanForever')]

In [125]:
all_chunks = []
for title, summary in summaries.items():
    all_chunks += [
        f"(Topic: {title}) {sentence}"
        for sentence in re.split(r"[\.!?]", summary)
    ]
len(all_chunks)

17167

In [126]:
import random

all_entities = []
all_properties = []
knowledge_graph = []

chunk_sample = random.sample(all_chunks, 1000)

for chunk in tqdm(chunk_sample):
    triples = semantic_extractor.get_semantic_triples(chunk, all_entities, all_properties)
    for triple in triples:
        if triple.entityA not in all_entities:
            all_entities.append(triple.entityA)
        if triple.entityB not in all_entities:
            all_entities.append(triple.entityB)
        if triple.relationship not in all_properties:
            all_properties.append(triple.relationship)
    knowledge_graph += [t.model_dump() for t in triples]

with open("extracted-kg.json", "w") as f:
    json.dump(knowledge_graph, f)

  0%|          | 0/1000 [00:00<?, ?it/s]

Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-ffDfcz898RNowZnqJdIXjun5 on tokens per min (TPM): Limit 450000, Used 450000, Requested 457. Please try again in 60ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Exceeded rate limit
Sleeping before retry (done 1 time(s))
Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-ffDfcz898RNowZnqJdIXjun5 on tokens per min (TPM): Limit 450000, Used 450000, Requested 457. Please try again in 60ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Exceeded rate limit
Sleeping before retry (done 2 time(s))
Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-ffDfcz898RNowZnqJdIXjun5 on tokens per min (TPM): Limit 450000, Used 449622, Requested 457. Please tr

KeyboardInterrupt: 

In [127]:
with open("extracted-kg.json", "w") as f:
    json.dump(knowledge_graph, f)