# Knowledge Graph

In order to create the knowledge graph (KG) we are going to use spacy and the large english model. We will then use pyvis to visualise the network created by the algorithmn. The text data is just over 2.5M tokens so we will make the limit of the model 3M

In [8]:
import spacy
from tqdm import tqdm
import re
import networkx as nx
from pyvis.network import Network

# Load the large English model
nlp = spacy.load("en_core_web_lg")

# Initialize an empty graph
graph = nx.Graph()

limit = 3000000

Then we need to create functions that:

1) Allow us to flex how much data to load in (for testing)
2) Clean the data to remove page numbers, special characters and spaces

In [9]:

# Function to load text file
def load_text(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return file.read()

# Function to load a limited amount of text from the file
def load_limited_text(file_path, limit=50000):
    """Load a maximum of 'limit' characters from the text file."""
    text = ""
    with open(file_path, "r", encoding="utf-8") as file:
        while len(text) < limit:
            chunk = file.read(limit - len(text))  # Read only the needed remaining characters
            if not chunk:  # If no more content to read, break the loop
                break
            text += chunk
    return text

# Function to clean text (remove HTML, special characters, and page numbers)
def clean_text(text):
    # Remove HTML entities
    text = re.sub(r'&#[0-9]+;', '', text)

    # Remove § and section numbers (e.g., §3)
    text = re.sub(r'§\d+', '', text)

    # Remove standalone page numbers (e.g., "531", "532")
    text = re.sub(r'\b\d{1,4}\b', '', text)  # Match numbers with 1 to 4 digits

    # Remove multiple newlines, excessive whitespace
    text = re.sub(r'\s+', ' ', text)

    return text


### Code Explanation:

#### Function Definition:
- The function `add_to_graph(graph, doc)` takes two inputs:
  - `graph`: The knowledge graph (could be a network graph like in NetworkX).
  - `doc`: A document (likely processed by an NLP library like spaCy).

#### Loop Through Sentences:
- `for sent in doc.sents`: Loops through each sentence in the document.

#### Extract Entities:
- `entities_in_sentence = [ent for ent in sent.ents]`: Extracts **named entities** from each sentence and stores them in the list `entities_in_sentence`. These could be people, places, organizations, etc.

#### Find the Root Verb (Relationship):
- `root_verb = None`: Initializes the `root_verb` variable.
- The loop `for token in sent` iterates over each token (word) in the sentence to find the **root verb** (main action/relationship in the sentence).
- `token.dep_ == ROOT` checks if the token has the syntactic dependency ROOT, identifying the main verb/action in the sentence.
- `root_verb = token.lemma_` stores the **lemma** (base form) of the verb as `root_verb`, which will be used as the **relationship** between entities.

#### Create Edges Between Entities:
- If a root verb is found (`if root_verb`) and there are at least two entities in the sentence (`len(entities_in_sentence) > 1`), the code creates edges (relationships) between those entities.
- It loops through all pairs of entities (`ent1` and `ent2`) in the sentence. If the entities are different (`if ent1 != ent2`), it adds an **edge** to the graph using `graph.add_edge`.
  - **Edge**: Connects `ent1.text` and `ent2.text` (the names of the entities).
  - **Relationship**: The edge's relationship is represented by the `root_verb` (the verb from the sentence).


In [10]:
def add_to_graph(graph, doc):
    """
    Add entities to the graph and create edges based on the root verb and prepositional phrases in each sentence.
    """
    # Entity types to include in the graph
    valid_entity_types = {"PERSON", "ORG", "GPE", "LOC"}  # People, Organizations, Locations

    for sent in doc.sents:
        # Extract valid entities from the sentence
        entities_in_sentence = [ent for ent in sent.ents if ent.label_ in valid_entity_types]

        # Find the root verb and possible prepositional phrase
        root_verb = None
        prep_phrase = None
        modifiers = []  # To capture adverbs or adjectives that modify the verb

        for token in sent:
            if token.dep_ == "ROOT":
                root_verb = token.lemma_
            if token.dep_ == "prep":
                prep_phrase = token.text
            if token.dep_ in {"advmod", "amod"}:
                modifiers.append(token.text)

        # Combine root verb, prepositional phrase, and modifiers to form a more meaningful relationship
        if root_verb:
            # Build a detailed relationship string
            relationship = f"{root_verb} {' '.join(modifiers)} {prep_phrase}".strip()

            # Create edges only if there are more than one entity
            if len(entities_in_sentence) > 1:
                for i, ent1 in enumerate(entities_in_sentence):
                    for ent2 in entities_in_sentence[i+1:]:
                        if ent1 != ent2:
                            # Lowercase entities for consistent graph entries
                            entity1 = ent1.text.lower()
                            entity2 = ent2.text.lower()

                            # Avoid adding duplicate edges
                            if not graph.has_edge(entity1, entity2):
                                graph.add_edge(entity1, entity2, relationship=relationship)


We then load in our corpus of data, clean the text and check the output

In [11]:
# Load the first 50,000 characters from your large text file and clean it
text = load_limited_text("../data/raw/origin-of-the-world.txt", limit=limit)
text = clean_text(text)

print(text[0:1000])

# Increase spaCy's maximum document length limit
nlp.max_length = len(text) + 1000  # Add some buffer to the current text length


THE STORY AND AIM OF THE OUTLINE OF HISTORY T HE Outline of History was first written in -. It was published in illustrated parts, and it was carefully revised and printed again as a book in . It was again revised very severely and rearranged for a reprint in (January) ; it was reissued in a revised and much more amply illustrated edition in , and again in came a quite fresh edition, recast, rewritten in many places, and with much added new matter. This has now been further revised. There were many reasons to move a writer to attempt a World History in . It was the last, the weariest, most disillusioned year of the Great War. Everywhere there were unwonted privations ; everywhere there was mourning. The tale of the dead and mutilated had mounted to many millions. Men felt they had come to a crisis in the world’s affairs. They were too weary and heart-sick to consider complicated possibilities. They were not sure whether they were facing a disaster to civilization or the inauguration of

When we are happy with the output we can then process the documents. We use the command "pipe" to instigate parrallel processing.

In [12]:
# Process the single chunk of text (since it's already 50,000 characters)
docs = nlp.pipe([text], batch_size=1, n_process=16)  # Re-enabled 'parser' for verb/root detection

# Process the document and build the graph
for doc in tqdm(docs, total=1000, desc="Building Knowledge Graph"):
    add_to_graph(graph, doc)


Building Knowledge Graph:   0%|                                              | 1/1000 [18:20<305:30:29, 1100.93s/it]


In [13]:
nx.write_gml(graph, f"../data/result/knowledge_graph_2.0_limit={limit}.gml")