## Knowledge Graph (KG) Construction from Unstructured Text

> Slides link: https://docs.google.com/presentation/d/1ta1Gw004GIBky70Bm1-fYULBIKBBb02hKFg1l0k81vY/edit#slide=id.p

**Useful Related Links:**
- https://www.boundaryml.com/blog/structured-output-from-llms
- https://huggingface.co/spaces/urchade/gliner_multiv2.1
- https://huggingface.co/urchade/gliner_large-v2
- https://github.com/hitz-zentroa/GoLLIE

In [17]:
# !pip install gliner

In [1]:
from gliner import GLiNER

def merge_entities(entities):
    if not entities:
        return []
    merged = []
    current = entities[0]
    for next_entity in entities[1:]:
        if next_entity['label'] == current['label'] and (next_entity['start'] == current['end'] + 1 or next_entity['start'] == current['end']):
            current['text'] = text[current['start']: next_entity['end']].strip()
            current['end'] = next_entity['end']
        else:
            merged.append(current)
            current = next_entity
    # Append the last entity
    merged.append(current)
    return merged


# model = GLiNER.from_pretrained("numind/NuNerZero")
model = GLiNER.from_pretrained("numind/NuZero_token")




In [3]:
# NuZero requires labels to be lower-cased!
labels = ["location","date","person","event", "company", "organization", "position"]
labels = [l.lower() for l in labels]

text = """Fiat has completed its buyout of Chrysler, making the U.S. business a wholly-owned subsidiary of the Italian
carmaker as it gears up to use their combined resources to turn around its loss-making operations in
Europe. The company announced on January 1 that it had struck a $4.35 billion deal - cheaper than analysts
had expected - to gain full control of Chrysler, ending more than a year of tense talks that had obstructed Chief Executive Sergio Marchionne's efforts to create the
world's seventh-largest auto maker."""

entities = model.predict_entities(text, labels, threshold=0.4)

entities = merge_entities(entities)

for entity in entities:
    print(entity["text"], "=>", entity["label"])
    

Fiat => organization
Chrysler => company
U.S. => location
Italian => location
Europe => location
January 1 => date
Chrysler => company
Chief Executive => position
Sergio Marchionne => person


In [150]:
# !pip install wikipedia

In [4]:
import wikipedia
from tqdm import tqdm

In [5]:
page = wikipedia.page(title="Tom Hanks", auto_suggest=False)
page.content[:1000]

"Thomas Jeffrey Hanks (born July 9, 1956) is an American actor and filmmaker. Known for both his comedic and dramatic roles, he is one of the most popular and recognizable film stars worldwide, and is regarded as an American cultural icon. Hanks's films have grossed more than $4.9 billion in North America and more than $9.96 billion worldwide, making him the fourth-highest-grossing actor in North America. He has received numerous honors including the AFI Life Achievement Award in 2002, the Kennedy Center Honor in 2014, the Presidential Medal of Freedom and the French Legion of Honor both in 2016, as well as the Golden Globe Cecil B. DeMille Award in 2020.\nHanks made his breakthrough with leading roles in a series of comedy films that received positive media attention, such as Splash (1984), The Money Pit (1986), Big (1988) and A League of Their Own (1992). He won two consecutive Academy Awards for Best Actor for starring as a gay lawyer suffering from AIDS in Philadelphia (1993) and t

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [7]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    separators=["\n\n", "\n"]
)

chunks = text_splitter.split_text(page.content)
len(chunks)

97

In [8]:
len(chunks[0])

662

In [52]:
labels = ["award", "location", "organization", "person", "movie"]

In [53]:
chunks_entities = []
entity_list = []
duplicates = set()
for text in tqdm(chunks):
    entities = model.predict_entities(text, labels, threshold=0.7)
    entities = merge_entities(entities)
    chunk_entities = set()
    for entity in entities:
        # print(entity["text"], "=>", entity["label"])
        chunk_entities.add(entity["text"])
        if entity["text"] in duplicates:
            continue
        duplicates.add(entity["text"])
        entity_list.append((entity["text"], "=>", entity["label"]))

    chunks_entities.append(list(chunk_entities))

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:31<00:00,  3.11it/s]


In [22]:
chunks_entities[:2]

[['Thomas Jeffrey Hanks', 'North America', 'July', ','], ['Ron', 'Hanks']]

In [11]:
chunks[9]

'\nHaving grown up in the Bay Area, Hanks says that some of his first movie memories were seeing movies in the Alameda Theatre in Alameda, California. Hanks studied theater at Chabot College in Hayward, California, and transferred to California State University, Sacramento after two years. During a 2001 interview with sportscaster Bob Costas, Hanks was asked whether he would rather have an Oscar or a Heisman Trophy. He replied that he would rather win a Heisman by playing halfback for the California Golden Bears. He told New York magazine in 1986, "Acting classes looked like the best place for a guy who liked to make a lot of noise and be rather flamboyant. I spent a lot of time going to plays. I wouldn\'t take dates with me. I\'d just drive to a theater, buy myself a ticket, sit in the seat and read the program, and then get into the play completely. I spent a lot of time like that, seeing Brecht, Tennessee Williams, Ibsen, and all that."'

In [29]:
entity_list[:4]

[('Thomas Jeffrey Hanks', '=>', 'person'),
 ('July', '=>', 'date'),
 (',', '=>', 'date'),
 ('North America', '=>', 'location')]

In [67]:
locs = []
orgs = []
persons = []
awards = []
movies = []
for e in entity_list:
    s,p, o = e
    if o == 'person':
        persons.append(s.lower())
    elif o == 'organization':
        orgs.append(s.lower())
    elif o == 'location':
        locs.append(s.lower())
    elif o == 'award':
        awards.append(s.lower())
    elif o == 'movie':
        movies.append(s.lower())

In [68]:
len(movies)

88

In [69]:
locs

['north america',
 'philadelphia',
 'broadway',
 'concord',
 'california',
 'red bluff',
 'oakland',
 'bay area',
 'alameda',
 'hayward',
 'cleveland',
 'ohio',
 'new york city',
 'los angeles',
 'us',
 'hollywood',
 'wall street',
 'moon',
 'france',
 'u.s.',
 'texas',
 'soviet union',
 'neighborhood',
 'studio 8h',
 'queensland',
 'australia',
 'new orleans',
 'greece',
 'ketchum',
 'idaho',
 'las vegas',
 'schöneck',
 'hesse',
 'germany',
 'united states',
 'kentucky',
 'mati',
 'athens',
 'white house',
 'new york',
 'rock and roll hall of fame',
 'pittsburgh',
 'worldwide',
 'asteroid 12818 tomhanks',
 'world',
 'london',
 'secaucus',
 'new jersey',
 'boston',
 'edina',
 'minnesota']

In [190]:
# !pip install SPARQLWrapper

In [18]:
from SPARQLWrapper import SPARQLWrapper, JSON

# Define the SPARQL endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

# Define the SPARQL query
query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?subject ?predicate ?object
WHERE {
  { 
    ?subject ?predicate ?object .
    ?subject rdfs:label "Tom Hanks"@en .
  }
  UNION
  {
    ?subject ?predicate ?object .
    ?subject rdfs:label "Killing Lincoln"@en .
  }
}
LIMIT 100
"""

# Set the query
sparql.setQuery(query)

# Set the return format to JSON
sparql.setReturnFormat(JSON)

# Execute the query and convert the result to a Python dictionary
results = sparql.query().convert()

# Process and print the results
for result in results["results"]["bindings"]:
    print(f"{result['subject']['value']} {result['predicate']['value']} {result['object']['value']}")


http://dbpedia.org/resource/Category:Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2004/02/skos/core#Concept
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2002/07/owl#Thing
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/Person
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Person
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#NaturalPerson
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.wikidata.org/entity/Q19088
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.wikidata.org/entity/Q215627
http://dbpedia.org/resource/Tom_Hanks http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www

In [31]:
query = """PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?relation ?object
WHERE {
  {
    dbr:Tom_Hanks ?relation ?object .
    ?object rdfs:label "Killing Lincoln"@en .
  }
  UNION
  {
    dbr:Killing_Lincoln ?relation ?object .
    ?object rdfs:label "Tom Hanks"@en .
  }
}

"""


In [32]:
# Set the query
sparql.setQuery(query)

# Set the return format to JSON
sparql.setReturnFormat(JSON)

# Execute the query and convert the result to a Python dictionary
results = sparql.query().convert()

# Process and print the results
for result in results["results"]["bindings"]:
    print(f"{result['relation']['value']} {result['object']['value']}")

http://dbpedia.org/ontology/wikiPageWikiLink http://dbpedia.org/resource/Tom_Hanks


In [19]:
from litellm import completion
from typing import List
import json

In [20]:
def format_entities(ent_list:List[str]) -> str:
    return "\n\n".join([e for e in ent_list])

# print(format_entities(chunks_entities[9]))

In [34]:
system_message = """Extract all the relationships between the following entities ONLY based on the given context. 
Return a list of JSON objects. For example:

<Examples>
    [{{"subject": "John", "relationship": "lives in", "object": "US"}},
    {{"subject": "Eifel towel", "relationship": "is located in", "object": "Paris"}},
    {{"subject": "Hayao Miyazaki", "relationship": "is", "object": "Japanese animator"}}]
</Examples>

- ONLY return triples and nothing else. None of 'subject', 'relationship' and 'object' can be empty.

Entities: \n\n{entities}

"""
# example for a particular set of entities
i = 3

ents = format_entities(chunks_entities[i])
text = chunks[i]

user_message = "Context: {text}\n\nTriples:"
response = completion(
  # api_key=OPENAI_API_KEY,
  # model="ollama/adrienbrault/nous-hermes2pro-llama3-8b:q8_0",
  # model="ollama/llama3",
  model="gpt-3.5-turbo",
  messages=[{"content": system_message.format(entities=ents),"role": "system"}, {"content": user_message.format(text=text),"role": "user"}],
  max_tokens=1000,
  format = "json"
)

triples = json.loads(response.choices[0].message.content)
triples

[{'subject': 'Hanks', 'relationship': 'launched', 'object': 'Playtone'},
 {'subject': 'Playtone',
  'relationship': 'has',
  'object': 'exclusive television development deal with HBO'},
 {'subject': 'Hanks',
  'relationship': 'won',
  'object': 'seven Primetime Emmy Awards'},
 {'subject': 'Hanks',
  'relationship': 'won',
  'object': 'Emmy Awards for his work as a producer of various limited series and television movies'},
 {'subject': 'Hanks',
  'relationship': 'won',
  'object': 'Emmy Awards for From the Earth to the Moon, Band of Brothers, John Adams, The Pacific, Game Change, and Olive Kitteridge'},
 {'subject': 'Hanks',
  'relationship': 'made',
  'object': "Broadway debut in Nora Ephron's Lucky Guy"},
 {'subject': 'Hanks',
  'relationship': 'earned',
  'object': 'Tony Award for Best Actor in a Play nomination'}]

In [366]:
import time

errors = []
all_triples = []
for i in tqdm(range(len(chunks_entities))):
    try:
        ents = format_entities(chunks_entities[i])
        text = chunks[i]
        
        user_message = "Context: {text}\n\nTriples:"
        response = completion(
            # api_key=OPENAI_API_KEY,
            # model="ollama/adrienbrault/nous-hermes2pro-llama3-8b:q8_0",
            # model="ollama/llama3",
            model="gpt-3.5-turbo",
            messages=[{"content": system_message.format(entities=ents),"role": "system"}, {"content": user_message.format(text=text),"role": "user"}],
            max_tokens=1000,
            format="json"
        )
        triples = json.loads(response.choices[0].message.content)
        all_triples.append(triples)
        time.sleep(3)
    except Exception as e:
        print(f"Error for chunk {i}, {e}")
        errors.append(response.choices[0].message.content)
        all_triples.append([])

  8%|███████████▍                                                                                                                              | 8/97 [00:51<06:27,  4.36s/it]

Error for chunk 7, Expecting value: line 1 column 1 (char 0)


 11%|███████████████▌                                                                                                                         | 11/97 [01:13<08:50,  6.17s/it]

Error for chunk 10, Expecting value: line 1 column 1 (char 0)


 12%|████████████████▉                                                                                                                        | 12/97 [01:14<06:26,  4.55s/it]

Error for chunk 11, Expecting value: line 1 column 1 (char 0)


 20%|██████████████████████████▊                                                                                                              | 19/97 [01:55<06:12,  4.78s/it]

Error for chunk 18, Expecting value: line 1 column 1 (char 0)


 21%|████████████████████████████▏                                                                                                            | 20/97 [01:56<04:51,  3.79s/it]

Error for chunk 19, Expecting value: line 1 column 1 (char 0)


 32%|███████████████████████████████████████████▊                                                                                             | 31/97 [03:09<05:49,  5.29s/it]

Error for chunk 30, Expecting value: line 1 column 1 (char 0)


 44%|████████████████████████████████████████████████████████████▋                                                                            | 43/97 [04:23<04:18,  4.78s/it]

Error for chunk 42, Expecting value: line 1 column 1 (char 0)


 54%|█████████████████████████████████████████████████████████████████████████▍                                                               | 52/97 [05:16<03:20,  4.45s/it]

Error for chunk 51, Expecting value: line 1 column 1 (char 0)


 61%|███████████████████████████████████████████████████████████████████████████████████▎                                                     | 59/97 [05:46<02:43,  4.29s/it]

Error for chunk 58, Expecting value: line 1 column 1 (char 0)


 82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                        | 80/97 [07:34<01:14,  4.41s/it]

Error for chunk 79, Expecting value: line 1 column 1 (char 0)


 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋           | 89/97 [08:13<00:31,  3.97s/it]

Error for chunk 88, Expecting value: line 1 column 1 (char 0)


 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊    | 94/97 [08:39<00:12,  4.29s/it]

Error for chunk 93, Expecting value: line 1 column 1 (char 0)


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [08:52<00:00,  5.49s/it]


In [372]:
# output_file = "triples.json"
# json_data = json.dumps(all_triples, indent=4)
# with open(output_file, "w") as file:
#     file.write(json_data)

In [42]:
input_file = "triples.json"
with open(input_file, "r") as file:
    all_triples = json.load(file)

all_triples[0]

[{'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has received',
  'object': 'AFI Life Achievement Award'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has received',
  'object': 'Kennedy Center Honor'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has received',
  'object': 'Presidential Medal of Freedom'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has received',
  'object': 'French Legion of Honor'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has received',
  'object': 'Golden Globe Cecil B. DeMille Award'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has grossed',
  'object': '$4.9 billion in North America'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'has grossed',
  'object': '$9.96 billion worldwide'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'is',
  'object': 'American actor'},
 {'subject': 'Thomas Jeffrey Hanks',
  'relationship': 'is',
  'object': 'filmmaker'},
 {'subject': 'North Ame

In [1]:
# !pip install pyvis

In [79]:
def get_color(n: str) -> str:
    type_to_color = {
        "person": "#6495ED",
        "location": "#3CB371",
        "award": "#F4A460",
        "organization": "#CD5C5C",
        "movie": "#6A5ACD"
    }
    if n.lower() in persons:
        return type_to_color["person"]
    if n.lower() in locs:
        return type_to_color["location"]
    if n.lower() in awards:
        return type_to_color["award"]
    if n.lower() in orgs:
        return type_to_color["organization"]
    if n.lower() in movies:
        return type_to_color["movie"]
    return "red"
    

def get_size(n: str) -> int:
    type_to_size = {
        "person": 50,
        "location": 30,
        "award": 20,
        "organization": 10,
        "movie": 40
    }
    if n.lower() in persons:
        return type_to_size["person"]
    if n.lower() in locs:
        return type_to_size["location"]
    if n.lower() in awards:
        return type_to_size["award"]
    if n.lower() in orgs:
        return type_to_size["organization"]
    if n.lower() in movies:
        return type_to_size["movie"]
    
    return 10

In [80]:
from pyvis.network import Network
import networkx as nx

G = nx.Graph()

for items in all_triples:
    for item in items:
        try:
            node_1 = item["subject"]
            node_2 = item["object"]
            G.add_node(node_1, title=node_1, color=get_color(node_1), size=get_size(node_1), label=node_1)
            G.add_node(node_2, title=node_2, color=get_color(node_2), size=get_size(node_2), label=node_2)
            G.add_edge(node_1, node_2, title=item["relationship"], weight=4)
        except Exception as e:
            print(f"Error in item: {item}")

Error in item: {'subject': 'William Dodd', 'relationship': 'is an American diplomat'}


In [81]:
nt =  Network(height="750px", width="100%")
# nt =  Network(height="750px", width="100%", bgcolor="#222222", font_color="white")

nt.from_nx(G)
# nt.toggle_physics(True)
nt.force_atlas_2based(central_gravity=0.015, gravity=-31)
nt.show("graph.html", notebook=False)
# Generate the HTML
# html = nt.generate_html()

# # Write the HTML to a file
# with open("graph.html", "w") as file:
#     file.write(html)

# # Display the graph in a Jupyter Notebook
from IPython.display import IFrame
IFrame("graph.html", width=1000, height=800)

graph.html


![image](kg_graph.png)