Using REBEL model to extract entities and relationships instead of LLM.

Follow along: https://colab.research.google.com/drive/1G6pcR0pXvSkdMQlAK_P-IrYgo-_staxd?usp=sharing#scrollTo=XX_GxhusPZR8

Install Nebula Graph: `curl -fsSL nebula-up.siwei.io/install.sh | bash`

Also follow along: https://colab.research.google.com/drive/1tLjOg2ZQuIClfuWrAC2LdiZHCov8oUbs#scrollTo=kkHpLzEuYo_9

In [1]:
import os

import openai
import pandas as pd
from dotenv import load_dotenv
from transformers import pipeline
from llama_index.core import (
    Document,
    KnowledgeGraphIndex,
    StorageContext,
)
from llama_index.core.settings import Settings
from llama_index.core.text_splitter import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.graph_stores.nebula import NebulaGraphStore

load_dotenv()

openai.api_key = os.environ["OPENAI_API_KEY"]

os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

Setup nebula graph:

Run ngql:
```CREATE SPACE llamaindex(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
USE llamaindex;
CREATE TAG entity(name string);
CREATE EDGE relationship(relationship string);
CREATE TAG INDEX entity_index ON entity(name(256));
```

In [60]:
space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], [
    "relationship"
]  # default, could be omit if create from an empty kg
tags = ["entity"]  # default, could be omit if create from an empty kg

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

Load review documents:

In [61]:
def load_reviews(
    file: str = "../../data/clean/cleaned_reviews.csv",
    start_index: int = 0,
    limit: int = 10,
):
    df = pd.read_csv(file)[start_index:start_index + limit]
    print(f"Loaded reviews: {df.shape}")
    df["Content"] = df.apply(lambda row: f"{row['Review Title']}\n{row['Review Content']}", axis=1)

    documents = [
        Document(
            text=row["Content"],
            metadata={
                "Airline": row["Airline"],
                "Type of Traveller": row["Type of Traveller"],
                "Route": row["Route"],
                "Class": row["Class"],
                "Month Flown": row["Month Flown"],
                "Review Date": row["Review Date"],
            }
        )
        for _, row in df.iterrows()
    ]
    return documents
    # splitter = SentenceSplitter(
    #    chunk_size=200,
    #    chunk_overlap=0,
    #    paragraph_separator="\n\n"
    #)
    # 
    #nodes = splitter.get_nodes_from_documents(documents)
    # return nodes

In [93]:
reviews = load_reviews(limit=1_0000_000)
len([doc for doc in reviews if "American" in doc.metadata["Airline"]])

Loaded reviews: (28950, 15)


[Document(id_='c1947939-44c2-4028-998d-958fb002a6a6', embedding=None, metadata={'Airline': 'American Airlines', 'Type of Traveller': 'Solo Leisure', 'Route': 'Dallas Ft Worth to Austin', 'Class': 'Economy Class', 'Month Flown': '2024-04-01', 'Review Date': '2024-04-21'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Never fly with them\nI'm writing this for my wife. She has been stranded by American for 18 hours now, and counting. Two cancellations, and then delays. The plane she's waiting on now was late leaving Austin, and is delayed about 3 hours so far. We have to decide if she's just going to a hotel, and I have to drive to Dallas to pick her up. We are about to give up on the idea of American getting her out of DFW at all. We haven't even begun to talk to them about a refund, but I can imagine how that's going to go. Terrible company. Never fly with them. They have tortured her. She's been awake at the airport for 18 hours now, and she is

Load extract triplets functions:

In [84]:
# Function to parse the generated text and extract the triplets
# Rebel outputs a specific format. This code is mostly copied from the model card!

triplet_extractor = pipeline(
    "text2text-generation",
    model="Babelscape/rebel-large",
    tokenizer="Babelscape/rebel-large",
    device=-1,
    temperature=0.3,
    top_p=0.2,
    # do_sample=True,
) # , device="cuda:0")


def extract_triplets(input_text):
    text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(input_text, return_tensors=True, return_text=False)[0]["generated_token_ids"]])[0]

    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and subject in input_text and relation != '' and relation in input_text and object_ != '' and object_ in input_text:
        triplets.append((subject.strip(), relation.strip(), object_.strip()))

    return triplets



In [85]:
print(reviews[2].text)

Crew was super nice
Flight was great! Crew was super nice, Chief Stewardess & FSS showered me with many snacks! And one of the FSS gave me a bag tag, offered to take my picture, and had a little talk. She was so kind and fun. The food was delicious. Booked BTC (The burger & Steak), burger was amazing, chefs kiss. What The steak was a little over cooked but that’s okay, the sides were delicious. The seat was clean, as well as the bathroom. IFE had many movies, shows, music to choose from. Seat could transform to a full flat bed, inflight wifi could be used for instagram, WhatsApp, etc. Crew tried to keep me entertained at every moment, offering me such things as snacks, toys etc. Love Singapore Airlines. The world class service is wonderful. Check in experience with LHR was good but not amazing, girl who checked me in looks rather annoyed. Lounge was great, gave me a free lion teddy.


In [86]:
triplet_extractor.tokenizer.batch_decode([triplet_extractor(reviews[2].text, return_tensors=True, return_text=False)[0]["generated_token_ids"]])



['<s><triplet> IFE <subj> Singapore Airlines <obj> owned by</s>']

In [87]:
extract_triplets(reviews[2].text)

[]

In [76]:
input_sentence = "Gràcia is a district of the city of Barcelona, Spain. It comprises the neighborhoods of Vila de Gràcia, Vallcarca i els Penitents, El Coll, La Salut and Camp d'en Grassot i Gràcia Nova. Gràcia is bordered by the districts of Eixample to the south, Sarrià-Sant Gervasi to the west and Horta-Guinardó to the east. A vibrant and diverse enclave of Catalan life, Gràcia was an independent municipality for centuries before being formally annexed by Barcelona in 1897 as a part of the city's expansion."
extract_triplets(input_sentence)

[]

In [80]:
import spacy
import spacy_components

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("rebel", after="senter", config={
    'device':0, # Number of the GPU, -1 if want to use CPU
    'model_name':'Babelscape/rebel-large'} # Model used, will default to 'Babelscape/rebel-large' if not given
    )
input_sentence = "Gràcia is a district of the city of Barcelona, Spain. It comprises the neighborhoods of Vila de Gràcia, Vallcarca i els Penitents, El Coll, La Salut and Camp d'en Grassot i Gràcia Nova. Gràcia is bordered by the districts of Eixample to the south, Sarrià-Sant Gervasi to the west and Horta-Guinardó to the east. A vibrant and diverse enclave of Catalan life, Gràcia was an independent municipality for centuries before being formally annexed by Barcelona in 1897 as a part of the city's expansion."
                 
doc = nlp(input_sentence)
doc_list = nlp.pipe([input_sentence])
for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

  sent_lengths_max = sent_lengths.max().item() + 1


(0, 8): {'relation': 'located in the administrative territorial entity', 'head_span': Gràcia, 'tail_span': Barcelona}
(0, 10): {'relation': 'country', 'head_span': Gràcia, 'tail_span': Spain}
(8, 0): {'relation': 'contains administrative territorial entity', 'head_span': Barcelona, 'tail_span': Gràcia}
(8, 10): {'relation': 'country', 'head_span': Barcelona, 'tail_span': Spain}
(17, 0): {'relation': 'located in the administrative territorial entity', 'head_span': Vila de Gràcia, 'tail_span': Gràcia}
(21, 0): {'relation': 'located in the administrative territorial entity', 'head_span': Vallcarca i els Penitents, 'tail_span': Gràcia}
(26, 0): {'relation': 'located in the administrative territorial entity', 'head_span': El Coll, 'tail_span': Gràcia}
(29, 0): {'relation': 'located in the administrative territorial entity', 'head_span': La Salut, 'tail_span': Gràcia}
(0, 46): {'relation': 'shares border with', 'head_span': Gràcia, 'tail_span': Eixample}
(0, 51): {'relation': 'shares border 

In [88]:
doc = nlp(reviews[2].text)
doc_list = nlp.pipe([reviews[2].text])
for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

(0, 5): {'relation': 'part of', 'head_span': Crew, 'tail_span': Flight}
(5, 0): {'relation': 'has part', 'head_span': Flight, 'tail_span': Crew}
(63, 65): {'relation': 'has part', 'head_span': burger, 'tail_span': Steak}
(65, 63): {'relation': 'part of', 'head_span': Steak, 'tail_span': burger}
(89, 65): {'relation': 'has part', 'head_span': sides, 'tail_span': Steak}
(94, 102): {'relation': 'part of', 'head_span': seat, 'tail_span': bathroom}
(102, 94): {'relation': 'has part', 'head_span': bathroom, 'tail_span': seat}
(107, 109): {'relation': 'has part', 'head_span': movies, 'tail_span': shows}
(109, 107): {'relation': 'part of', 'head_span': shows, 'tail_span': movies}
(131, 126): {'relation': 'uses', 'head_span': instagram, 'tail_span': wifi}
(133, 126): {'relation': 'uses', 'head_span': WhatsApp, 'tail_span': wifi}
(22, 142): {'relation': 'use', 'head_span': snacks, 'tail_span': entertained}
(157, 158): {'relation': 'facet of', 'head_span': Love Singapore Airlines, 'tail_span': Si

In [83]:
reviews = load_reviews()

for review in reviews:
    print(f"Review: {review}")
    triplets = extract_triplets(review.text)
    print(f"Triplets: {triplets}")
    print("--")

Loaded reviews: (10, 15)
Review: Doc ID: 0c5f635e-f9a5-4104-8e64-4f1ef481dc60
Text: Food was below par Overall disappointing from Singapore
Airlines. Late and disorganized boarding. The A350-900 is a tired old
thing. Does anyone seriously use the coat hanger button on the seat
back in front of you? This aircraft has no air outlets above seats and
made the trip stuffy. The entertainment system was very old and dated
movies, yes ...
Triplets: []
--
Review: Doc ID: d8e0b688-d012-4295-8fac-b2793fb74a12
Text: recent pricing and service changes I usually fly Singapore
Airlines internationally, since 10 years I am a Krisflyer Gold member.
The recent pricing and service changes, show the results of bailed out
airlines believing ethical standards don't apply to them anymore.
Unbelievably toxic SIA service for business customers. Bear in mind if
you fly f...
Triplets: []
--
Review: Doc ID: f9c95404-7944-46a4-9436-a98b69a02bf7
Text: Crew was super nice Flight was great! Crew was super nice, Chief

Load Graph Index Query Engine:

In [18]:
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.chunk_size = 256
# Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
# Settings.num_output = 512
# Settings.context_window = 3900

documents = load_reviews()
# index = KnowledgeGraphIndex.from_documents(documents, kg_triplet_extract_fn=extract_triplets)
index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
    # kg_triplet_extract_fn=extract_triplets,
)

Loaded reviews: (10, 15)


Check it was loaded to Nebula:

In [7]:
%load_ext ngql
%ngql --address 127.0.0.1 --port 9669 --user root --password nebula

Connection Pool Created


Unnamed: 0,Name
0,llamaindex


In [89]:
%ngql USE llamaindex;
%ngql MATCH ()-[e]->() RETURN e LIMIT 100;

Unnamed: 0,e
0,"(""2024-04-21"")-[:relationship@8324040960810900..."
1,"(""Chief stewardess"")-[:relationship@-717721805..."
2,"(""2024-04-01"")-[:relationship@-707808986102225..."
3,"(""2024-04-01"")-[:relationship@2576048185196685..."
4,"(""2024-04-01"")-[:relationship@6481582437030897..."
...,...
95,"(""Singapore airlines"")-[:relationship@40560834..."
96,"(""Singapore airlines"")-[:relationship@40560834..."
97,"(""Singapore airlines"")-[:relationship@40560834..."
98,"(""Singapore airlines"")-[:relationship@44683897..."


In [90]:
documents[0].text

'Food was below par\nOverall disappointing from Singapore Airlines. Late and disorganized boarding. The A350-900 is a tired old thing. Does anyone seriously use the coat hanger button on the seat back in front of you? This aircraft has no air outlets above seats and made the trip stuffy. The entertainment system was very old and dated movies, yes even the so called “recent releases”. TV selections were abysmal.  This was the first international flight I watched nothing at all on. Food was below par and drinks service patchy. I used to think Singapore Airlines was a great airline but after this flight I beg to differ. One good point is the staff on board professional and friendly.'

In [91]:
for r in reviews:
    print(r.metadata)
    print(r.text)

{'Airline': 'Singapore Airlines', 'Type of Traveller': 'Couple Leisure', 'Route': 'Melbourne to Singapore', 'Class': 'Economy Class', 'Month Flown': '2024-04-01', 'Review Date': '2024-04-21'}
Food was below par
Overall disappointing from Singapore Airlines. Late and disorganized boarding. The A350-900 is a tired old thing. Does anyone seriously use the coat hanger button on the seat back in front of you? This aircraft has no air outlets above seats and made the trip stuffy. The entertainment system was very old and dated movies, yes even the so called “recent releases”. TV selections were abysmal.  This was the first international flight I watched nothing at all on. Food was below par and drinks service patchy. I used to think Singapore Airlines was a great airline but after this flight I beg to differ. One good point is the staff on board professional and friendly.
{'Airline': 'Singapore Airlines', 'Type of Traveller': 'Business', 'Route': 'Zurich to Singapore', 'Class': 'Premium Econ

In [92]:
for r in index.as_retriever().retrieve("What does Singapore Airlines offer?"):
    print(r.text)

Due to a cancellation by another airline, I needed to modify the departure on my full fare, refundable ticket. I completed the customer assistance form on their website to request the rebooking, and tried repeatedly over an 8 hour period to reach their customer service number in the US. It was impossible to get past the recording and reach an agent over the phone. It was also not possible to make this change myself online. Upon arriving at Changi, I discovered that instead of changing my outbound flight as requested, Singapore Airlines cancelled my entire ticket. When I tried to rebook the flight, there was not a single open seat for the return over a two day period (Easter weekend). Agents at Changi were apologetic and tried to help, but with no seats available, there was little they could do. I have never before been unable to reach an airline representative, and this has certainly not been the case when I have tried multiple channels over an 8 hour time period.
That was it. SQ could

In [22]:
response = index.as_query_engine().query("Tell me about Singapore Airlines")
print(response)

Singapore Airlines, founded in 1982, is known for its high service standards and quality entertainment offerings. It provides various services such as customer assistance forms, access to Star Alliance Gold status, and amenities like inflight WiFi. The airline has faced criticism for its customer service in handling flight changes and cancellations, as well as for its change policies that include additional fees for simple modifications. Despite these issues, Singapore Airlines is recognized for its cabin service quality, although cost-saving measures like discontinuing amenity kits in certain classes have been noted.


In [None]:
response = index.as_query_engine().query("Tell me about Singapore Airlines")
print(response)

Doesnt seem to be from the KG...

In [None]:
from llama_index.core.query_engine import KnowledgeGraphQueryEngine


nl2kg_query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
)