# BYO Knowledge Graph

The notebook shows the use of open source APIs to create knowledge graph and key phrase metadata for [LangChain](https://github.com/hwchase17/langchain) [Document](https://github.com/hwchase17/langchain/blob/1ff7c958b0a84b08c84eebba958b5b3fb0e6e409/langchain/schema.py#L269). 

More details of the techniques can be found below:

- [**EntityExtractor**](../../../../slangchain/nlp/ner/entity_extractor.py): Uses [Spacy](https://spacy.io/) to extract [named entities](https://machinelearningknowledge.ai/named-entity-recognition-ner-in-spacy-library/) in text.

- [**KnowledgeGraph**](../../../../slangchain/nlp/ner/knowledge_graph.py): Inspired by [knowledge graph generation](https://medium.com/nlplanet/building-a-knowledge-base-from-texts-a-full-practical-example-8dbbffb912fa) using HuggingFace's [Bablescape model](https://huggingface.co/Babelscape/rebel-large).

- [**KeyPhraseExtractor**](../../../../slangchain/nlp/ner/phrase_extractor.py): Uses HuggingFace [ml6team's key phrase extractor model](https://huggingface.co/ml6team/keyphrase-extraction-distilbert-inspec) to extract important key phrases from the text.

If you haven't already done so, instructions to setup the environment can be found [here](../../../../README.md).

## Setup Parameters

In [1]:
import os
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

chunk_size = 256
chunk_overlap = 30
os.environ["OPENAI_API_KEY"] = "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"
os.environ["PINECONE_API_KEY"] = "5cc48262-c60e-4f24-965a-6bf57bc0d922"
os.environ["PINECONE_ENV"] = "asia-southeast1-gcp"

## Sample Data

Let's first load some sample data

In [2]:
from langchain.document_loaders.wikipedia import WikipediaLoader

loader = WikipediaLoader(query="LeBron James")
documents = loader.load()

Split the documents into chunks

In [3]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
  chunk_size = chunk_size,
  chunk_overlap = chunk_overlap
)
split_documents = text_splitter.split_documents(documents)

## Content Tagging and Knowledge Graph

Let's now instantiate the classes required to create the tags and knowledge graphs. It might take awhile to download the models from hugging face initially, but you'll only have to do it once:

- [**EntityExtractor**](../../../../slangchain/nlp/ner/entity_extractor.py): Uses [Spacy](https://spacy.io/) to extract [named entities](https://machinelearningknowledge.ai/named-entity-recognition-ner-in-spacy-library/) in text.

- [**KnowledgeGraph**](../../../../slangchain/nlp/ner/knowledge_graph.py): Inspired by [knowledge graph generation](https://medium.com/nlplanet/building-a-knowledge-base-from-texts-a-full-practical-example-8dbbffb912fa) using HuggingFace's [Bablescape model](https://huggingface.co/Babelscape/rebel-large).

- [**KeyPhraseExtractor**](../../../../slangchain/nlp/ner/phrase_extractor.py): Uses HuggingFace [ml6team's key phrase extractor model](https://huggingface.co/ml6team/keyphrase-extraction-distilbert-inspec) to extract important key phrases from the text.


In [4]:
from slangchain.nlp.ner.entity_extractor import EntityExtractor
from slangchain.nlp.ner.phrase_extractor import KeyPhraseExtractor
from slangchain.nlp.ner.knowledge_graph import KnowledgeGraph

kg_kwargs = {"max_length": chunk_size + 100}

entity_extractor = EntityExtractor()
knowledge_graph = KnowledgeGraph(**kg_kwargs)
key_phrase_extractor = KeyPhraseExtractor()

NOTE: Redirects are currently not supported in Windows or MacOs.


Create the tags and knowledge graph per Document chunk.

The document tags are a combination of the outputs from [**EntityExtractor**](../../../../slangchain/nlp/ner/entity_extractor.py) (persons organisations, and locations) and key phrases from [**KeyPhraseExtractor**](../../../../slangchain/nlp/ner/phrase_extractor.py).

Unlike [OpenSearch](https://opensearch.org/), Pinecone does not allow searches over a list of Dictionary objects. As a workaround, we have mapped the Knowledge graph's subjects, relations and objects as list of strings. We're open to suggestions as to how we can structure the entity relationships better.

In [11]:
import json
for document in split_documents:
  entity_extractor.inference(document.page_content)
  persons = entity_extractor.persons
  organisations = entity_extractor.organisations
  locations = entity_extractor.locations

  key_phrases = key_phrase_extractor.inference(document.page_content)
  entity_relations = knowledge_graph.inference(document.page_content)

  tags = persons + organisations + locations + key_phrases
  tags = list({i for i in tags})
  document.metadata.update({
    "tags": tags,
    "subjects": list({i["subject"] for i in entity_relations}),
    "relations": list({i["relation"] for i in entity_relations}),
    "objects": list({i["object"] for i in entity_relations}),
  })

As per the print out below, the tags and knowledge graph metadata were added to the documents.

In [13]:
print(f'Content: {split_documents[0].page_content}\n')
print(f'Tags: {split_documents[0].metadata["tags"]}\n')
print(f'Subjects: {split_documents[0].metadata["subjects"]}\n')
print(f'Relations: {split_documents[0].metadata["relations"]}\n')
print(f'Objects: {split_documents[0].metadata["objects"]}\n')

Content: LeBron Raymone James Sr. (; born December 30, 1984) is an American professional basketball player for the Los Angeles Lakers in the National Basketball Association (NBA).  Nicknamed "King James", he is considered to be one of the greatest basketball players in history and is often compared to Michael Jordan in debates over the greatest basketball player of all time.  James is the all-time leading scorer in NBA history and ranks fourth in career assists. He has won four NBA championships (two with the Miami Heat, one each with the Lakers and Cleveland Cavaliers), and has competed in 10 NBA Finals. He has four Most Valuable Player (MVP) Awards, four Finals MVP Awards, and two Olympic gold medals. He has been named an All-Star 19 times, selected to the All-NBA Team 19 times (including 13 First Team selections) and the All-Defensive Team six times, and was a runner-up for the NBA Defensive Player of the Year Award twice in his career.James grew up playing basketball for St. Vincen

In [14]:
split_documents[0].metadata

{'title': 'LeBron James',
 'summary': 'LeBron Raymone James Sr. (; born December 30, 1984) is an American professional basketball player for the Los Angeles Lakers in the National Basketball Association (NBA).  Nicknamed "King James", he is considered to be one of the greatest basketball players in history and is often compared to Michael Jordan in debates over the greatest basketball player of all time.  James is the all-time leading scorer in NBA history and ranks fourth in career assists. He has won four NBA championships (two with the Miami Heat, one each with the Lakers and Cleveland Cavaliers), and has competed in 10 NBA Finals. He has four Most Valuable Player (MVP) Awards, four Finals MVP Awards, and two Olympic gold medals. He has been named an All-Star 19 times, selected to the All-NBA Team 19 times (including 13 First Team selections) and the All-Defensive Team six times, and was a runner-up for the NBA Defensive Player of the Year Award twice in his career.James grew up pla

## Creating our self-querying retriever
Now we can instantiate our retriever. To do this we'll need to have a Pinecone instance. To use Pinecone, you must have an API key. 
Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).


In [15]:
import pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone


document_content_description = "Content of Wikipedia page"
llm = OpenAI(temperature=0)
embedding = OpenAIEmbeddings()

pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),  # find at app.pinecone.io
    environment=os.getenv("PINECONE_ENV")  # next to api key in console
)

index_name = "kg-slangchain-demo"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=1536)
vectordb = Pinecone.from_documents(split_documents, embedding, index_name=index_name)


Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents.

In [16]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chains import RetrievalQA

metadata_field_info=[
    AttributeInfo(
        name="title",
        description="The Wikipedia page title", 
        type="string", 
    ),
    AttributeInfo(
        name="summary",
        description="The Wikipedia page summary", 
        type="string", 
    ),
    AttributeInfo(
        name="source",
        description="The Wikipedia page source url", 
        type="string", 
    ),
    AttributeInfo(
        name="tags",
        description="List of delimited Key word phrases from the content",
        type="string or list[string]"
    ),
    AttributeInfo(
        name="subjects",
        description="List of Subject Knowledge graph entities from the content",
        type='''list[string]'''
    ),
    AttributeInfo(
        name="relations",
        description="List of Subject Knowledge graph entities from the content",
        type='''list[string]'''
    ),
    AttributeInfo(
        name="objects",
        description="List of Subject Knowledge graph entities from the content",
        type='''list[string]'''
    )
]

retriever = SelfQueryRetriever.from_llm(
    llm, vectordb, document_content_description, metadata_field_info, verbose=True
)
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True
)

## Testing it Out

### Query on Knowledge Graph Objects

An example query based on Knowledge Graph objects

In [93]:
qa_chain("Lebron James achievements where objects is regular season")

query='Lebron James achievements' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='objects', value='regular season') limit=None


{'query': 'Lebron James achievements where objects is regular season',
 'result': ' Lebron James has the most points (44,724), most consecutive games scoring (1,635), 2nd most minutes played (63,392), 2nd most field goals made (16,327), 2nd most field goals attempted (32,459), 3rd most free throws made (9,610), 3rd most free throws attempted (13,069), 4th most 3-point field goals attempted (7,515), 5th most assists (12,008), 5th most games played (1,636), and 6th most 3-pointers made (2,944) in his regular season career.',
 'source_documents': [Document(page_content='11,035)\nmost games played (266)\nmost consecutive games scoring (266) (James has scored in every game he has played)\n2nd most assists (1,919)\n2nd most defensive rebounds (1,990)\n2nd most 3-point field goals made (432)\n2nd most triple doubles (28)\n3rd most Finals appearances (10)\n6th most points per game (28.7)\n6th most rebounds (2,391)\n10th most blocks (252)\n\n\n=== Career – regular season and playoffs combined =

### Query on Knowlege Graph Subjects

An example query based on Knowledge Graph subjects

In [82]:
qa_chain("Lebron James playing age where subjects is Amateur Athletic Union")

query='Lebron James playing age' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='subjects', value='Amateur Athletic Union') limit=None


{'query': 'Lebron James playing age where subjects is Amateur Athletic Union',
 'result': ' 9',
 'source_documents': [Document(page_content=' his birth.:\u200a22\u200a His father, Anthony McClelland, has an extensive criminal record and was not involved in his life. When James was growing up, life was often a struggle for the family, as they moved from apartment to apartment in the seedier neighborhoods of Akron while Gloria struggled to find steady work. Realizing that her son would be better off in a more stable family environment, Gloria allowed him to move in with the family of Frank Walker, a local youth football coach who introduced James to basketball when he was nine years old.:\u200a23\u200aJames began playing organized basketball in the fifth grade. He later played Amateur Athletic Union (AAU) basketball for the Northeast Ohio Shooting Stars. The team enjoyed success on a local and national level, led by ', metadata={'objects': ['basketball'], 'relations': ['sport'], 'source'

### Query on Knowlege Graph Relations

An example query based on Knowledge Graph relations

In [84]:
qa_chain("Lebron James draft year where relations is point in time")

query='Lebron James draft year' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='relations', value='point in time') limit=None


{'query': 'Lebron James draft year where relations is point in time',
 'result': ' 2003',
 'source_documents': [Document(page_content='The Decision was a television special on ESPN in which National Basketball Association (NBA) player LeBron James announced that he would be signing with the Miami Heat instead of returning to his hometown team, the Cleveland Cavaliers. It was broadcast live on July 8, 2010. James was an unrestricted free agent after playing seven seasons in Cleveland, where he was a two-time NBA Most Valuable Player and a six-time All-Star. He grew up in nearby Akron, Ohio, where he received national attention as a high school basketball star.\n\n\n== Background ==\nJames was born and raised in Akron, Ohio, where he received national attention as a high school basketball star at St. Vincent–St. Mary High School. He was drafted out of high school by his hometown Cleveland Cavaliers with the first overall pick of the 2003 NBA draft. He played the first seven seasons of hi

### Query on Knowlege Graph Key Combinations

An example query based on Knowledge Graph key combinations (objects, subjects and relations)

In [91]:
qa_chain("Lebron James achievements where relations is point in time and subjects is Finals and objects is 2016")

query='Lebron James achievements' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='relations', value='point in time'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='subjects', value='Finals'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='objects', value='2016')]) limit=None


{'query': 'Lebron James achievements where relations is point in time and subjects is Finals and objects is 2016',
 'result': " In 2016, Lebron James led the Cavaliers to victory over the Golden State Warriors in the NBA Finals, delivering the team's first championship and ending the Cleveland sports curse.",
 'source_documents': [Document(page_content=" was heavily touted by the national media as a future NBA superstar. A prep-to-pro, he was selected by the Cleveland Cavaliers with the first overall pick of the 2003 NBA draft. Named the 2004 NBA Rookie of the Year, he soon established himself as one of the league's premier players, leading the Cavaliers to their first NBA Finals appearance in 2007 and winning the NBA MVP award in 2009 and 2010. After failing to win a championship with Cleveland, James left in 2010 as a free agent to join the Miami Heat; this was announced in a nationally televised special titled The Decision and is among the most controversial free agency moves in spo

### Query on Tags

An example query based on Tags

In [87]:
qa_chain("Lebron James losses where tags is finals")

query='Lebron James losses' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='tags', value='finals') limit=None


{'query': 'Lebron James losses where tags is finals',
 'result': ' Lebron James did not lose any of the NBA Finals he played in with the Miami Heat; the team won back-to-back championships in 2012 and 2013.',
 'source_documents': [Document(page_content=' This was the first NBA Finals matchup between the two teams, and the first time that Finals participants had both missed the playoffs in the previous season. James had previously played with Miami under Heat head coach Erik Spoelstra, winning back-to-back NBA titles in 2012 and 2013 in four consecutive Finals appearances from 2011 to 2014, while Heat president Pat Riley was head coach of the "Showtime"-era Lakers from 1981 to 1990, leading them to four NBA titles in seven Finals appearances. For the first time in six seasons, the Golden State Warriors were not in the Finals.The Finals were originally scheduled for June, but the season was suspended in mid-March due to the COVID-19 pandemic. The NBA and its players later approved a plan