# Loose Entity Extraction

## Setup

In [1]:
%pip install python-dotenv pypdf

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl.metadata (7.4 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: python-dotenv, pypdf
Successfully installed pypdf-4.2.0 python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv
import os
load_dotenv('exp.env', override=True)

# Neo4j
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

#OPENAI
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

## Experimentation

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("attention-is-all-you-need.pdf")
pages = loader.load_and_split()

In [3]:
pages[1]

Document(page_content='1 Introduction\nRecurrent neural networks, long short-term memory [ 13] and gated recurrent [ 7] neural networks\nin particular, have been firmly established as state of the art approaches in sequence modeling and\ntransduction problems such as language modeling and machine translation [ 35,2,5]. Numerous\nefforts have since continued to push the boundaries of recurrent language models and encoder-decoder\narchitectures [38, 24, 15].\nRecurrent models typically factor computation along the symbol positions of the input and output\nsequences. Aligning the positions to steps in computation time, they generate a sequence of hidden\nstates ht, as a function of the previous hidden state ht−1and the input for position t. This inherently\nsequential nature precludes parallelization within training examples, which becomes critical at longer\nsequence lengths, as memory constraints limit batching across examples. Recent work has achieved\nsignificant improvements in compu

In [4]:
from langchain_core.prompts import ChatPromptTemplate

extract_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert researcher that needs to extract relevant facts from the below paper excerpt "
            "for use in a source of truth knowledge base. "
            "The knowledge base will be used to help non-experts answer critical questions about the research materials."
            "The facts should be extracted in the form of a a knowledge triple: subject, predicate object."
            "each subject, predicate, and object should only be 1 to a few words. Avoid sentences and conjunctions, "
            "instead you can create more facts. "
            "Avoid object containing a preposition or adverb, instead create additional facts"
            "Given that this is a source of truth, please ensure the facts come only from the provided text, "
            "Do not create fictitious data or impute missing values."
            "Only include facts that are clear, non-ambiguous, and relevant to the research"
        ),
        ("human", "{text}"),
    ]
)

In [5]:
import datetime
from typing import List
from langchain_core.pydantic_v1 import BaseModel, Field

class Fact(BaseModel):
    """A useful fact"""
    subject: str = Field(description="Subject: Entity being described")
    predicate: str = Field(description="Predicate: The property or action of the subject that is being described")
    object: str = Field(description="Object: The value of the property or action being described")

class Facts(BaseModel):
    """A series of useful facts"""
    facts: List[Fact]


In [6]:
from langchain_openai import ChatOpenAI

llm_extractor=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

In [7]:
chain_extractor = extract_prompt | llm_extractor.with_structured_output(Facts)

In [8]:
facts = chain_extractor.invoke(pages[1].page_content)

In [9]:
def stringify_facts(f:Facts):
    res = set()
    for fact in f.facts:
        res.add(f'({fact.subject}) - [{fact.predicate}] -> ({fact.object})')
    return res

for s in stringify_facts(facts):
    print(s)

(Transformer) - [relying on] -> (attention mechanism)
(ByteNet) - [uses] -> (convolutional neural networks)
(Recurrent neural networks) - [used in] -> (sequence modeling)
(Recent work) - [achieved] -> (significant improvements in computational efficiency)
(Transformer) - [does not use] -> (sequence-aligned RNNs)
(Recurrent models) - [preclude] -> (parallelization within training examples)
(Transformer) - [first transduction model relying on] -> (self-attention)
(Attention mechanisms) - [integral part of] -> (sequence modeling)
(Transformer) - [model architecture] -> (eschewing recurrence)
(Transformer) - [can reach] -> (new state of the art in translation quality)
(Transformer) - [allows for] -> (more parallelization)
(Recurrent models) - [generate] -> (sequence of hidden states)
(Recent work) - [improved] -> (model performance)
(Recurrent neural networks) - [used in] -> (transduction problems)
(Gated recurrent neural networks) - [established as] -> (state of the art approaches)
(Recur

In [10]:
relevance_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert researcher that needs to validate the relevance and clarity of extracted facts from a research paper"
            "for use in a source of truth knowledge base. "
            "The knowledge base will be used to help non-experts answer critical questions about the research materials. "
            "To be relevant The fact should make sense, be non-ambiguous, and relevant to research on their own. "
            "for example the fact (Recent work) - [improving] -> (model performance) is not relevant since the 'work' and 'model' are ambiguous, "
            "however the fact (Transformer) - [rely on] -> (attention mechanism) is relevant since "
            "'Transformer' and 'attention mechanism' are colloquially understood to be specific model types and components in the research."
            "In your response only return relevant facts according to this criteria."
            "Do not create fictitious data or impute missing values. "
            "Do not alter any of the facts. "
            "Only include facts from the ones provided."
        ),
        ("human", "{text}"),
    ]
)

In [11]:
chain_filterer = (relevance_prompt | llm_extractor.with_structured_output(Facts)).with_types(input_type=Facts, output_type=Facts)

In [12]:
filtered_facts = chain_filterer.invoke(facts)

In [13]:
print(f'Extracted Facts: {len(facts.facts)}')
print(f'Relevant Facts: {len(filtered_facts.facts)}')

Extracted Facts: 29
Relevant Facts: 29


In [14]:
facts_set = stringify_facts(facts)
filtered_facts_set = stringify_facts(filtered_facts)
print('Dropped facts:')
for fact in facts_set.difference(filtered_facts_set):
    print(fact)

Dropped facts:


## Create Classes and Functions for Pipeline

In [15]:
class Chunk(BaseModel):
    """A chunk of a source from which facts were extracted"""
    id:str
    sourceId:str
    seqId:int
    text:str


class Source(BaseModel):
    """A source of facts"""
    id:str
    name:str
    type:str
    url:str

class CitedFact(BaseModel):
    """A useful fact with a source chunk id"""
    subject: str
    predicate: str
    object: str
    chunkId: str
    def __init__(self, fact: Fact, chunk:Chunk):
        super().__init__(subject = fact.subject.lower(),
                         predicate = fact.predicate.lower(),
                         object = fact.object.lower(),
                         chunkId = chunk.id)


In [29]:
from uuid import uuid4
from typing import Tuple
from langchain_core.documents import Document

def insert_sources(source:Source, chunks: List[Chunk]):
    #source node
    graph.query('''
    MERGE(s:Source {id:$id})
    SET s.name = $name, s.type=$type, s.url=$url
    ''', params=source.dict())

    #chunk nodes and rels to source
    graph.query('''
    UNWIND $chunks AS chunk
    MATCH(s:Source {id:chunk.sourceId})
    MERGE(c:Chunk {id:chunk.id})
    SET c.seqId=chunk.seqId, c.text=chunk.text
    MERGE (c)-[:PART_OF]->(s)
    ''', params={'chunks': [chunk.dict() for chunk in chunks]})

merge_entity_query = """
    UNWIND $entities AS entity
    MATCH(c:Chunk {id:entity.chunkId})
    MERGE(e:Entity {id:entity.id})
    MERGE (c)-[:HAS_ENTITY]->(e)
"""

def insert_facts(facts: List[CitedFact]):
    #insert subjects
    subjects = [{'id':fact.subject, 'chunkId':fact.chunkId} for fact in facts]
    graph.query(merge_entity_query, params={'entities': subjects})

    #insert objects
    objects = [{'id':fact.object, 'chunkId':fact.chunkId} for fact in facts]
    graph.query(merge_entity_query, params={'entities': objects})

    #insert predicates
    graph.query("""
    UNWIND $facts AS fact
    MATCH(s:Entity {id:fact.subject})
    MATCH(o:Entity {id:fact.object})
    MERGE (s)-[r:RELATES_TO {id:fact.predicate}]->(o)
    ON CREATE SET r.chunkId = fact.chunkId
    """, params={'facts': [fact.dict() for fact in facts]})


def create_sources(docs: List[Document], name: str, doc_type: str, url: str) -> Tuple[Source, List[Chunk]]:
    source = Source(id=str(uuid4()), name=name, type=doc_type, url=url)
    chunks = []
    for i in range(len(docs)):
        chunks.append(Chunk(id=str(uuid4()), sourceId=source.id, seqId=i, text=docs[i].page_content))
    return source, chunks


def extract_facts(chunk: Chunk) -> List[CitedFact]:
    unfiltered_facts = chain_extractor.invoke(chunk.text)
    facts = chain_filterer.invoke(unfiltered_facts)

    print(f'\tExtracted Facts: {len(unfiltered_facts.facts)}')
    print(f'\tRelevant Facts: {len(facts.facts)}')
    facts_set = stringify_facts(facts)
    unfiltered_facts_set = stringify_facts(unfiltered_facts)
    print('\tDropped facts:')
    for dropped_fact in unfiltered_facts_set.difference(facts_set):
        print(f'\t\t * {dropped_fact}')
    return [CitedFact(fact, chunk) for fact in facts.facts]

## Run Pipeline

In [30]:
from langchain_community.graphs.neo4j_graph import Neo4jGraph

graph = Neo4jGraph(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)
graph.query('CREATE CONSTRAINT entityId IF NOT EXISTS FOR (n:Entity) REQUIRE (n.id) IS UNIQUE;')
graph.query('CREATE CONSTRAINT chunkId IF NOT EXISTS FOR (n:Chunk) REQUIRE (n.id) IS UNIQUE;')
graph.query('CREATE CONSTRAINT sourceId IF NOT EXISTS FOR (n:Source) REQUIRE (n.id) IS UNIQUE;')


[]

In [33]:
%%time

from tqdm import tqdm

print("formatting and ingesting sources")
source, chunks = create_sources(pages, "Paper: Attention Is All You Need", "pdf", "https://arxiv.org/pdf/1706.03762")
insert_sources(source, chunks)

cited_facts = []
print("extracting facts")
for chunk in tqdm(chunks):
    try: cited_facts.extend(extract_facts(chunk))
    except Exception as e: print(e)
print("ingesting facts")
insert_facts(cited_facts)
print("pipeline completed")

formatting and ingesting sources
extracting facts


  6%|▋         | 1/16 [00:13<03:29, 13.93s/it]

	Extracted Facts: 12
	Relevant Facts: 12
	Dropped facts:


 12%|█▎        | 2/16 [00:39<04:53, 20.99s/it]

	Extracted Facts: 33
	Relevant Facts: 33
	Dropped facts:


 19%|█▉        | 3/16 [00:44<02:55, 13.51s/it]

	Extracted Facts: 5
	Relevant Facts: 5
	Dropped facts:


 25%|██▌       | 4/16 [00:56<02:34, 12.86s/it]

	Extracted Facts: 16
	Relevant Facts: 16
	Dropped facts:


 31%|███▏      | 5/16 [01:10<02:28, 13.49s/it]

	Extracted Facts: 16
	Relevant Facts: 16
	Dropped facts:


 38%|███▊      | 6/16 [01:33<02:44, 16.42s/it]

	Extracted Facts: 24
	Relevant Facts: 24
	Dropped facts:


 44%|████▍     | 7/16 [01:43<02:10, 14.46s/it]

	Extracted Facts: 12
	Relevant Facts: 12
	Dropped facts:


 50%|█████     | 8/16 [02:14<02:37, 19.72s/it]

	Extracted Facts: 38
	Relevant Facts: 28
	Dropped facts:
		 * (self-attention) - [increases] -> (maximum path length)
		 * (investigators) - [plan to] -> (investigate approach further)
		 * (self-attention) - [could be restricted to] -> (neighborhood of size r)
		 * (convolutional layer) - [requires] -> (stack of O(n/k) layers)
		 * (attention distributions) - [are inspected from] -> (models)
		 * (representation) - [is smaller than] -> (dimensionality)
		 * (convolutional layer) - [requires] -> (stack of O(logk(n)) layers)
		 * (self-attention) - [could yield] -> (more interpretable models)
		 * (convolutional layer) - [has] -> (kernel width k)
		 * (section) - [describes] -> (training regime)


 56%|█████▋    | 9/16 [03:29<04:19, 37.10s/it]

Function Facts arguments:

{"facts":[{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Transformer","predicate":"achieves","object":"better BLEU scores"},{"subject":"Tra

 62%|██████▎   | 10/16 [03:42<02:57, 29.52s/it]

	Extracted Facts: 12
	Relevant Facts: 12
	Dropped facts:


 69%|██████▉   | 11/16 [03:50<01:54, 22.94s/it]

	Extracted Facts: 9
	Relevant Facts: 9
	Dropped facts:


 75%|███████▌  | 12/16 [03:59<01:15, 18.82s/it]

	Extracted Facts: 10
	Relevant Facts: 10
	Dropped facts:


 81%|████████▏ | 13/16 [04:19<00:57, 19.10s/it]

	Extracted Facts: 16
	Relevant Facts: 16
	Dropped facts:


 88%|████████▊ | 14/16 [04:24<00:29, 14.74s/it]

	Extracted Facts: 7
	Relevant Facts: 5
	Dropped facts:
		 * (American governments) - [making] -> (registration or voting process more difficult)
		 * (American governments) - [passed] -> (laws)


 94%|█████████▍| 15/16 [04:30<00:12, 12.06s/it]

	Extracted Facts: 7
	Relevant Facts: 7
	Dropped facts:


100%|██████████| 16/16 [04:35<00:00, 17.20s/it]

	Extracted Facts: 7
	Relevant Facts: 7
	Dropped facts:
ingesting facts





pipeline completed
CPU times: user 298 ms, sys: 35.2 ms, total: 333 ms
Wall time: 4min 36s
