In [15]:
pip install pdfplumber python-dotenv openai neo4j

Collecting neo4j
  Downloading neo4j-5.27.0-py3-none-any.whl.metadata (5.9 kB)
Collecting pytz (from neo4j)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading neo4j-5.27.0-py3-none-any.whl (301 kB)
Downloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
Installing collected packages: pytz, neo4j
Successfully installed neo4j-5.27.0 pytz-2024.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pdfplumber

text = ""
# Open the PDF file
with pdfplumber.open('/Users/joshuawedekind/Documents/einstein.pdf') as pdf:
    for page in pdf.pages:
        text += page.extract_text()

print(len(text))
print(text[0:500])


44759
Einstein’s Patents and Inventions
Asis Kumar Chaudhuri
Variable Energy Cyclotron Centre
1‐AF Bidhan Nagar, Kolkata‐700 064
Abstract: Times magazine selected Albert Einstein, the German born Jewish Scientist as the person of the 20th
century. Undoubtedly, 20th century was the age of science and Einstein’s contributions in unravelling mysteries
of nature was unparalleled. However, few are aware that Einstein was also a great inventor. He and his
collaborators had patented a wide variety of inventi


In [4]:
# Define the function to chunk text
def chunk_text(text, chunk_size, overlap, split_on_whitespace_only=True):
    chunks = []
    index = 0

    while index < len(text):
        if split_on_whitespace_only:
            prev_whitespace = 0
            left_index = index - overlap
            while left_index >= 0:
                if text[left_index] == " ":
                    prev_whitespace = left_index
                    break
                left_index -= 1
            next_whitespace = text.find(" ", index + chunk_size)
            if next_whitespace == -1:
                next_whitespace = len(text)
            chunk = text[prev_whitespace:next_whitespace].strip()
            chunks.append(chunk)
            index = next_whitespace + 1
        else:
            start = max(0, index - overlap + 1)
            end = min(index + chunk_size + overlap, len(text))
            chunk = text[start:end].strip()
            chunks.append(chunk)
            index += chunk_size

    return chunks

# Call the function and get chunks back
chunks = chunk_text(text, 500, 40)

# Print the length of the chunks list
print(len(chunks)) # 89 chunks in total”

89


In [16]:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY=os.getenv('OPENAI_API_KEY')
open_ai_client = OpenAI(api_key=OPENAI_API_KEY)

In [14]:
# Define the function to embed chunks
def embed(texts):
    response = open_ai_client.embeddings.create(
        input=texts,
        model="text-embedding-3-small",
    )
    return list(map(lambda n: n.embedding, response.data))

# Call the function and get embeddings back
embeddings = embed(chunks)

# Print the length of the embeddings list
print(len(embeddings)) # 89, matching the number of chunks
# Print the length of the first embedding
print(len(embeddings[0])) # 1536 dimensions”

89
1536


In [17]:
from neo4j import GraphDatabase

# Neo4j connection details
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
uri = "bolt://localhost:7687"
user = "neo4j"
password = NEO4J_PASSWORD
driver = GraphDatabase.driver(uri, auth=(user, password))

In [18]:
driver.execute_query(f"CALL db.index.vector.createNodeIndex('pdf', 'Chunk', 'embedding', {len(embeddings[0])}, 'cosine')")


EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x10601d1e0>, keys=[])

In [19]:
cypher_query = '''
WITH $chunks as chunks, range(0, size($chunks)) AS index
UNWIND index AS i
WITH i, chunks[i] AS chunk, $embeddings[i] AS embedding
MERGE (c:Chunk {index: i})
SET c.text = chunk, c.embedding = embedding
'''

driver.execute_query(cypher_query, chunks=chunks, embeddings=embeddings)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x117ba9750>, keys=[])

In [20]:
records, _, _ = driver.execute_query(
"MATCH (c:Chunk) WHERE c.index = 0 RETURN c.embedding, c.text")

print(records[0]["c.text"][0:30])
print(records[0]["c.embedding"][0:3])

Einstein’s Patents and Inventi
[0.023738375, -0.022444434, -0.014644509]


In [22]:
question = "At what time was Einstein really interested in experimental works?"
question_embedding = embed([question])[0]


In [28]:
query = '''
CALL db.index.vector.queryNodes('pdf', 3, $question_embedding)
YIELD node AS hits, score
RETURN hits.text AS text, score, hits.index AS index
'''

similar_records, _, _ = driver.execute_query(query, question_embedding=question_embedding)

In [29]:
for record in similar_records:
    print(record["text"])
    print(record["score"], record["index"])
    print("======")

CH‐Switzerland
Considering Einstein’s upbringing, his interest in inventions and patents was not unusual.
Being a manufacturer’s son, Einstein grew upon in an environment of machines and instruments.
When his father’s company obtained the contract to illuminate Munich city during beer festival, he
was actively engaged in execution of the contract. In his ETH days Einstein was genuinely interested
in experimental works. He wrote to his friend, “most of the time I worked in the physical laboratory,
fascinated by the direct contact with observation.” Einstein's
0.8111112117767334 42
Einstein
left his job at the Patent office and joined the University of Zurich on October 15, 1909. Thereafter, he
continued to rise in ladder. In 1911, he moved to Prague University as a full professor, a year later, he
was appointed as full professor at ETH, Zurich, his alma‐mater. In 1914, he was appointed Director of
the Kaiser Wilhelm Institute for Physics (1914–1932) and a professor at the Humboldt Unive

In [31]:
system_message = "You're en Einstein expert, but can only use the provided documents to respond to the questions."

user_message = f"""
Use the following documents to answer the question that will follow:
{[doc["text"] for doc in similar_records]}

---

The question to answer using information only from the above documents:
{question}
"""

In [32]:
print("Question:", question)

stream = open_ai_client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")


Question: At what time was Einstein really interested in experimental works?
During his days at ETH, Einstein was genuinely interested in experimental works.

In [33]:
driver.execute_query("CREATE FULLTEXT INDEX PdfChunkFulltext FOR (c:Chunk) ON EACH [c.text]")

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x106034700>, keys=[])