<a href="https://colab.research.google.com/github/luisgdelafuente/gnai/blob/main/Embeds_Romeo%26Juliet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Testing another approach for a vector search engine.

Chat with Document(s) using OpenAI ChatGPT API and Text Embedding
How to chat with any documents, PDFs, and books using OpenAI ChatGPT API and Text Embedding

Source: https://blog.devgenius.io/chat-with-document-s-using-openai-chatgpt-api-and-text-embedding-6a0ce3dc8bc8


We will be using three tools in this tutorial:

* OpenAI GPT-3, specifically the new ChatGPT API (gpt-3.5-turbo). Not because this model is any better than other models, but because it is cheaper ($0.002 / 1K tokens) and good enough for this use case.
* Chroma, the AI-native open-source embedding database (i.e., vector search engine). Chroma is an easy-to-use vector database when used in conjunction with LangChain; otherwise, it’s kind of unusable. If you want to deploy these types of applications in production, I recommend using Elasticsearch because it has wide adoption and has been around for years. Not because Elasticsearch is better than competitors, but because not many organizations like to add a new technology stack.
* LangChain, is a library that aims to assist developers in building applications that use Large Language Models (LLMs) by allowing them to integrate these models with other sources of computation or knowledge.

In [None]:
# Install Python libraries

%%writefile requirements.txt
openai
chromadb
langchain
tiktoken

Writing requirements.txt


In [None]:
%pip install -r requirements.txt

In [None]:
# Import libraries

import os
import platform

import openai
import chromadb
import langchain

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.llms import OpenAI
from langchain.chains import ChatVectorDBChain
from langchain.document_loaders import GutenbergLoader

print('Python: ', platform.python_version())

Python:  3.10.12


In [None]:
# Mount Google Drive on Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# OpenAI API Key
os.environ["OPENAI_API_KEY"] = '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'

Configure Chroma

Chroma uses both of my favorite technologies for their backend — DuckDB and Apache Parquet — but by default, it uses an in-memory database. This is fine for this tutorial, but I want to give you the option of storing the database file on a disk so you can reuse the database without paying for embedding it every single time.

In [None]:
persist_directory="/content/drive/My Drive/Colab Notebooks/chroma/romeo"

Data

We will be using the data from Project Gutenberg’s “Romeo and Juliet by William Shakespeare”, which consists of 55,985 tokens. This makes it a nicely sized dataset.

https://www.gutenberg.org/cache/epub/1513/pg1513-images.html

Convert Document to Embedding

Convert the document, i.e., the book, to vector embedding and store it in a vector search engine, i.e., a vector database.

In [None]:
def get_gutenberg(url):
    loader = GutenbergLoader(url)
    data = loader.load()
    return data

In [None]:
romeoandjuliet_data = get_gutenberg('https://www.gutenberg.org/cache/epub/1513/pg1513.txt')

text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)
romeoandjuliet_doc = text_splitter.split_documents(romeoandjuliet_data)

embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(romeoandjuliet_doc, embeddings, persist_directory=persist_directory)
vectordb.persist()

The first step is a bit self-explanatory, but it involves using ‘from langchain.document_loaders import GutenbergLoader’ to load a book from Project Gutenberg.


The second step is more involved. To obtain an embedding, we need to send the text string, i.e., the book, to OpenAI’s embeddings API endpoint along with a choice of embedding model ID, e.g., text-embedding-ada-002. The response will contain an embedding. However, since the book consists of 55,985 tokens and the token limit for the text-embedding-ada-002 model is 2,048 tokens, we use the ‘text_splitter’ utility (from ‘langchain.text_splitter import TokenTextSplitter’) to split the book into manageable 1,000-token chunks. The following is an illustration of a sample embedding response from OpenAI.

The third step is pretty straightforward: we store the embedding in Chroma, our vector search engine, and persist it on a file system.

Configure LangChain QA

To configure LangChain QA with Chroma, use the OpenAI GPT-3 model (model_name=’gpt-3.5-turbo’) and ensure that the response includes the intermediary step of a result from a vector search engine, i.e., Chroma (set return_source_documents=True).

In [None]:
romeoandjuliet_qa = ChatVectorDBChain.from_llm(OpenAI(temperature=0, model_name="gpt-3.5-turbo"), vectordb, return_source_documents=True)



Questions & Answers with “Romeo and Juliet” Book

Question #1: Have Romeo and Juliet spent the night together?

In [None]:
query = "Have Romeo and Juliet spent the night together? Provide a verbose answer, referencing passages from the book."
chat_history = ""
result = romeoandjuliet_qa({"question": query, "chat_history": chat_history})

In [None]:
result["source_documents"]

[Document(page_content='\n\n\n\r\n\n\nROMEO.\r\n\n\nYet banished? Hang up philosophy.\r\n\n\nUnless philosophy can make a Juliet,\r\n\n\nDisplant a town, reverse a Prince’s doom,\r\n\n\nIt helps not, it prevails not, talk no more.\r\n\n\n\r\n\n\nFRIAR LAWRENCE.\r\n\n\nO, then I see that mad men have no ears.\r\n\n\n\r\n\n\nROMEO.\r\n\n\nHow should they, when that wise men have no eyes?\r\n\n\n\r\n\n\nFRIAR LAWRENCE.\r\n\n\nLet me dispute with thee of thy estate.\r\n\n\n\r\n\n\nROMEO.\r\n\n\nThou canst not speak of that thou dost not feel.\r\n\n\nWert thou as young as I, Juliet thy love,\r\n\n\nAn hour but married, Tybalt murdered,\r\n\n\nDoting like me, and like me banished,\r\n\n\nThen mightst thou speak, then mightst thou tear thy hair,\r\n\n\nAnd fall upon the ground as I do now,\r\n\n\nTaking the measure of an unmade grave.\r\n\n\n\r\n\n\n [_Knocking within._]\r\n\n\n\r\n\n\nFRIAR LAWRENCE.\r\n\n\nArise; one knocks. Good Romeo, hide thyself.\r\n\n\n\r\n\n\nROMEO.\r\n\n\nNot I, unle

In [None]:
result["answer"]

"It is unclear from the given context whether Romeo and Juliet have spent the night together. However, Romeo and Juliet have expressed their love for each other and have made plans to be together. In Act II, Scene II, Romeo climbs over the Capulet's garden wall to be with Juliet and they exchange vows of love. Later, in Act III, Scene V, Juliet is waiting for the Nurse to return with news from Romeo and expresses her desire to be with him. While there is no explicit mention of them spending the night together, their intense love and desire for each other suggests that they may have consummated their relationship."

In [None]:
query = "Who is Rosaline? Provide a verbose answer, referencing passages from the book."
result = romeoandjuliet_qa({"question": query, "chat_history": chat_history})

In [None]:
result["source_documents"]

[Document(page_content=';\r\n\n\nCounty Anselmo and his beauteous sisters;\r\n\n\nThe lady widow of Utruvio;\r\n\n\nSignior Placentio and his lovely nieces;\r\n\n\nMercutio and his brother Valentine;\r\n\n\nMine uncle Capulet, his wife, and daughters;\r\n\n\nMy fair niece Rosaline and Livia;\r\n\n\nSignior Valentio and his cousin Tybalt;\r\n\n\nLucio and the lively Helena. _\r\n\n\n\r\n\n\n\r\n\n\nA fair assembly. [_Gives back the paper_] Whither should they come?\r\n\n\n\r\n\n\nSERVANT.\r\n\n\nUp.\r\n\n\n\r\n\n\nROMEO.\r\n\n\nWhither to supper?\r\n\n\n\r\n\n\nSERVANT.\r\n\n\nTo our house.\r\n\n\n\r\n\n\nROMEO.\r\n\n\nWhose house?\r\n\n\n\r\n\n\nSERVANT.\r\n\n\nMy master’s.\r\n\n\n\r\n\n\nROMEO.\r\n\n\nIndeed I should have ask’d you that before.\r\n\n\n\r\n\n\nSERVANT.\r\n\n\nNow I’ll tell you without asking. My master is the great rich Capulet,\r\n\n\nand if you be not of the house of Montagues, I pray come and crush a\r\n\n\ncup of wine. Rest you merry.\r\n\n\n\r\n\n\n [_Exit._]\r\n\

In [None]:
result["answer"]

"Rosaline is a woman whom Romeo is infatuated with at the beginning of the play. She is mentioned by Benvolio when he suggests that Romeo attend the Capulet's feast to compare her beauty with other women. Romeo initially claims that Rosaline is the only woman he will ever love, but later falls in love with Juliet. When Mercutio tries to conjure Romeo to come out of hiding, he uses Rosaline's name in his invocation. Romeo's love for Rosaline is described as unrequited and superficial, as he is more in love with the idea of being in love than with Rosaline herself."

In [None]:
query = "What is Romeo and Juliet about? please summarize the core message in a clear paragraph."
result = romeoandjuliet_qa({"question": query, "chat_history": chat_history})
result["answer"]

'Romeo and Juliet is a tragic play by William Shakespeare about two young lovers from feuding families in Verona, Italy. Despite the obstacles and opposition from their families, Romeo and Juliet fall deeply in love and secretly marry. However, their happiness is short-lived as a series of misunderstandings and tragic events lead to their untimely deaths. The play explores themes of love, fate, and the destructive power of feuds and hatred.'