<a href="https://colab.research.google.com/github/mdeniz20/NLP-0/blob/main/RAG_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installations

In [1]:
!pip install langchain langchain-community chromadb cohere
!pip install langchain_cohere

Collecting langchain
  Downloading langchain-0.2.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.2.10-py3-none-any.whl.metadata (2.7 kB)
Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting cohere
  Downloading cohere-5.6.2-py3-none-any.whl.metadata (3.3 kB)
Collecting langchain-core<0.3.0,>=0.2.23 (from langchain)
  Downloading langchain_core-0.2.23-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.93-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17

# Imports

In [8]:
import os
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_cohere import CohereEmbeddings
import requests
from google.colab import userdata

# Environment Variables

In [3]:
os.environ["COHERE_API_KEY"] = userdata.get("COHERE_API_KEY")

# Setting the Environment

In [24]:
os.system("rm -r ./db")

0

In [25]:
text = requests.get("https://www.gutenberg.org/cache/epub/1727/pg1727.txt")
text.text

directory_path = "./data"

os.makedirs(directory_path, exist_ok=True)
print(f"Directory '{directory_path}' created successfully.")

file_path = os.path.join(directory_path, "odyssey.txt")
with open(file_path, "w", encoding="utf-8") as file:
    file.write(text.text)
print("The book installed successfully!")

if not os.path.exists("./db"):
    os.makedirs("./db")
    print("Directory './db' created successfully.")


Directory './data' created successfully.
The book installed successfully!
Directory './db' created successfully.


# RAG Initialize Vector Store

In [33]:
current_dir = os.getcwd()
file_path = os.path.join(current_dir, "data", "odyssey.txt")
persisten_directory = os.path.join(current_dir, "db", "chorma_db_new0")

print("\n\Generating embeddings")
embeddings = CohereEmbeddings(
    model = "embed-english-v3.0",
)
print("All embeddings have been generated")
# Check if the directory exists, if not create it

if not os.path.exists(persisten_directory): #if the vector store does not exist
    print("Initializing vector store..")
    if not os.path.exists(file_path):
      raise FileNotFoundError(f"File {file_path} not found")

    loader = TextLoader(file_path)
    documents = loader.load()

    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    #chunk_overlap does get a little bit of characters from the next chunk
    docs = text_splitter.split_documents(documents)

    print("The number of chuns:", len(docs))
    print("There is a sable chunk:\n", docs[0].page_content)

    print("Initializing vector store")
    db = Chroma.from_documents(docs, embeddings, persist_directory=persisten_directory)
    #We store all the embeddings with their vector representations
    print("Vector store initialized")

else:
    print("Vector store exists")

db = Chroma(persist_directory=persisten_directory, embedding_function=embeddings)

retriever = db.as_retriever(
    search_type = "similarity_score_threshold",
    search_kwargs = {"k": 3, "score_threshold": 0.3}
    #0.4 means lower bound of similarity
    #"k": 3 means return the top 3 similar (most relevant) documents
)



\Generating embeddings
All embeddings have been generated
Vector store exists


Running Queries

In [35]:
query = "Who is Odysseus' wife?"
relevant_docs = retriever.invoke(query)

print("These are the most relevant documents to your query:")
len(relevant_docs)
for i, doc in enumerate(relevant_docs, 1):
  print(f"Document {i}:\n{doc.page_content}\n")
  if doc.metadata:
    print(f"Source {doc.metadata.get('score', 'Unknown--')}\n")

These are the most relevant documents to your query:
Document 1:
Now all the rest, as many as fled from sheer destruction, were at
    home, and had escaped both war and sea, but Odysseus only, craving
    for his wife and for his homeward path, the lady nymph Calypso
    held, that fair goddess, in her hollow caves, longing to have him
    for her lord. But when now the year had come in the courses of the
    seasons, wherein the gods had ordained that he should return home
    to Ithaca, not even there was he quit of labours, not even among
    his own; but all the gods had pity on him save Poseidon, who raged
    continually against godlike Odysseus, till he came to his own
    country. Howbeit Poseidon had now departed for the distant
    Ethiopians, the Ethiopians that are sundered in twain, the
    uttermost of men, abiding some where Hyperion sinks and some where
    he rises. There he looked to receive his hecatomb of bulls and
    rams, there he made merry sitting at the feast