# The Zephryns knowledge database

Imagine an evil genius whose goal is to explore the galaxy and save endangered alien species... Not so evil after all, apart he likes seeing his failing employees suffer and uselessly beg for pity. You've just been employed at the zegma-IV station that references the Zephryn species. You now have to know this species. Otherwise, your boss will not be eager to give you your daily oxygen. 

All the company confidential knowledge is stored as markdown files. We have built an AI to help you. A R.A.G. is used to handle the ever growing knowledge about the studied species and to keep the knowledge confidential.

Good luck!

# Load the documents

In [1]:
from dotenv import load_dotenv
load_dotenv()


from langchain_community.document_loaders import DirectoryLoader
# from langchain_community.document_loaders import UnstructuredMarkdownLoader

# # Load markdown files from the .documents directory
# loader = DirectoryLoader(
#     './documents',
#     glob="**/*.md",
#     loader_cls=UnstructuredMarkdownLoader,
#     loader_kwargs={"mode": "elements"}
#     )

from langchain_community.document_loaders import TextLoader
loader = DirectoryLoader(
    './documents',
    glob="**/*.md",
    loader_cls=TextLoader
    )

docs = loader.load()

## Create the chunks

In [2]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pprint import pprint

headers_to_split_on = [
    ("#", "Header"),
    ("##", "Header 1"),
    ("###", "Header 2"),
]

# Markdown splitter
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on = headers_to_split_on,
    strip_headers = True
)

chunks = []
for doc in docs:
    chunks.extend(markdown_splitter.split_text(doc.page_content))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    separators=['\n\n^(\s*-\s*)', '\n^(\s*-\s*)', '\n\n', '\n', '(?<=\. )', ' ', ''],
)

splitted_chunks = splitter.split_documents(chunks)

## Embeddings

In [3]:
!ollama pull llama3.2

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ 

In [4]:
import shutil
from langchain.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata


persist_directory = './db/chroma'

#from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # Default to text-embedding-ada-002

from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(
    model="llama3.2",
)


# If the directory exists, first delete it
try:
    shutil.rmtree(persist_directory)
except FileNotFoundError as e:
    pass
except PermissionError:
    # If db variable exists, delete the collection
    if 'db' in globals() and isinstance(db, Chroma): # type: ignore
        db.delete_collection() # type: ignore
    else:
        raise




filtered_chunks = filter_complex_metadata(splitted_chunks)

for item in filtered_chunks:
    item.page_content = (
        item.metadata.get('Header', '') + '\n' + 
        item.metadata.get('Header 1', '') + '\n' + 
        item.metadata.get('Header 2', '') + '\n' + 
        item.page_content)
    
# Create vector store and save the db
db = Chroma.from_documents(
    filtered_chunks, 
    embeddings,
    persist_directory=persist_directory
)

In [4]:
results = db.similarity_search_with_relevance_scores(
    "List all the subspecies of the Zephryn.",
    k=10#, score_threshold=0.5
)
pprint(results)

# results = db.similarity_search_by_vector_with_relevance_scores(
#     embeddings.embed_query("List all the subspecies of the Zephryn."),
#     k=10#, score_threshold=0.5
# )
# pprint(results)


# results = db.max_marginal_relevance_search(
#     "List all the subspecies of Zephryns.",
#     k=5#, score_threshold=0.5
# )
# pprint(results)

[(Document(metadata={'Header': 'The Zephryn: A Celestial Animal', 'Header 1': 'Subspecies of the Zephryn'}, page_content='The Zephryn: A Celestial Animal\nSubspecies of the Zephryn\n\n- Skydancers: Known for their aerial acrobatics and mesmerizing flight patterns, Skydancers are highly social creatures that live in large, communal nests. They are often depicted in art and literature as symbols of freedom and grace.\n- Stormcallers: Masters of the wind and weather, Stormcallers can manipulate storms and summon lightning. They are solitary creatures, often found high in the mountains, where they commune with the elements.\n- Whispersingers: Gentle and peaceful, Whispersingers are known for their enchanting songs, which can soothe the wildest beasts and calm the most troubled souls. They often live in secluded forests, where they spend their days singing and tending to their gardens.\n- Windriders: Agile and swift, Windriders are skilled hunters, capable of capturing prey on the wing. The

## Chain for a QA

only used as a step in the development process, to evaluate our vectors

In [100]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain.prompts import PromptTemplate

# retriever = db.as_retriever(
#     search_type="similarity",
#     search_kwargs={'k': 10}
# )

# retriever = db.as_retriever(
#     search_type="similarity_score_threshold",
#     search_kwargs={'k': 7, 'score_threshold': 0.4}
# )

retriever = db.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 10, 'lambda_mult': 0.6}
)





# from langchain_openai import OpenAI
# llm = OpenAI()
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2")



#prompt = hub.pull("rlm/rag-prompt")

prompt = PromptTemplate.from_template(
    """Use the following pieces of context to answer the question at the end. If you don't know the answer or if you don't have enough information in the context, just say that you don't know, don't try to make up an answer.
Be very concise and to the point. Don't write more than 2 sentences. Do not start your answer with things like "based on the context" or "I think".

Context:
---------
{context}
---------
Question: {question}
Helpful Answer:""" 
)


def format_docs(docs):
    return "\n\n".join(
        doc.page_content for doc in docs
    )

qa_chain = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["docs"])))
    | prompt
    | llm
    | StrOutputParser()
)

# get the source documents
qa_chain_with_source = RunnableParallel(
    {"docs": retriever, "question": RunnablePassthrough()}
).assign(answer=qa_chain)

def invoke(input):
    return qa_chain_with_source.invoke(input)




question = "List all the subspecies of the Zephryn."
# question = "What does the Skydancer eat?"
# question = "What are the physical differences between the Windrider and the Skydancer?"

answer = invoke(question)
print(answer['answer'])
print("--------------------------")
pprint(answer)


Skydancers, Stormcallers, Whispersingers, Windriders, and Dreamweavers.
--------------------------
{'answer': 'Skydancers, Stormcallers, Whispersingers, Windriders, and '
           'Dreamweavers.',
 'docs': [Document(metadata={'Header': 'The Zephryn: A Celestial Animal', 'Header 1': 'Subspecies of the Zephryn'}, page_content='The Zephryn: A Celestial Animal\nSubspecies of the Zephryn\n\n- Skydancers: Known for their aerial acrobatics and mesmerizing flight patterns, Skydancers are highly social creatures that live in large, communal nests. They are often depicted in art and literature as symbols of freedom and grace.\n- Stormcallers: Masters of the wind and weather, Stormcallers can manipulate storms and summon lightning. They are solitary creatures, often found high in the mountains, where they commune with the elements.\n- Whispersingers: Gentle and peaceful, Whispersingers are known for their enchanting songs, which can soothe the wildest beasts and calm the most troubled souls. Th

## Chain for study buddy

### Generate the questions from the whole knowledge database

Concatenate some knowledge elements, to generate the questions against. Here we simply rebuild the original documents. But we could try by similarity for larger documents.

In [32]:
from langchain_ollama import OllamaLLM
from collections import defaultdict
llm = OllamaLLM(model="llama3.2")

docs = db.get(include=['metadatas'])

print('items count in db: ', len(docs['ids']))

# re-build the documents
dict = defaultdict(list)
for i, metadata in enumerate(docs['metadatas']):
    header = metadata.get('Header')
    id = docs['ids'][i]
    dict[header].append(id)

for item in dict:
    print(item)
    print(len(dict[item]))

    

items count in db:  40
Dreamweavers
7
Skydancers
7
Stormcallers
6
Whispersingers
6
Windriders
7
The Zephryn: A Celestial Animal
7
defaultdict(<class 'list'>,
            {'Dreamweavers': ['bef38af0-1a69-45bf-8aaf-60eb5f737e1e',
                              '7eb1633c-9a90-48d9-93b4-f7a7748aa42f',
                              '9827130a-5fb8-451b-875a-99155153a163',
                              'b7cdef0a-f5f9-4293-af1e-97e4039ab35c',
                              '224b3c2c-4beb-48db-9c01-f6f3771de34c',
                              '467a4391-04d7-4112-9c7f-5f6e016779e9',
                              'f462d7df-f060-4323-9c3a-d35a03f42318'],
             'Skydancers': ['3f5d5756-6811-4b8f-a277-cdef9e5fa411',
                            'b321641c-927e-4234-817d-4a036a8f35af',
                            '75e61e53-3510-41b8-b80e-08d68e1e1287',
                            '978bf2d8-5552-43c0-83e9-588f02563053',
                            'b6f5ef4c-478d-4863-990f-5fc7bc6ea0af',
           

Build the chain to generate the question for each knowledge document.

In [279]:
import json
from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

class ByIdsRetriever(BaseRetriever):
    _ids = []

    def set_ids(self, ids):
        self._ids = ids

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""

        if(self._ids == []):
            raise("No ids set")

        matching_documents = []
        for id in self._ids:
            item = db.get(id)

            if(item is None or len(item.get('documents', [])) == 0):
                continue

            document = Document(
                page_content=item['documents'][0],
                metadata=item['metadatas'][0],
                id=id
            )

            matching_documents.append(document)

        return matching_documents
    

retriever_by_ids = ByIdsRetriever()


from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain.prompts import PromptTemplate


prompt_questions = PromptTemplate.from_template(
    """You are a study buddy. You ask the user questions that will help him to learn the concepts within the given knowledge context. 
Use the following pieces of context to generate 10 questions to ask to the user. Do not use any other information than the context.

Be very concise and to the point. 
Each question expects a simple answer. 
Do not combine multiple questions in one. 
Do not start your answer with things like "based on the context" or "I think". 
Do not generate any answer to the questions.

Avoid question with a yes/no answer. The question should not include any clue to their answer. 
For example "Do Dreamweavers live in communal nests?" or "Can Dreamweavers be found in places of great spiritual significance?" are bad questions because it gives the answer in the question itself. Instead, you should ask "Where do Dreamweavers live?" and "What subspecies of Zephryn can be found in places of great spiritual significance?".

Consider each question as a separate question.

Each question is expected to be one separate line.

Context:
---------
{context}
---------
Questions: """
)

def format_docs(docs):
    return "\n\n".join(
        doc.page_content for doc in docs
    )

chain_questions = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["docs"])))
    | prompt_questions
    | llm
    | StrOutputParser()
)

# get the source documents
chain_questions_with_source = RunnableParallel(
    {"docs": retriever_by_ids, "question": RunnablePassthrough()}
).assign(answer=chain_questions)


questions_all = []
for key, ids in dict.items():
    retriever_by_ids.set_ids(ids)
    llm_output = chain_questions_with_source.invoke('')
    questions_str = llm_output['answer']
    # split the output into individual questions by trimming and ignoring empty strings
    questions = [q.strip() for q in questions_str.split('\n') if q.strip()]
    questions_all.extend(questions)


# db_questions = Chroma.from_documents(
#     [Document(page_content=question) for question in questions_all],
#     embeddings,
#     collection_name="questions",
#     persist_directory=persist_directory
# )



# Store questions into a JSON file, because... why not ?
with open('questions/questions.json', 'w') as f:
    json.dump(questions_all, f, indent=4)


# TODO: asks the system its own questions, for fun
# TODO: asks the user the questions then evaluate his answer

Select a question. For now it is random, but it could be a more intelligent selection

In [280]:
import random

questions_all = []
with open('questions/questions.json', 'r') as f:
    questions_all = json.load(f)

random_question = random.choice(questions_all)
print(random_question)


What subspecies of Zephryn are associated with places of great spiritual significance?


Ask the system its own question, for fun. It uses the previous QA chain.

In [281]:


answer = invoke(random_question)
print(answer['answer'])



Dreamweavers. They are drawn to places of great beauty and spiritual significance, often appearing as wisps of light or shimmering apparitions in these locations.


Evaluate a user's answer

In [292]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain.prompts import PromptTemplate


prompt_answer_evaluation = PromptTemplate.from_template(
    """You are a study buddy. You ask the user questions that will help him to learn the concepts within the given knowledge context. 
You asked the following question to the user. Your evaluation should be only based on the context and the question.
You evaluate the response accuracy, relevance, and completeness.
Do not use any other information than the context.

Be very concise and to the point. Provide a useful feedback.
Do not say things like "based on the context" or "I think" or "According to the text". Do not mention a "context".
Never give the correct response to the user, instead suggest parts of the context to review.

If the answer to the question is accurate, relevant and complete, your feedback is just "Correct!" without any comment. Do not include anything else.
If the answer to the question is not accurate, relevant and complete, sum up your evaluation in two sentences. The first sentence is a funny short, overall evaluation of the answer. The second sentence is optional and is a suggestion for content to review. Do not include any clue to the correct answer.
Never mention the score in the feedback.

You talk to the user and give him feedback on his answer. You don't talk about him as "the user" but as "you".


Context:
---------
{context}
---------
Questions: 
{question}:
Answer:
{user_answer}
Feedback:"""
)

def format_docs(docs):
    return "\n\n".join(
        doc.page_content for doc in docs
    )

chain_answer_evaluation = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["docs"])))
    | prompt_answer_evaluation
    | llm
    | StrOutputParser()
)

# get the source documents
chain_answer_evaluation_with_source = RunnableParallel(
    {
     "docs": (lambda x: x["question"] + '\n' + x['user_answer']) | retriever, # feed the retriever with the original question and the user's answer
     "question": (lambda x: x["question"]),
     "user_answer": (lambda x: x["user_answer"])
     }
).assign(answer=chain_answer_evaluation)

user_answer = "Elephant"
evaluation_output = chain_answer_evaluation_with_source.invoke({"question": random_question, "user_answer": user_answer})

#pprint(evaluation_output)
print(evaluation_output['answer'])



You are thinking about physical characteristics.

Review Dreamweavers' Social Structure and The Zephryn: A Celestial Animal for more information.
