In [2]:
#metadata section
#using this article: https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MetadataExtraction_LLMSurvey/
#currently not sure if the order of these operations matters
import nest_asyncio
import os
import openai
from llama_index.llms.openai import OpenAI
from llama_index.llms.ollama import Ollama
from llama_index.core.schema import MetadataMode
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.ingestion import IngestionPipeline


from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)

nest_asyncio.apply()
#key is set by the environment not written into the code
#openai.api_key = os.environ["OPENAI_API_KEY"]
#also not sure how to determine these values
llm = OpenAI(temperature=0.1, model=Ollama, max_tokens=512)


node_parser = TokenTextSplitter(
    separator=" ", chunk_size=256, chunk_overlap=128
)

extractors_2 = [
    SummaryExtractor(summaries=["prev", "self", "next"], llm=llm),
    QuestionsAnsweredExtractor(
        questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
    ),
]


# load in blog
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])

print(docs[0].get_content())

orig_nodes = node_parser.get_nodes_from_documents(docs)

# take just the first 8 nodes for testing
nodes = orig_nodes[20:28]

print(nodes[3].get_content(metadata_mode="all"))


# process nodes with metadata extractors
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_2])

nodes_1 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)

print(nodes_1[3].get_content(metadata_mode="all"))

# [eugeneyan](/)

  * [Start Here](/start-here/ "Start Here")
  * [Writing](/writing/ "Writing")
  * [Speaking](/speaking/ "Speaking")
  * [Prototyping](/prototyping/ "Prototyping")
  * [About](/about/ "About")

# Patterns for Building LLM-based Systems & Products

[ [llm](/tag/llm/) [engineering](/tag/engineering/)
[production](/tag/production/) [🔥](/tag/🔥/) ]  · 66 min read

> Discussions on [HackerNews](https://news.ycombinator.com/item?id=36965993),
> [Twitter](https://twitter.com/eugeneyan/status/1686531758701899776), and
> [LinkedIn](https://www.linkedin.com/posts/eugeneyan_patterns-for-building-
> llm-based-systems-activity-7092300473981927424-_wVo)

“There is a large class of problems that are easy to imagine and build demos
for, but extremely hard to make products out of. For example, self-driving:
It’s easy to demo a car self-driving around a block, but making it into a
product takes a decade.” -
[Karpathy](https://twitter.com/eugeneyan/status/1672692174704766976)

This write

  from .autonotebook import tqdm as notebook_tqdm


because evals were often conducted with untested, incorrect
ROUGE implementations.

![Dimensions of model evaluations with ROUGE](/assets/rogue-scores.jpg)

Dimensions of model evaluations with ROUGE
([source](https://aclanthology.org/2023.acl-long.107/))

And even with recent benchmarks such as MMLU, **the same model can get
significantly different scores based on the eval implementation**.
[Huggingface compared the original MMLU
implementation](https://huggingface.co/blog/evaluating-mmlu-leaderboard) with
the HELM and EleutherAI implementations and found that the same example could
have different prompts across various providers.

![Different prompts for the same question across MMLU
implementations](/assets/mmlu-prompt.jpg)

Different prompts for the same question across MMLU implementations
([source](https://huggingface.co/blog/evaluating-mmlu-leaderboard))

Furthermore, the evaluation approach differed across all three benchmarks:

  * Original MMLU: Compares predicted probabiliti

Parsing nodes: 100%|██████████| 8/8 [00:00<00:00, 236.58it/s]
  0%|          | 0/12 [00:00<?, ?it/s]Retrying llama_index.llms.openai.base.OpenAI._achat in 0.12312208284654824 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.20338523956181231 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.8972601354115479 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.47388709150857244 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.5063392637176669 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.5497409233003199 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.8401711115214758 seconds as it raise

KeyboardInterrupt: 

Retrying llama_index.llms.openai.base.OpenAI._achat in 10.048864075678926 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 3.3659859825282084 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 9.023487660410197 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 14.437106378913485 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.6130717173493434 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.24315709302457789 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.043060165102468 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.092479

In [None]:
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-small-en-v1.5")

In [None]:
'''from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "TheBloke/openchat-3.5-0106-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=False, revision='main')
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0)
hf = HuggingFacePipeline(pipeline=pipe)
above very slow. ollama is goated + easier to test
'''

In [None]:
'''
To run this you need to install ollama first 
https://ollama.com/download
Run ollama after installation (it wont show up at all as it runs in the background)
In terminal of this venv run "ollama pull openchat" to use this model. 
Don't default pip install torch unless you want to watch paint dry. Use the gpu specific install 
'''

local_llm = "openchat"
#https://ollama.com/library?sort=popular

In [None]:
import torch

torch.cuda.empty_cache()

In [None]:
from langchain_community.document_loaders import UnstructuredFileLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant

loader = DirectoryLoader('C:/Users/Adel/Desktop/AWS_FULL_DOCS/', glob="**/*.md", show_progress=True, loader_cls=UnstructuredFileLoader, recursive=True, use_multithreading=True)

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, chunk_overlap=200
)
texts = text_splitter.split_documents(documents)

In [None]:
#langchain bastards have progress bar class for every other embedding but not fastembed (⩺_⩹)
qdrant_store = Qdrant.from_documents(
    location=":memory:", #temp memory im just using for testing purposes. not suitable for production but fine enough for demo
    collection_name="AWS_TEST",
    documents=texts,
    embedding=embeddings
)
retriever = qdrant_store.as_retriever()

In [None]:
from typing import Dict, TypedDict

class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        keys: A dictionary where each key is a string.
    """

    keys: Dict[str, any]

In [None]:
from langchain import hub
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser

### Nodes ###

#TODO: Use multiple models for different tasks. For example, use a model that is good at generating questions for the transform_query node. Will reduce inference time and improve performance
#https://ollama.com/library?sort=popular



def retrieve(state):
    """
    Retrieve documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """
    print("---RETRIEVE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = retriever.get_relevant_documents(question)
    return {"keys": {"documents": documents, "question": question}}


def generate(state):
    """
    Generate answer

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    print("---GENERATE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    
    # Prompt
    prompt = hub.pull("rlm/rag-prompt")

    # LLM
    llm = ChatOllama(model=local_llm, temperature=0)

    # Post-processing
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # Chain
    rag_chain = prompt | llm | StrOutputParser()

    # Run
    generation = rag_chain.invoke({"context": documents, "question": question})
    return {
        "keys": {"documents": documents, "question": question, "generation": generation}
    }

def grade_documents(state):
    """
    Determines whether the retrieved documents are relevant to the question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates documents key with relevant documents
    """

    print("---CHECK RELEVANCE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]

    # LLM
    llm = ChatOllama(model=local_llm, format="json", temperature=0)

    # Prompt
    prompt = PromptTemplate(
        template="""You are a grader assessing relevance of a retrieved document to a user question. \n 
        Here is the retrieved document: \n\n {context} \n\n
        Here is the user question: {question} \n
        If the document contains keywords related to the user question, grade it as relevant. \n
        It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
        Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
        Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.""",
        input_variables=["question","context"],
    )

    # Chain
    chain = prompt | llm | JsonOutputParser()

    # Score
    filtered_docs = []
    for d in documents:
        score = chain.invoke(
            {
                "question": question,
                "context": d.page_content,
            }
        )
        grade = score["score"]
        if grade == "yes":
            print("---GRADE: DOCUMENT RELEVANT---")
            filtered_docs.append(d)
        else:
            print("---GRADE: DOCUMENT NOT RELEVANT---")
            continue

    return {"keys": {"documents": filtered_docs, "question": question}}

def transform_query(state):
    """
    Transform the query to produce a better question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates question key with a re-phrased question
    """

    print("---TRANSFORM QUERY---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]

    # LLM
    llm = ChatOllama(model=local_llm, temperature=0)
    
    # Create a prompt template with format instructions and the query
    prompt = PromptTemplate(
        template="""You are generating questions that is well optimized for retrieval. \n 
        Look at the input and try to reason about the underlying sematic intent / meaning. \n 
        Here is the initial question:
        \n ------- \n
        {question} 
        \n ------- \n
        Formulate an improved question:""",
        input_variables=["question"],
    )

    # Chain
    chain = prompt | llm | StrOutputParser()
    better_question = chain.invoke({"question": question})

    return {"keys": {"documents": documents, "question": better_question}}

def prepare_for_final_grade(state):
    """
    Passthrough state for final grade.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): The current graph state
    """

    print("---FINAL GRADE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    generation = state_dict["generation"]

    return {
        "keys": {"documents": documents, "question": question, "generation": generation}
    }


### Edges ###

def decide_to_generate(state):
    """
    Determines whether to generate an answer, or re-generate a question.

    Args:
        state (dict): The current state of the agent, including all keys.

    Returns:
        str: Next node to call
    """

    print("---DECIDE TO GENERATE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    filtered_documents = state_dict["documents"]

    if not filtered_documents:
        # All documents have been filtered check_relevance
        # We will re-generate a new query
        print("---DECISION: TRANSFORM QUERY---")
        return "transform_query"
    else:
        # We have relevant documents, so generate answer
        print("---DECISION: GENERATE---")
        return "generate"


def grade_generation_v_documents(state):
    """
    Determines whether the generation is grounded in the document.

    Args:
        state (dict): The current state of the agent, including all keys.

    Returns:
        str: Binary decision
    """
    print("---GRADE GENERATION vs DOCUMENTS---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    generation = state_dict["generation"]

    # LLM
    llm = ChatOllama(model=local_llm, format="json", temperature=0)

    # Prompt
    prompt = PromptTemplate(
        template="""You are a grader assessing whether an answer is grounded in / supported by a set of facts. \n 
        Here are the facts:
        \n ------- \n
        {documents} 
        \n ------- \n
        Here is the answer: {generation}
        Give a binary score 'yes' or 'no' score to indicate whether the answer is grounded in / supported by a set of facts. \n
        Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.""",
        input_variables=["generation", "documents"],
    )

    # Chain
    chain = prompt | llm | JsonOutputParser()
    score = chain.invoke({"generation": generation, "documents": documents})
    grade = score["score"]

    if grade == "yes":
        print("---DECISION: SUPPORTED, MOVE TO FINAL GRADE---")
        return "supported"
    else:
        print("---DECISION: NOT SUPPORTED, GENERATE AGAIN---")
        return "not supported"

def grade_generation_v_question(state):
    """
    Determines whether the generation addresses the question.

    Args:
        state (dict): The current state of the agent, including all keys.

    Returns:
        str: Binary decision
    """

    print("---GRADE GENERATION vs QUESTION---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    generation = state_dict["generation"]

    llm = ChatOllama(model=local_llm, format="json", temperature=0)

    # Prompt
    prompt = PromptTemplate(
        template="""You are a grader assessing whether an answer is useful to resolve a question. \n 
        Here is the answer:
        \n ------- \n
        {generation} 
        \n ------- \n
        Here is the question: {question}
        Give a binary score 'yes' or 'no' to indicate whether the answer is useful to resolve a question. \n
        Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.""",
        input_variables=["generation", "question"],
    )

    # Prompt
    chain = prompt | llm | JsonOutputParser()
    score = chain.invoke({"generation": generation, "question": question})
    grade = score["score"]

    if grade == "yes":
        print("---DECISION: USEFUL---")
        return "useful"
    else:
        print("---DECISION: NOT USEFUL---")
        return "not useful"

In [None]:
import pprint

from langgraph.graph import END, StateGraph

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("grade_documents", grade_documents)  # grade documents
workflow.add_node("generate", generate)  # generatae
workflow.add_node("transform_query", transform_query)  # transform_query
workflow.add_node("prepare_for_final_grade", prepare_for_final_grade)  # passthrough

# Build graph
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "transform_query": "transform_query",
        "generate": "generate",
    },
)
workflow.add_edge("transform_query", "retrieve")
workflow.add_conditional_edges(
    "generate",
    grade_generation_v_documents,
    {
        "supported": "prepare_for_final_grade",
        "not supported": "generate",
    },
)
workflow.add_conditional_edges(
    "prepare_for_final_grade",
    grade_generation_v_question,
    {
        "useful": END,
        "not useful": "transform_query",
    },
)

# Compile
app = workflow.compile()

In [None]:
inputs = {"keys": {"question": "What is Alexa for Business?"}}
for output in app.stream(inputs):
    for key, value in output.items():
        # Node
        pprint.pprint(f"Node '{key}':")
        # Optional: print full state at each node
        pprint.pprint(value["keys"], indent=2, width=80, depth=None)
    pprint.pprint("\n---\n")

# Final generation
pprint.pprint(value['keys']['generation'])

In [None]:
'''
Notes:
The QA csv sheet provided has some errors in it (e.g. the question "What is the maximum number of security groups associated with a load balancer?" is 5 but the wrong answer expected is 50
Llamaindex/langchain huggingfacellm wrapper literally makes inference almost 7x slower vs using a direct pipeline.
Ollama is godsend but we should use it with linux/wsl for faster inference 
Although llamaindex does have some nice features that langchain doesnt have (at a glance) - we should use both of them in production instead of only one. both of their docs arent very clear on lots of things/are outdated
I havent tested performance between chroma and qdrant but at a glance qdrant seems to be faster
Above code is from langchain's example (wouldve saved a months worth of work if we pivoted earlier) - wasted an entire day trying to use with huggingface but decided on ollama like their docs 
Performance completely depends on the question asked. If the question is too vague, the model will take a long time to generate a response as it has to make multiple calls to rewrite question/rank response to prevent hallucinations. 
Current benchmark is 25-45s on average. 2 min slowest i've seen. Needs some more tuning - will see if using a different model for each task will help
Just generating an answer on its own without going through workflow takes a couple of seconds

(Modular workflow graph - needs some further tuning)
Step 1: User asks a question
Step 2: Retrieve relevant document
Step 3: Grade document relevance
Step 4: If document is not relevant go back to step 2 (error here is if all documents are not relevant we get stuck in an "infinite" loop but it usually stops after 25 calls - roughly 2 mins)
Step 5: Generate answer
Step 6: Grade answer. If answer is hallucinated/doesnt match document go back to step 5 (same error as 4 if context doesnt exist)
Step 7: If answer is useful, end. If not go back to step 2

'''

#TODO: Use linux/wsl with vLLM for faster inference. Should be almost 2x faster than Windows from some readings?
#TODO: Use llamaindex for some tasks and langchain for others. Both have their own strengths and weaknesses. Both have horrible docs
#TODO: Host qdrant on a server
#TODO: Figure out metadata for documents. Useful for filtering out irrelevant/old answers - should sped up & reduce hallucinations
#TODO: Add preprocessing & postprocessing to the workflow. Will be useful for filtering out irrelevant/old answers - should sped up & reduce hallucinations
#TODO: Test CRAG performance against current self-rag implementation
#TODO: Add chat history for multiple-shot reasoning
