# Local LLM Example w/ RAG on OnLogic K800

This notebook shows an example pipeline for locally running an LLM with RAG integration, using Ollama and LangChain. 

Ollama is a user friendlly, free, and open-source tool for local LLM execution. For this example, we pull 2 models from Ollama:
* Llama3.1:8b-instruct-fp16 - This is the general purpose model that the pipeline is built from. It works well on its own, but implementing a RAG allows for more control over responses and ensures the model adheres to reliable data sources. 
* nomic-embed-text - This is an Ollama supported text embedding model (137M parameters) that allows the pipeline to pull relevant context sources from the context database based on their similarity with the prompt.

To get started with Ollama, visit their website and download the app for your device .
After installing, run:

`ollama pull llama3.1:8b-instruct-fp16`

`ollama pull nomic-embed-text`

to download and prepare the models for local execution. Ollama has a wide catalog of llms of different sizes and specialties, so browse to find one that matches your usecase and hardware.  

Getting the models running is as easy as running

`ollama start llama3.1:8b`

Which allows you to chat with the LLM through the command line. Run `ollama ps` to see the models you have currently loaded in memory and `ollama list` to see your list of locally downloaded models. 

Langchain is the second tool used for this example, which is used to define the overall RAG pipeline. We first create an embeddings database using LangChain Chroma to store the embeddings for each context document, and then define a chain which collects the most relevent documents and provides them to the llama model which in turn responds to the question. This is a super simple example of the types of chains that LangChain supports, which can be much more complex and customizable, allowing for things like web search and model self-correction. 


All of the LangChain libraries used here can be installed by: 

`pip install langchain langchain_community`

`pip install -qU langchain_chroma`

`pip install -qU langchain_ollama`

`pip install -qU langchain_unstructured`

This notebook was run locally on an OnLogic K804 with one 20GB Nvidia A4000 GPU. 

This example script will walk through the creation of a simple chat bot for giving technical support for a couple of onlogic pcs. To do this, we need to provide it context from which it will create its answers. To do this, we'll pull a collection of product manuals from onlogic.com, and let langchain and ollama convert them into embeddings, which allows the RAG to gain some semantic meaning of the content in each document. To do this, we'll use langchain's unstructured library to parse the documents and separate them into sections. This requires some additional installation, starting with 

`pip install -qU "unstructured[pdf]"`

When parsing pdfs locally, the unstructured library also requires some additional dependencies, Tessaract and Poppler. 

You can also do pdf parsing through the free unstructured API, which requires an API key. 

In [12]:
# Parse pdfs, creating a Document object for each section within each pdf
# this can take a while, depending on the number and size of the documents. 
from langchain_unstructured import UnstructuredLoader
from langchain_community.vectorstores.utils import filter_complex_metadata
from glob import glob

docs = []

pdf_filepaths = glob('product_manuals/*.pdf')

for file in pdf_filepaths:
    loader = UnstructuredLoader(
        file_path=file,
        strategy="fast",
        chunking_strategy='by_title'
    )

    for doc in loader.lazy_load():
        docs.append(doc)
    
filter_complex_metadata(docs)

print(f'Total number of documents: {len(docs)}')
docs[0]

Total number of documents: 290


Document(metadata={'source': 'product_manuals\\OnLogic-FR201-Product-Manual.pdf', 'file_directory': 'product_manuals', 'filename': 'OnLogic-FR201-Product-Manual.pdf', 'last_modified': '2024-12-04T12:09:29', 'page_number': 1, 'orig_elements': 'eJzlmMtu3DYUhl+F0DpUeb9k17Rxm6JpjNhZ2YFB8WILmJEGksaXBn33HlEyMLGnwHShRdXNAD/1c0Tx4yHP4dW3Im7iNjbDTR2Kt6iwXlAracKJmoCFFA7bECimleZGGcpSlMUbVGzj4IIbHPT5Vvi27ULduCH2WW/cU7sfbu5ifXs3QIsRDLrMrQ91GO6gUVoFjbu2boax19WVZiV5g6jUpf36Bj1LwyfJmSj1EZ3t0FD0T/0Qt+M3nNePcXOxcz4Wf8GDVG/iTai76Ie2exoNu64Nez/cbF2zd5u+mE2N28bx8afm9/a29vjsMyMUn09m/DGby11IRf6Y5nbvbvMHXxWxuS2+5tYe/rUNdapjnk5GmMCUYSIuKXtL7Ftmx9476HnT7LdV7MBFx1EO8XGcqiK/FM0vRdNLxy7D0y4P7rIeNvm7XpITgXteEY+VNAoLQT02ljNstQpVoLFiRi1KTutSHYAjfJJMqlK91tn93+bGDrl9jvd1X7cN+rXu83hPYRYsI4IojXmlPRZcMuxEkPAjmU6iCpKbpZjpEmyUm9JkaJMUs6TClvSIzvb/OzVNiLaV8ZjHRLCgzmIjtcAheM9plMwtTU2xkh5Q07NklBzV2b4iamd11w+oAyquj6hNaNo0t6dvlkpKrwg1OEWqsai4xRCBCcagQ6o0JUwtjNDIkh8gtLNkhJT8iM72FSH8soNpjehd2w7ot/12Fzt0XjcwcSfhiy76oIKFLMUBPh8SNsoRr

In [13]:
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings

# Use a local embedding model to store the embddings of all context documents in a vector database
local_embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(docs, local_embeddings)

# formatting function used by the Llama model to "read" the context documents.
def format_docs(docs):
    return "\n\n".join(f'context source: {doc.metadata["filename"]}\nContent:\n{doc.page_content}' for doc in docs)

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


In [14]:
# Testing the vector store
query = "What are the displayport details for the k800 series?"

docs = vectorstore.similarity_search(query)
# Retrieves the most semantically similar context document to the above query: 
docs[0].page_content

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


'2.9 - DisplayPort\n\nThere are two full-size DisplayPorts on the Karbon 800 series. Both ports support DP 1.4 at 4K 60Hz\n\nand support MST (Multi Stream Topology). An MST hub can be used to support up to four\n\nindependent displays. Please refer to Intel documentation for additional Alder Lake-S display output\n\nspecifications: https://ark.intel.com/\n\n22'

In [15]:
from langchain_ollama import ChatOllama

# Load Llama3.1 model

model = ChatOllama(
    model='llama3.1:8b'
)


In [16]:
# Run LLM with context retrieval
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Define the chain

RAG_TEMPLATE = """
You are an assistant for tech support question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

<context>
{context}
</context>

Answer the following question:

{question}"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

retriever = vectorstore.as_retriever()

# The chain finds the most relevent context document using the vectorstore retriever, formats it,
# and inputs it into the template along with the user quesion. Then the model executues on the entire prompt, 
# and returns a response object. 

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | model
)

In [21]:
question = "What are the details for the K800 series of computers' CE compliance?"
response = chain.invoke(question)

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


In [18]:
print(response.content)

The K800 computer system was evaluated for medical, IT equipment, automotive, maritime, and railway EMC standards as a class A device. It complies with the relevant IT equipment directives for the CE mark. Modification of the system may void the certifications. Testing includes EN 55032, EN 55035, EN 60601-1, EN 62368-1, EN 60950-1, and others.


In [22]:
# Total run duration in seconds
print(response.response_metadata['total_duration'] /  1e9)

3.7736804


In [20]:
response

AIMessage(content='The K800 computer system was evaluated for medical, IT equipment, automotive, maritime, and railway EMC standards as a class A device. It complies with the relevant IT equipment directives for the CE mark. Modification of the system may void the certifications. Testing includes EN 55032, EN 55035, EN 60601-1, EN 62368-1, EN 60950-1, and others.', additional_kwargs={}, response_metadata={'model': 'llama3.1:8b', 'created_at': '2024-12-05T21:48:27.7233821Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 1829200800, 'load_duration': 11295300, 'prompt_eval_count': 656, 'prompt_eval_duration': 323000000, 'eval_count': 87, 'eval_duration': 1493000000}, id='run-ae65f6d7-8075-41b5-8d16-e8043cfa1d03-0', usage_metadata={'input_tokens': 656, 'output_tokens': 87, 'total_tokens': 743})