### Chat with your unstructured Text Files with Llama3 and Ollama

Some code inspired by Sascha Retter (https://blog.retter.jetzt/)

##### Chat with local Llama3 Model via Ollama in KNIME Analytics Platform — Also extract Logs into structured JSON Files
https://medium.com/p/aca61e4a690a

##### Ask Questions from your CSV with an Open Source LLM, LangChain & a Vector DB
https://www.tetranyde.com/blog/langchain-vectordb

##### Document Loaders in LangChain
https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123

##### Unleashing Conversational Power: A Guide to Building Dynamic Chat Applications with LangChain, Qdrant, and Ollama (or OpenAI’s GPT-3.5 Turbo)
https://medium.com/@ingridwickstevens/langchain-chat-with-your-data-qdrant-ollama-openai-913020ec504b


In [37]:
import os

import pandas as pd

# Document Loaders in LangChain
# https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123
from langchain_community.document_loaders import UnstructuredFileLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

# from langchain.vectorstores import Chroma
from langchain_community.vectorstores import Chroma

# from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OllamaEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import utils as chromautils

# from langchain.llms import Ollama
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.chains import RetrievalQA

embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2" # the standard embedding model for
model = "llama3:instruct" # model needs already be available, already pulled with for example 'ollama run llama3:instruct'

In [11]:
# Proxy configuration
proxy = "http://proxy.my-company.com:8080"  # Replace with your proxy server and port
proxy = ""
os.environ['http_proxy'] = proxy
os.environ['https_proxy'] = proxy

In [52]:
question = f"What does Hamlet say to his mother? Can you give the source?"

question = f"What is the first person Hamlet does kill? Can you give the source?"

In [54]:
# Define the directory containing your log files. Note: if they have .CSV endings other document loaders might be better
text_files_directory = "../documents/shakespeare/"

In [55]:
def list_text_files(directory):
    """List all TXT files in the given directory."""
    # List all files in the directory
    files = os.listdir(directory)
    # Filter out all files that end with '.txt'
    # and file.startswith('coffee')
    
    # txt_files = [file for file in files if file.endswith('.txt')]
    txt_files = [os.path.join(directory, file) for file in files if file.endswith('.txt')]
    return txt_files

# Specify the directory to search for PDF files
txt_files = list_text_files(text_files_directory)
print(txt_files)


["../documents/shakespeare/A Midsummer Night's Dream.txt", '../documents/shakespeare/Hamlet, Prince of Denmark.txt', '../documents/shakespeare/King Lear.txt', '../documents/shakespeare/Macbeth.txt', '../documents/shakespeare/Sonnets by William Shakespeare.txt']


In [56]:
# https://github.com/langchain-ai/langchain/issues/8556#issuecomment-1806835287

# Load and embed the content of the log files
def load_and_embed_files(file_paths):
    documents = []
    for file_path in file_paths:
        loader = UnstructuredFileLoader(file_path, mode="elements")
        documents.extend(loader.load())
        documents = chromautils.filter_complex_metadata(documents)
    return documents

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)
# embedding_model = SentenceTransformerEmbeddings(model_name=embedding_model_name)



In [57]:
# Load and embed the log files
documents = load_and_embed_files(txt_files)

In [40]:
type(documents)

list

In [41]:
# Define the path to store the Chroma vector store (in SQLite format)
v_path_vector_store = '../data/vectorstore/shakespeare'

In [42]:
# create the vector store from the documents / logs you provided
vectorstore = Chroma.from_documents(
    documents=documents, 
    embedding=embedding_model, 
    persist_directory=v_path_vector_store
)

#### Use the stored Vector store

In [44]:
# load vectorstore from disk
chroma_db = Chroma(persist_directory=v_path_vector_store, embedding_function=embedding_model)

In [45]:
type(chroma_db)

langchain_community.vectorstores.chroma.Chroma

In [46]:
# define the LLM - if you just want the result and not see it being printed out set verbose=False
llm = Ollama(model=model,
            verbose=True,
            callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))

print(f"Loaded LLM model {llm.model}")

Loaded LLM model llama3:instruct


In [53]:
# Initialize the RetrievalQA chain with the vector store retriever
retriever = chroma_db.as_retriever(search_kwargs={"k": 2})  # Use the number of documents to retrieve
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
)

# Use the 'invoke' method to handle the query
result = qa_chain.invoke({"query": question})

I'm happy to help!

Based on the context provided, it seems that the question is about the character Hamlet from Shakespeare's play "Hamlet". If I understand correctly, the question asks what the first person Hamlet kills.

From my knowledge of the play, I can tell you that Hamlet's first kill is Polonius. This occurs in Act 3, Scene 3, when Hamlet mistakes Polonius for a snake and stabs him through the arras (curtain).

Source: Shakespeare, W. (1603). Hamlet.

Please let me know if I'm correct or if you'd like me to clarify anything!

#### Use the model

In [None]:
# llm_model = Ollama(model=model, verbose=False)  # Disable verbose for batch processing

In [None]:
# Define the instruction and log file prompts
v_instruct = """Instructions:
"""

v_prompt = """Question:
"""

# Combine the instruction and prompt
combined_prompt = v_instruct + "\n" + v_prompt

# Print the instruction and log file prompt
# print(v_instruct)
# print(v_prompt)


In [None]:
# Use the LLM to process the combined prompt
# response = llm_model(combined_prompt)
response = llm(combined_prompt)

In [None]:
# Print the response
print(response)