### Chat with your unstructured LOGs with Llama3 and Ollama

Some code inspired by Sascha Retter (https://blog.retter.jetzt/)

##### Chat with local Llama3 Model via Ollama in KNIME Analytics Platform — Also extract Logs into structured JSON Files
https://medium.com/p/aca61e4a690a

##### Ask Questions from your CSV with an Open Source LLM, LangChain & a Vector DB
https://www.tetranyde.com/blog/langchain-vectordb

##### Document Loaders in LangChain
https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123

##### Unleashing Conversational Power: A Guide to Building Dynamic Chat Applications with LangChain, Qdrant, and Ollama (or OpenAI’s GPT-3.5 Turbo)
https://medium.com/@ingridwickstevens/langchain-chat-with-your-data-qdrant-ollama-openai-913020ec504b


In [8]:
import os

import pandas as pd

# Document Loaders in LangChain
# https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123
from langchain_community.document_loaders import UnstructuredFileLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

# from langchain.vectorstores import Chroma
from langchain_community.vectorstores import Chroma

# from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OllamaEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings

# from langchain.llms import Ollama
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.chains import RetrievalQA

embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2" # the standard embedding model for
model = "llama3:instruct" # model needs already be available, already pulled with for example 'ollama run llama3:instruct'

In [34]:
# Proxy configuration
proxy = "http://proxy.my-company.com:8080"  # Replace with your proxy server and port
proxy = ""
os.environ['http_proxy'] = proxy
os.environ['https_proxy'] = proxy

In [2]:
question = f"What would be the best set of JSON columns to extract data from these Logfiles in a systematic way? Can you write a prompt?"

In [20]:
# Define the directory containing your log files. Note: if they have .CSV endings other document loaders might be better
log_files_directory = "../documents/logs/"

In [21]:
# List all log files in the directory
log_files = [os.path.join(log_files_directory, f) for f in os.listdir(log_files_directory) if os.path.isfile(os.path.join(log_files_directory, f))]

In [22]:
print(log_files)

['../documents/logs/logfile_01.log', '../documents/logs/logfile_02.log', '../documents/logs/logfile_03.log', '../documents/logs/logfile_04.log']


In [23]:
# Load and embed the content of the log files
def load_and_embed_files(file_paths):
    documents = []
    for file_path in file_paths:
        loader = UnstructuredFileLoader(file_path)
        documents.extend(loader.load())
    return documents

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)
# embedding_model = SentenceTransformerEmbeddings(model_name=embedding_model_name)

In [None]:
# Load and embed the log files
documents = load_and_embed_files(log_files)

In [36]:
type(documents)

list

In [26]:
# Define the path to store the Chroma vector store (in SQLite format)
v_path_vector_store = '../data/vectorstore/chroma_vector_store_logs'

In [28]:
# create the vector store from the documents / logs you provided
vectorstore = Chroma.from_documents(
    documents=documents, 
    embedding=embedding_model, 
    persist_directory=v_path_vector_store
)

#### Use the stored Vector store

In [29]:
# load vectorstore from disk
chroma_db = Chroma(persist_directory=v_path_vector_store, embedding_function=embedding_model)

In [30]:
type(chroma_db)

langchain_community.vectorstores.chroma.Chroma

In [31]:
# define the LLM - if you just want the result and not see it being printed out set verbose=False
llm = Ollama(model=model,
            verbose=True,
            callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))

print(f"Loaded LLM model {llm.model}")

Loaded LLM model llama3:instruct


In [35]:
# Initialize the RetrievalQA chain with the vector store retriever
retriever = chroma_db.as_retriever(search_kwargs={"k": 2})  # Use the number of documents to retrieve
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
)

# Use the 'invoke' method to handle the query
result = qa_chain.invoke({"query": question})

What an interesting log file!

After analyzing the logs, I've identified some relevant patterns and extracted potential JSON column names that could facilitate efficient data extraction. Here's a suggested set of columns:

**Event Timestamp**: `timestamp` (UTC format)

**Session ID**: `session_id` (e.g., `9.67.116.99:1047:6`)

**Source IP**: `source_ip` (e.g., `9.67.116.98`)

**Destination IP**: `destination_ip` (e.g., `9.67.116.99`)

**Event Type**: `event_type` (e.g., `PATHDELTA`, `RESVDELTA`, etc.)

**Hop Count**: `hop_count` (extracted from `RSVP_HOP` logs)

**Interface ID**: `interface_id` (extracted from `rsvp_event_mapSession` logs)

**Filter Installation**: `filter_installed` (Boolean value indicating whether a filter was installed or not)

**QoS Request**: `qos_request` (contains information about the Quality of Service request, such as source IP, destination IP, protocol, and reservation details)

Other potential columns:

* `style`: The type of RSVP object (`WF` in this case

#### Use the model

In [37]:
llm_model = Ollama(model=model, verbose=False)  # Disable verbose for batch processing

In [45]:
# Define the instruction and log file prompts
v_instruct = """Prompt:
**Extract Log Data**

Using the provided log files, extract data into the following JSON columns:

1. **timestamp**: Extract the timestamp from each log entry in UTC format.
2. **session_id**: Identify and extract the session ID from each log entry.
3. **source_ip**: Extract the source IP address from each log entry.
4. **destination_ip**: Extract the destination IP address from each log entry.
5. **event_type**: Categorize each log event into a specific type (e.g., `PATHDELTA`, `RESVDELTA`).
6. **hop_count**: Count the number of hops in each RSVP object and extract it as a separate column.
7. **interface_id**: Extract the interface ID from each log entry related to `rsvp_event_mapSession`.
8. **filter_installed**: Indicate whether a filter was installed or not for each relevant log entry.
9. **qos_request**: Extract information about QoS requests, including source IP, destination IP, protocol, and reservation details.

extract all data and always use these exact structure.
"""

v_prompt = """Here is the Log file:
03/22 08:52:51 INFO   :..........rpapi_Reg_UnregFlow: ReadBuffer:  Entering
 
03/22 08:52:52 INFO   :..........rpapi_Reg_UnregFlow: ReadBuffer:  Exiting
 
03/22 08:52:52 INFO   :..........rpapi_Reg_UnregFlow: RSVPPutActionName:  Result = 0
 
03/22 08:52:52 INFO   :..........rpapi_Reg_UnregFlow: RSVPPutActionName:  Exiting
 
03/22 08:52:52 INFO   :..........rpapi_Reg_UnregFlow: flow[sess=9.67.116.99:1047:6, 
source=9.67.116.98:8000] registered with CLCat2
03/22 08:52:52 EVENT  :..........qosmgr_response: RESVRESP from qosmgr, reason=0, qoshandle=8b671d0
03/22 08:52:52 INFO   :..........qosmgr_response: src-9.67.116.98:8000 dst-9.67.116.99:1047 proto-6
03/22 08:52:52 TRACE  :...........traffic_reader: tc response msg=1, status=1
03/22 08:52:52 INFO   :...........traffic_reader: Reservation req successful[session=9.67.116.99:1047:6,
source=9.67.116.98:8000, qoshd=8b671d0]
20 
03/22 08:52:52 TRACE  :........api_action_sender: constructing a RESV
03/22 08:52:52 TRACE  :........flow_timer_stop: stopped T1
03/22 08:52:52 TRACE  :........flow_timer_stop: Stop T4
03/22 08:52:52 TRACE  :........flow_timer_start: started T1
03/22 08:52:52 TRACE  :........flow_timer_start: Start T4
21 
03/22 08:52:52 TRACE  :.......rsvp_flow_stateMachine: entering state RESVED
22 
03/22 08:53:07 EVENT  :..mailslot_sitter: process received signal SIGALRM
03/22 08:53:07 TRACE  :.....event_timerT1_expire: T1 expired
03/22 08:53:07 INFO   :......router_forward_getOI: Ioctl to query route entry successful
03/22 08:53:07 TRACE  :......router_forward_getOI:         source address:   9.67.116.98
03/22 08:53:07 TRACE  :......router_forward_getOI:         out inf:   9.67.116.98
03/22 08:53:07 TRACE  :......router_forward_getOI:         gateway:   0.0.0.0
03/22 08:53:07 TRACE  :......router_forward_getOI:         route handle:   7f5251c8
03/22 08:53:07 INFO   :......rsvp_flow_stateMachine: state RESVED, event T1OUT
03/22 08:53:07 TRACE  :.......rsvp_action_nHop: constructing a PATH
03/22 08:53:07 TRACE  :.......flow_timer_start: started T1
03/22 08:53:07 TRACE  :......rsvp_flow_stateMachine: reentering state RESVED
03/22 08:53:07 TRACE  :.......mailslot_send: sending to (9.67.116.99:0)
"""

# Combine the instruction and prompt
combined_prompt = v_instruct + "\n" + v_prompt

# Print the instruction and log file prompt
# print(v_instruct)
# print(v_prompt)


In [46]:
# Use the LLM to process the combined prompt
response = llm_model(combined_prompt)

In [47]:
# Print the response
print(response)

Here is the extracted data in JSON format:

```
[
  {
    "timestamp": "2023-03-22T08:52:51Z",
    "session_id": null,
    "source_ip": "9.67.116.98",
    "destination_ip": "9.67.116.99",
    "event_type": "INFO",
    "hop_count": null,
    "interface_id": null,
    "filter_installed": null,
    "qos_request": {
      "source_ip": "9.67.116.98",
      "destination_ip": "9.67.116.99",
      "protocol": 6
    }
  },
  {
    "timestamp": "2023-03-22T08:52:52Z",
    "session_id": "9.67.116.99:1047:6",
    "source_ip": "9.67.116.98",
    "destination_ip": "9.67.116.99",
    "event_type": "INFO",
    "hop_count": null,
    "interface_id": null,
    "filter_installed": null,
    "qos_request": {
      "source_ip": "9.67.116.98",
      "destination_ip": "9.67.116.99",
      "protocol": 6
    }
  },
  {
    "timestamp": "2023-03-22T08:52:52Z",
    "session_id": null,
    "source_ip": null,
    "destination_ip": null,
    "event_type": "EVENT",
    "hop_count": null,
    "interface_id": null,
  

In [48]:
v_output_file = "../data/llm_response.txt"

In [49]:
# Save the response to a text file
with open(v_output_file, 'w') as file:
    file.write(response)