# IForest anomaly detection on logs

A method to let the LLM only analyze the most anomalous logs based on an anomaly detection algorithm. The anomaly detection is on the embedded logs on a 1024 dimensional room.

This notebook is based on an [article](https://qdrant.tech/articles/detecting-coffee-anomalies/) about anomaly detection in a qdrant vectorstore. We will try to implement an anomaly detection on the embedded vector, to find the most abnormal logs. The end goal is to be able to store the vectors in the qdrant and use this as both a classic RAG example and at the same time be able to execute the anomaly detection on the vectorstore. However qdrant has a restriction, where you can only implement their specific anomaly detection algorithm and therefore we will not store the embeddings in the vectorstore in this example. Instead we will keep the vectors as a large dataframe during our anomaly detection.

All logs gets an anomaly score, which makes it easy to set a threshold and evaluate all logs above the threshold. However one major problem with this method is it takes a long time to run. It is analyzing the most abnormal logs that you want or decide seems like an anomaly. The amount of logs analyzed can be alot, based on the effectiveness of the anomaly detection algorithm. It also embedds all logs, which takes a long time to do with the current embedding models.

In [None]:
### Old dependencies that works ###
!pip install -q pyod==2.0.1 tensorflow==2.17.0 pythresh==0.3.6

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyod.models.iforest import IForest
from pythresh.thresholds.zscore import ZSCORE
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
import time

First step is to read your file of logs or get it from the portal. To get all logs from the portal watch the [data cleaning notebook](data_cleaning.ipynb).

In [None]:
#Read in your pickle file with logs
logs = pd.read_pickle('data/nova_small_eval_set')

#Adds logs as a list
log_list = logs.values.tolist()
log_list_strings = [' '.join(map(str, sublist)) for sublist in log_list]

In this step we will embedd all logs, this will take a long time if you have a large set of logs.

In [None]:
# LLM Settings
# choose your preferred llm by setting the model variable = the model you want set by default to llama3:8b

model70b='llama3:70b'
model8b='llama3:8b'
modelphi3mini='phi3:mini'
modelphi3medium='phi3:medium'
model3p1_8 = "llama3.1:8b" 
model3p1_70 = "llama3.1:70b"
modelgemma9 = "gemma2:9b"
modelgemma27 = "gemma2:27b"
modelqwen7="qwen2:7b"
modelqwen = "qwen2"

# set your preference of model and collection name here 👇🏾
model = model3p1_70

#Creates your embedding and LLM model
embed_model = OllamaEmbedding(model_name='mxbai-embed-large', base_url='http://10.129.20.4:9090')
llm = Ollama(model=model, base_url='http://10.129.20.4:9090', request_timeout=360)

start = time.time()
#Embedding
embed_list = embed_model._get_text_embeddings(log_list_strings)
df_embed = pd.DataFrame(embed_list)
print(f"Embedding time {time.time() - start} seconds")

Now we will setup our anomaly detection method. In our example we will use [IForest](https://pyod.readthedocs.io/en/latest/_modules/pyod/models/iforest.html) from pyod. However this can easily be replaced with another algorithm if preferred. We will also plot the result of the anomaly detection with a score of color. Keep in mind, this plot may not represent the clusters well, because it only plots 2 of 1024 dimensions. We also extract the log with highest anomaly-score.

In [None]:
#### Anomaly detection algorithm
clf_name = 'IForest'
clf = IForest(contamination = 0.001, random_state=123) #contamination should probably be as low as possible
clf.fit(df_embed)
print("Original threshold:", clf.threshold_)
print("max score:", max(clf.decision_scores_))

thres = ZSCORE()
y_scores = clf.decision_scores_  # raw outlier scores (higher score = more likely to be anomaly)
# binary labels (0: inliers, 1: outliers)
y_pred = thres.eval(y_scores)

# Plot the result of the first 2 dimension (total of 1024 dimension, so it isn't a good representation visually)
figures, ax1 = plt.subplots(1,1)
ax1.scatter(df_embed.iloc[:,0], df_embed.iloc[:,1], c = y_scores)
ax1.set_title('Found outliers by score')

#Find the "largest" anomaly (highest score)
value_to_find = max(y_scores)

# Use numpy.where() to find the position
position = np.where(y_scores == value_to_find)

# Print the position
print("Position of the value:", position[0])

#most abnormal log
anom = logs.iloc[position[0],:] #It should be request id '18ea01d4-10c8-4280-9419-41e8e0b2550d'
anom

Now we will use the LLM with a premade prompt to analyze the single log based on instructions from the prompt. This does not compare or analyze based on other logs in your system, but rather the pretrained information about logs from the LLM. The prompt can be fine-tuned to better analyze the log that you can see above.

In [None]:
#Simple setup of an LLM explaining the log with highest score without context to other logs

# May be changed for your specific usercase
prompt =  f"""You are a log anlysis expert that will analyse a specifc log. Based on this log: {anom.to_string()}. 
Can you interpret the log and explain why it could be some potential problems with it.
This logs differ from the other logs in a dataset, but I would like you to explain what it actually says in a more extensive analysis.
You don't need to explain all variables that are normal. Instead only bring up the parts that seems to be potential issues.
Keep the answer short and clear explanations"""

response = llm.complete(prompt)
print(response)

### Extra information
#### Timeseries
This is a start to multiple cases. One interesting future project would be to use this part with a more preprocessed data. Then it would find more interesting anomalies within eg one UUID, which could give a more relevant results. It could also be added into different types of timeseries data, which the IForest find abnormal patterns in time. For example if a request fails and keeps retrying until failure, it could potentially find these abnormal retrials. [HERE](https://medium.com/@pw33392/discover-unusual-patterns-in-time-series-data-with-unsupervised-anomaly-detection-and-isolation-78db408caaed) is an example that shows how IForest finds changes within a pattern, which can be implemented into our embedding vectorstore with the logs if they are preprocessed for specific purposes.

#### Embedding model (mxbai-embed-large)
One large setback at this moment is the embedding model that we use. It is trained to find a semantic meaning in natural language, while we want to find a more semantic meaning for one or multiple logs. Therefore to get a more accurate algorithm, we need to have an embedding model that is specifically trained and used for our logs. It is important that the embedding model understand the language and sentences in a log, which is different from the natural language. If we have a model that is better with logs,  we could be able to find abnormalities that isn't only different in a textual pattern.

#### Preprocessing
We haven't in this notebook made any deeper analysis of the embedded vectors for the anomaly detection. One important part of a successful IForest is the quality of the data. We assume the quality is good enough, but in a greater implementation, some preprocessing steps may be necessary. Example of steps that could help the anomaly detection are normalize the data or reduce the amount of data to a specific UUID.