In [1]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

from langchain_ollama import OllamaEmbeddings, OllamaLLM
import chromadb
import os
import pickle

In [2]:
llm_model = "llama3.1:8b"

In [3]:
# Initialize the ChromaDB client with persistent storage in the current directory
chroma_client = chromadb.PersistentClient(path=os.path.join(os.getcwd(), "chroma_db"))

In [4]:
# Define a custom embedding function for ChromaDB using Ollama
class ChromaDBEmbeddingFunction:
    """
    Custom embedding function for ChromaDB using embeddings from Ollama.
    """
    def __init__(self, langchain_embeddings):
        self.langchain_embeddings = langchain_embeddings

    def __call__(self, input):
        # Ensure the input is in a list format for processing
        if isinstance(input, str):
            input = [input]
        return self.langchain_embeddings.embed_documents(input)

In [5]:
# Initialize the embedding function with Ollama embeddings
embedding = ChromaDBEmbeddingFunction(
    OllamaEmbeddings(
        model=llm_model,
        base_url="http://localhost:11434"  # Adjust the base URL as per your Ollama server configuration
    )
)

In [6]:
# Define a collection for the RAG workflow
collection_name = "rag_collection_demo_1"
collection = chroma_client.get_or_create_collection(
    name=collection_name,
    metadata={"description": "A collection for RAG with Ollama - Demo1"},
    embedding_function=embedding  # Use the custom embedding function
)

In [7]:
# Function to add documents to the ChromaDB collection
def add_documents_to_collection(documents, ids):
    """
    Add documents to the ChromaDB collection.
    
    Args:
        documents (list of str): The documents to add.
        ids (list of str): Unique IDs for the documents.
    """
    collection.add(
        documents=documents,
        ids=ids
    )

In [3]:
with open('./dataset/cwe_explanations_for_rag.pkl','rb') as f:
    docs = pickle.load(f)

In [7]:
print(docs['CWE-1004'])

CWE-1004 is Sensitive Cookie Without 'HttpOnly' Flag
Description: The product uses a cookie to store sensitive information, but the cookie is not marked with the HttpOnly flag.
The HttpOnly flag directs compatible browsers to prevent client-side script from accessing cookies. Including the HttpOnly flag in the Set-Cookie HTTP response header helps mitigate the risk associated with Cross-Site Scripting (XSS) where an attacker's script code might attempt to read the contents of a cookie and exfiltrate information obtained. When set, browsers that support the flag will not reveal the contents of the cookie to a third party via client-side script executed via XSS.
        
Common Consequences:
Confidentiality: Read Application Data - If the HttpOnly flag is not set, then sensitive information stored in the cookie may be exposed to unintended parties.
Integrity: Gain Privileges or Assume Identity - If the cookie in question is an authentication cookie, then not setting the HttpOnly flag may

In [9]:
# Documents only need to be added once or whenever an update is required. 
# This line of code is included for demonstration purposes:
add_documents_to_collection(list(docs.values()), list(docs.keys()))

Insert of existing embedding ID: CWE-1004
Insert of existing embedding ID: CWE-1007
Insert of existing embedding ID: CWE-102
Insert of existing embedding ID: CWE-1021
Insert of existing embedding ID: CWE-1022
Insert of existing embedding ID: CWE-1023
Insert of existing embedding ID: CWE-1024
Insert of existing embedding ID: CWE-1025
Insert of existing embedding ID: CWE-103
Insert of existing embedding ID: CWE-1037
Insert of existing embedding ID: CWE-1038
Insert of existing embedding ID: CWE-1039
Insert of existing embedding ID: CWE-104
Insert of existing embedding ID: CWE-1041
Insert of existing embedding ID: CWE-1042
Insert of existing embedding ID: CWE-1043
Insert of existing embedding ID: CWE-1044
Insert of existing embedding ID: CWE-1045
Insert of existing embedding ID: CWE-1046
Insert of existing embedding ID: CWE-1047
Insert of existing embedding ID: CWE-1048
Insert of existing embedding ID: CWE-1049
Insert of existing embedding ID: CWE-105
Insert of existing embedding ID: CWE-1

In [10]:
# Function to query the ChromaDB collection
def query_chromadb(query_text, n_results=3):
    """
    Query the ChromaDB collection for relevant documents.
    
    Args:
        query_text (str): The input query.
        n_results (int): The number of top results to return.
    
    Returns:
        list of dict: The top matching documents and their metadata.
    """
    results = collection.query(
        query_texts=[query_text],
        n_results=n_results
    )
    return results["documents"], results["metadatas"]

# Function to interact with the Ollama LLM
def query_ollama(prompt):
    """
    Send a query to Ollama and retrieve the response.
    
    Args:
        prompt (str): The input prompt for Ollama.
    
    Returns:
        str: The response from Ollama.
    """
    llm = OllamaLLM(model=llm_model, temperature=0, num_predict=20000)
    return llm.invoke(prompt)

# RAG pipeline: Combine ChromaDB and Ollama for Retrieval-Augmented Generation
def rag_pipeline(query_text):
    """
    Perform Retrieval-Augmented Generation (RAG) by combining ChromaDB and Ollama.
    
    Args:
        query_text (str): The input query.
    
    Returns:
        str: The generated response from Ollama augmented with retrieved context.
    """
    # Step 1: Retrieve relevant documents from ChromaDB
    retrieved_docs, metadata = query_chromadb(query_text)
    
    # Combine all retrieved documents into a single context string
    if retrieved_docs:
        context = "\n\n".join([doc[0] for doc in retrieved_docs])  # Join all documents with double newlines
    else:
        context = "No relevant documents found."

    # Step 2: Send the query along with the context to Ollama
    augmented_prompt = f"Context: {context}\n\nQuestion: {query_text}\nAnswer:"
    print("######## Augmented Prompt ########")
    print(augmented_prompt)

    response = query_ollama(augmented_prompt)
    return response

In [11]:
# Define a query to test the RAG pipeline
query = """ 
Classify the CWE class reading points below
1. Unvalidated user input: In several test methods (e.g., `testCssImageNoBaseHref`, `testCssImageWithBaseHref`, etc.), the `$value` parameter is not validated or sanitized before being used to set a property on the `Style` object. This could lead to potential security issues if an attacker were able to inject malicious input.
2. Lack of error handling: In some test methods (e.g., `testOpacity`, `testZIndex`, etc.), the code assumes that the `set_prop()` method will always succeed, but it does not handle cases where the property is invalid or cannot be set. This could lead to unexpected behavior or errors if an attacker were able to inject malicious input.
3. Potential for infinite loops: In the `testWordBreakBreakWord` method, the code sets two properties on the `Style` object (`overflow_wrap` and `word_break`) with values that are not validated or sanitized. If an attacker were able to inject malicious input, it could potentially lead to an infinite loop.
4. Potential for security vulnerabilities in third-party libraries**: The Dompdf library itself may have security vulnerabilities that are not addressed by the test code.
""" # Change the query as needed
response = rag_pipeline(query)
print("######## Response from LLM ########\n", response)

######## Augmented Prompt ########
Context: CWE-89
Name: Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')
Description: The product constructs all or part of an SQL command using externally-influenced input from an upstream component, but it does not neutralize or incorrectly neutralizes special elements that could modify the intended SQL command when it is sent to a downstream component. Without sufficient removal or quoting of SQL syntax in user-controllable inputs, the generated SQL query can cause those inputs to be interpreted as SQL instead of ordinary user data.

        
Common Consequences:
Confidentiality: Execute Unauthorized Code or Commands - Adversaries could execute system commands, typically by changing the SQL statement to redirect output to a file that can then be executed.
Confidentiality: Read Application Data - Since SQL databases generally hold sensitive data, loss of confidentiality is a frequent problem with SQL injection vulne