<h2>Policy Evaluation & Gap Identification</h2>
<p>Jay Yong</p>

<p>Import the necessay libraries and packages</p>

In [33]:
from llama_index.llms.openai import OpenAI
from llama_index.core import SimpleDirectoryReader
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.core.schema import Document
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
import openai
import PyPDF2
import os


<p>Set your OpenAI API key</p>

In [None]:
openai.api_key = ""

<p>Experimenting with policy understanding using Llamaindex

In [22]:
llm = OpenAI(api_key=openai.api_key)
#llm = OpenAI(api_key=openai.api_key, model="gpt-4o-mini")

# Function to read a PDF file and extract text
def extract_pdf_text(cis_pdf_path):
    with open(cis_pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
        return text

# Function to create a query engine
def create_query_engine(file_path, index_name):
    # Extract text from the PDF
    policy_text = extract_pdf_text(file_path)

    # Create a document from the extracted text
    doc = Document(text=policy_text)

    # Build an index from the document
    index = VectorStoreIndex([doc], llm=llm)

    # Create a query engine from the index
    query_engine = index.as_query_engine()
    index.storage_context.persist(f'{index_name}.json')
    return query_engine

In [11]:
# Querying the index
query = "As an IT auditor, review the document and list down the controls on user account."
cis_query_engine = create_query_engine("CIS_Controls_v8.1_Account.pdf", "cis_index")
response = cis_query_engine.query(query)

# Print the response from the query
print(f"Answer: {response}")

Answer: The controls on user account management outlined in the document are as follows:
1. Establish and Maintain an Inventory of Accounts
2. Use Unique Passwords
3. Disable Dormant Accounts
4. Restrict Administrator Privileges to Dedicated Administrator Accounts
5. Establish and Maintain an Inventory of Service Accounts
6. Centralize Account Management


In [23]:
# Testing with dummy policy
query = "As an IT auditor, review the document and list down the policy on user account password."
pan_query_engine = create_query_engine("Pan User Account Policy.pdf", "pan_index")
response = pan_query_engine.query(query)

# Print the response from the query
print(f"Answer: {response}")

Answer: The policy on user account password is that passwords should be 12 characters long, consisting of alphanumeric and special characters.


In [None]:
# Comparing Pan company policy against the CIS controls
query_engine_tools = [
    QueryEngineTool(
        query_engine = cis_query_engine,
        metadata = ToolMetadata(
            name = "CIS controls",
            description = "CIS controls",
        ),
    ),
    QueryEngineTool(
        query_engine = pan_query_engine,
        metadata = ToolMetadata(
            name = "Pan company policy",
            description = "Pan company policy",
        ),
    )
]

gap_query_engine = RouterQueryEngine(
    selector = LLMSingleSelector.from_defaults(),
    query_engine_tools = query_engine_tools,
    llm=llm,
)

In [None]:
# Identify the policy gaps
query = "As an IT auditor, compare the Pan company policy against the CIS controls and create a list of policy gaps identified in Pan company policy."
response = gap_query_engine.query(query)
print(f"Answer: {response}")

Answer: The Pan company policy lacks clear processes for tracking and managing authorization to credentials for user accounts, disabling and removing dormant accounts, enforcing unique password usage with specific requirements, restricting administrator privileges to dedicated accounts, maintaining inventory of all accounts with necessary details, centralizing account management, automatically logging out users after inactivity, training users to lock screens, implementing Multi-Factor Authentication (MFA) and Single Sign-On (SSO), conducting regular audits for active accounts and service accounts, and reviewing and validating service account purposes on a recurring schedule.


In [25]:
# Make recommendation
query = "As an IT auditor, recommend how to improve Pan company policy such that the risk of brute force attacks on user accounts are minimised."
response = gap_query_engine.query(query)
print(f"Answer: {response}")

Answer: Recommend implementing a policy that enforces the use of unique and strong passwords for all user accounts. Additionally, enable Multi-Factor Authentication (MFA) for all accounts to add an extra layer of security. Regularly audit and disable dormant accounts to reduce the attack surface for potential brute force attacks. Educate users on the importance of password security, encourage the use of password manager applications, and ensure automatic logout after a period of inactivity. These measures will help minimize the risk of brute force attacks on user accounts.


<p>Log analysis of a brute force attack</p>

In [27]:
#https://llamahub.ai/l/readers/llama-index-readers-file?from=all
from llama_index.readers.file import (
    DocxReader,
    HWPReader,
    PDFReader,
    EpubReader,
    FlatReader,
    HTMLTagReader,
    ImageCaptionReader,
    ImageReader,
    ImageVisionLLMReader,
    IPYNBReader,
    MarkdownReader,
    MboxReader,
    PptxReader,
    PandasCSVReader,
    VideoAudioReader,
    UnstructuredReader,
    PyMuPDFReader,
    ImageTabularChartReader,
    XMLReader,
    PagedCSVReader,
    CSVReader,
    RTFReader,
)

# CSV Reader example
parser = CSVReader()
file_extractor = {".csv": parser}  # Add other CSV formats as needed
documents = SimpleDirectoryReader(
    "./logs", file_extractor=file_extractor
).load_data()
log_index = VectorStoreIndex.from_documents(documents, llm=llm)

# Optionally persist the index to disk:
# index.storage_context.persist(persist_dir="./csv_index")

log_query_engine = log_index.as_query_engine()


2025-10-19 22:58:59,464 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
query = "As a cyber incident investigator, determine whether is this a brute force attack?"
response = log_query_engine.query(query)
print(f"Answer: {response}")

2025-10-19 22:59:04,482 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-19 22:59:05,307 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Answer: This appears to be a brute force attack based on the repeated failed password attempts from the same source IP address for the same username within a short period of time, leading to a disconnection due to too many authentication failures.


In [29]:
# Compare the logs against Pan company policy
# Comparing Pan company policy against the CIS controls
val_query_engine_tools = [
    QueryEngineTool(
        query_engine = cis_query_engine,
        metadata = ToolMetadata(
            name = "CIS controls",
            description = "CIS controls",
        ),
    ),
    QueryEngineTool(
        query_engine = pan_query_engine,
        metadata = ToolMetadata(
            name = "Pan company policy",
            description = "Pan company policy",
        ),
    ),
    QueryEngineTool(
        query_engine = log_query_engine,
        metadata = ToolMetadata(
            name = "brute force logs",
            description = "brute force logs",
        ),
    )
]

validation_query_engine = RouterQueryEngine(
    selector = LLMSingleSelector.from_defaults(),
    query_engine_tools = val_query_engine_tools,
    llm=llm,
)

In [31]:
query = "As an IT auditor, examine the brute force logs and determine what are the policy gaps in Pan company policy that allowed the brute force attack to occur. List down the policy gaps with the supporting evidence in Pan company policy."
response = validation_query_engine.query(query)
print(f"Answer: {response}")

2025-10-19 23:05:08,290 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-19 23:05:08,301 - INFO - Selecting query engine 1: Pan company policy is directly related to examining policy gaps and determining what allowed the brute force attack to occur..
2025-10-19 23:05:08,721 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-19 23:05:10,480 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Answer: The policy gaps in Pan company policy that allowed the brute force attack to occur are as follows:

1. Lack of account lockout duration: The policy does not specify a time duration for the lockout after 12 consecutive failed attempts, potentially allowing attackers to repeatedly attempt to gain access without a significant delay.
2. Absence of multi-factor authentication: The policy does not mention the implementation of multi-factor authentication, which could have added an extra layer of security to prevent unauthorized access.
3. Limited password complexity requirements: While the policy mandates passwords to be 12 characters long and include alphanumeric and special characters, it does not specify requirements such as prohibiting common passwords or enforcing a minimum number of character types, which could have made the passwords more secure against brute force attacks.


<h1>Completed our CI project?<h1>