# Cyber Developer Day 2024
## Introduction

TODO


### Environment Setup

The following code blocks are used to setup environment variables and imports for the rest of the notebook.

In [1]:
%load_ext autoreload
%aimport -logging
%autoreload 2

# Ensure that the morpheus directory is in the python path. This may not need to be run depending on the environment setup
import sys
import os

if ("MORPHEUS_ROOT" not in os.environ):
    os.environ["MORPHEUS_ROOT"] = os.path.abspath("../../..")

llm_dir = os.path.abspath(os.path.join(os.getenv("MORPHEUS_ROOT", "../../.."), "examples", "cyber_dev_day"))

if (llm_dir not in sys.path):
    sys.path.append(llm_dir)

Ensure the necessary environment variables are set. As a last resort, try to load them from a `.env` file.


In [2]:
# Ensure that the current environment is set up with API keys
required_env_vars = [
    "MORPHEUS_ROOT",
    "NGC_API_KEY",
    "NVIDIA_API_KEY",
]

if (not all([var in os.environ for var in required_env_vars])):

    # Try loading an .env file if it exists
    from dotenv import load_dotenv

    load_dotenv()

    # Check again
    if (not all([var in os.environ for var in required_env_vars])):
        raise ValueError(f"Please set the following environment variables: {required_env_vars}")


In [3]:
# Global imports
import os
import sys
import cudf
import pandas as pd

Configure logging to allow Morpheus messages to appear in the notebook.

In [4]:
# Configure logging
import logging
import cyber_dev_day

# Create a logger for this module. Use the cyber_dev_day module name because the notebook will just be __main__
logger = logging.getLogger(cyber_dev_day.__name__)

# Configure the root logger log level
logger.root.setLevel(logging.INFO)

In [5]:
# Test the logger out. You should see the log message in the output
logger.info("Successfully configured logging!")

Successfully configured logging!


## Part 1 - Intro to Interacting with LLMs

This section will go over how to integrate LLMs into code with GUI or Python based examples

In [6]:
from nemollm.api import NemoLLM
from pprint import pprint

# If you do not specify an api_key or an org_id, it will use environment variables
conn = NemoLLM()

response = conn.generate(
  prompt="Message Mason saying that we should compete in next week's football tournament, and tell him that the winner will get $100.",
  model="gpt-43b-002",
  customization_id="f4cd7435-def4-41e5-835b-9febb679ce01",
  stop=[],
  tokens_to_generate=150,
  temperature=1.0,
  top_k=1,
  top_p=0.0,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0,
)
pprint(response)

Request succeeded with HTTP Status Code 200 OK Full response: {"text":" Hey Mason,  I think we should compete in the next week's football tournament. I think it would be fun and the winner will get $100. I think you are a good player and I am sure you will win. I will be waiting for your response.  From,  John.","cumlogprobs":-46.16297,"prompt_labels":[{"class_name":"nontoxic","score":0.99038994}],"completion_labels":[{"class_name":"nontoxic","score":0.9922271}]}

{'completion_labels': [{'class_name': 'nontoxic', 'score': 0.9922271}],
 'cumlogprobs': -46.16297,
 'prompt_labels': [{'class_name': 'nontoxic', 'score': 0.99038994}],
 'text': " Hey Mason,  I think we should compete in the next week's football "
         'tournament. I think it would be fun and the winner will get $100. I '
         'think you are a good player and I am sure you will win. I will be '
         'waiting for your response.  From,  John.'}


Try some for yourself! Here are a few other ideas on things to try out: 

1. Try out different prompts and see how the model responds
   1. Are there prompts that work better than others?
2. Try out different models and see how they compare
   1. Are there models that work better than others?
   2. You can use any one of these models: `["gpt-8b-000", "gpt-20b", "gpt-43b-002"]`
3. Try out different temperature values and see how they compare
   1. Are there temperature values that work better than others?
   2. What happens when you set the temperature to 0? What about 1?

## Part 2 - Prototyping

Building a Pipeline with LLMs to Analyze CVEs

### Overview

Background of RAG, data sources, building vector databases

### Building the Vector Database

In [7]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from cyber_dev_day.embeddings import create_code_embedding

# Create the embedding object that will be used to generate the embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device":"cuda"}, show_progress=True)

# Create a vector database of the code using the supplied embedding function. The returned value will be a
# FaissVectorDatabase object.
# NOTE: This may take a few minutes to run.
faiss_vdb = create_code_embedding(code_dir=os.getenv("MORPHEUS_ROOT"), embedding=embeddings, include_notebooks=False, exclude=[
    ".cache/**/*.py",
    "build*/**/*.py"
])

# Save the vector database to disk
code_faiss_dir = os.path.join(os.getenv("MORPHEUS_ROOT"), ".tmp", "morpheus_code_faiss")

faiss_vdb.save_local(code_faiss_dir)


Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
Generating embedding for source code in /workspace
Total 1028 source code documents in 623 files.
Creating embeddings...


Batches:   0%|          | 0/166 [00:00<?, ?it/s]

Loading faiss with AVX2 support.
Successfully loaded faiss with AVX2 support.
Creating embeddings... Complete


### Running a RAG Pipeline with Morpheus

How do we utilize the Vector Database to answer questions?

#### Morpheus Overview

Quick overview of what Morpheus is and how to use it.

In [8]:
from morpheus.config import Config, PipelineModes

# Create the pipeline config
pipeline_config = Config()
pipeline_config.mode = PipelineModes.OTHER

#### Building a Morpheus RAG Pipeline

Below, we will build a pipeline that uses Morpheus to answer questions about the code in the repository that we created a vector database for. This works by using the `LLMEngine` in Morpheus with a `RAGNode`.

In [9]:
from textwrap import dedent
import cudf
from cyber_dev_day.config import EngineConfig, LLMModelConfig, NVFoundationLLMModelConfig
from morpheus._lib.llm import LLMEngine
from morpheus.llm.nodes.extracter_node import ExtracterNode
from morpheus.llm.nodes.rag_node import RAGNode
from morpheus.llm.services.llm_service import LLMService
from morpheus.llm.task_handlers.simple_task_handler import SimpleTaskHandler
from morpheus.messages import ControlMessage

from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.service.vdb.faiss_vdb_service import FaissVectorDBService
from morpheus.stages.input.in_memory_source_stage import InMemorySourceStage
from morpheus.stages.llm.llm_engine_stage import LLMEngineStage
from morpheus.stages.output.in_memory_sink_stage import InMemorySinkStage
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
from morpheus.utils.concat_df import concat_dataframes

def _build_rag_llm_engine(model_config: LLMModelConfig):

    engine = LLMEngine()

    engine.add_node("extracter", node=ExtracterNode())

    prompt = dedent("""
    You are a helpful assistant. Given the following background information:
    {% for c in contexts -%}
    Source File: {{ c.metadata.source }}
    Source File Language: {{ c.metadata.language }}
    Source Content:
    ```
    {{ c.page_content }}
    ```
    {% endfor %}

    Please answer the following question:
    {{ query }}
    """).strip("\n")

    vector_service = FaissVectorDBService(code_faiss_dir, embeddings=embeddings)

    vdb_resource = vector_service.load_resource()

    llm_service = LLMService.create(model_config.service.type, **model_config.service.model_dump(exclude={"type"}))

    llm_client = llm_service.get_client(**model_config.model_dump(exclude={"service"}))

    # Async wrapper around embeddings
    async def calc_embeddings(texts: list[str]) -> list[list[float]]:
        return embeddings.embed_documents(texts)

    engine.add_node("rag",
                    inputs=["/extracter"],
                    node=RAGNode(prompt=prompt,
                                 vdb_service=vdb_resource,
                                 embedding=calc_embeddings,
                                 llm_client=llm_client))

    engine.add_task_handler(inputs=["/rag"], handler=SimpleTaskHandler())

    return engine

async def run_rag_pipeline(p_config: Config, model_config: LLMModelConfig, question: str):
    source_dfs = [
        cudf.DataFrame({"questions": [question]}),
    ]

    completion_task = {"task_type": "completion", "task_dict": {"input_keys": ["questions"], }}

    pipe = LinearPipeline(p_config)

    pipe.set_source(InMemorySourceStage(p_config, dataframes=source_dfs))

    pipe.add_stage(
        DeserializeStage(p_config, message_type=ControlMessage, task_type="llm_engine", task_payload=completion_task))

    pipe.add_stage(
        LLMEngineStage(p_config,
                        engine=_build_rag_llm_engine(model_config)))

    sink = pipe.add_stage(InMemorySinkStage(p_config))

    await pipe.run_async()

    messages = sink.get_messages()
    responses = concat_dataframes(messages)

    # The responses are quite long, when debug is enabled disable the truncation that pandas and cudf normally
    # perform on the output
    pd.set_option('display.max_colwidth', None)
    print("Response:\n%s" % (responses['response'].iloc[0], ))




In [10]:
model_config = NVFoundationLLMModelConfig.model_validate({
    "service": {
        "type": "nvfoundation", "api_key": None
    },
    "model_name": "mixtral_8x7b",
    "temperature": 0.0
})

# Run the Pipeline
await run_rag_pipeline(pipeline_config, model_config, "Does the code repo import the `pyarrow_hotfix` package from the `morpheus` root package?")


====Pipeline Pre-build====
====Pre-Building Segment: linear_segment_0====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Pipeline Started====
====Building Segment: linear_segment_0====
Added source: <from-mem-0; InMemorySourceStage(dataframes=[                                           questions
0  Does the code repo import the `pyarrow_hotfix`...], repeat=1)>
  └─> morpheus.MessageMeta
Added stage: <deserialize-1; DeserializeStage(ensure_sliceable_index=True, message_type=<class 'morpheus._lib.messages.ControlMessage'>, task_type=llm_engine, task_payload={'task_type': 'completion', 'task_dict': {'input_keys': ['questions']}})>
  └─ morpheus.MessageMeta -> morpheus.ControlMessage
Added stage: <llm-engine-2; LLMEngineStage(engine=<morpheus._lib.llm.LLMEngine object at 0x7f3bdc111d70>)>
  └─ morpheus.C

I20240311 03:39:36.869108  6834 asyncio_runnable.hpp:246] AsyncioRunnable::run() > Creating new event loop
I20240311 03:39:36.869587  6834 asyncio_runnable.hpp:259] AsyncioRunnable::run() > Calling run_until_complete() on main_task()


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

====Pipeline Complete====
Response:
Yes, the `pyarrow_hotfix` package is imported in the `/workspace/morpheus/__init__.py` file, which is the root package of the Morpheus library. The import statement `import pyarrow_hotfix` can be found near the end of the file.


### RAG Limitations

Using the pipeline we built, we can now ask questions about the code in the repository and the LLM will be able use the vector database to answer them. However, what happens if we need to ask questions about the code that are not in the vector database? For example, what if we needed to ask questions about the dependencies that the code uses? Would the LLM be able to answer these questions? Let's try it out by re-running our RAG pipeline with a more complex question:



In [11]:
# Run the Pipeline
await run_rag_pipeline(pipeline_config, model_config, "Does the code repo use `langchain` functions which are deprecated?")

====Pipeline Pre-build====
====Pre-Building Segment: linear_segment_0====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Building Segment: linear_segment_0====
====Pipeline Started====
Added source: <from-mem-4; InMemorySourceStage(dataframes=[                                                            questions
0  Does the code repo use `langchain` functions which are deprecated?], repeat=1)>
  └─> morpheus.MessageMeta
Added stage: <deserialize-5; DeserializeStage(ensure_sliceable_index=True, message_type=<class 'morpheus._lib.messages.ControlMessage'>, task_type=llm_engine, task_payload={'task_type': 'completion', 'task_dict': {'input_keys': ['questions']}})>
  └─ morpheus.MessageMeta -> morpheus.ControlMessage
Added stage: <llm-engine-6; LLMEngineStage(engine=<morpheus._lib.llm.LLMEngine object at

I20240311 03:39:37.368492  6841 asyncio_runnable.hpp:246] AsyncioRunnable::run() > Creating new event loop
I20240311 03:39:37.375123  6841 asyncio_runnable.hpp:259] AsyncioRunnable::run() > Calling run_until_complete() on main_task()


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

====Pipeline Complete====
Response:
Based on the provided source files, I cannot determine if the code repo uses `langchain` functions that are deprecated. The `langchain_fixture` function in the `/workspace/tests/examples/llm/common/conftest.py` file only checks if `langchain` is installed and imports it if it is. It does not use any specific functions or methods from the `langchain` package.

To determine if `langchain` functions are being used and if they are deprecated, you would need to examine the other source files in the code repo where `langchain` is imported and used. You can check the documentation for the version of `langchain` being used to see if any of the functions or methods being used are deprecated.


It's likely that the model was not able to determine the answer to this question because it would need additional information. Depending on the model used, you might see output similar to:
```
Based on the provided source code, I cannot determine if the `langchain` functions used in the code are deprecated or not.

To determine if `langchain` functions are deprecated, you would need to check the documentation for the specific functions being used in the code or consult the maintainers of the `langchain` package.
```

How would we go about solving this problem?

#### Answering Complex Question with RAG + LLM Agents

Intro to LLM Agents and how our checklist + agents approach can help solve multi-step problems.

### Running the CVE Pipeline with Morpheus

#### The Engine Config

In [12]:
from cyber_dev_day.config import EngineConfig

# Create the engine configuration
engine_config = EngineConfig.model_validate({
    'checklist': {
        'model': {
            'service': {
                'type': 'nemo', 'api_key': None, 'org_id': None
            },
            'model_name': 'gpt-43b-002',
            'customization_id': None,
            'temperature': 0.0,
            'tokens_to_generate': 300
        }
    },
    'agent': {
        'model': {
            'service': {
                'type': 'nvfoundation', 'api_key': None
            }, 'model_name': 'mixtral_8x7b', 'temperature': 0.0
        },
        'sbom': {
            'data_file': ''
        },
        'code_repo': {
            'faiss_dir': code_faiss_dir,
            'embedding_model_name': "sentence-transformers/all-mpnet-base-v2"
        }
    }
})

In [13]:
# Print the current configuration object
print(engine_config.model_dump_json(indent=2))

{
  "checklist": {
    "model": {
      "service": {
        "type": "nemo",
        "api_key": null,
        "org_id": null
      },
      "model_name": "gpt-43b-002",
      "customization_id": null,
      "temperature": 0.0,
      "tokens_to_generate": 300
    }
  },
  "agent": {
    "model": {
      "service": {
        "type": "nvfoundation",
        "api_key": null
      },
      "model_name": "mixtral_8x7b",
      "temperature": 0.0
    },
    "sbom": {
      "data_file": ""
    },
    "code_repo": {
      "faiss_dir": "/workspace/.tmp/morpheus_code_faiss",
      "embedding_model_name": "sentence-transformers/all-mpnet-base-v2"
    }
  }
}


#### The Pipeline Function

In [14]:
import time
import cudf
from cyber_dev_day.pipeline_utils import build_cve_llm_engine
from morpheus.messages import ControlMessage
from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.stages.input.in_memory_source_stage import InMemorySourceStage
from morpheus.stages.llm.llm_engine_stage import LLMEngineStage
from morpheus.stages.output.in_memory_sink_stage import InMemorySinkStage
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
from morpheus.utils.concat_df import concat_dataframes

async def run_cve_pipeline(p_config: Config, e_config: EngineConfig, input_cves: list[str]):
    source_dfs = [
        cudf.DataFrame({
            "cve_info": input_cves
        })
    ]

    completion_task = {"task_type": "completion", "task_dict": {"input_keys": ["cve_info"], }}

    pipe = LinearPipeline(p_config)

    pipe.set_source(InMemorySourceStage(p_config, dataframes=source_dfs))

    pipe.add_stage(DeserializeStage(p_config, message_type=ControlMessage, task_type="llm_engine", task_payload=completion_task))

    pipe.add_stage(LLMEngineStage(p_config, engine=build_cve_llm_engine(e_config)))

    sink = pipe.add_stage(InMemorySinkStage(p_config))

    start_time = time.time()

    await pipe.run_async()

    messages = sink.get_messages()
    responses = concat_dataframes(messages)

    logger.info("Pipeline complete")

    print("Pipeline complete. Received %s responses:\n%s" % (len(messages), responses['response']))


In [15]:
# Now run the pipeline with a specified CVE description
await run_cve_pipeline(pipeline_config, engine_config, [
    "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
])

Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
====Pipeline Pre-build====
====Pre-Building Segment: linear_segment_0====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Pipeline Started====
====Building Segment: linear_segment_0====
Added source: <from-mem-8; InMemorySourceStage(dataframes=[                                                                                                                                                                       cve_info
0  An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops.], repeat=1)>
  └─> morpheus.MessageMeta
Added stage: <deserialize-9; DeserializeStage(ensure_sliceable_index=True, message_type=<class 'mo

  warn_deprecated(
I20240311 03:39:42.666868  6850 asyncio_runnable.hpp:246] AsyncioRunnable::run() > Creating new event loop
I20240311 03:39:42.680075  6850 asyncio_runnable.hpp:259] AsyncioRunnable::run() > Calling run_until_complete() on main_task()
  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m
Observation: [36;1m[1;3mFalse[0m
Thought:[32;1m[1;3mThe SBOM Package Checker returned False, which means the Linux kernel is not present in the Docker container's software bill of materials. Therefore, I cannot determine if the container's Linux kernel version is affected by the issue. However, I can still check if the vulnerable file is present in the container.

Action: Docker Container Code QA System
Action Input: drivers/media/dvb-core/dvbdev.c[0m
Observation: [33;1m[1;3mThe string "drivers/media/dvb-core/dvbdev.c" appears to be a part of a CVE (Common Vulnerabilities and Exposures) information, specifically a use-after-free vulnerability found in the Linux kernel through version 6.0.9. This issue is related to t

### Hitting the Limits of the LLMs

This section will go over some of the areas where LLMs may fail.

In [16]:
# Example to cause the LLM to fail (TBD)

## Part 3 - Beyond Prototyping

This section will focus on refining the prototype to fix any gaps, improve the accuracy, and create a production ready pipeline.

### Improving the Model

In [17]:
# Code for improving the model
engine_config_custom_model = engine_config.model_copy(deep=True)

# Set the customization ID
# engine_config_custom_model.checklist.model.customization_id = "<CUSTOMIZATION_ID>"

# Run the pipeline
await run_cve_pipeline(pipeline_config, engine_config_custom_model, [
    "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
])

Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
====Pipeline Pre-build====
====Pre-Building Segment: linear_segment_0====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Pipeline Started====
====Building Segment: linear_segment_0====
Added source: <from-mem-12; InMemorySourceStage(dataframes=[                                                                                                                                                                       cve_info
0  An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops.], repeat=1)>
  └─> morpheus.MessageMeta
Added stage: <deserialize-13; DeserializeStage(ensure_sliceable_index=True, message_type=<class '

I20240311 03:39:46.376272  6858 asyncio_runnable.hpp:246] AsyncioRunnable::run() > Creating new event loop
I20240311 03:39:46.377018  6858 asyncio_runnable.hpp:259] AsyncioRunnable::run() > Calling run_until_complete() on main_task()


[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m
Observation: [36;1m[1;3mFalse[0m
Thought:[32;1m[1;3mThe SBOM Package Checker returned False, which means the Linux kernel is not present in the Docker container's software bill of materials. Therefore, I cannot determine if the container's Linux kernel version is affected by the issue. However, I can still check if the vulnerable file is present in the container.

Action: Docker Container Code QA System
Action Input: drivers/media/dvb-core/dvbdev.c[0m
Observation: [33;1m[1;3mThe string "drivers/media/dvb-core/dvbdev.c" appears to be a part of a CVE (Common Vulnerabilities and Exposures) information, specifically a use-after-free vulnerability found in the Linux kernel through version 6.0.9. This issue is related to the `dvb_register_device` function that dynamical

### Batching Multiple Requests

This section will show how to run multiple requests simultaneously improving the throughput

In [18]:
# Code for batching multiple requets
await run_cve_pipeline(pipeline_config, engine_config_custom_model, [
    "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
] * 5)

Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
====Pipeline Pre-build====
====Pre-Building Segment: linear_segment_0====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Pipeline Started====
====Building Segment: linear_segment_0====
Added source: <from-mem-16; InMemorySourceStage(dataframes=[                                                                                                                                                                       cve_info
0  An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops.
1  An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_regist

I20240311 03:39:50.082346  6867 asyncio_runnable.hpp:246] AsyncioRunnable::run() > Creating new event loop
I20240311 03:39:50.083042  6867 asyncio_runnable.hpp:259] AsyncioRunnable::run() > Calling run_until_complete() on main_task()


[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m
Observation: [36;1m[1;3mFalse[0m
Thought:[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m[32;1m[1;3mThe SBOM Package Checker returned False, which means the Linux kernel is n

### Creating a Microservice

This section will show how to convert the development pipeline into a microservice which is capable of handling requests

In [19]:
# Code for creating a microservice
import time
import cudf
from cyber_dev_day.pipeline_utils import build_cve_llm_engine
from morpheus.messages import ControlMessage
from morpheus.messages import MessageMeta
from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.pipeline.stage_decorator import stage
from morpheus.stages.input.http_server_source_stage import HttpServerSourceStage
from morpheus.stages.input.in_memory_source_stage import InMemorySourceStage
from morpheus.stages.llm.llm_engine_stage import LLMEngineStage
from morpheus.stages.output.in_memory_sink_stage import InMemorySinkStage
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
from morpheus.utils.concat_df import concat_dataframes
from morpheus.utils.http_utils import HTTPMethod

async def run_cve_pipeline_microservice(p_config: Config, e_config: EngineConfig):

    completion_task = {"task_type": "completion", "task_dict": {"input_keys": ["cve_info"], }}

    pipe = LinearPipeline(p_config)

    # expected payload is:
    # [{"cve_info": <sting>},
    #  {"cve_info": <sting>},]
    pipe.set_source(
        HttpServerSourceStage(p_config,
                                bind_address="0.0.0.0",
                                port=26302,
                                endpoint="/scan",
                                method=HTTPMethod.POST))

    @stage
    def print_payload(payload: MessageMeta) -> MessageMeta:
        serialized_str = payload.df.to_json(orient='records', lines=True)

        logger.info("======= Got Request =======\n%s\n===========================", serialized_str)

        return payload

    pipe.add_stage(print_payload(config=p_config))

    pipe.add_stage(DeserializeStage(p_config, message_type=ControlMessage, task_type="llm_engine", task_payload=completion_task))

    pipe.add_stage(LLMEngineStage(p_config, engine=build_cve_llm_engine(e_config)))

    sink = pipe.add_stage(InMemorySinkStage(p_config))

    start_time = time.time()

    await pipe.run_async()

    messages = sink.get_messages()
    responses = concat_dataframes(messages)

    logger.info("Pipeline complete")

    print("Pipeline complete. Received %s responses:\n%s" % (len(messages), responses['response']))

In [20]:
await run_cve_pipeline_microservice(pipeline_config, engine_config_custom_model)

Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
====Pipeline Pre-build====
====Pre-Building Segment: linear_segment_0====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Building Segment: linear_segment_0====
====Pipeline Started====
Added source: <from-http-20; HttpServerSourceStage(bind_address=0.0.0.0, port=26302, endpoint=/scan, method=HTTPMethod.POST, accept_status=201, sleep_time=0.1, queue_timeout=5, max_queue_size=None, num_server_threads=None, max_payload_size=10, request_timeout_secs=30, lines=False, stop_after=0)>
  └─> morpheus.MessageMeta
Added stage: <print_payload-21; WrappedFunctionStage(name=print_payload, on_data_fn=functools.partial(<function run_cve_pipeline_microservice.<locals>.print_payload at 0x7f398809b0a0>), accept_type=<class 'morpheus.messages.m

I20240311 03:39:54.645272  6881 asyncio_runnable.hpp:246] AsyncioRunnable::run() > Creating new event loop
I20240311 03:39:54.646185  6881 asyncio_runnable.hpp:259] AsyncioRunnable::run() > Calling run_until_complete() on main_task()


{"cve_info":"An issue was discovered in the Linux kernel through 6.0.9. drivers\/media\/dvb-core\/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."}



  serialized_str = payload.df.to_json(orient='records', lines=True)




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m
Observation: [36;1m[1;3mFalse[0m
Thought:[32;1m[1;3mThe SBOM Package Checker returned False, which means the Linux kernel is not present in the Docker container's software bill of materials. Therefore, I cannot determine if the container's Linux kernel version is affected by the issue. However, I can still check if the vulnerable file is present in the container.

Action: Docker Container Code QA System
Action Input: drivers/media/dvb-core/dvbdev.c[0m
Observation: [33;1m[1;3mThe string "drivers/media/dvb-core/dvbdev.c" appears to be a part of a CVE (Common Vulnerabilities and Exposures) information, specifically a use-after-free vulnerability found in the Linux kernel through version 6.0.9. This issue is related to t

  serialized_str = payload.df.to_json(orient='records', lines=True)




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to check if the Docker container's Linux kernel version is affected by this issue and if the vulnerable file is present in the container.
Action: SBOM Package Checker
Action Input: Linux kernel[0m
Observation: [36;1m[1;3mFalse[0m
Thought:[32;1m[1;3mThe SBOM Package Checker returned False, which means the Linux kernel is not present in the Docker container's software bill of materials. Therefore, I cannot determine if the container's Linux kernel version is affected by the issue. However, I can still check if the vulnerable file is present in the container.

Action: Docker Container Code QA System
Action Input: drivers/media/dvb-core/dvbdev.c[0m
Observation: [33;1m[1;3mThe string "drivers/media/dvb-core/dvbdev.c" appears to be a part of a CVE (Common Vulnerabilities and Exposures) information, specifically a use-after-free vulnerability found in the Linux kernel through version 6.0.9. This issue is related to t

##### Triggering the Microservice

To trigger the microservice, we will use a CURL request to send a request to the microservice. Since the notebook cannot run commands while the microservice is running, we need to open up a new terminal to send the request. To do that, follow the steps below:

1. In Jupyter Lab, press Ctrl + Shift + L (Shift + ⌘ + L on Mac) to open a new Launcher tab
2. In the Launcher tab, click on the Terminal icon to open a new terminal
3. In the terminal, run the following command to send a request to the microservice:
```bash
curl --request POST \
  --url http://localhost:26302/scan \
  --header 'Content-Type: application/json' \
  --data '[{
      "cve_info" : "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
   }]'
```
4. Once the request is sent, the microservice will process the request and return the results in the terminal
   1. To see the results, switch back to the Notebook tab. You should see that the microservice received your request and started processing it.
   ```
   I20240308 16:00:56.422039 3010283 http_server.cpp:129] Received request: POST : /scan
   ```
   2. It helps to have the terminal and the notebook side by side so you can see the results in the terminal as they come in. To do this, click on the terminal tab and drag it to the right side of the screen. You should then be able to see the terminal and the notebook side by side similar to the image below:
   ![Terminal and Notebook Side by Side](./images/side_by_side.png)
5. To stop the microservice, interrupt the kernel by pressing the stop button in the toolbar