
<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

# AI: Retrieval Augmented Generation (RAG)

This notebook shows how to build a semantic search engine using **RAG**. 

The task is to build a model that will be able to take in a plain language query and find the most relevant documents to answer this query. 

#### Data: 
[https://www.kaggle.com/datasets/manavkhambhayata/cve-2024-database-exploits-cvss-os](https://www.kaggle.com/datasets/manavkhambhayata/cve-2024-database-exploits-cvss-os)

The data comes from the National Vulnerability Database (NVD), a government-managed repository of cybersecurity vulnerabilities. It provides detailed information on security issues, including severity scores and affected systems.

The dataset was extracted using the NVD API and processed with Python. It includes vulnerabilities published between January 1, 2024, and January 15, 2024, with key details such as CVE ID, description, CVSS score, attack vector, and affected operating systems.

#### Libraries:

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith. You will need to sign up for an account and generate an API key that you will enter below. [LangSmith](https://smith.langchain.com/?_gl=1*1c7os5z*_ga*MTk2ODA2OTQ4Ny4xNzUzODY4NjM2*_ga_47WX3HKKY2*czE3NTM4Njc4ODEkbzEkZzEkdDE3NTM4Njc4OTYkajQ1JGwwJGgw)

- [langchain](https://www.langchain.com/langchain)
- [getpass](https://docs.python.org/3/library/getpass.html) Prompt the user for a password without echoing.
- `<a model server for embedding models> ` depending on which platform you use to pull your model embeddings, you will need their langchain library. e.g. openai, azure, this example uses huggingface
- [langchain-huggingface](https://anaconda.org/conda-forge/langchain-huggingface)

In [2]:
# Load Libraries - Make sure to run this cell!
import getpass
import os
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

import warnings
warnings.filterwarnings("ignore")

## Load data
Langchain has a variety of nice modules that help you load different formats of documents. [document loaders](https://python.langchain.com/docs/how_to/#document-loaders)

In [3]:
DATA_HOME = '../data/'

filename_cves = 'nvd_vulnerabilities_with_os.csv'

In [4]:
loader = CSVLoader(
    file_path=DATA_HOME + filename_cves, 
    source_column='CVE ID',
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
    },
     encoding="UTF-8"
)

docs_raw = loader.load()
for record in docs_raw[:2]:
    print(record)
    print('----------')

RuntimeError: Error loading ../data/nvd_vulnerabilities_with_os.csv

## Create Embeddings

You can generate the templated code you need based on the model server you want to use. Go to the langchain site and select your preferred server from the dropdown menu.  They have many options, openai, azure, google, aws, anthropic, etc. Below I am using huggingface. [https://python.langchain.com/docs/tutorials/retrievers/#embeddings](https://python.langchain.com/docs/tutorials/retrievers/#embeddings)

You have some decisions to make at this point: 

**Decision 1**: Select model server (or this is where you could access a downloaded model if working in an airgapped environment)

**Decision 2**: Select the specific model to use to generate your embeddings. Since we are doing a semantic search engine, you will want something that is good at that task like a sentence-transformer. This is where you can end up with something that is great for your task, or something that is terrible. Once you have your pipeline, you will want to try out some different models to see how the results change. Things to consider:

 - size (larger might be better)
 - tokenizer (do you need multi-language capabilities?, this is where you would ensure it has seen your languages)
 - origin (use something that comes from a reputable source, be weary of things that have been fine-tuned by random people in places like huggingface. go for the ones from Microsoft, Facebook etc)
 - content for training (do you need to know this? if so, you'll want an open source model)
 - input length (models can only take in a specific number of tokens, older, smaller models have a smaller amount of tokens they can take in which affects your ability to embed large/long texts


In [6]:
# you will need to grab your API token for the server you choose
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass()

In [5]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2")# all-MiniLM-L6-v2

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Try out your new embedding model on a few of our documents. The **embed_query** method is typically used to embed a single sentence, like we do for an incoming query, which is why it's useful here to just see what it does on one document. 

In [6]:
vector_1 = embeddings.embed_query(docs_raw[0].page_content)
vector_2 = embeddings.embed_query(docs_raw[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[-0.033642400056123734, -0.029917238280177116, 0.004915493540465832, -0.00423580314964056, 0.00263518700376153, -0.0005402038223110139, 0.01388082280755043, 0.027193713933229446, 0.007828943431377411, -0.024739526212215424]


## Storage
Now that we know how to embed our documents, we will need to store these embeddings in a database. There are many options for doing this, but the most effecient way to store embeddings is in a **vector database**. These have been optimized to store and retrieve these kinds of embeddings, so when you can use them, you should. You can also use more traditional things like MongoBD or Elasticsearch with specific field types for storing a dense vector. This is useful if you need to store a lot of metadata or continue to have the option do to a keyword search in addition to the semantic search (this is common). 

In [7]:
vector_store = InMemoryVectorStore(embeddings)

NameError: name 'embeddings' is not defined

In [8]:
in_memory = vector_store.add_documents(documents=docs_raw)

## Semantic Search
we will use similarity search (which uses cosine simliarity) to find the documents that are the most similar to our query. Then look at the 5 most relevant. 

In [16]:
results = vector_store.similarity_search(
    "Chrome vulnerabilities to heap corruption in May."
)
for doc in results[0:5]:
    print(doc)
    print('-----')


page_content='CVE ID: CVE-2024-0223
Description: Heap buffer overflow in ANGLE in Google Chrome prior to 120.0.6099.199 allowed a remote attacker to potentially exploit heap corruption via a crafted HTML page. (Chromium security severity: High)
CVSS Score: 8.8
Attack Vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H
Affected OS: Fedoraproject Fedora 39, Fedoraproject Fedora 38' metadata={'source': 'CVE-2024-0223', 'row': 223}
-----
page_content='CVE ID: CVE-2024-0225
Description: Use after free in WebGPU in Google Chrome prior to 120.0.6099.199 allowed a remote attacker to potentially exploit heap corruption via a crafted HTML page. (Chromium security severity: High)
CVSS Score: 8.8
Attack Vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H
Affected OS: Fedoraproject Fedora 39, Fedoraproject Fedora 38' metadata={'source': 'CVE-2024-0225', 'row': 225}
-----
page_content='CVE ID: CVE-2024-0222
Description: Use after free in ANGLE in Google Chrome prior to 120.0.6099.199 allowed a rem

## Semantic Search 2
Return documents based on similarity to an embedded query.Now we want to embed out query (USING THE SAME EMBEDDING MODEL AS THE DOCS). Then compare our query's vector to the database vectors and get the ones with the smallest distance between the 2. There are multiple ways to calculate the *distance* between 2 vectors, but the most popular for this task is cosine similarity. 

In [23]:
embedding = embeddings.embed_query("Which phone os is the most vulnerable to shell injection?")

results = vector_store.similarity_search_by_vector(embedding)
for doc in results[0:5]:
    print(doc)
    print('-----')


page_content='CVE ID: CVE-2023-42136
Description: PAX Android based POS devices with PayDroid_8.1.0_Sagittarius_V11.1.50_20230614 or earlier can allow the execution of arbitrary commands with system account privilege by shell injection starting with a specific word.




The attacker must have shell access to the device in order to exploit this vulnerability.
CVSS Score: 7.8
Attack Vector: CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H
Affected OS: Paxtechnology Paydroid' metadata={'source': 'CVE-2023-42136', 'row': 1284}
-----
page_content='CVE ID: CVE-2023-47560
Description: An OS command injection vulnerability has been reported to affect QuMagie. If exploited, the vulnerability could allow authenticated users to execute commands via a network.

We have already fixed the vulnerability in the following version:
QuMagie 2.2.1 and later
CVSS Score: 7.4
Attack Vector: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:C/C:L/I:L/A:L
Affected OS: N/A' metadata={'source': 'CVE-2023-47560', 'row': 341}
-----
page

# Generated Answers
this RAG architecture is usually part 1 of a chatbot (and many other products). This gets us the relevant source documents for a user's query, then we take these top X docs and give them to a generative model to generate an answer to our question BASED on these documents. 