
<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

# AI: Retrieval Augmented Generation (RAG)

This notebook shows how to build a semantic search engine using **RAG**. 

The task is to build a model that will be able to take in a plain language query and find the most relevant documents to answer this query. 

#### Data: 
[https://www.kaggle.com/datasets/manavkhambhayata/cve-2024-database-exploits-cvss-os](https://www.kaggle.com/datasets/manavkhambhayata/cve-2024-database-exploits-cvss-os)

The data comes from the National Vulnerability Database (NVD), a government-managed repository of cybersecurity vulnerabilities. It provides detailed information on security issues, including severity scores and affected systems.

The dataset was extracted using the NVD API and processed with Python. It includes vulnerabilities published between January 1, 2024, and January 15, 2024, with key details such as CVE ID, description, CVSS score, attack vector, and affected operating systems.

#### Libraries:

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith. You will need to sign up for an account and generate an API key that you will enter below. [LangSmith](https://smith.langchain.com/?_gl=1*1c7os5z*_ga*MTk2ODA2OTQ4Ny4xNzUzODY4NjM2*_ga_47WX3HKKY2*czE3NTM4Njc4ODEkbzEkZzEkdDE3NTM4Njc4OTYkajQ1JGwwJGgw)

- [langchain](https://www.langchain.com/langchain)
- [getpass](https://docs.python.org/3/library/getpass.html) Prompt the user for a password without echoing.
- `<a model server for embedding models> ` depending on which platform you use to pull your model embeddings, you will need their langchain library. e.g. openai, azure, this example uses huggingface
- [langchain-huggingface](https://anaconda.org/conda-forge/langchain-huggingface)

In [None]:
# Load Libraries - Make sure to run this cell!
import getpass
import os
from langchain_community.document_loaders.csv_loader import CSVLoader
import warnings
warnings.filterwarnings("ignore")

## Load data
Langchain has a variety of nice modules that help you load different formats of documents. [document loaders](https://python.langchain.com/docs/how_to/#document-loaders)

In [None]:
DATA_HOME = '../data/'

filename_cves = 'nvd_vulnerabilities_with_os.csv'

In [None]:
loader = CSVLoader(
    file_path=DATA_HOME + filename_cves, 
    source_column='CVE ID',
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
    },
     encoding="UTF-8"
)

docs_raw = loader.load()
for record in docs_raw[:2]:
    print(record)
    print('----------')

## Create Embeddings

You can generate the templated code you need based on the model server you want to use. Go to the langchain site and select your preferred server from the dropdown menu.  They have many options, openai, azure, google, aws, anthropic, etc. Below I am using huggingface. [https://python.langchain.com/docs/tutorials/retrievers/#embeddings](https://python.langchain.com/docs/tutorials/retrievers/#embeddings)

You have some decisions to make at this point: 

**Decision 1**: Select model server (or this is where you could access a downloaded model if working in an airgapped environment)

**Decision 2**: Select the specific model to use to generate your embeddings. Since we are doing a semantic search engine, you will want something that is good at that task like a sentence-transformer. This is where you can end up with something that is great for your task, or something that is terrible. Once you have your pipeline, you will want to try out some different models to see how the results change. Things to consider:

 - size (larger might be better)
 - tokenizer (do you need multi-language capabilities?, this is where you would ensure it has seen your languages)
 - origin (use something that comes from a reputable source, be weary of things that have been fine-tuned by random people in places like huggingface. go for the ones from Microsoft, Facebook etc)
 - content for training (do you need to know this? if so, you'll want an open source model)
 - input length (models can only take in a specific number of tokens, older, smaller models have a smaller amount of tokens they can take in which affects your ability to embed large/long texts

## Exercise 1: import your embeddings model
Import your embeddings model from your chosen server (remember you likely need to authenticate)

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

# you will need to grab your API token for the server you choose 
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass()

In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2")# all-MiniLM-L6-v2

## Exercise: Embed a single document
Try out your new embedding model on a few of our documents. The **embed_query** method is typically used to embed a single sentence, like we do for an incoming query, which is why it's useful here to just see what it does on one document. 
1. what does this embedding look like? Print a few items from it
2. How long is this embedding? This is an important aspect of encoding documents.
3. What does this 'length' refer to? Words? characters? sentences? 

In [None]:
embedded_doc = embeddings.embed_query(#your code)

print(vector_1[:10])

## Storage
Now that we know how to embed our documents, we will need to store these embeddings in a database. There are many options for doing this, but the most effecient way to store embeddings is in a **vector database**. These have been optimized to store and retrieve these kinds of embeddings, so when you can use them, you should. You can also use more traditional things like MongoBD or Elasticsearch with specific field types for storing a dense vector. This is useful if you need to store a lot of metadata or continue to have the option do to a keyword search in addition to the semantic search (this is common). 

## Exercise: Set up a vector store
- You can make this more challenging by choosing something like MongoDB or pinecone if you want to play with these integrations.
- or you can simply use the code below to store it in memory (this obviously would not be an option once we get past a few thousand items, but it will work for this exercise)

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(#your code)

In [None]:
in_memory = vector_store.add_documents(#Your Code)

## Semantic Search
## Exercise: find the top 5 documents for a query and print them 

1. Use *similarity* search (uses string matching? or so they say on langchain site) to find the documents that are the most similar to our query.
2. Then print the 5 most relevant. 

In [None]:
results = vector_store.similarity_search(
   # YOUR Query here
)

# print your results


## Semantic Search 2

## Exercise: Embed query and return similar docs

Return documents based on similarity to an embedded query. 
1. Embed the query (USING THE SAME EMBEDDING MODEL AS THE DOCS). 
2. Compare our query's vector to the database vectors and get the ones with the smallest distance between the 2 (yes there's a function for this). There are multiple ways to calculate the *distance* between 2 vectors, but the most popular for this task is cosine similarity.
3. Print the top 5 documents
4. Try out a different similarity calculation and see if the results change

In [None]:
embedding = embeddings.embed_query(# your query)

results = vector_store.similarity_search_by_vector(# your code)


# print top 5 docs


# Generated Answers
this RAG architecture is usually part 1 of a chatbot (and many other products). This gets us the relevant source documents for a user's query, then we take these top X docs and give them to a generative model to generate an answer to our question BASED on these documents. 