# Module 8: 1 - Indexing Data into a Vector Database and RAG with Langchain
----------------------------------------------------------------------------
In this lesson, we will demonstrate the process of indexing data into a vector database and using Langchain's Q&A chain to seamlessly implement the Retrieval Augmented Generation (RAG) flow for Knowledge Enhanced LLMs. We'll begin by downloading ATT&CK Enterprise data in STIX format using the attackcti Python library. Next, we'll extract all groups and techniques used by these groups to create markdown files simulating an Intel repository where threat intelligence analysts record notes about tracked threat actors. After loading and tokenizing these markdown files, we'll set them up in a FAISS database to generate embeddings. We will then apply a similarity search method to find relevant documents. Finally, we'll set the vector database as a retriever, initialize a retriever chain, and use it as a RAG chain with Langchain to automate retrieval and context addition, allowing the LLM to provide informed answers.

## Objectives
* Understand the process of indexing data into a vector database.
* Learn how to set the vector database as a retriever and initialize a retriever chain for RAG.
* Simulate an Intel repository using markdown files.
* Generate embeddings and perform similarity searches to find relevant data.
* Automate context addition to user prompts for improved LLM responses.

## What this session covers:
* Downloading and organizing ATT&CK Enterprise data using the attackcti Python library.
* Extracting and converting group and technique data into markdown files.
* Simulating an Intel repository with markdown files.
* Loading, tokenizing, and embedding data in a FAISS database.
* Performing similarity searches to find relevant documents.
* Setting the vector database as a retriever.
* Initializing and utilizing a retriever chain for RAG with Langchain.
* Interacting with the LLM for enhanced, context-rich answers.

## Install Libraries

In [1]:
!pip install openai
!pip install langchain
!pip install langchain_openai
!pip install -qU langchain_community
!pip install faiss-cpu
!pip install pydantic
!pip install attackcti
!pip install unstructured
!pip install markdown
!pip install tiktoken
!pip install langchain_huggingface
!pip install jinja2
!pip install python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

## Define Initial Variables

In [2]:
import os

# Define a few variables
current_directory = os.path.dirname("__file__")
data_directory = os.path.join(current_directory, "data")
documents_directory = os.path.join(data_directory, "documents")
templates_directory = os.path.join(current_directory, "templates")
group_template = os.path.join(templates_directory, "group.md")

## Download ATT&CK STIX Data

In [3]:
from attackcti.utils.downloader import STIXDownloader

stix20_downloader = STIXDownloader(download_dir="./data/attack", stix_version="2.0")

stix20_downloader.download_all_domains(release="15.1")

Downloaded enterprise-attack.json to data/attack/v15.1
Downloaded mobile-attack.json to data/attack/v15.1
Downloaded ics-attack.json to data/attack/v15.1


{'enterprise': 'data/attack/v15.1/enterprise-attack.json',
 'mobile': 'data/attack/v15.1/mobile-attack.json',
 'ics': 'data/attack/v15.1/ics-attack.json'}

In [4]:
stix20_downloader.downloaded_file_paths

{'enterprise': 'data/attack/v15.1/enterprise-attack.json',
 'mobile': 'data/attack/v15.1/mobile-attack.json',
 'ics': 'data/attack/v15.1/ics-attack.json'}

## Initialize ATT&CK Python Client

In [5]:
from attackcti import attack_client

lift = attack_client(local_paths=stix20_downloader.downloaded_file_paths)

## Get Techniques Used by ATT&CK Groups
Gettings technique STIX objects used by all groups accross all ATT&CK matrices..

In [6]:
techniques_used_by_groups = lift.get_techniques_used_by_all_groups()
techniques_used_by_groups[0]

{'type': 'intrusion-set',
 'id': 'intrusion-set--01e28736-2ffc-455b-9880-ed4d1407ae07',
 'created_by_ref': 'identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5',
 'created': '2021-01-06T17:46:35.134Z',
 'modified': '2024-04-17T22:10:56.266Z',
 'name': 'Indrik Spider',
 'description': '[Indrik Spider](https://attack.mitre.org/groups/G0119) is a Russia-based cybercriminal group that has been active since at least 2014. [Indrik Spider](https://attack.mitre.org/groups/G0119) initially started with the [Dridex](https://attack.mitre.org/software/S0384) banking Trojan, and then by 2017 they began running ransomware operations using [BitPaymer](https://attack.mitre.org/software/S0570), [WastedLocker](https://attack.mitre.org/software/S0612), and Hades ransomware. Following U.S. sanctions and an indictment in 2019, [Indrik Spider](https://attack.mitre.org/groups/G0119) changed their tactics and diversified their toolset.(Citation: Crowdstrike Indrik November 2018)(Citation: Crowdstrike EvilCorp Marc

## RAG BEGINS!!

## 01. Get Documents

### Create ATT&CK Groups Markdown Files

In [7]:
import copy
from jinja2 import Template

# Create Group docs
all_groups = dict()
for technique in techniques_used_by_groups:
    if technique["id"] not in all_groups:
        group = dict()
        group["group_name"] = technique["name"]
        group["group_id"] = technique["external_references"][0]["external_id"]
        group["created"] = technique["created"]
        group["modified"] = technique["modified"]
        group["description"] = technique["description"]
        group["aliases"] = technique["aliases"]
        if "x_mitre_contributors" in technique:
            group["contributors"] = technique["x_mitre_contributors"]
        group["techniques"] = []
        all_groups[technique["id"]] = group
    technique_used = dict()
    technique_used["matrix"] = technique["technique_matrix"]
    technique_used["domain"] = technique["x_mitre_domains"]
    technique_used["platform"] = technique["platform"]
    technique_used["tactics"] = technique["tactic"]
    technique_used["technique_id"] = technique["technique_id"]
    technique_used["technique_name"] = technique["technique"]
    technique_used["use"] = technique["relationship_description"]
    if "data_sources" in technique:
        technique_used["data_sources"] = technique["data_sources"]
    all_groups[technique["id"]]["techniques"].append(technique_used)

if not os.path.exists(documents_directory):
    print("[+] Creating knowledge directory..")
    os.makedirs(documents_directory)

print("[+] Creating markadown files for each group..")
markdown_template = Template(open(group_template).read())
for key in list(all_groups.keys()):
    group = all_groups[key]
    print("  [>>] Creating markdown file for {}..".format(group["group_name"]))
    group_for_render = copy.deepcopy(group)
    markdown = markdown_template.render(
        metadata=group_for_render,
        group_name=group["group_name"],
        group_id=group["group_id"],
    )
    file_name = (group["group_name"]).replace(" ", "_")
    open(f"{documents_directory}/{file_name}.md", encoding="utf-8", mode="w").write(
        markdown
    )

[+] Creating markadown files for each group..
  [>>] Creating markdown file for Indrik Spider..
  [>>] Creating markdown file for LuminousMoth..
  [>>] Creating markdown file for Wizard Spider..
  [>>] Creating markdown file for Elderwood..
  [>>] Creating markdown file for FIN7..
  [>>] Creating markdown file for WIRTE..
  [>>] Creating markdown file for Dragonfly..
  [>>] Creating markdown file for OilRig..
  [>>] Creating markdown file for Equation..
  [>>] Creating markdown file for Fox Kitten..
  [>>] Creating markdown file for Lazarus Group..
  [>>] Creating markdown file for Aquatic Panda..
  [>>] Creating markdown file for TA505..
  [>>] Creating markdown file for Inception..
  [>>] Creating markdown file for admin@338..
  [>>] Creating markdown file for BlackTech..
  [>>] Creating markdown file for Malteiro..
  [>>] Creating markdown file for Earth Lusca..
  [>>] Creating markdown file for Turla..
  [>>] Creating markdown file for Suckfly..
  [>>] Creating markdown file for Te

## 02 Index Source Knowledge

### 2.1 Load Documents

In [8]:
import glob
from langchain_community.document_loaders import UnstructuredMarkdownLoader

In [9]:
# variables
group_files = glob.glob(os.path.join(documents_directory, "*.md"))

# Loading Markdown files
md_docs = []
print("[+] Loading Group markdown files..")
for group in group_files:
    print(f" [*] Loading {os.path.basename(group)}")
    loader = UnstructuredMarkdownLoader(group)
    md_docs.extend(loader.load())

print(f"[+] Number of .md documents processed: {len(md_docs)}")

[+] Loading Group markdown files..
 [*] Loading Deep_Panda.md
 [*] Loading Aoqin_Dragon.md
 [*] Loading Poseidon_Group.md
 [*] Loading Whitefly.md
 [*] Loading Magic_Hound.md
 [*] Loading APT-C-23.md
 [*] Loading Earth_Lusca.md
 [*] Loading Windigo.md
 [*] Loading Cleaver.md
 [*] Loading FIN7.md
 [*] Loading APT12.md
 [*] Loading FIN13.md
 [*] Loading Transparent_Tribe.md
 [*] Loading APT32.md
 [*] Loading Confucius.md
 [*] Loading Tropic_Trooper.md
 [*] Loading UNC788.md
 [*] Loading BlackOasis.md
 [*] Loading The_White_Company.md
 [*] Loading BlackTech.md
 [*] Loading Axiom.md
 [*] Loading Chimera.md
 [*] Loading OilRig.md
 [*] Loading APT16.md
 [*] Loading IndigoZebra.md
 [*] Loading Leviathan.md
 [*] Loading APT33.md
 [*] Loading Windshift.md
 [*] Loading Sowbug.md
 [*] Loading Tonto_Team.md
 [*] Loading DarkHydrus.md
 [*] Loading APT-C-36.md
 [*] Loading PittyTiger.md
 [*] Loading APT17.md
 [*] Loading Strider.md
 [*] Loading APT37.md
 [*] Loading Ke3chang.md
 [*] Loading Gorgon_G

Check a doc page content

In [10]:
print(md_docs[0].page_content)

Deep Panda - G0009

Created: 2017-05-31T21:31:49.412Z

Modified: 2022-07-20T20:10:29.593Z

Contributors: Andrew Smith, @jakx_

Aliases

Deep Panda,Shell Crew,WebMasters,KungFu Kittens,PinkPanther,Black Vine

Description

Deep Panda is a suspected Chinese threat group known to target many industries, including government, defense, financial, and telecommunications. (Citation: Alperovitch 2014) The intrusion into healthcare company Anthem has been attributed to Deep Panda. (Citation: ThreatConnect Anthem) This group is also known as Shell Crew, WebMasters, KungFu Kittens, and PinkPanther. (Citation: RSA Shell Crew) Deep Panda also appears to be known as Black Vine based on the attribution of both group names to the Anthem intrusion. (Citation: Symantec Black Vine) Some analysts track Deep Panda and APT19 as the same group, but it is unclear from open source information if the groups are the same. (Citation: ICIT China's Espionage Jul 2016)

Techniques Used

Matrix Domain Platform Techniq

### 2.2 Split Documents

Use langchain text splitter

In [11]:
# Recursively split by character
# This text splitter is the recommended one for generic text.
# It is parameterized by a list of characters. It tries to split on them in
# order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""].
# This has the effect of trying to keep all paragraphs (and then sentences, and then words)
# together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

from langchain_text_splitters import RecursiveCharacterTextSplitter

In [12]:
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")


def tiktoken_len(text):
    tokens = tokenizer.encode(
        text, disallowed_special=()  # To disable this check for all special tokens
    )
    return len(tokens)

In [13]:
# Chunking Text
print("[+] Initializing RecursiveCharacterTextSplitter..")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # number of tokens overlap between chunks
    add_start_index=True,  # the character index at which each split Document starts within the initial Document is preserved
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""],
)

[+] Initializing RecursiveCharacterTextSplitter..


In [14]:
print("[+] Splitting documents in chunks..")
chunks = text_splitter.split_documents(md_docs)

print(f"[+] Number of documents: {len(md_docs)}")
print(f"[+] Number of chunks: {len(chunks)}")

[+] Splitting documents in chunks..
[+] Number of documents: 148
[+] Number of chunks: 398


In [15]:
print(chunks[0])
print(chunks[1])
print(chunks[2])

page_content='Deep Panda - G0009

Created: 2017-05-31T21:31:49.412Z

Modified: 2022-07-20T20:10:29.593Z

Contributors: Andrew Smith, @jakx_

Aliases

Deep Panda,Shell Crew,WebMasters,KungFu Kittens,PinkPanther,Black Vine

Description

Deep Panda is a suspected Chinese threat group known to target many industries, including government, defense, financial, and telecommunications. (Citation: Alperovitch 2014) The intrusion into healthcare company Anthem has been attributed to Deep Panda. (Citation: ThreatConnect Anthem) This group is also known as Shell Crew, WebMasters, KungFu Kittens, and PinkPanther. (Citation: RSA Shell Crew) Deep Panda also appears to be known as Black Vine based on the attribution of both group names to the Anthem intrusion. (Citation: Symantec Black Vine) Some analysts track Deep Panda and APT19 as the same group, but it is unclear from open source information if the groups are the same. (Citation: ICIT China's Espionage Jul 2016)

Techniques Used

Matrix Domain Pl

### 2.3 Embed Documents

In [16]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS  # , DistanceStrategy

Define the embeddings function.

In [17]:
# If you want to define the OpenAI embeddings function
# from langchain_openai import OpenAIEmbeddings

# If you want to define an open-source embedding function
embeddings_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Equivalent to SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


Create or load database from disk

In [18]:
# https://python.langchain.com/v0.2/docs/integrations/vectorstores/faiss/
from langchain_community.vectorstores.faiss import DistanceStrategy

db_dir = "data/faiss/faiss_index"

# Check if database directory exists
if os.path.exists(db_dir):
    # Load database from disk
    db = FAISS.load_local(
        folder_path=db_dir,
        embeddings=embeddings_function,
        distance_strategy=DistanceStrategy.COSINE,
        allow_dangerous_deserialization=True,
    )
else:
    # With OpenAI Embeddings
    # db = FAISS.from_documents(chunks, OpenAIEmbeddings())

    # Create a new database
    db = FAISS.from_documents(
        chunks, embedding=embeddings_function, distance_strategy=DistanceStrategy.COSINE
    )
    # Sabe database to disk
    db.save_local("data/faiss/faiss_index")

ask a question directly to the DB

In [19]:
# query it
query = "What threat actor sends text messages over social media to their targets?"
relevant_docs = db.similarity_search(query)

In [20]:
relevant_docs

[Document(metadata={'source': 'data\\documents\\Magic_Hound.md', 'start_index': 20568}, page_content='as well as messaging services (such as WhatsApp) to spearphish victims.(Citation: SecureWorks Mia Ash July 2017)(Citation: Microsoft Phosphorus Mar 2019)(Citation: ClearSky Kittens Back 3 August 2020) [\'enterprise-attack\'] enterprise-attack Linux,macOS,Windows T1560.001 Archive via Utility Magic Hound has used gzip to archive dumped LSASS process memory and RAR to stage and compress local folders.(Citation: FireEye APT35 2018)(Citation: DFIR Report APT35 ProxyShell March 2022)(Citation: DFIR Phosphorus November 2021) [\'enterprise-attack\'] enterprise-attack PRE T1585.001 Social Media Accounts Magic Hound has created fake LinkedIn and other social media accounts to contact targets and convince them--through messages and voice communications--to open malicious links.(Citation: ClearSky Kittens Back 3 August 2020) [\'enterprise-attack\'] enterprise-attack macOS,Windows,Linux T1564.003 

In [21]:
# print results
print(relevant_docs[0].page_content)

as well as messaging services (such as WhatsApp) to spearphish victims.(Citation: SecureWorks Mia Ash July 2017)(Citation: Microsoft Phosphorus Mar 2019)(Citation: ClearSky Kittens Back 3 August 2020) ['enterprise-attack'] enterprise-attack Linux,macOS,Windows T1560.001 Archive via Utility Magic Hound has used gzip to archive dumped LSASS process memory and RAR to stage and compress local folders.(Citation: FireEye APT35 2018)(Citation: DFIR Report APT35 ProxyShell March 2022)(Citation: DFIR Phosphorus November 2021) ['enterprise-attack'] enterprise-attack PRE T1585.001 Social Media Accounts Magic Hound has created fake LinkedIn and other social media accounts to contact targets and convince them--through messages and voice communications--to open malicious links.(Citation: ClearSky Kittens Back 3 August 2020) ['enterprise-attack'] enterprise-attack macOS,Windows,Linux T1564.003 Hidden Window Magic Hound malware has a function to determine whether the C2 server wishes to execute the ne

## 03. Enable Retriever

### Set Database as a Retriever

In [22]:
retriever = db.as_retriever()

## 04. Initialze LLM Client

### Initialize OpenAI Client

In [None]:
#!pip install -U python-dotenv

In [23]:
from langchain_openai import ChatOpenAI

from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key = os.environ.get("OPENAI_API_KEY")


llm = ChatOpenAI(
    model="gpt-3.5-turbo-0125", openai_api_key=openai_api_key, temperature=0
)

### Incorporate the Retriever into a Question-Answering chain

In [24]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

system_prompt = (
    "You are a Threat Intelligence assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)

### Initialize RAG Chain

In [25]:
from langchain.chains import create_retrieval_chain

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## 05. Initialize Conversation with Context / Relevant Documents

### Run Q&A

In [26]:
query = "What threat actor sends text messages over social media to their targets?"

In [27]:
response = rag_chain.invoke({"input": query})
response["answer"]

'Magic Hound is the threat actor that sends text messages over social media to their targets. They have created fake LinkedIn and other social media accounts to contact targets and convince them to open malicious links through messages and voice communications. (Citation: ClearSky Kittens Back 3 August 2020)'