# Module 8: 3 - Optimizing RAG - Knowledge Graph
----------------------------------------------------------------------------
In this lesson, we will explore the latest RAG approach from the Microsoft research team, [GraphRAG](https://microsoft.github.io/graphrag/). It is a structured and hierarchical method for Retrieval Augmented Generation (RAG). Unlike simple semantic-search techniques that use plain text snippets, GraphRAG involves creating a knowledge graph from raw text, organizing this into a community hierarchy, generating summaries for these communities, and utilizing these structures for RAG-based tasks.

## Objectives
* Understand GraphRAG
* Implement GraphRAG to refine the retrieval process.
* Utilize knowledge graphs for enhanced retrieval accuracy.
* Test and validate GraphRAG using enriched ATT&CK groups data.

## What this session covers:
* Indexing groups data using GraphRAG
* Exploring the different component of GraphRAG
* Executing global and local search prompts
* Demonstrating and testing optimization scenarios with ATT&CK groups data.

## Install Libraries

In [1]:
!pip install graphrag
!pip install pandas
!pip install os
!pip install subprocess
!pip install requests
!pip install pyarrow

Collecting graphrag
  Downloading graphrag-0.3.6-py3-none-any.whl.metadata (6.0 kB)
Collecting aiofiles<25.0.0,>=24.1.0 (from graphrag)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting aiolimiter<2.0.0,>=1.1.0 (from graphrag)
  Downloading aiolimiter-1.1.0-py3-none-any.whl.metadata (4.5 kB)
Collecting azure-identity<2.0.0,>=1.17.1 (from graphrag)
  Downloading azure_identity-1.19.0-py3-none-any.whl.metadata (80 kB)
Collecting azure-search-documents<12.0.0,>=11.4.0 (from graphrag)
  Downloading azure_search_documents-11.5.2-py3-none-any.whl.metadata (23 kB)
Collecting azure-storage-blob<13.0.0,>=12.22.0 (from graphrag)
  Downloading azure_storage_blob-12.23.1-py3-none-any.whl.metadata (26 kB)
Collecting datashaper<0.0.50,>=0.0.49 (from graphrag)
  Downloading datashaper-0.0.49-py3-none-any.whl.metadata (3.7 kB)
Collecting devtools<0.13.0,>=0.12.2 (from graphrag)
  Downloading devtools-0.12.2-py3-none-any.whl.metadata (4.8 kB)
Collecting environs<12.0.0,>=11.0.0

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
crewai-tools 0.13.2 requires beautifulsoup4>=4.12.3, which is not installed.
embedchain 0.1.123 requires beautifulsoup4<5.0.0,>=4.12.2, which is not installed.
embedchain 0.1.123 requires pypdf<6.0.0,>=5.0.0, which is not installed.




ERROR: Could not find a version that satisfies the requirement os (from versions: none)
ERROR: No matching distribution found for os
ERROR: Could not find a version that satisfies the requirement subprocess (from versions: none)
ERROR: No matching distribution found for subprocess




## Define Paths

In [2]:
import os

# Define the directory where you want to create the .env file
current_directory = os.path.dirname("__file__")
data_directory = os.path.join(current_directory, "data")
env_file_path = os.path.join(data_directory, ".env")

## Creating .env File

In [3]:
import os

# Ensure the data directory exists
if not os.path.exists(data_directory):
    print("[+] Creating data directory..")
    os.makedirs(data_directory)

# Create the .env file and add the line to it
with open(env_file_path, "w") as file:
    file.write("GRAPHRAG_API_KEY=<api key>")  # api key as 'string'

## Set Up GraphRAG Workspace Variables

In [5]:
# This line of code creates settings.yaml and .env files, and also prompts and cache directories

!python -m graphrag.index --init --root ./data

⠋ GraphRAG Indexer 
Initializing project at ./data
⠋ GraphRAG Indexer 
Traceback (most recent call last):
⠋ GraphRAG Indexer 
  File "<frozen runpy>", line 198, in _run_module_as_main
⠋ GraphRAG Indexer 
  File "<frozen runpy>", line 88, in _run_code
⠋ GraphRAG Indexer 
  File 
"c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\graph
rag\index\__main__.py", line 104, in <module>
⠋ GraphRAG Indexer 
    index_cli(
⠋ GraphRAG Indexer 
  File 
"c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\graph
rag\index\cli.py", line 126, in index_cli
⠋ GraphRAG Indexer 
    _initialize_project_at(root_dir, progress_reporter)
⠋ GraphRAG Indexer 
  File 
"c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\graph
rag\index\cli.py", line 199, in _initialize_project_at
⠋ GraphRAG Indexer 
    raise ValueError(msg)
⠋ GraphRAG Indexer 
ValueError: Project already initialized at data
⠋ GraphRAG Indexer 


  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,


## Update .env and settings.yaml files

Update .env:

```
GRAPHRAG_API_KEY=<API_KEY>
```

Update settings.yaml:

```
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"  ----> "documents"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"  ----> ".*\\.md$"
```

## Indexing Documents with GraphRAG

In [6]:
import subprocess

# Direcotry 'output' is created after indexing documents
if not os.path.exists(os.path.join(data_directory, "output")):
    print("[+] Indexing documents..")
    subprocess.run(
        ["python3", "-m", "graphrag.index", "--root", "./data"]
    )  # takes around 4 minutes to complete (9 .md files)
else:
    print("Documents were already indexed")

Documents were already indexed


## Exploring Artifacts

In [7]:
import pandas as pd
import os

pd.set_option("max_colwidth", 100)

# Defining artifacts directory
artifacts_directory = os.path.join(data_directory, "output/20240801-133147/artifacts")

communities = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_communities.parquet")
)
community_reports = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_community_reports.parquet")
)
entities = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_entities.parquet")
)
nodes = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_nodes.parquet")
)
relationships = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_relationships.parquet")
)
text_units = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_text_units.parquet")
)
documents = pd.read_parquet(
    path=os.path.join(artifacts_directory, "create_final_documents.parquet")
)

### Documents

In [8]:
documents.head(n=2)

Unnamed: 0,id,text_unit_ids,raw_content,title
0,afd6025b421d49a061e3a6acf74daedd,"[8112029b261923f118b8ba9196d067c8, 9cba25ded5f4750aeafc8a92bcc4f160, ec2e57ce7860a4c68a1210df495...",# FIN7 - G0046\n\n**Created**: 2017-05-31T21:32:09.460Z\n\n**Modified**: 2024-04-17T22:09:41.004...,FIN7.md
1,2362aa068c9293d37a7f821c903e629c,"[bdbda95d8a567d3c9ee1d96ffbd1e2c0, 5e839788284eecb63814ac126969a91f, 51659aba82df667a55ca598ef7d...",# Sandworm Team - G0034\n\n**Created**: 2017-05-31T21:32:04.588Z\n\n**Modified**: 2024-04-06T19:...,Sandworm_Team.md


### Chunks

In [9]:
text_units.head(n=2)

Unnamed: 0,id,text,n_tokens,document_ids,entity_ids,relationship_ids
0,bdbda95d8a567d3c9ee1d96ffbd1e2c0,# Sandworm Team - G0034\n\n**Created**: 2017-05-31T21:32:04.588Z\n\n**Modified**: 2024-04-06T19:...,1200,[2362aa068c9293d37a7f821c903e629c],"[b45241d70f0e43fca764df95b2b81f77, 4119fd06010c494caa07f439b333f4c5, d3835bf3dda84ead99deadbeac5...","[36be44627ece444284f9e759b8cd25c6, a64b4b17b07a44e4b1ac33580d811936, 423b72bbd56f4caa98f3328202c..."
1,5e839788284eecb63814ac126969a91f,HTML tags for the communication traffic between the C2 server.(Citation: ESET Telebots Dec 2016...,1200,[2362aa068c9293d37a7f821c903e629c],"[b45241d70f0e43fca764df95b2b81f77, 96aad7cb4b7d40e9b7e13b94a67af206, c9632a35146940c2a86167c7726...","[bdddcb17ba6c408599dd395ce64f960a, bc70fee2061541148833d19e86f225b3, 0fc15cc3b44c4142a770feb4c03..."


### Entities

In [10]:
entities.sample(n=2)

Unnamed: 0,id,name,type,description,human_readable_id,graph_embedding,text_unit_ids,description_embedding
1,4119fd06010c494caa07f439b333f4c5,"""RUSSIA'S GENERAL STAFF MAIN INTELLIGENCE DIRECTORATE (GRU) MAIN CENTER FOR SPECIAL TECHNOLOGIES...","""ORGANIZATION""","""Russia's GRU GTsST, military unit 74455, is the organization behind Sandworm Team, involved in ...",1,,[bdbda95d8a567d3c9ee1d96ffbd1e2c0],"[-0.03776426985859871, 0.02152342163026333, 0.05562674254179001, 0.02057747170329094, -0.0243981..."
115,e683130322ac47708a852a5e51abb7c5,"""COBALT GROUP""","""ORGANIZATION""","""Cobalt Group is a cybercriminal group that may be linked to Carbanak and has used Carbanak malw...",294,,[b45050169dbd3a4df2093131b2c6a8ef],"[-0.007689749822020531, -0.027905846014618874, 0.04049476236104965, 0.059866175055503845, 0.0257..."


In [11]:
entities["type"].unique()

array(['"ORGANIZATION"', '"GEO"', '"EVENT"', '', '"ACTIVITY"', '"TARGET"',
       '"CONCEPT"', '"OBJECT"', '"DOCUMENT"', '"TECHNOLOGY"',
       '"INDUSTRY"', '"INFORMATION"', '"TECHNIQUE"', '"FILE"', '"TACTIC"',
       '"METHOD"', '"PERSON"', '"SOFTWARE"', '"LOCATION"', '"SECTOR"',
       '"API"', '"TOOL"', '"MALWARE"', '"ENTITY"', '"CATEGORY"',
       '"ORGANIZATION TYPE"'], dtype=object)

### Relationships

In [12]:
relationships.head(n=2)

Unnamed: 0,source,target,weight,description,text_unit_ids,id,human_readable_id,source_degree,target_degree,rank
0,"""SANDWORM TEAM""","""RUSSIA'S GENERAL STAFF MAIN INTELLIGENCE DIRECTORATE (GRU) MAIN CENTER FOR SPECIAL TECHNOLOGIES...",1.0,"""Sandworm Team is attributed to and operates under the GRU GTsST military unit 74455.""",[bdbda95d8a567d3c9ee1d96ffbd1e2c0],36be44627ece444284f9e759b8cd25c6,0,42,1,43
1,"""SANDWORM TEAM""","""GRU UNIT 74455""",2.0,"Sandworm Team, directly associated with GRU Unit 74455, conducts cyber operations under its aegi...","[bdbda95d8a567d3c9ee1d96ffbd1e2c0, e8027f448bcfaa176bfb6c261f96b671]",a64b4b17b07a44e4b1ac33580d811936,1,42,1,43


### Communities

In [13]:
communities.head(n=2)

Unnamed: 0,id,title,level,raw_community,relationship_ids,text_unit_ids
0,0,Community 0,0,0,"[36be44627ece444284f9e759b8cd25c6, a64b4b17b07a44e4b1ac33580d811936, 423b72bbd56f4caa98f3328202c...","[2dc4bd2cb621b71f05523705af6b571f,396c22047983755b945f5235333819d3,51659aba82df667a55ca598ef7d76..."
1,4,Community 4,0,4,"[30a251bc3d04430d82b5a1a98c7b8c75, 93e1d19f9bfa4c6b8962d56d10ea9483, 8046335ba70b434aa3188392a74...","[202bcc7099b468087583d88e9ddb57de,396c22047983755b945f5235333819d3,4e30c37f12c9daf946a2010e949a6..."


In [14]:
community_reports.head(n=2)

Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,10,"# Carbanak Group and Carbanak Malware\n\nThe community focuses on the Carbanak Group, known for ...",1,8.5,Carbanak Group and Carbanak Malware,The high impact severity rating reflects the significant threat posed by the Carbanak Group's ac...,"The community focuses on the Carbanak Group, known for its use of Carbanak malware, highlighting...","[{'explanation': 'The Carbanak Group is a prominent entity within the cybercrime community, know...","{\n ""title"": ""Carbanak Group and Carbanak Malware"",\n ""summary"": ""The community focuses on...",1bd35e5c-af30-4e33-b898-a8d58ef37083
1,11,# Turla Cyber Espionage Activities and Infrastructure Targets\n\nThis report delves into the sop...,1,8.5,Turla Cyber Espionage Activities and Infrastructure Targets,The high impact severity rating reflects Turla's sophisticated capabilities and strategic target...,"This report delves into the sophisticated cyber espionage activities of the Turla group, focusin...","[{'explanation': 'Turla, a group with ties to Russia's FSB, has been active since at least 2004,...","{\n ""title"": ""Turla Cyber Espionage Activities and Infrastructure Targets"",\n ""summary"": ""...",4b8eb4db-bf82-4576-920a-ebf15d72fb95


### Reading Parquet Files

In [15]:
# Another way using pyarrow
import pyarrow.parquet as pa

table = pa.read_table(
    source=os.path.join(artifacts_directory, "create_final_entities.parquet")
)

table

pyarrow.Table
id: string
name: string
type: string
description: string
human_readable_id: int64
graph_embedding: null
text_unit_ids: list<element: string>
  child 0, element: string
description_embedding: list<element: double>
  child 0, element: double
__index_level_0__: int64
----
id: [["b45241d70f0e43fca764df95b2b81f77","4119fd06010c494caa07f439b333f4c5","d3835bf3dda84ead99deadbeac5d0d7d","077d2820ae1845bcbb1803379a3d1eae","3671ea0dd4e84c1a9b02c5ab2c8f4bac",...,"67f10971666240ea930f3b875aabdc1a","8b95083939ad4771b57a97c2d5805f36","3c4062de44d64870a3cc5913d5769244","24652fab20d84381b112b8491de2887e","d4602d4a27b34358baa86814a3836d68"]]
name: [[""SANDWORM TEAM"",""RUSSIA'S GENERAL STAFF MAIN INTELLIGENCE DIRECTORATE (GRU) MAIN CENTER FOR SPECIAL TECHNOLOGIES (GTSST)"",""GRU UNIT 74455"",""GRU UNIT 26165"",""UKRAINE"",...,""REMOTE SYSTEM DISCOVERY"",""SOFTWARE PACKING"",""LOCAL ACCOUNT"",""SYSTEM OWNER/USER DISCOVERY"",""INDICATOR REMOVAL FROM TOOLS""]]
type: [[""ORGANIZATION"",""ORGAN

## Global Search: Understanding the Knowledge Corpus
"For reasoning about holistic questions about the corpus by leveraging the community summaries"

In [16]:
!python -m graphrag.query --root ./data --method global "What are the top 3 most common techniques used by adversary groups?"



creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response:
I am sorry but I am unable to answer this question given the provided data.


  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
Exception in _map_response_single_batch
Traceback (most recent call last):
  File "c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\graphrag\query\structured_search\global_search\search.py", line 214, in _map_response_single_batch
    search_response = await self.llm.agenerate(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\graphrag\query\llm\oai\chat_openai.py", line 142, in agenerate
    async for attempt in retryer:
  File "c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\tenacity\asyncio\__init__.py", line 166, in __anext__
    do = await self.iter(retry_state=self._retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\jcohe\AppData\Local\Programs\Python\Python311\Lib\site-packages\tenacity\asyncio\__init__.py", line 153, in iter
    result = await action(r

## Local Search: Exploring Specific Entities and their Context

"for reasoning about specific entities by fanning-out to their neighbors and associated concepts."

In [14]:
!python3 -m graphrag.query --root ./data --method local "What are the top 3 most common techniques used by adversary groups?"



INFO: Reading settings from data/settings.yaml
[0m[38;5;8m[[0m2024-08-05T15:05:30Z [0m[33mWARN [0m lance::dataset[0m[38;5;8m][0m No existing dataset at /Users/panda/Documents/CyberChasquis/BTA_4Days/module-8/lancedb/description_embedding.lance, it will be created
creating llm client with {'api_key': 'REDACTED,len=132', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=132', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'reque

### A Local Question

In [2]:
!python -m graphrag.query --root ./data --method local "What do we know about adversary group Wizard Spider? What are its main relationship?"




INFO: Vector Store Args: {}
creating llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_re

  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
[2024-11-03T22:19:57Z WARN  lance::dataset] No existing dataset at C:\Users\jcohe\Code\VisualStudioProjects\BTA_4Days-main\module-8\data\output\20240801-133147\artifacts\lancedb\entity_description_embeddings.lance, it will be created


## References

- Welcome to GRAPHRAG: https://microsoft.github.io/graphrag/

- Properties of intrusion set here: https://github.com/oasis-open/cti-python-stix2/blob/50fd81fd6ba4f26824a864319305bc298e89bb45/stix2/v21/sdo.py#L314

- Reading Parquet files using Pandas: https://www.geeksforgeeks.org/read-a-parquet-file-using-pandas/

- Pyarrow - read_table(): https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html