# Actionable Intelligence: Open Source Intelligence, LLM and Knowledge Graph for CSOs	

A talk by [miaoski](https://miaoski.github.io/) and [GD](mailto:gd@cscs.asia) at RightsCon 2025.

*Civil Society Cyber Shield (CSCS) / Open Culture Foundation*

<img src="img/cscs-logo.png" width="20%" /> <img src="img/OCF-logo.png" width="20%" />

# Why?

CSO might not have the resource to collect, monitor, and analyze Cyber Threat Intelligence about their daily operation.

# Goals

- To collect, summarize, visualize and most importantly prioritize actionable intelligence
- To improve the (cyber)security of the organizations and individuals.

# Define the Priority Intelligence Requirements (PIR)

Cyber Underground General Intelligence Requirements Handbook ([CU-GIRH](https://intel471.com/resources/cyber-underground-handbook)) from Intel 471 is a very handy reference. It is compatible with [OASIS STIX 2.1](https://docs.oasis-open.org/cti/stix/v2.1/os/stix-v2.1-os.html).

<img src="img/CU-GIRH.png" width="30%" />

The handbook provides a thorough list of General Intelligence Requirements (GIRs), from which a CSO can decide what are the most important / concerning items. 

<img src="img/CU-GIRH-1.png" width="50%" />
<img src="img/CU-GIRH-2.png" width="50%" />

You are encouraged to download the book and attend their workshop (if available), as their **Intelligence Planning Workbook** is not publicly available. The workbook looks like this.

![Sample Collection Plan](img/Intel-Planning-Workbook.png)

# Set Google alerts

After defining what most concerns you, use a Googel Account, preferably the automation account in your G-Suite for NGO plan, to subscribe to Google Alerts. For example, for Tibetan CSO and Uyghur CSO, they might want to use these keywords:
- Cybersecurity Tibet
- Cybersecurity Uyghur
- Spyware Uyghur
- Trend Micro Uyghur
- Unit42 Tibet
- etc.

It is recommended to select "All Language" / "Any Region".

<img src="img/Cybersecurity Tibet.png" width="50%" /><img src="img/Spyware Alert.png" width="50%" />

# Receive alerts via email
Sooner or later, you get incoming alerts in your mailbox. For example,

<img src="img/AP.png" width="50%" />
<img src="img/The Record.png" width="50%" />

Sometimes you have to click a few links to download the original threat intelligence report.

<img src="img/VOA.png" width="50%" />
<img src="img/VOA Spyware as a service.png" width="50%" />
<img src="img/Download the PDF.png" width="50%" />
<img src="img/TRB4 PDF.png" width="50%" />

# Vendors' reports are very useful

Trend Micro, Unit 42 (Palo Alto Networks), CISCO Talos, Google Threat Intelligence, Recorded Future, ... are all industrial leaders. Their threat analysis reports are of good quality.


**ADD CTI SEARCH ENGINES**

<img src="img/Trend Micro - Moonshine EK.png" width="50%" />
<img src="img/Google Blog.png" width="50%" />

## Paste the URLs to the Notebook

Just paste the URLs from the incoming alerts here.

⚠️ Some website doesn't welcome automation scripts. We found most cybersecurity vendors are willing to share, though.

In [3]:
import os
import requests
HTML_SOURCE_PATH = './source'

In [18]:
# Download the URL
URLS = """\
https://cloud.google.com/blog/topics/threat-intelligence/pro-prc-haienergy-us-news
https://edition.cnn.com/2021/03/24/tech/uyghurs-hacking/index.html
https://www.trendmicro.com/en_us/research/24/l/earth-minotaur.html
https://therecord.media/china-linked-tibetan-group-hacked-sites
https://www.lookout.com/threat-intelligence/article/badbazaar-surveillanceware-apt15
https://cloud.google.com/blog/topics/threat-intelligence/pro-prc-haienergy-us-news
"""

def do_scrape(url):
    """Equivalent to
    wget --user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36' -x -l 1 $1
    """
    header = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
        'Referer': 'https://www.google.com/',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.5',
    }
    source_real_path = os.path.realpath(HTML_SOURCE_PATH)
    uri = url.split('://')[1].split('?')[0].split('#')[0].split('&')[0]
    if os.path.exists(os.path.join(source_real_path, uri)):
        print(f'⏭ Already downloaded: {url}')
        return
    r = requests.get(url, headers=header)
    if r.status_code != 200:
        print(f'❌ Failed to fetch {url}, status code {r.status_code}]')
        return
    content = r.text

    # Note that we used r.url here to reflect the redirection chain
    uri = r.url.split('://')[1].split('?')[0].split('#')[0].split('&')[0]
    html_fn = os.path.join(source_real_path, uri)
    if html_fn[-1] == '/':
        html_fn += 'index.html'
    assert os.path.realpath(html_fn)[:len(source_real_path)] == source_real_path     
    if not os.path.exists(os.path.dirname(html_fn)):
        os.makedirs(os.path.dirname(html_fn))
    with open(html_fn, 'w', encoding='utf-8') as f:
        f.write(content)
    print(f'✅ Saved as {html_fn}')

# Download 'em!
for url in URLS.split('\n'):
    url = url.strip()
    if url:
        do_scrape(url)

⏭ Already downloaded: https://cloud.google.com/blog/topics/threat-intelligence/pro-prc-haienergy-us-news
⏭ Already downloaded: https://edition.cnn.com/2021/03/24/tech/uyghurs-hacking/index.html
⏭ Already downloaded: https://www.trendmicro.com/en_us/research/24/l/earth-minotaur.html
⏭ Already downloaded: https://therecord.media/china-linked-tibetan-group-hacked-sites
⏭ Already downloaded: https://www.lookout.com/threat-intelligence/article/badbazaar-surveillanceware-apt15
⏭ Already downloaded: https://cloud.google.com/blog/topics/threat-intelligence/pro-prc-haienergy-us-news


In [19]:
# Conver the HTML to TXT
from html2text import HTML2Text
from glob import glob
import re

html2text = HTML2Text()
html2text.ignore_links = True
html2text.ignore_images = True
html2text.body_width = 0
source_real_path = os.path.realpath(HTML_SOURCE_PATH)

for fn in glob(source_real_path + '/**', recursive=True):
    if not os.path.isfile(fn):
        continue
    if not fn.endswith('.txt'):
        txt_fn = fn.replace('.html', '.txt')
        if not txt_fn.endswith('.txt'):
            txt_fn += '.txt'
        if os.path.exists(txt_fn):
            print(f'⏭ Already converted: {txt_fn}')
            continue
        print(f'📝 Converting {fn} → {txt_fn}')
        with open(fn, encoding='utf-8') as f:
            text = html2text.handle(f.read())
            text = re.sub("\n[ \t]+", "\n", text)
            text = re.sub("\n\n+", "\n\n", text)
        with open(txt_fn, 'w', encoding='utf-8') as f:
            f.write(text)

⏭ Already converted: /home/phil/miaoski/rightscon/source/edition.cnn.com/2021/03/24/tech/uyghurs-hacking/index.txt
⏭ Already converted: /home/phil/miaoski/rightscon/source/www.trendmicro.com/en_us/research/24/l/earth-minotaur.txt
⏭ Already converted: /home/phil/miaoski/rightscon/source/therecord.media/china-linked-tibetan-group-hacked-sites.txt
⏭ Already converted: /home/phil/miaoski/rightscon/source/www.lookout.com/threat-intelligence/article/badbazaar-surveillanceware-apt15.txt
⏭ Already converted: /home/phil/miaoski/rightscon/source/cloud.google.com/blog/topics/threat-intelligence/pro-prc-haienergy-us-news.txt


# Extract the intelligence with OpenAI GPT-4o

Or a local LLM, if you have enough computing power that runs Llama 3.3 70B or better models. It is recommended to use a model that supports tool calling.

Here we provide a predefined subset of STIX 2.1 objects. We are focusing on high-level intelligence, such as actors, malware and victimology. Indicators of compromise (IoC) will be discussed later.

In [2]:
import os
import dotenv
import instructor
from pydantic import BaseModel, Field, field_validator
from typing import Optional, Literal
from datetime import date
import openai

dotenv.load_dotenv()

# model = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY_GD'))

model = openai.AzureOpenAI(
    api_version='2024-08-01-preview',
    azure_endpoint=os.getenv('OPENAI_API_BASE'),
    api_key=os.getenv('OPENAI_API_KEY'))

# For a local LLM, uncomment the following lines.
# model = openai.OpenAI(
#         base_url="http://192.168.2.3:11434/v1",
#         api_key="ollama")

class Entity(BaseModel):
    name: str


class Victim(Entity):
    """
    Victim that is attacked or targeted by an intrusion set, a campaign and/or a malware in the said event.
    Cite the name as it is mentioned in the threat report.
    """
    victim_type: Literal['company', 'NGO/CSO/NPO', 'community', 'diaspora', 'government', 'cloud service provider', 'ISP/Telco', 'sector', 'other']
    name: str = Field(..., description="Specific name, such as SolarWinds, EFF, Uyghur Diaspora, Ukrainian-speaking community in Berlin, Japan governemtn, AWS, Viasat, Education Sector in Canada, etc.")
    # citation: str = Field(..., description="Cite one sentence in the report saying the victim is targeted/attacked")


class IntrusionSet(Entity):
    """
    An Intrusion Set is a grouped set of adversarial behaviors and resources with common properties that is believed to be orchestrated by a single organization.
    For example, APT41, Void Typhoon, Pawn Storm.
    """
    name: str
    # citation: str = Field(..., description="Cite one sentence mentioning this intrusion set")    

class Campaign(Entity):
    """
    A Campaign is a grouping of adversarial behaviors that describes a set of malicious activities or attacks (sometimes called waves) that occur over a period of time against a specific set of targets.
    For example, Operation Dream Job, Operation SolarWind. It almost always begins with "Operation ..."
    """
    name: str
    # citation: str = Field(..., description="Cite one sentence mentioning this campaign")    

class Malware(Entity):
    """
    Malware is a malicious code. It can be a malware, an exploit kit, a hacking tool, a webshell, etc.
    For example, COBEACON, Angler Exploit Kit, China Chopper.
    """
    name: str
    # citation: str = Field(..., description="Cite one sentence describing this malware")

class TargetedCountry(Entity):
    """
    A country, region, or city targeted by the intrusion set, campaign, and/or malware in this event.
    It has to be *targeted*, and not only *happened in*.
    If a campaign targets US Energy sector, then US is a targeted country.
    If a campaign targets Amazon (a company in the US), then US is *not* a targeted country.
    """
    name: str
    # citation: str = Field(..., description="Cite one sentence that supports the country/region/city is targeted.")

class Sector(BaseModel):
    """
An industrial sector is defined in STIX 2.1 as one of the follows,
- agriculture
- aerospace                (NASA, space shuttle, aerospace, satellite launching, rocket; NOT aviation)
- automotive
- chemical
- commercial
- communications
- construction
- defense                  (such as military, defense contractors, army, navy, military academy)
- education
- energy
- entertainment
- financial-services
- government               (such as ministry, president office, provincial/prefectural/state government, CAA, FAA, City Hall)
    - emergency-services     (such as police, EMT, ambulances, law enforcement)
    - government-public-services  (sanitation, sewers, services and not government itself)
- healthcare
- hospitality-leisure
- infrastructure
    - dams
    - nuclear                (nuclear energy, nuclear power plant, nuclear fuel provider, nuclear reprocess plant, etc.)
    - water
- insurance
- manufacturing
- mining
- non-profit               (such as activists, civil society, civilian organization)
- pharmaceuticals
- retail
- technology
- telecommunications       (such as telco, ISP, hosting service providers)
- transportation           (such as civil aviation, airlines, maritime, railroad transportation)
- utilities

In addition to the sectors above, we have added
- dissident
- diplomatic               (embassies, diplomatic entities, diplomatic personnel, trade office, etc.)
- information-technology   (such as IT companies, cloud service providers, cloud platform, antivirus companies, software company, Internet company (but not ISP))
- media                    (such as media outlet, news, TV station, public broadcast, journalists)
- political                (such as politicians, political parties, political not-for-profit corporation, etc.)
    """
    name: Literal[
        'agriculture', 'aerospace', 'automotive', 'chemical', 'commercial', 'communications',
        'construction', 'defense', 'education', 'energy', 'entertainment', 'financial-services',
        'government', 'emergency-services', 'government-public-services', 'healthcare', 'hospitality-leisure',
        'infrastructure', 'dams', 'nuclear', 'water', 'insurance', 'manufacturing', 'mining', 'non-profit',
        'pharmaceuticals', 'retail', 'technology', 'telecommunications', 'transportation', 'utilities',
        'dissident', 'diplomatic', 'information-technology', 'media', 'political']
    # citation: str = Field(..., description="Cite one sentence that supports that the sector is targeted.")

    
class CTI_Report_in_STIX_21(BaseModel):
    """
    Extract cyberthreat intelligence (CTI) report to structured data in STIX 2.1 spec.
    """
    # summary: str = Field(..., description="One line summary of the main event in the CTI report")
    intrustion_sets: Optional[list[IntrusionSet]] = Field(None, description="List of intrusion set(s) in this event")
    campaigns: Optional[list[Campaign]] = Field(None, description="List of campaign(s) of this event.")
    malware: Optional[list[Malware]] = Field(None, description="List of malware, exploit kits, hacking tools used in this event")
    vulnarabilities: Optional[list[str]] = Field(None, description="List of CVE-20xx-xxxxx, GHSA, ZDI, MS-xx-xxx numbers or named vulnerabilities being exploited in this event. Vulnerability can also have names, such as HeartBleed, Spectre.")
    victim_orgs: Optional[list[Victim]] = Field(None)
    targeted_countries: Optional[list[TargetedCountry]] = Field(None)
    targeted_sectors: Optional[list[Sector]] = Field(None, description="List of industrial sectors being targeted in this event")
    exploited_software: Optional[list[str]] = Field(None, description="List of software being targeted/exploited in this event, for example, Windows 10, Windows 11, TOMCAT, Apache Struts, Microsoft Office, Adobe Flash")
    exploited_hardware: Optional[list[str]] = Field(None, description="List of hardware being targeted/exploited in this event. Specify the impacted model and versions, if possible. For example, Siemens S7-1200 PLC, Asus RT-AC86U version before 3.0.0.4.386_51915.")


# Patch the OpenAI client
client = instructor.from_openai(model, mode=instructor.Mode.TOOLS_STRICT)

# Extract structured data from natural language
# cti_report = client.chat.completions.create(
#     # model='zac/phi4-tools',            # phi4-tools works, if you have limited VRAM
#     model="gpt-4o-2024-11-20",
#     response_model=CTI_Report_in_STIX_21,
#     messages=[{"role": "user", "content": report}],
# )


In [5]:
# Store LLM inferred intelligence in a database, so we don't have to spend money twice!

import sqlite3
from glob import glob

db = sqlite3.connect('cti.db')
c = db.cursor()

c.execute('CREATE TABLE IF NOT EXISTS reports (filename TEXT NOT NULL PRIMARY KEY, json JSON)')
db.commit()

reports = {}

for fn in glob('./source/**/*.txt', recursive=True):
    c.execute('SELECT json FROM reports WHERE filename=?', (fn,))
    if row := c.fetchone():
        print('Loading', fn)
        j = row[0]
        reports[fn] = CTI_Report_in_STIX_21().model_validate_json(j)
    else:
        print('Parsing', fn)
        report = open(fn, encoding='utf-8')
        cti_report = client.chat.completions.create(
            model="gpt-4o-2024-11-20",
            response_model=CTI_Report_in_STIX_21,
            messages=[{"role": "user", "content": report}])
        reports[fn] = cti_report
        c.execute('INSERT INTO reports VALUES (?, ?)', (fn, cti_report.model_dump_json()))
        db.commit()

print('Done.')

Parsing ./source/edition.cnn.com/2021/03/24/tech/uyghurs-hacking/index.txt
Parsing ./source/www.trendmicro.com/en_us/research/24/l/earth-minotaur.txt
Parsing ./source/therecord.media/china-linked-tibetan-group-hacked-sites.txt
Parsing ./source/www.lookout.com/threat-intelligence/article/badbazaar-surveillanceware-apt15.txt
Parsing ./source/cloud.google.com/blog/topics/threat-intelligence/pro-prc-haienergy-us-news.txt
Done.


## Ontology
We are limiting relationships between STIX objects (or, edges between nodes) to,
- Intrusion set *uses* a malware
- Intrusion set *targets* a victim organization
- Intrusion set *targets* a country
- Intrusion set *targets* an industry sector
- Campaign *uses* a malware
- Campaign *targets* a victim organization
- Campaign *targets* a country
- Campaign *targets* an industry sector
- Malware *targets* a victim organization
- Malware *exploits* a vulnerability
- Malware *targets* a software
- Malware *targets* a hardware

There are more combinations in the real world, but we chose a subset to focus on what we care for now.

In [6]:
import networkx as nx
G = nx.Graph()

for fn, cti in reports.items():
    for e in cti.intrustion_sets or []:
        G.add_node(e.name, color='red')
    for e in cti.campaigns or []:
        G.add_node(e.name, color='yellow')
    for e in cti.malware or []:
        G.add_node(e.name, color='orange')
    for e in cti.victim_orgs or []:
        G.add_node(e.name, color='green')
    for e in cti.targeted_countries or []:
        G.add_node(e.name, color='violet')
    for e in cti.targeted_sectors or []:
        G.add_node(e.name, color='blue')
    G.add_nodes_from(list(cti.vulnarabilities or []), color='magenta')
    G.add_nodes_from(list(cti.exploited_software or []), color='cyan')
    G.add_nodes_from(list(cti.exploited_hardware or []), color='purple')
    
    for s in cti.intrustion_sets or []:
        for o in cti.malware or []:
            G.add_edge(s.name, o.name, predicate='uses', color='red')
        for o in cti.victim_orgs or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
        for o in cti.targeted_countries or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
        for o in cti.targeted_sectors or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
    
    for s in cti.campaigns or []:
        for o in cti.malware or []:
            G.add_edge(s.name, o.name, predicate='uses', color='red')
        for o in cti.victim_orgs or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
        for o in cti.targeted_countries or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
        for o in cti.targeted_sectors or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
    
    for s in cti.malware or []:
        for o in cti.victim_orgs or []:
            G.add_edge(s.name, o.name, predicate='targets', color='blue')
        for o in cti.vulnarabilities or []:
            G.add_edge(s.name, o, predicate='exploits', color='green')        
        for o in cti.exploited_software or []:
            G.add_edge(s.name, o, predicate='targets', color='blue')        
        for o in cti.exploited_hardware or []:
            G.add_edge(s.name, o, predicate='targets', color='blue')        

# Visualization

We use pyvis to visualize the graph and make it interactive.

In [8]:
import pyvis
g = pyvis.network.Network(notebook=True, filter_menu=True, cdn_resources='in_line')    # You don't need cdn_resources='*' in Firefox
g.from_nx(G)
# g.toggle_physics(False)
g.show('cti.html')

cti.html


# IOC

- ORKL.eu
- ioc[.]one

# Conclusion
- Share your intelligence with partner CSOs and friends!
- More data → Less noises → More precise intelligence

We have open sourced this notebook on Github. Let's improve it as a community!

# Dependencies

The notebook uses the following pipy packages:
- html2text==2024.2.26
- openai==1.60.1
- python-dotenv==1.0.1
- instructor==1.7.2
- pydantic==2.10.6
- networkx==3.4.2
- pyvis==0.3.2

You may want to use newer versions of the dependencies. Uncomment the next cell to install the dependencies. Remember to restart Python kernel to reload the modules.


# License
[CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) Open Culture Foundation, except screenshots from respective copyright holders.

In [None]:
# !pip install html2text openai python-dotenv instructor pydantic