<div style="text-align: center;">
<img src="./images/logo.png" width="400"/>
</div>

> **Objective:**  
> Transform unstructured data from synthetic support posts into a structured **Neo4j graph database**, enabling the discovery of relationships between user-reported **symptoms**, their **underlying root causes**, and **associated solutions**.


In [1]:
import pandas as pd
from load_data import json_to_dataframes, TEXT_COL
from preprocess import preprocess_text
from keyword_extraction import extract_keywords


%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [2]:
file_path = "/Users/aleynakara/Documents/sympTome/support_posts.json"

### 📥 `load_data`: from messy JSON → tidy DataFrame

1. **File read & sanity check**  
   `read_json_dataset(path)`  
   * Raises `ValueError` if the file is missing, malformed, or the top-level object isn’t a list.

2. **ID hygiene(Optional)**  
   * `normalize_id()` strips alpha-prefixes (`"post42"` → `42`).  
   * `validate_ids()` rewrites duplicate / missing IDs to fresh sequential strings (`"000"`, `"001"`, …).

3. **Flattening logic** – inside `json_to_dataframes(path)`  
   | `source` flag | Row content | `comment_id` |
   |---------------|-------------|--------------|
   | `0` (`TITLE_CLS`) | post **title** | `NaN` |
   | `1` (`DESCRIPTION_CLS`) | post **description** | `NaN` |
   | `2` (`COMMENT_CLS`) | every **comment** | original comment id |

   Every post can therefore produce up to three rows, all with the same `post_id`.

4. **Return value**  
   One concatenated DataFrame with columns  
   ```text
   post_id | comment_id | user | text | source


In [3]:
df = json_to_dataframes(file_path)

In [4]:
df

Unnamed: 0,post_id,user,text,source,comment_id
0,post001,TrainTravelerMax,Struggling to log into the train service website,0,
1,post001,TrainTravelerMax,I tried accessing the train login page today b...,1,
2,post002,HomeOfficeSally,Wi-Fi signal drops frequently when I move arou...,0,
3,post002,HomeOfficeSally,Whenever I go to the basement or the far end o...,1,
4,post003,GamerGuy89,Ping is fine but I can't load certain websites...,0,
...,...,...,...,...,...
323,post105,TechStudentMaya,Flushing DNS and resetting fixed the connectiv...,2,c156
324,post106,NetworkTech,Packet loss can be driver-related; try disabli...,2,c157
325,post106,StreamerAlex,Disabling that feature stopped the packet loss...,2,c158
326,post107,SupportGuy,Check if the update disabled the hosted networ...,2,c159


### 🔧 Pre-processing pipeline at a glance

| Stage | What it does |
|-------|--------------|
| **1. Security-aware NLP core** | Loads **SecureSpacy**:<br/>• swaps in its custom tokenizer<br/>• inserts Trend-Micro’s EntityRuler (10 threats)<br/>This means IPs, hashes, CVEs stay intact for later steps. |
| **2. Text sanitation helpers** | Plain regex / emoji utils:<br/>`remove_html_tag`, `replace_emojis`, `remove_urls`, `remove_punctuation_preserving_entities` |
| **3. Entity “bubble-wrap”** | `detect_and_preserve_entities` replaces each entity with a placeholder (`__IP_ADDRESS_0__`) so downstream cleaners never split it. After cleaning, `restore_entities` puts originals back. Also collects **keywords** (entity strings) & **labels** for later analytics. |
| **4. Optional clean-ups** | Controlled by flags in `preprocess_text()`:<br/>`spelling_correction` → *TextBlob*<br/>`stopword_removal` → NLTK list<br/>`do_stemming` / `do_lemmatizing` |
| **5. Row-wise pipeline** | `preprocess_text(df, "text", …)`<br/>→ dumps original to `text_orig` (if `suffix`) <br/>→ applies steps 3–5 <br/>→ returns cleaned `text`, plus new **keywords** & **labels** columns. |
| **6. Sentence explode (optional)** | `split_into_sentences(df, "text")` tokenises each text into N sentences and **explodes** them into N new rows—handy for sentence-level classification/embedding. |

In [5]:
do_stemming, do_lemmatizing = False, True
df = preprocess_text(
    df, TEXT_COL, do_stemming=do_stemming, do_lemmatizing=do_lemmatizing
)

In [6]:
df

Unnamed: 0,post_id,user,text,source,comment_id,text_orig,keywords,labels
0,post001,TrainTravelerMax,struggle log train service website,0,,Struggling to log into the train service website,{},{}
1,post001,TrainTravelerMax,try access train login page today keep spin wi...,1,,I tried accessing the train login page today b...,"{vpn, today}","{date, protocol}"
2,post002,HomeOfficeSally,wifi signal drop frequently move around house,0,,Wi-Fi signal drops frequently when I move arou...,{},{}
3,post002,HomeOfficeSally,whenever go basement far end live room wifi ke...,1,,Whenever I go to the basement or the far end o...,{},{}
4,post003,GamerGuy89,ping fine cant load certain websites online game,0,,Ping is fine but I can't load certain websites...,{ping},{tool}
...,...,...,...,...,...,...,...,...
323,post105,TechStudentMaya,flush dns reset fix connectivity issue,2,c156,Flushing DNS and resetting fixed the connectiv...,{dns},{protocol}
324,post106,NetworkTech,packet loss driverrelated try disable advance ...,2,c157,Packet loss can be driver-related; try disabli...,{},{}
325,post106,StreamerAlex,disable feature stop packet loss thank,2,c158,Disabling that feature stopped the packet loss...,{},{}
326,post107,SupportGuy,check update disable host network feature driver,2,c159,Check if the update disabled the hosted networ...,{},{}


### Why SecureSpacy?

| Feature | SecureSpacy | spaCy |
|---------|-------------|-------|
| Keeps security artefacts intact (IPs, hashes, CVEs…) | ✅ | ❌ |
| Ships EntityRuler for 10 security types | ✅ | ❌ |
| Simple drop-in (`nlp.tokenizer = custom_tokenizer(nlp)`) | ✅ | — |

In [7]:
import spacy
from securespacy.tokenizer import custom_tokenizer
from securespacy.patterns import add_entity_ruler_pipeline

txt = "Ping 8.8.8.8; corp-portal[.]com still down."
s_nlp = spacy.load("en_core_web_sm")
s_nlp.tokenizer = custom_tokenizer(s_nlp)
add_entity_ruler_pipeline(s_nlp)

<spacy.lang.en.English at 0x106bddde0>

In [8]:
print([t.text for t in s_nlp(txt)])
print([(e.label_, e.text) for e in s_nlp(txt).ents])

['Ping', '8.8.8.8', ';', 'corp-portal[.]com', 'still', 'down', '.']
[('TOOL', 'Ping'), ('IP', '8.8.8.8'), ('DOMAIN', 'corp-portal[.]com')]


In [9]:
d_nlp = spacy.load("en_core_web_sm")
print([t.text for t in d_nlp(txt)])
print([(ent.text, ent.label_) for ent in d_nlp(txt).ents])

['Ping', '8.8.8.8', ';', 'corp', '-', 'portal[.]com', 'still', 'down', '.']
[('Ping 8.8.8.8', 'PERSON')]


### 10 entities are **not** enough! — why we bolt on extra patterns 🚀

SecureSpacy’s built-in EntityRuler covers the classic ten artefacts (IP, URL, DOMAIN, HASH, CVE, FILE_PATH, EMAIL, REGKEY, PROCESS, VENDOR).  
Great for malware write-ups, **but my patterns talk about a *lot* more networking stuff**.  
So I extend the ruler with ~60 rule-based patterns (below) to catch the jargon the model ignores.

| **New label** | **Example match** | **Why it matters** |
|---------------|------------------|--------------------|
| `IP_ADDRESS` (IPv4 & IPv6 regex) | `2001:4860:4860::8888` | IPv6 is invisible to the stock pattern |
| `PROTOCOL` | `TCP`, `dns`, `https` | Needed for policy checks: “is **HTTPS** open on 443?” |
| `DEVICE` | `firewall`, `access point` | Lets the graph link tickets to assets |
| `ERROR_MESSAGE` | `connection timed out` | Groups incidents by failure motif |
| `NETWORK_CONCEPT` | `default gateway`, `NAT`, `SSID` | Higher-level hints for root-cause analysis |
| `CONFIGURATION_SETTING` | `MTU`, `TTL`, `WPA2` | Surfaces the knobs mentioned in fixes |

Why rules instead of retraining?

1. deterministic and fast (regex/keyword)

2. zero additional data labelling

3. keeps annotation consistent across 10 k+ tickets

4. easy to tweak when new jargon appears

In [10]:
df

Unnamed: 0,post_id,user,text,source,comment_id,text_orig,keywords,labels
0,post001,TrainTravelerMax,struggle log train service website,0,,Struggling to log into the train service website,{},{}
1,post001,TrainTravelerMax,try access train login page today keep spin wi...,1,,I tried accessing the train login page today b...,"{vpn, today}","{date, protocol}"
2,post002,HomeOfficeSally,wifi signal drop frequently move around house,0,,Wi-Fi signal drops frequently when I move arou...,{},{}
3,post002,HomeOfficeSally,whenever go basement far end live room wifi ke...,1,,Whenever I go to the basement or the far end o...,{},{}
4,post003,GamerGuy89,ping fine cant load certain websites online game,0,,Ping is fine but I can't load certain websites...,{ping},{tool}
...,...,...,...,...,...,...,...,...
323,post105,TechStudentMaya,flush dns reset fix connectivity issue,2,c156,Flushing DNS and resetting fixed the connectiv...,{dns},{protocol}
324,post106,NetworkTech,packet loss driverrelated try disable advance ...,2,c157,Packet loss can be driver-related; try disabli...,{},{}
325,post106,StreamerAlex,disable feature stop packet loss thank,2,c158,Disabling that feature stopped the packet loss...,{},{}
326,post107,SupportGuy,check update disable host network feature driver,2,c159,Check if the update disabled the hosted networ...,{},{}


In [None]:
keyword_method = "keybert"
df = extract_keywords(df, TEXT_COL, method=keyword_method)

In [None]:
df.drop(columns=["text_clean"], inplace=True)
# df.to_csv(f"./output/{keyword_method}_keywords.csv", index=False)

In [None]:
df

Unnamed: 0,post_id,user,text,source,comment_id,text_orig,keywords,labels
0,post001,TrainTravelerMax,struggle log train train service website,0,,Struggling to log into the train service website,{},{}
1,post001,TrainTravelerMax,load strangely browse access train login websi...,1,,I tried accessing the train login page today b...,"{today, vpn}","{protocol, date}"
2,post002,HomeOfficeSally,signal drop frequently,0,,Wi-Fi signal drops frequently when I move arou...,{},{}
3,post002,HomeOfficeSally,disconnect extremely slow wifi disconnect extr...,1,,Whenever I go to the basement or the far end o...,{},{}
4,post003,GamerGuy89,ping fine load websites online game load certa...,0,,Ping is fine but I can't load certain websites...,{ping},{tool}
...,...,...,...,...,...,...,...,...
323,post105,TechStudentMaya,flush dns reset reset fix connectivity,2,c156,Flushing DNS and resetting fixed the connectiv...,{dns},{protocol}
324,post106,NetworkTech,packet loss driverrelated large send offload,2,c157,Packet loss can be driver-related; try disabli...,{},{}
325,post106,StreamerAlex,feature stop packet packet loss thank,2,c158,Disabling that feature stopped the packet loss...,{},{}
326,post107,SupportGuy,check update disable disable host network netw...,2,c159,Check if the update disabled the hosted networ...,{},{}


### 🔑 Keyword extraction (BERT-based default, TextRank fallback)

| Step | Key parts | Why |
|------|-----------|-----|
| **1. Model bootstrap** | `kw_model = KeyBERT(SentenceTransformer("all-MiniLM-L6-v2"))` | Uses a lightweight BERT sentence-embedder so keywords come from semantic similarity, not just TF-IDF. |
| **2. `extract_keywords_keybert(text)`** | • Top-N phrases `kw_model.extract_keywords(...)` (1–3-grams)<br>• `fuzzywuzzy.process.dedupe` drops near-duplicates (≥ 70% match)<br>• keep first **6** terms → `"cpu overload router"` | Returns only the most distinctive, non-redundant nuggets. |
| **3. DataFrame integration** | `extract_keywords(df, ["text"], method="keybert")`<br>• stores original text in `"text_clean"` <br>• replaces `text` with space-joined keywords | Keeps the DataFrame lean: downstream models ingest only the distilled keyword string. |

### Why keyword extraction?

* **Trim the fat** – cut greetings & filler.  
* **Speed** – shorter text → faster embeddings/search.  
* **Focus** – keep only high-signal phrases (symptoms, devices, configs).  

KeyBERT/TextRank distil each post to ≈ 6 sharp terms.

In [None]:
keyword_method = "keybert"
df = pd.read_csv(f"./output/{keyword_method}_keywords.csv")

In [None]:
import os

# Set the STANFORDTOOLSDIR environment variable
os.environ["STANFORDTOOLSDIR"] = os.path.expanduser("~/stanford-ner-2015-12-09")

# Set the CLASSPATH to the Stanford NER jar
os.environ["CLASSPATH"] = os.path.join(
    os.environ["STANFORDTOOLSDIR"], "stanford-ner.jar"
)

# Set the STANFORD_MODELS to the classifiers directory
os.environ["STANFORD_MODELS"] = os.path.join(
    os.environ["STANFORDTOOLSDIR"], "classifiers"
)

### CRF — Conditional Random Field (used here for NER)

| What | Why it matters |
|------|----------------|
| **Model type** | Sequence-labeler that scores the *whole* tag chain, keeps BIO tags consistent. |
| **Features** | Hand-crafted cues (word shape, suffix, prev-tag, etc.) feed into the CRF instead of raw embeddings. |
| **Training / decoding** | Learns P(labels ; tokens); finds the best path with Viterbi—no GPU required. |


*(Good on small data & resource-light but needs manual features and sees mostly local context.)*


In [None]:
from ner import extract_entities_from_dataframes

ner_method = "stanford"
df = extract_entities_from_dataframes(df, method=ner_method)

Some weights of the model checkpoint at jackaduma/SecBERT were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# df.to_csv(f"./output/{ner_method}_entities.csv", index=False)

In [None]:
ner_method = "stanford"
df = pd.read_csv(f"./output/{ner_method}_entities.csv")

In [None]:
df

Unnamed: 0,post_id,comment_id,source,labels,text_orig,text,keywords
0,post001,,0,set(),Struggling to log into the train service website,struggle log train train service website,"{'train', 'website', 'service', 'struggle', 'l..."
1,post001,,1,"{'protocol', 'date'}",I tried accessing the train login page today b...,load strangely browse access train login websi...,"{'train', 'access', 'fine', 'login', 'websites..."
2,post002,,0,set(),Wi-Fi signal drops frequently when I move arou...,signal drop frequently,"{'signal', 'drop', 'frequently'}"
3,post002,,1,set(),Whenever I go to the basement or the far end o...,disconnect extremely slow wifi disconnect extr...,"{'slow', 'wifi', 'disconnect', 'extremely', 'r..."
4,post003,,0,{'tool'},Ping is fine but I can't load certain websites...,ping fine load websites online game load certa...,"{'fine', 'websites', 'online', 'game', 'certai..."
...,...,...,...,...,...,...,...
323,post105,c156,2,{'protocol'},Flushing DNS and resetting fixed the connectiv...,flush dns reset reset fix connectivity,"{'connectivity', 'reset', 'dns', 'fix', 'flush'}"
324,post106,c157,2,set(),Packet loss can be driver-related; try disabli...,packet loss driverrelated large send offload,"{'packet', 'offload', 'loss', 'large', 'driver..."
325,post106,c158,2,set(),Disabling that feature stopped the packet loss...,feature stop packet packet loss thank,"{'packet', 'feature', 'stop', 'thank', 'loss'}"
326,post107,c159,2,set(),Check if the update disabled the hosted networ...,check update disable disable host network netw...,"{'network', 'feature', 'disable', 'update', 'd..."


In [None]:
from text_classification import classify_text

### OpenAI splitter — why it’s handy

* **Graph-ready nodes** – GPT tags each chunk as **symptom (0) / cause (1) / solution (2)** → direct import to Neo4j.
* **Auto-segment** – breaks mixed sentences (“Wi-Fi drops; rebooting fixes it”) into separate rows, no regex pain.
* **Context-aware** – prompt feeds in our keywords & NER labels → sharper, domain-specific decisions.
* **Zero retraining** – tweak the prompt, not a model; great for fast iteration.
* **Clean output** – `classify_text(df)` explodes the DataFrame with new `segment` and `node_type` columns, ready for the next step.


In [None]:
classification_method = "openai"
df = classify_text(df, classification_method)

Classifying and splitting rows:   0%|          | 0/328 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Classifying and splitting rows:   1%|          | 2/328 [00:04<12:25,  2.29s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

The keywords and labels provided also do not provide additional context as they are just individual characters and not meaningful words or phrases. 

In order to provide a meaningful classification, more context or a more structured text snippet would be needed. For example, a text snippet like "The webpage loads strangely when I try to access it through the VPN" would provide a clear symptom that can be classified. 

Without additional context or information, it's not possible to provide the requested JSON array of objects.'
Classifying and splitting rows:   1%|          | 3/328 [00:11<23:10,  4.28s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/

In [None]:
# df.to_csv(f"./output/{classification_method}_classified.csv", index=False)

In [None]:
classification_method = "openai"
df = pd.read_csv(f"./output/{classification_method}_classified.csv")

### 🏷️  `generate_node_names()` — why we need it

* **Merge near-duplicates**  
  *Embeds* each `segment + labels + keywords`, clusters with **HDBSCAN** → one “Wi-Fi timeout” node instead of 200 synonymous sentences.

* **Readable graph labels**  
  Picks the most central items in every cluster (or the lone outliers) and asks GPT to coin a **short, human name**.

* **Query-friendly format**  
  A second GPT pass converts those names to **snake_case** (`wifi_timeout`, `dns_misconfig`) – perfect for Cypher, dashboards, or APIs.

* **Parameter knobs**  
  - Swap embedding backend (`openai`, etc.)  
  - Tune cluster granularity (`min_cluster_size`)  
  - Plug in another LLM or offline namer later.

Result: the DataFrame returns with a single `node_name` column—deduplicated, concise, and machine-friendly—ready to be loaded as graph nodes.


In [21]:
from node_name_generator import generate_node_names


df = generate_node_names(
    df,
    embedding_type="openai",  # or "bert", "tfidf", etc.
    output_path="./output/node_names.csv",  # optional path for saving embeddings
    clustering_method="hdbscan",
    clustering_params={"min_cluster_size": 5},
)

In [None]:
# df.to_csv("./output/node_names.csv", index=False)

In [10]:
import pandas as pd

df = pd.read_csv("./output/node_names.csv")

In [11]:
df

Unnamed: 0,node_type,post_id,comment_id,source,text_orig,node_name
0,0,post002,,0,Wi-Fi signal drops frequently when I move arou...,wifi_connectivity_troubleshooting_kit
1,0,post002,,1,Whenever I go to the basement or the far end o...,wifi_connectivity_troubleshooting_kit
2,0,post002,,1,Whenever I go to the basement or the far end o...,wifi_connectivity_troubleshooting_kit
3,1,post002,,1,Whenever I go to the basement or the far end o...,network_driver_update_cluster
4,0,post003,,0,Ping is fine but I can't load certain websites...,wifi_connectivity_troubleshooting_kit
...,...,...,...,...,...,...
419,1,post106,c157,2,Packet loss can be driver-related; try disabli...,network_driver_update_cluster
420,2,post106,c157,2,Packet loss can be driver-related; try disabli...,large_packet_send_offload
421,0,post107,c160,2,Enabling hosted network in driver properties f...,wifi_connectivity_troubleshooting_kit
422,2,post107,c160,2,Enabling hosted network in driver properties f...,network_hotspot_properties_fixer


## Knowledge Graph Construction

###  Directed, Labelled Graph (Boring 👎🏻)
A graph can be formalised as the tuple  $G = (V, E, s, t, \ell_V, \ell_E)$:

\begin{aligned}
V       &\;=\; \text{finite set of vertices (nodes)}, \\[4pt]
E       &\;\subseteq\; V \times V \;=\; \text{finite set of edges}, \\[4pt]
s : E \to V &\; \text{source map (gives the start‐vertex of each edge)}, \\[4pt]
t : E \to V &\; \text{target map (gives the end‐vertex of each edge)}, \\[4pt]
\ell_V : V \to \Sigma_V &\; \text{vertex–labelling function}, \\[4pt]
\ell_E : E \to \Sigma_E &\; \text{edge–labelling function}.
\end{aligned}


### Our Graph (Lovely 💖🤌🏻🤌🏻🤌🏻)
<!-- ✨ Pretty-printed tables: baby-pink nodes, pastel-green edges -->
<style>
  /* NODE table — light baby-pink */
  .node-table {
    border: 2px solid #ffb6c1;       /* baby-pink border */
    border-collapse: collapse;
    width: 100%;
  }
  .node-table th, .node-table td {
    border: 1px solid #ffb6c1;
    padding: 8px;
    text-align: left;
  }
  .node-table th {
    background: rgba(255, 182, 193, 0.35); /* baby-pink header with mild opacity */
  }

  /* EDGE table — pastel-green */
  .edge-table {
    border: 2px solid #a7f5a7;       /* pastel-green border */
    border-collapse: collapse;
    width: 100%;
    margin-top: 24px;
  }
  .edge-table th, .edge-table td {
    border: 1px solid #a7f5a7;
    padding: 8px;
    text-align: left;
  }
  .edge-table th {
    background: rgba(167, 245, 167, 0.35); /* pastel-green header with mild opacity */
  }
</style>

<table class="node-table">
  <thead>
    <tr>
      <th><strong>Node Label</strong></th>
      <th><strong>Description</strong></th>
      <th><strong>Example value for <code>name</code></strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>Post</code></td>
      <td>An individual support request / forum post.</td>
      <td>N/A</td>
    </tr>
    <tr>
      <td><code>Symptom</code></td>
      <td>A specific problem characteristic reported.</td>
      <td><code>no_website_access</code></td>
    </tr>
    <tr>
      <td><code>Cause</code></td>
      <td>A potential root cause of one or more symptoms.</td>
      <td><code>dns_misconfiguration</code></td>
    </tr>
    <tr>
      <td><code>Solution</code></td>
      <td>A suggested fix or troubleshooting step.</td>
      <td><code>flush_dns_cache</code></td>
    </tr>
  </tbody>
</table>

<!-- EDGE TABLE -->
<table class="edge-table">
  <thead>
    <tr>
      <th><strong>Relationship Type</strong></th>
      <th><strong>Start Node</strong></th>
      <th><strong>End Node</strong></th>
      <th><strong>Description</strong></th>
      <th><strong>Key Properties&nbsp;(type)</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>REPORTS_SYMPTOM</code></td>
      <td>Post</td>
      <td>Symptom</td>
      <td>Symptom described in the post.</td>
      <td><code>context: string</code> (optional snippet)</td>
    </tr>
    <tr>
      <td><code>SUGGESTS_CAUSE</code></td>
      <td>Post</td>
      <td>Cause</td>
      <td>Cause suggested in a reply.</td>
      <td><code>replySource: string</code>, <code>strength: float</code> (optional)</td>
    </tr>
    <tr>
      <td><code>SUGGESTS_SOLUTION</code></td>
      <td>Post</td>
      <td>Solution</td>
      <td>Solution suggested in a reply.</td>
      <td><code>replySource: string</code>, <code>strength: float</code> (optional)</td>
    </tr>
    <tr>
      <td><code>HAS_ASSOCIATED_CAUSE</code></td>
      <td>Symptom</td>
      <td>Cause</td>
      <td>(Derived) Symptom–cause pair frequently linked.</td>
      <td><code>frequency: integer</code>, <code>confidence: float</code> (optional)</td>
    </tr>
    <tr>
      <td><code>SYMPTOM_COCCURS</code></td>
      <td>Symptom</td>
      <td>Symptom</td>
      <td>(Derived) Two symptoms often co-occur.</td>
      <td><code>frequency: integer</code>, <code>lift: float</code> (optional)</td>
    </tr>
    <tr>
      <td><code>CAUSE_ADDRESSED_BY</code></td>
      <td>Cause</td>
      <td>Solution</td>
      <td>(Derived) Solution known to fix the cause.</td>
      <td><code>frequency: integer</code></td>
    </tr>
  </tbody>
</table>


### Cypher Query: `symptom_query`

📋 **The Query**

```cypher
MERGE (s:Symptom {name: $props.name})
ON CREATE SET s += $props, s.created_at = timestamp()
ON MATCH  SET s += $props, s.updated_at = timestamp()
```

> **Purpose:** Ensure a unique `Symptom` node by its `name`, create it if missing, and update properties/timestamps appropriately.

🎯 **Why Use This Pattern?**

* **Idempotency:** Running the same query repeatedly won’t create duplicate nodes.


### 🌐 Relationship Query: `rel_query`

```cypher
MATCH (postNode:Post {postId: $postId})
MATCH (symptomNode:Symptom {name: $node_name})
MERGE (postNode)-[describesRel:DESCRIBES_SYMPTOM]->(symptomNode)
SET describesRel = $props
```

> **Purpose:** Link a `Post` to a `Symptom` via a `DESCRIBES_SYMPTOM` relationship, ensuring it exists and updating its properties.

---

### 🎯 Why Use This Pattern?

* **Co-occurrence Analysis:** Quickly identify frequently co-mentioned symptoms.
* **Efficient Counting:** By incrementing `weight`, you avoid expensive aggregation queries over large datasets.
* **Avoid Duplicate Relationships:** The `WHERE id(s1) < id(s2)` convention and `MERGE` ensure one relationship per unordered pair.
* **Dynamic Updates:** Each new post automatically adjusts relationship weights, keeping metrics up to date.

---

*This polished explanation should help convey both the intent and mechanics of the co-occurrence query in a clear, visually appealing way.*


<div style="text-align: center;">
<img src="./images/er_diagram.jpg" width="300"/>
</div>

In [None]:
import pandas as pd
from graph_generator import (
    Neo4jUploader,
)

NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "cheesecake"

uploader = Neo4jUploader(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD)
uploader.clear_database_interactive()
uploader.create_constraints()
uploader.populate_graph_from_dataframe(df)
uploader.create_symptom_cooccurrence_relationships()
# uploader.close()

INFO:root:Successfully connected to Neo4j database.
INFO:root:Database cleared successfully.
INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE CONSTRAINT IF NOT EXISTS FOR (e:Post) REQUIRE (e.id) IS UNIQUE` has no effect.} {description: `CONSTRAINT constraint_49c43cbc FOR (e:Post) REQUIRE (e.id) IS UNIQUE` already exists.} {position: None} for query: 'CREATE CONSTRAINT IF NOT EXISTS FOR (p:Post) REQUIRE p.id IS UNIQUE'
INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE CONSTRAINT IF NOT EXISTS FOR (e:Symptom) REQUIRE (e.name) IS UNIQUE` has no effect.} {description: `CONSTRAINT constraint_baf64ff0 FOR (e:Symptom) REQUIRE (e.name) IS UNIQUE` already exists.} {position: None} for query: 'CREATE CONSTRAINT IF

In [4]:
inference_json_file_path = "./data/small.json"

In [5]:
do_lemmatizing = True
do_stemming = False
keyword_method = "keybert"

In [6]:
from load_data import json_to_dataframes, TEXT_COL
from preprocess import preprocess_text
from keyword_extraction import extract_keywords

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
2025-06-04 16:59:55,855 - INFO - PyTorch version 2.5.1 available.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
2025-06-04 16:59:57,376 - INFO - Use pytorch device_name: mps
2025-06-04 16:59:57,377 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


In [7]:
inference_df = json_to_dataframes(inference_json_file_path)
inference_df = preprocess_text(
    inference_df, TEXT_COL, do_stemming=do_stemming, do_lemmatizing=do_lemmatizing
)
inference_df = extract_keywords(inference_df, TEXT_COL, method=keyword_method)

In [8]:
inference_df

Unnamed: 0,post_id,user,text,source,comment_id,text_orig,keywords,labels,text_clean
0,post001,TrainTravelerMax,struggle log train train service website,0,,Struggling to log into the train service website,{},{},struggle log train service website
1,post001,TrainTravelerMax,load strangely browse access train login websi...,1,,I tried accessing the train login page today b...,"{vpn, today}","{date, protocol}",try access train login page today keep spin wi...
2,post001,ITSupportAnna,vpns cause issue try turn check issue specific...,2,c001,VPNs often cause issues with specific sites. T...,{},{},vpns often cause issue specific sit try turn c...
3,post001,NetworkNils,cache cookies cause browser cache cookies,2,c002,Sometimes browser cache or cookies cause loadi...,{},{},sometimes browser cache cookies cause load iss...


In [9]:
inference_df.drop(columns=["text_clean"], inplace=True)

In [10]:
inference_df.to_csv("./output/inference_df.csv", index=False)

In [1]:
import pandas as pd

keyword_method = "keybert"
inference_df = pd.read_csv("./output/inference_df.csv")

In [2]:
inference_df

Unnamed: 0,post_id,user,text,source,comment_id,text_orig,keywords,labels
0,post001,TrainTravelerMax,struggle log train train service website,0,,Struggling to log into the train service website,set(),set()
1,post001,TrainTravelerMax,load strangely browse access train login websi...,1,,I tried accessing the train login page today b...,"{'vpn', 'today'}","{'date', 'protocol'}"
2,post001,ITSupportAnna,vpns cause issue try turn check issue specific...,2,c001,VPNs often cause issues with specific sites. T...,set(),set()
3,post001,NetworkNils,cache cookies cause browser cache cookies,2,c002,Sometimes browser cache or cookies cause loadi...,set(),set()


In [3]:
import os

# Set the STANFORDTOOLSDIR environment variable
os.environ["STANFORDTOOLSDIR"] = os.path.expanduser("~/stanford-ner-2015-12-09")
# Set the CLASSPATH to the Stanford NER jar
os.environ["CLASSPATH"] = os.path.join(
    os.environ["STANFORDTOOLSDIR"], "stanford-ner.jar"
)
# Set the STANFORD_MODELS to the classifiers directory
os.environ["STANFORD_MODELS"] = os.path.join(
    os.environ["STANFORDTOOLSDIR"], "classifiers"
)

In [4]:
from ner import extract_entities_from_dataframes

ner_method = "stanford"
inference_df = extract_entities_from_dataframes(inference_df, method=ner_method)

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleynakara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Some weights of the model checkpoint at jackaduma/SecBERT were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly i

In [5]:
inference_df

Unnamed: 0,post_id,comment_id,source,labels,text_orig,text,keywords
0,post001,,0,set(),Struggling to log into the train service website,struggle log train train service website,"{website, service, struggle, train, log}"
1,post001,,1,"{'date', 'protocol'}",I tried accessing the train login page today b...,load strangely browse access train login websi...,"{fine, train, browse, access, websites, vpn, l..."
2,post001,c001,2,set(),VPNs often cause issues with specific sites. T...,vpns cause issue try turn check issue specific...,"{sit, vpns, turn, cause, try, check, issue, sp..."
3,post001,c002,2,set(),Sometimes browser cache or cookies cause loadi...,cache cookies cause browser cache cookies,"{cause, cache, cookies, browser}"


In [6]:
from text_classification import classify_text

classification_method = "openai"
inference_df = classify_text(inference_df, classification_method)

Classifying and splitting rows:   0%|          | 0/4 [00:00<?, ?it/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Classifying and splitting rows:  50%|█████     | 2/4 [00:04<00:04,  2.08s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Classifying and splitting rows:  75%|███████▌  | 3/4 [00:07<00:02,  2.72s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Classifying and splitting rows: 100%|██████████| 4/4 [00:12<00:00,  3.55s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Classifying and splitting rows: 100%|██████████| 4/4 [00:15<00:00,  3.97s/it]


In [7]:
inference_df.to_csv("./output/inference_df.csv", index=False)

In [8]:
inference_df = pd.read_csv("./output/inference_df.csv")

In [13]:
inference_df

Unnamed: 0,segment,node_type,post_id,comment_id,source,labels,text_orig,keywords
0,struggle log train,0,post001,,0,set(),Struggling to log into the train service website,"{'website', 'service', 'struggle', 'train', 'l..."
1,train service,1,post001,,0,set(),Struggling to log into the train service website,"{'website', 'service', 'struggle', 'train', 'l..."
2,service website,2,post001,,0,set(),Struggling to log into the train service website,"{'website', 'service', 'struggle', 'train', 'l..."
3,load strangely browse access train login websi...,0,post001,,1,"{'date', 'protocol'}",I tried accessing the train login page today b...,"{'fine', 'train', 'browse', 'access', 'website..."
4,vpn,1,post001,,1,"{'date', 'protocol'}",I tried accessing the train login page today b...,"{'fine', 'train', 'browse', 'access', 'website..."
5,vpns cause issue,0,post001,c001,2,set(),VPNs often cause issues with specific sites. T...,"{'sit', 'vpns', 'turn', 'cause', 'try', 'check..."
6,try turn check issue,2,post001,c001,2,set(),VPNs often cause issues with specific sites. T...,"{'sit', 'vpns', 'turn', 'cause', 'try', 'check..."
7,specific sit,1,post001,c001,2,set(),VPNs often cause issues with specific sites. T...,"{'sit', 'vpns', 'turn', 'cause', 'try', 'check..."
8,cache cookies cause,0,post001,c002,2,set(),Sometimes browser cache or cookies cause loadi...,"{'cause', 'cache', 'cookies', 'browser'}"
9,browser cache cookies,1,post001,c002,2,set(),Sometimes browser cache or cookies cause loadi...,"{'cause', 'cache', 'cookies', 'browser'}"


In [14]:
from node_name_generator import generate_node_names


inference_df = generate_node_names(
    inference_df,
    embedding_type="openai",  # or "bert", "tfidf", etc.
    output_path="./output/inference_node_names.csv",  # optional path for saving embeddings
    clustering_method="hdbscan",
    clustering_params={"min_cluster_size": 2},
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:

In [15]:
inference_df

Unnamed: 0,node_type,post_id,comment_id,source,text_orig,node_name
0,0,post001,,0,Struggling to log into the train service website,struggle_train_log_service
1,1,post001,,0,Struggling to log into the train service website,train_logistics
2,2,post001,,0,Struggling to log into the train service website,service_logix
3,0,post001,,1,I tried accessing the train login page today b...,secure_web_access_protocol
4,1,post001,,1,I tried accessing the train login page today b...,secure_access_suite
5,0,post001,c001,2,VPNs often cause issues with specific sites. T...,vpn_troubleshooting_guide
6,2,post001,c001,2,VPNs often cause issues with specific sites. T...,troubleshoot_task_manager
7,1,post001,c001,2,VPNs often cause issues with specific sites. T...,solve_sit
8,0,post001,c002,2,Sometimes browser cache or cookies cause loadi...,cache_cookie_cause_resolver
9,1,post001,c002,2,Sometimes browser cache or cookies cause loadi...,cache_cookie_setter


In [None]:
# NEO4J_PASSWORD = "cheesecake"  # Replace with your actual Neo4j password
# NEO4J_URI = "bolt://localhost:7687"  # Or your AuraDB URI
# NEO4J_USER = "neo4j"

# from graph_generator import Neo4jUploader  # Import if not already

# uploader = Neo4jUploader(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD)

2025-06-04 16:50:19,757 - INFO - Successfully connected to Neo4j database.


In [16]:
number_of_solutions_to_recommend = 2

recommended_solutions = uploader.infer_solutions_from_dataframe(
    inference_df, limit=number_of_solutions_to_recommend
)

INFO:root:Inferring solutions for symptoms: ['struggle_train_log_service', 'secure_web_access_protocol', 'vpn_troubleshooting_guide', 'cache_cookie_cause_resolver'] and known causes: ['train_logistics', 'secure_access_suite', 'solve_sit', 'cache_cookie_setter']
INFO:root:Scores from symptoms path: {'vpn_troubleshooting_guide': 1.0}
INFO:root:Scores from direct causes path (weighted): {}
INFO:root:
Combined & Ranked Recommended Treatments (Top 2):
INFO:root:- Treatment: vpn_troubleshooting_guide, Combined Score: 1.0000


In [17]:
recommended_solutions

['vpn_troubleshooting_guide']

In [16]:
uploader.close()

2025-06-04 16:51:27,623 - INFO - Neo4j connection closed.
