Update.

antejavor · antejavor · commit 0b6ea8e4d80f · 2024-11-08T11:01:33.000+01:00
diff --git a/memgraph-graphRAG/graphRAG.ipynb b/memgraph-graphRAG/graphRAG.ipynb
@@ -5,7 +5,10 @@
    "metadata": {},
    "source": [
     "# GraphRAG in Memgraph\n",
-    "In this example, we are going to build GraphRAG by using the Memgraph ecosystem and OpenAI"
+    "\n",
+    "In this tutorial, we are going to build GraphRAG by using the Memgraph ecosystem and OpenAI. This example will be based on portion of the fixed dataset that will be enriched via unstructured data to create a knowledge graph. \n",
+    "\n",
+    "To perform the search of relevant data, in this example we will use the vector search on node embeddings to find schematically relevant data, after that the structured data will be pulled out of the graph and passed to LLM for answering the question. "
    ]
   },
   {
@@ -14,50 +17,61 @@
    "source": [
     "## Prerequisites\n",
     "\n",
+    "In order to start this tutorial, you will need to have Docker, Python and OpenAI API key, with a few small tweaks you can make this work on your local ollama. \n",
+    "\n",
     "First we need to start Memgraph that has the vector search, we can do this by running the following command: \n",
     "\n",
     "TODO: Updated the command when the vector search is available in the official Memgraph docker image\n",
     "```bash\n",
     "docker run -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage:exp-vector-1 --log-level=TRACE --also-log-to-stderr --telemetry-enabled=False --experimental-vector-indexes='tag__Entity__embedding__{\"dimension\":384,\"limit\":3000}'\n",
     "```\n",
     "\n",
+    "\n",
     "You can do this outside of this notebook. \n",
-    "After that make sure you have few packages installed on your system. You can install them using pip3"
+    "\n",
+    "Once Memgraph is running in the background, make sure to load the initial dataset:  "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "vscode": {
+     "languageId": "markdown"
+    }
+   },
    "outputs": [],
    "source": [
-    "\n",
-    "pip install neo4j                   # for driver and connection to Memgraph\n",
-    "pip install sentence-transformers   # for sentence embeddings\n",
-    "pip install openai                  # for access to LLM\n",
-    "pip install dotenv                  # for environment variables\n"
+    "```bash\n",
+    "cat ./data/memgraph-export-got.cypherl | docker run -i memgraph/mgconsole --host=localhost\n",
+    "``` "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "After the install of prerequisites, make sure to insert the base dataset into the Memgraph you can do it via following command: "
+    "After dataset is ingested, install a few Python packages that are need to run a demo:  "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "vscode": {
-     "languageId": "markdown"
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "```bash\n",
-    "cat ./data/memgraph-export-final.cypherl | docker run -i memgraph/mgconsole --host=localhost\n",
-    "``` "
+    "\n",
+    "pip install neo4j                   # for driver and connection to Memgraph\n",
+    "pip install sentence-transformers   # for calculating sentence embeddings\n",
+    "pip install openai                  # for access to LLM\n",
+    "pip install dotenv                  # for environment variables\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After the install of prerequisites, make sure to insert the base dataset into the Memgraph you can do it via following command: "
    ]
   },
   {
@@ -66,11 +80,17 @@
    "source": [
     "## Enrich knowledge graph with the embeddings \n",
     "\n",
-    "Since in GraphRAG you are not writing actual queries, rather you are asking the questions about your domain knowledge graph in plain English, somehow you need to get to the relevant parts of your knowledge graph. \n",
+    "Since in GraphRAG you are not writing actual Cypher queries, rather you are asking the questions about your domain knowledge graph in plain English, somehow you need to get to the relevant parts of your knowledge graph. \n",
+    "\n",
+    "To do this you need to encode the semantic meaning into the graph so you are able to locate the semantically similar parts of the graph. \n",
+    "\n",
+    "There are a few approaches you can take here, embedding the node labels and properties, embedding the triplets related to a node, embedding the specific paths node can take. As you put more data into embedding you will need a vector with more dimensions and this is costly for memory and performance. \n",
+    "\n",
+    "On the other hand, you will be able to locate the semantically similar parts of the graph with higher accuracy. This means if you are asking the longer questions, your semantical search will find the right part of the graph. \n",
     "\n",
-    "To do this you need to encode the semantic meaning into the knowledge of the graph so you are able to retrive it based on you question. \n",
+    "If semantic search misses the relevant part of the graph, LLM will not be able to answer the question correctly. \n",
     "\n",
-    "Here is the function that calculates embeddings based on the node labels an properties: \n"
+    "To provide a base line example, here is the function that calculates embeddings based on the node labels an properties: \n"
    ]
   },
   {
@@ -106,17 +126,30 @@
     "        session.run(\"MATCH (n) SET n:Entity\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If we have a node `:Character {name:\"Viserys Targaryen\"}` in the graph the encoded embedding will be the label `:Charater` and `name:Viserys Targaryen`.\n",
+    "\n",
+    "Asking the question `Who is Viserys Targaryen?` will result in the very similar embedding and you will be able to locate that node in the graph. On the other hand, asking `To whom was Viserys Targaryen Loyal in seasone 1 of Game of Thrones?`, this is much longer question and there is a chance that this question won't find the `Viserys Targaryen` node in the graph. \n",
+    "\n",
+    "Embedding a triplet on the node will yield a better result in this case. "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Finding the relevant part of the graph\n",
     "\n",
+    "TODO: configure and set vector search index based on new release\n",
+    "\n",
     "Once the embeddings are calculated in your graph, you can perform a search on top of that embeddings, for that you need a Vector search. \n",
     "\n",
     "Memgraph supports vector search from version 2.22.  \n",
     "\n",
-    "The goal is to find the most similar node that resembles your question and extract the relevant knowledge from there. "
+    "The goal is to find the most similar node that resembles your question and extract the relevant knowledge from there. The function takes the question embedding that is compared to embeddings stored on the nodes."
    ]
   },
   {
@@ -153,13 +186,22 @@
     "        return nodes_data[0] if nodes_data else None"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Based on the difference of the question embeddings and node embeddings we get the most similar node. This provides us a pivot point from were we can pull the relevant data.  If we were searching the information about `Viserys Targaryen`, we want to pull the data around that node, so this is a pivot node. "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Getting the relevant data\n",
     "\n",
-    "Once you have the pivot node that is connected to the knowledge you need, you can fetch the relevant data around that node: \n"
+    "Once we have the pivot node, we can start pulling the relevant structured data around that node. The most straight forward approach is to perform multiple hops from the pivot node. \n",
+    "\n",
+    "Here is the function that fetches the data around pivot node, that is number of `hops` away form the pivot node.  \n"
    ]
   },
   {
@@ -210,11 +252,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To avoid overload the LLM with the non-relevant data, we are dropping the embedding property out of the nodes. \n",
+    "TODO: Insert a picture showing this. \n",
+    "\n",
+    "To avoid overload the LLM limited context with the non-relevant data, we are dropping the embedding property out of the nodes, since there is a lot of data in the embeddings, and they are not particularly relevant to the LLM. \n",
     "\n",
     "## Helper functions \n",
     "\n",
-    "We also need different prompts to provide the context to the LLM what is expected to generate"
+    "For LLM to understand what it needs to do, we need specific prompts. The `RAG_prompt` is the prompt describing how it should answer the question. The `question_prompt` is optimization made for calculating question embeddings, were only the key pices of information is extracted from the question to get the better embedding. For example if you as `Who is Viserys Targaryen?`, only the `Viserys Targaryen` will be extracted from the question. In the end the LLM will get the full question back in the `RAG_prompt`"
    ]
   },
   {