Add entity metadata extractor (#7163)

run-llama · Aug 7, 2023 · d6cf818 · d6cf818
1 parent 0b91476
commit d6cf818
Show file tree

Hide file tree

Showing 7 changed files with 461 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,10 @@
 - add router module docs (#7171)
 - add retriever router (#7166)
 
+### New Features
+- Added an `EntityExtractor` for metadata extraction (#7163)
+- Added a `RouterRetriever` for routing queries to specific retrievers (#7166)
+
 ### Bug Fixes / Nits
 - Fix for issue where having multiple concurrent streamed responses from `OpenAIAgent` would result in interleaving of tokens across each response stream. (#7164)
 - fix llms callbacks issue (args[0] error) (#7165)

diff --git a/docs/core_modules/data_modules/documents_and_nodes/usage_metadata_extractor.md b/docs/core_modules/data_modules/documents_and_nodes/usage_metadata_extractor.md
@@ -6,6 +6,7 @@ Our metadata extractor modules include the following "feature extractors":
 - `SummaryExtractor` - automatically extracts a summary over a set of Nodes
 - `QuestionsAnsweredExtractor` - extracts a set of questions that each Node can answer
 - `TitleExtractor` - extracts a title over the context of each Node
+- `EntityExtractor` - extracts entities (i.e. names of places, people, things) mentioned in the content of each Node
 
 You can use these feature extractors within our overall `MetadataExtractor` class. Then you can plug in the `MetadataExtractor` into our node parser:
 
@@ -40,4 +41,5 @@ caption: Metadata Extraction Guides
 maxdepth: 1
 ---
 /examples/metadata_extraction/MetadataExtractionSEC.ipynb
+/examples/metadata_extraction/EntityExtractionClimate.ipynb
 ```
diff --git a/docs/core_modules/data_modules/index/metadata_extraction.md b/docs/core_modules/data_modules/index/metadata_extraction.md
@@ -21,6 +21,7 @@ from llama_index.node_parser.extractors import (
     QuestionsAnsweredExtractor,
     TitleExtractor,
     KeywordExtractor,
+    EntityExtractor,
 )
 
 metadata_extractor = MetadataExtractor(
@@ -29,6 +30,7 @@ metadata_extractor = MetadataExtractor(
         QuestionsAnsweredExtractor(questions=3),
         SummaryExtractor(summaries=["prev", "self"]),
         KeywordExtractor(keywords=10),
+        EntityExtractor(prediction_threshold=0.5),
     ],
 )
 

diff --git a/docs/examples/metadata_extraction/EntityExtractionClimate.ipynb b/docs/examples/metadata_extraction/EntityExtractionClimate.ipynb
@@ -0,0 +1,343 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Entity Metadata Extraction\n",
+    "\n",
+    "In this demo, we use the new `EntityExtractor` to extract entities from each node, stored in metadata. The default model is `tomaarsen/span-marker-xlm-roberta-base-multinerd`, which is downloaded an run locally from [HuggingFace](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd).\n",
+    "\n",
+    "For more information on metadata extraction in LlamaIndex, see our [documentation](https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/documents_and_nodes/usage_metadata_extractor.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Needed to run the entity extractor\n",
+    "# !pip install span_marker\n",
+    "\n",
+    "import os\n",
+    "import openai\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n",
+    "openai.api_key = os.getenv(\"OPENAI_API_KEY\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup the Extractor and Parser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n",
+      "  warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "'NoneType' object has no attribute 'cadam32bit_grad_fp32'\n"
+     ]
+    }
+   ],
+   "source": [
+    "from llama_index.node_parser.extractors.metadata_extractors import (\n",
+    "    MetadataExtractor,\n",
+    "    EntityExtractor,\n",
+    ")\n",
+    "from llama_index.node_parser.simple import SimpleNodeParser\n",
+    "\n",
+    "entity_extractor = EntityExtractor(\n",
+    "    prediction_threshold=0.5,\n",
+    "    label_entities=False,  # include the entity label in the metadata (can be erroneous)\n",
+    "    device=\"cpu\",  # set to \"cuda\" if you have a GPU\n",
+    ")\n",
+    "\n",
+    "metadata_extractor = MetadataExtractor(extractors=[entity_extractor])\n",
+    "\n",
+    "node_parser = SimpleNodeParser.from_defaults(metadata_extractor=metadata_extractor)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load the data\n",
+    "\n",
+    "Here, we will download the 2023 IPPC Climate Report - Chapter 3 on Oceans and Coastal Ecosystems (172 Pages)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
+      "100 20.7M  100 20.7M    0     0  22.1M      0 --:--:-- --:--:-- --:--:-- 22.1M\n"
+     ]
+    }
+   ],
+   "source": [
+    "!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, load the documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index import SimpleDirectoryReader\n",
+    "\n",
+    "documents = SimpleDirectoryReader(\n",
+    "    input_files=[\"./IPCC_AR6_WGII_Chapter03.pdf\"]\n",
+    ").load_data()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extracting Metadata\n",
+    "\n",
+    "Now, this is a pretty long document. Since we are not running on CPU, for now, we will only run on a subset of documents. Feel free to run it on all documents on your own though!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import random\n",
+    "\n",
+    "random.seed(42)\n",
+    "# comment out to run on all documents\n",
+    "# 100 documents takes about 5 minutes on CPU\n",
+    "documents = random.sample(documents, 100)\n",
+    "\n",
+    "nodes = node_parser.get_nodes_from_documents(documents)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Examine the outputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'page_label': '387', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf'}\n",
+      "{'page_label': '410', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf', 'entities': {'Parmesan', 'Boyd', 'Riebesell', 'Gattuso'}}\n",
+      "{'page_label': '391', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf', 'entities': {'Gulev', 'Fox-Kemper'}}\n",
+      "{'page_label': '430', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf', 'entities': {'Kessouri', 'van der Sleen', 'Brodeur', 'Siedlecki', 'Fiechter', 'Ramajo', 'Carozza'}}\n",
+      "{'page_label': '388', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "samples = random.sample(nodes, 5)\n",
+    "for node in samples:\n",
+    "    print(node.metadata)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Try a Query!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index import ServiceContext, VectorStoreIndex\n",
+    "from llama_index.llms import OpenAI\n",
+    "\n",
+    "service_context = ServiceContext.from_defaults(\n",
+    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.2)\n",
+    ")\n",
+    "\n",
+    "index = VectorStoreIndex(nodes, service_context=service_context)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "According to the provided context information, Fox-Kemper is mentioned in relation to the observed and projected trends of ocean warming and marine heatwaves. It is stated that Fox-Kemper et al. (2021) reported that ocean warming has increased on average by 0.88°C from 1850-1900 to 2011-2020. Additionally, it is mentioned that Fox-Kemper et al. (2021) projected that ocean warming will continue throughout the 21st century, with the rate of global ocean warming becoming scenario-dependent from the mid-21st century. Fox-Kemper is also cited as a source for the information on the increasing frequency, intensity, and duration of marine heatwaves over the 20th and early 21st centuries, as well as the projected increase in frequency of marine heatwaves in the future.\n"
+     ]
+    }
+   ],
+   "source": [
+    "query_engine = index.as_query_engine()\n",
+    "response = query_engine.query(\"What is said by Fox-Kemper?\")\n",
+    "print(response)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Contrast without metadata\n",
+    "\n",
+    "Here, we re-construct the index, but without metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'page_label': '542', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "for node in nodes:\n",
+    "    node.metadata.pop(\"entities\", None)\n",
+    "\n",
+    "print(nodes[0].metadata)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index import ServiceContext, VectorStoreIndex\n",
+    "from llama_index.llms import OpenAI\n",
+    "\n",
+    "service_context = ServiceContext.from_defaults(\n",
+    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.2)\n",
+    ")\n",
+    "\n",
+    "index = VectorStoreIndex(nodes, service_context=service_context)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "According to the provided context information, Fox-Kemper is mentioned in relation to the decline of the AMOC (Atlantic Meridional Overturning Circulation) over the 21st century. The statement mentions that there is high confidence in the decline of the AMOC, but low confidence for quantitative projections.\n"
+     ]
+    }
+   ],
+   "source": [
+    "query_engine = index.as_query_engine()\n",
+    "response = query_engine.query(\"What is said by Fox-Kemper?\")\n",
+    "print(response)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see, our metadata-enriched index is able to fetch more relevant information."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "llama-index",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}