feat(node_parser): MetadataExtractor - Feature Augmentation via nod…

…e parser post-processing (#6764) Co-authored-by: jon-chuang <jon-chuang@users.noreply.github.com> Co-authored-by: Jerry Liu <jerryjliu98@gmail.com>
run-llama · Jul 8, 2023 · 693b253 · 693b253
1 parent 6c6c8ef
commit 693b253
Show file tree

Hide file tree

Showing 10 changed files with 1,161 additions and 3 deletions.
diff --git a/docs/examples/metadata_extraction/MetadataExtractionSEC.ipynb b/docs/examples/metadata_extraction/MetadataExtractionSEC.ipynb
diff --git a/docs/how_to/customization/custom_documents.md b/docs/how_to/customization/custom_documents.md
@@ -131,3 +131,17 @@ document = Document(
 print("The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))
 print("The Embedding model sees this: \n", document.get_content(metadata_mode=MetadataMode.EMBED))
 ```
+
+
+## Advanced - Automatic Metadata Extraction
+
+We have initial examples of using LLMs themselves to perform metadata extraction.
+
+Take a look here! 
+
+```{toctree}
+---
+maxdepth: 1
+---
+/examples/metadata_extraction/MetadataExtractionSEC.ipynb
+```
diff --git a/docs/how_to/index/metadata_extraction.md b/docs/how_to/index/metadata_extraction.md
@@ -0,0 +1,71 @@
+# Metadata Extraction
+
+
+## Introduction
+In many cases, especially with long documents, a chunk of text may lack the context necessary to disambiguate the chunk from other similar chunks of text. 
+
+To combat this, we use LLMs to extract certain contextual information relevant to the document to better help the retrieval and language models disambiguate similar-looking passages.
+
+We show this in an [example notebook](https://github.com/jerryjliu/llama_index/blob/main/examples/metadata_extraction/MetadataExtractionSEC.ipynb) and demonstrate its effectiveness in processing long documents.
+
+## Usage
+
+First, we define a metadata extractor that takes in a list of feature extractors that will be processed in sequence.
+
+We then feed this to the node parser, which will add the additional metadata to each node.
+```python
+from llama_index.node_parser import SimpleNodeParser
+from llama_index.node_parser.extractors import (
+    MetadataExtractor,
+    SummaryExtractor,
+    QuestionsAnsweredExtractor,
+    TitleExtractor,
+    KeywordExtractor,
+)
+
+metadata_extractor = MetadataExtractor(
+    extractors=[
+        TitleExtractor(nodes=5),
+        QuestionsAnsweredExtractor(questions=3),
+        SummaryExtractor(summaries=["prev", "self"]),
+        KeywordExtractor(keywords=10),
+    ],
+)
+
+node_parser = SimpleNodeParser(
+    metadata_extractor=metadata_extractor,
+)
+```
+
+Here is an sample of extracted metadata:
+
+```
+{'page_label': '2',
+ 'file_name': '10k-132.pdf',
+ 'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings',
+ 'questions_this_excerpt_can_answer': '\n\n1. How many countries does Uber Technologies, Inc. operate in?\n2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n3. How much gross bookings did Uber Technologies, Inc. generate in 2019?',
+ 'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.",
+ 'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.',
+ 'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}
+```
+
+## Custom Extractors
+
+If the provided extractors do not fit your needs, you can also define a custom extractor like so:
+```python
+from llama_index.node_parser.extractors import MetadataFeatureExtractor
+
+class CustomExtractor(MetadataFeatureExtractor):
+    def extract(self, nodes) -> List[Dict]:
+        metadata_list = [
+            {
+                "custom": node.metadata["document_title"]
+                + "\n"
+                + node.metadata["excerpt_keywords"]
+            }
+            for node in nodes
+        ]
+        return metadata_list
+```
+
+In a more advanced example, it can also make use of an `llm_predictor` to extract features from the node content and the existing metadata. Refer to the [source code of the provided metadata extractors](https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/extractors/metadata_extractors.py) for more details.
diff --git a/docs/how_to/index/usage_pattern.md b/docs/how_to/index/usage_pattern.md
@@ -81,5 +81,6 @@ Read more about how to deal with data sources that change over time with `Index`
 ---
 maxdepth: 1
 ---
+metadata_extraction.md
 document_management.md
 ```
diff --git a/llama_index/node_parser/__init__.py b/llama_index/node_parser/__init__.py
@@ -1,6 +1,10 @@
 """Node parsers."""
 
-from llama_index.node_parser.simple import SimpleNodeParser
 from llama_index.node_parser.interface import NodeParser
+from llama_index.node_parser.simple import SimpleNodeParser
+
 
-__all__ = ["SimpleNodeParser", "NodeParser"]
+__all__ = [
+    "SimpleNodeParser",
+    "NodeParser",
+]
diff --git a/llama_index/node_parser/extractors/__init__.py b/llama_index/node_parser/extractors/__init__.py
@@ -0,0 +1,18 @@
+from llama_index.node_parser.extractors.metadata_extractors import (
+    MetadataExtractor,
+    SummaryExtractor,
+    QuestionsAnsweredExtractor,
+    TitleExtractor,
+    KeywordExtractor,
+    MetadataFeatureExtractor,
+)
+
+__all__ = [
+    "MetadataExtractor",
+    "MetadataExtractorBase",
+    "SummaryExtractor",
+    "QuestionsAnsweredExtractor",
+    "TitleExtractor",
+    "KeywordExtractor",
+    "MetadataFeatureExtractor",
+]