Skip to content

Commit

Permalink
feat(node_parser): MetadataExtractor - Feature Augmentation via nod…
Browse files Browse the repository at this point in the history
…e parser post-processing (#6764)



Co-authored-by: jon-chuang <jon-chuang@users.noreply.github.com>
Co-authored-by: Jerry Liu <jerryjliu98@gmail.com>
  • Loading branch information
3 people committed Jul 8, 2023
1 parent 6c6c8ef commit 693b253
Show file tree
Hide file tree
Showing 10 changed files with 1,161 additions and 3 deletions.
655 changes: 655 additions & 0 deletions docs/examples/metadata_extraction/MetadataExtractionSEC.ipynb

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions docs/how_to/customization/custom_documents.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,17 @@ document = Document(
print("The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))
print("The Embedding model sees this: \n", document.get_content(metadata_mode=MetadataMode.EMBED))
```


## Advanced - Automatic Metadata Extraction

We have initial examples of using LLMs themselves to perform metadata extraction.

Take a look here!

```{toctree}
---
maxdepth: 1
---
/examples/metadata_extraction/MetadataExtractionSEC.ipynb
```
71 changes: 71 additions & 0 deletions docs/how_to/index/metadata_extraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Metadata Extraction


## Introduction
In many cases, especially with long documents, a chunk of text may lack the context necessary to disambiguate the chunk from other similar chunks of text.

To combat this, we use LLMs to extract certain contextual information relevant to the document to better help the retrieval and language models disambiguate similar-looking passages.

We show this in an [example notebook](https://github.com/jerryjliu/llama_index/blob/main/examples/metadata_extraction/MetadataExtractionSEC.ipynb) and demonstrate its effectiveness in processing long documents.

## Usage

First, we define a metadata extractor that takes in a list of feature extractors that will be processed in sequence.

We then feed this to the node parser, which will add the additional metadata to each node.
```python
from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
MetadataExtractor,
SummaryExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
KeywordExtractor,
)

metadata_extractor = MetadataExtractor(
extractors=[
TitleExtractor(nodes=5),
QuestionsAnsweredExtractor(questions=3),
SummaryExtractor(summaries=["prev", "self"]),
KeywordExtractor(keywords=10),
],
)

node_parser = SimpleNodeParser(
metadata_extractor=metadata_extractor,
)
```

Here is an sample of extracted metadata:

```
{'page_label': '2',
'file_name': '10k-132.pdf',
'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings',
'questions_this_excerpt_can_answer': '\n\n1. How many countries does Uber Technologies, Inc. operate in?\n2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n3. How much gross bookings did Uber Technologies, Inc. generate in 2019?',
'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.",
'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.',
'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}
```

## Custom Extractors

If the provided extractors do not fit your needs, you can also define a custom extractor like so:
```python
from llama_index.node_parser.extractors import MetadataFeatureExtractor

class CustomExtractor(MetadataFeatureExtractor):
def extract(self, nodes) -> List[Dict]:
metadata_list = [
{
"custom": node.metadata["document_title"]
+ "\n"
+ node.metadata["excerpt_keywords"]
}
for node in nodes
]
return metadata_list
```

In a more advanced example, it can also make use of an `llm_predictor` to extract features from the node content and the existing metadata. Refer to the [source code of the provided metadata extractors](https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/extractors/metadata_extractors.py) for more details.
1 change: 1 addition & 0 deletions docs/how_to/index/usage_pattern.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,5 +81,6 @@ Read more about how to deal with data sources that change over time with `Index`
---
maxdepth: 1
---
metadata_extraction.md
document_management.md
```
8 changes: 6 additions & 2 deletions llama_index/node_parser/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
"""Node parsers."""

from llama_index.node_parser.simple import SimpleNodeParser
from llama_index.node_parser.interface import NodeParser
from llama_index.node_parser.simple import SimpleNodeParser


__all__ = ["SimpleNodeParser", "NodeParser"]
__all__ = [
"SimpleNodeParser",
"NodeParser",
]
18 changes: 18 additions & 0 deletions llama_index/node_parser/extractors/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from llama_index.node_parser.extractors.metadata_extractors import (
MetadataExtractor,
SummaryExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
KeywordExtractor,
MetadataFeatureExtractor,
)

__all__ = [
"MetadataExtractor",
"MetadataExtractorBase",
"SummaryExtractor",
"QuestionsAnsweredExtractor",
"TitleExtractor",
"KeywordExtractor",
"MetadataFeatureExtractor",
]
Loading

0 comments on commit 693b253

Please sign in to comment.