# Information Extraction with Haystack and NuExtract

In this example, we wil automate information extraction from textual data using language models.

The goal is to create an application to extract specific information from a given text or URL, following a user-defined structure. We will use
* [**Haystack**](https://haystack.deepset.ai/?utm_campaign=developer-relations&utm_source=hf-cookbook) - a customizable orchestration framework for building LLM applications. We will use Haystack to build the information extraction pipeline.
* [`NuExtract`](https://huggingface.co/numind/NuExtract) - a small language model, specifically fined-tuned for structured data extraction.

## Setups

In [1]:
!pip install -qU haystack-ai trafilatura transformers pyvis flash-attn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m483.0/483.0 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m91.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m837.9/837.9 kB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.9/229.9 kB[0m [31m13.9 MB/s[0m et

## Components

Haystack has two main concepts:
* **Components** are building blocks that perform a single task: file conversion, text generation, embedding creation...
* **Pipelines** allow us to define the flow of data through our LLM application, by combining Components in a directed (cyclic) graph.

The following subsections are the components that are used in our information extraction applications.

### `LinkContentFetcher` and `HTMLToDocument`: extract text from web pages

In our example, we will extract data from startup funding announcements found on the web.

To download web pages and extract text, we use two components:
* `LinkContentFetcher` fetches the content of some URLs and returns a list of content streams.
* `HTMLToDocument` converts HTML sources into textual `Documents`.

In [3]:
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument

fetcher = LinkContentFetcher()

streams = fetcher.run(urls=['https://example.com'])['streams']

converter = HTMLToDocument()
docs = converter.run(sources=streams)

docs

{'documents': [Document(id=4575ca84877be5dcfbc3a4cc8b8bbbb1cd2bf305e81a676ed4dc221c7b99018c, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: {'content_type': 'text/html', 'url': 'https://example.com'})]}

### `HuggingFaceLocalGenerator`: load and try the model

We use the `HuggingFaceLocalGenerator`, a text generation component that allows loading a model hosted on HuggingFace using the Transformers library. Haystack also supports `HuggingFaceAPIGenerator` (compatible with HuggingFace APIs and TGI).

We will load `NuExtract`, a model fine-tuned from `microsoft/Phi-3-mini-4k-instruct` to perform structured data extraction from text.

In [None]:
from haystack.components.generators import HuggingFaceLocalGenerator
import torch

generator = HuggingFaceLocalGenerator(
    model='numind/NuExtract',
    huggingface_pipeline_kwargs={
        'model_kwargs': {'torch_dtype': torch.bfloat16}
    }
)

# effectively load the model (`warm_up` is automatically invoked when the generator is part of the pipeline)
generator.warm_up()

The model supports a specific prompt structure, as can be inferred from the model card.

In [None]:
prompt = """<|input|>\n### Template:
{
    "Car": {
        "Name": "",
        "Manufacturer": "",
        "Designers": [],
        "Number of units produced": "",
    }
}
### Text:
The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).

The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.

In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
<|output|>
"""

In [None]:
result = generator.run(prompt=prompt)
print(result)

### `PromptBuilder`: dynamically create prompts

The `PromptBuilder` is initialized with a Jinja2 prompt template and renders it by filling in parameters passed through keyword arguments. Our prompt template reproduces the structure shown in the model card.

In [None]:
from haystack.components.builders import PromptBuilder
from haystack import Document

prompt_template = """<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
"""

prompt_builder = PromptBuilder(template=prompt_template)

In [None]:
example_document = Document(content='The Fiat Panda is a city car...')

example_schema = {
    'Car': {
        'Name': "",
        'Manufacturer': "",
        'Designers': [],
        'Number of units produced': "",
    }
}

prompt = prompt_builder.run(
    schema=example_schema,
    documents=[example_document]
)

print(prompt)

### `OutputAdapter`

We would like to have a dictionary for each source document. To perform this in a pipeline, we can use the `OutputAdapter`.

In [None]:
import json
from haystack.components.converters import OutputAdapter

adapter = OutputAdapter(
    template="""{{ replies[0] | replace("'", '"') | json_loads}}""",
    output_type=dict,
    custom_filters={'json_loads': json.loads}
)

print(adapter.run(**result))

## Information extraction pipeline

### Build the pipeline

Now we can create our pipeline by adding and connecting the individual components.

In [None]:
from haystack import Pipeline

ie_pipe = Pipeline()

ie_pipe.add_component('fetcher', fetcher)
ie_pipe.add_component('converter', converter)
ie_pipe.add_component('prompt_builder', prompt_builder)
ie_pipe.add_component('generator', generator)
ie_pipe.add_component('adapter', adapter)

ie_pipe.connect('fetcher', 'converter')
ie_pipe.connect('converter', 'prompt_builder')
ie_pipe.connect('prompt_builder', 'generator')
ie_pipe.connect('generator', 'adapter')

In [None]:
ie_pipe.show()

### Define the sources and the extraction schema

We will select a list of URLs related to recent startup funding announcements. We will also define a schema for the structured information we aim to extract.

In [None]:
urls = [
    "https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
    "https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
    "https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
    "https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
    "https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
    "https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
    "https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
    "https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
    "https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
    "https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
    "https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
    "https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html",
]

In [None]:
schema = {
    "Funding": {
        "New funding": "",
        "Investors": [],
    },
    "Company": {
        "Name": "",
        "Activity": "",
        "Country": "",
        "Total valuation": "",
        "Total funding": ""
    }
}

### Run the pipeline

We will pass the required data to each component.

In [None]:
from tqdm improt tqdm

extracted_data = []
for url in tqdm(urls):
    result = ie_pipe.run(
        {
            'fetcher': {'urls': [url]},
            'prompt_builder': {'schema': schema}
        }
    )

    extracted_data.append(result['adapter']['output'])

In [None]:
extracted_data[:2]

## Data exploration and visualization

### Dataframe

In [None]:
def flatten_dict(d, parent_key=""):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key} - {k}" if parent_key else k

        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key).items())
        elif isinstance(v, list):
            items.append((new_key, ', '.join(v)))
        else:
            items.append((new_key, v))

    return dict(items)

In [None]:
import pandas as pd

df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by='Compnay - Name')

df

### Build a simple graph

To understand the relationships between companies and investors, we will construct a graph and visualize it. We will build a graph using `networkx`, which allows to create and manipulate networks/graphs in a simple way.

Our graph will have companies and investors as nodes. We will connect investors to companies if they are mentioned in the same document.

In [None]:
import networkx as nx

# Create a new graph
G = nx.Graph()

# Add nodes and edges
for el in extracted_data:
    company_name = el['Company']['Name']
    G.add_node(
        company_name,
        label=company_name,
        title='Company'
    )

    investors = el['Funding']['Investors']
    for investor in investors:
        G.add_node(
            investor,
            label=investor,
            title='Investor',
            color='red'
        )

        G.add_edge(company_name, investor)

Next, we use Pyvis to visualize the graph. Pyvis is used for interactive visualization of networks/graphs.

In [None]:
from pyvis.network import Network
from IPython.display import display, HTML

net = Network(notebook=True, cdn_resources='in_line')
net.from_nx(G)

net.show('simple_graph.html')
display(HTML('simple_graph.html'))