### Using LangExtract to extract entities and their relations

[LangExtract](https://github.com/google/langextract/) is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions, in particular, a user-defined prompt and a few examples illustrating the kind of information that is to be extracted.

Like the previous information extraction (IE) approach we explored (in week 7), this one is also based on the user's instructions (prompt). However, unlike the previous IE approach that was based on (annotated) data schema, this approach relies on an example (one or more) of the data to be extracted.

We will explore LangExtract's information extraction approach on an example of extracting information about tech companies, their software products and their key employees, from the text of a news article.

In [1]:
import textwrap
import langextract as lx
import requests
from bs4 import BeautifulSoup

from collections import defaultdict
from pathlib import Path

Step 1: Define a concise prompt with instructions what information should be extracted

In [2]:
prompt = textwrap.dedent("""
    You are highly experienced in extracting named entities and their relations from text, especially from news articles and similar kinds of textual content.

    Your task is to extract people, companies, and software products mentioned in the text given below.

    Provide meaningful attributes for each entity you identify, to establish connections between entities of different types. For example, to establish a connection between a person and a company or a company and a software product.

    Important: Use exact text from the input for extraction text. Do not paraphrase.
    Extract entities in order of their appearance with no overlapping text spans.
""")

Step 2: Provide a high-quality example to guide the model

In [3]:
examples = [
    lx.data.ExampleData(
        text=(
            "Llion Jones, a co-founder of Sakana AI, has recently presented one of the company's key products: an AI-based tool called AI Scientist."
        ),
        extractions=[
            lx.data.Extraction(
                extraction_class="person",
                extraction_text="Llion Jones",
                attributes={"works_for": "Sakana AI", "role": "co-founder"},
            ),
            lx.data.Extraction(
                extraction_class="company",
                extraction_text="Sakana AI",
                attributes={"co-founder": "Llion Jones", "product":"AI Scientist"},
            ),
            lx.data.Extraction(
                extraction_class="software_product",
                extraction_text="AI Scientist",
                attributes={"developed_by": "Sakana AI"},
            ),
        ],
    )
]

Step 3: Prepare text to be used for IE

We will try out the LangExtract's information extraction approach on a [techcrunch article](https://techcrunch.com/2025/08/03/inside-openais-quest-to-make-ai-do-anything-for-you/) about LLM development inside OpenAI.
To that end, we need two additional python libraries:
* requests - for pulling the content of the article from the given URL
* beautifulsoup - for extracting the main text of the article

In [4]:
content = []
url = "https://techcrunch.com/2025/08/03/inside-openais-quest-to-make-ai-do-anything-for-you/"

try:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    content_elem = soup.find(lambda elem: (elem.name=='div') and elem.has_attr('class') and (elem['class'][0] == 'entry-content'))
    for p in content_elem.findAllNext(name='p'):
        if p.text:
            content.append(p.text)
except Exception as e:
    print(e)

input_text = "\n".join(content)

# store the content in a local file so that we do not need to pull it from the web each time
with open(Path.cwd() / 'data' / 'tech_crunch_article.txt', 'w') as f:
    f.write(input_text)

# input_text[:200]

Step 4: Run the extraction on the target text

In [5]:
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_url="http://localhost:11434", # Automatically selects Ollama provider
    model_id="gemma3:4b",
    fence_output=False,
    use_schema_constraints=False,
    debug=False,
    # extraction_passes=2 # Number of sequential extraction attempts to improve recall by finding additional entities
)

Note: in the above call of the `extract` function, we set `fence_output=False` and `use_schema_constraints=False` because LangExtract doesn't (yet) implement schema constraints for models other than Gemini. That is, only for Gemini models, it can be requested that the produced output is fully aligned wit the given schema.

For more details about the `extract` function, check the source code, available [here](https://github.com/google/langextract/blob/main/langextract/extraction.py)

We will now explore the results.

The results are stored in an instance of [`AnnotatedDocument` class](https://github.com/google/langextract/blob/main/langextract/core/data.py#L184), and the extracted data is in its `extractions` list attribute:

In [7]:
len(result.extractions)

72

In [9]:
print(result.extractions[0])

Extraction(extraction_class='person', extraction_text='Hunter Lightman', char_interval=CharInterval(start_pos=14, end_pos=29), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=1, group_index=0, description=None, attributes={'works_for': 'OpenAI', 'role': 'researcher'})


In [10]:
people = defaultdict(list)
companies = defaultdict(list)
products = defaultdict(list)
other = defaultdict(list)

for extr_item in result.extractions:

    ext_type = extr_item.extraction_class
    ext_lbl = extr_item.extraction_text

    if ext_type == "person":
        people[ext_lbl.lower()].append(extr_item)
    elif ext_type == "company":
        companies[ext_lbl.lower()].append(extr_item)
    elif ext_type  == 'software_product':
        products[ext_lbl.lower()].append(extr_item)
    else: other[ext_lbl.lower()].append(extr_item)

print(f"Number of unique persons: {len(people)}")
print(f"Number of unique companies: {len(companies)}")
print(f"Number of unique software products: {len(products)}")
print(f"Number of other unique entities: {len(other)}")

Number of unique persons: 13
Number of unique companies: 8
Number of unique software products: 20
Number of other unique entities: 0


In [None]:
list(other.items())[0]

In [11]:
def print_extracted_entities(entity_type:str, entities:dict):
    print(f"Extracted entities of type {entity_type.upper()}:")
    for entity, entity_occurrences in entities.items():
        for eo in entity_occurrences:
            position_info = ""
            if eo.char_interval:
                start, end = eo.char_interval.start_pos, eo.char_interval.end_pos
                position_info = f" (pos: {start}-{end})"
            print(f"• {eo.extraction_text} {position_info}: {eo.attributes if eo.attributes else 'no attr.'}")

In [12]:
# print_extracted_entities(dict(people))
print_extracted_entities("person", people)

Extracted entities of type PERSON:
• Hunter Lightman  (pos: 14-29): {'works_for': 'OpenAI', 'role': 'researcher'}
• Sam Altman  (pos: 1520-1530): {'title': 'CEO', 'works_for': 'OpenAI'}
• Mark Zuckerberg  (pos: 2008-2023): {'recruited': 'Shengjia Zhao'}
• Shengjia Zhao  (pos: 2187-2200): {'works_for': 'Meta', 'title': 'chief scientist', 'works_at': 'Meta Superintelligence Labs'}
• Andrej Karpathy  (pos: 2792-2807): {'works_for': 'OpenAI', 'role': 'first employee'}
• Ahmed El-Kishky  (pos: 3783-3798): {'works_for': 'OpenAI', 'role': 'researcher'}
• Lightman  (pos: 4296-4304): {'works_for': 'OpenAI', 'role': 'researcher'}
• Lightman  (pos: 4743-4751): {'works_for': 'OpenAI'}
• Lightman  (pos: 5752-5760): {'works_for': 'OpenAI'}
• Lightman  (pos: 6993-7001): {'approach': 'focusing on the model’s results'}
• Lightman  (pos: 9281-9289): {'quote': 'Like many problems in machine learning, it’s a data problem” \x90”said Lightman when asked about the limitations of agents on subjective tasks.” 

In [13]:
print_extracted_entities("company", companies)

Extracted entities of type COMPANY:
• OpenAI  (pos: 37-43): {'employee': 'Hunter Lightman'}
• OpenAI  (pos: 889-895): no attr.
• OpenAI  (pos: 2921-2927): {'employee': 'Andrej Karpathy'}
• OpenAI  (pos: 4071-4077): {'product': 'o1'}
• OpenAI  (pos: 4628-4634): {'leader': 'Daniel Selsam', 'leader_title': 'researcher'}
• OpenAI  (pos: 5506-5512): {'research_focus': 'AGI'}
• OpenAI  (pos: 7449-7455): {'researchers': 'Nathan Lambert'}
• OpenAI  (pos: 8426-8432): {'agent': 'Codex agent', 'agent_purpose': 'help software engineers offload simple coding tasks'}
• OpenAI  (pos: 8737-8743): {'agent': 'ChatGPT agent', 'agent_purpose': 'struggle with many of the complex, subjective tasks people want to automate'}
• OpenAI  (pos: 9529-9535): {'employee': 'Noam Brown', 'product': 'o1'}
• OpenAI  (pos: 10363-10369): {'model': 'GPT-5'}
• OpenAI  (pos: 11245-11251): {'industry': 'AI'}
• Meta  (pos: 2240-2244): {'employee': 'Shengjia Zhao', 'unit': 'superintelligence-focused unit'}
• Meta  (pos: 11492-1

In [14]:
print_extracted_entities("software product", products)

Extracted entities of type SOFTWARE PRODUCT:
• ChatGPT  (pos: 102-109): {'developed_by': 'OpenAI'}
• ChatGPT  (pos: 1220-1227): {'developed_by': 'OpenAI', 'status': 'research preview turned viral consumer business'}
• ChatGPT  (pos: 3198-3205): {'developed_by': 'OpenAI', 'derived_from': 'GPT series'}
• ChatGPT  (pos: 6607-6614): {'UX_features': ['thinking', 'reasoning']}
• ChatGPT  (pos: 11007-11014): {'developed_by': 'OpenAI'}
• MathGen  (pos: 290-297): {'developed_by': 'OpenAI', 'related_to': 'AI reasoning models'}
• o1  (pos: 1843-1845): {'developed_by': 'OpenAI', 'release_date': 'fall of 2024'}
• o1  (pos: 4067-4069): {'developed_by': 'OpenAI'}
• o1  (pos: 5796-5798): {'developed_by': 'OpenAI'}
• o1  (pos: 6603-6605): {'developed_by': 'ChatGPT'}
• o1  (pos: 9583-9585): {'developed_by': 'OpenAI'}
• OpenAI’s reasoning models and agents  (pos: 2281-2317): {'training_technique': 'reinforcement learning (RL)'}
• reinforcement learning (RL)  (pos: 2377-2403): {'used_by': 'OpenAI’s reason

The results can be store in the JSONL (JSON Lines) format, a text-based format where each line represents a valid JSON object; it is used for storing structured data records. For more info, see [https://jsonltools.com/what-is-jsonl](https://jsonltools.com/what-is-jsonl)

In [15]:
results_fname = "tech_companies_IE_results"

lx.io.save_annotated_documents([result],
                               output_name=f"{results_fname}.jsonl",
                               output_dir=Path.cwd() / 'ie_results')

[94m[1mLangExtract[0m: Saving to [92mtech_companies_IE_results.jsonl[0m: 1 docs [00:00, 334.66 docs/s]

[92m✓[0m Saved [1m1[0m documents to [92mtech_companies_IE_results.jsonl[0m





The JSONL file can be used for generating an HTML document with interactive visualization of the results

In [16]:
results_json = f"{Path.cwd()}/ie_results/{results_fname}.jsonl"

html_content = lx.visualize(results_json)

with open(f"{Path.cwd()}/ie_results/{results_fname}.html", "w") as f:
     f.write(html_content.data)

[94m[1mLangExtract[0m: Loading [92mtech_companies_IE_results.jsonl[0m: 100%|██████████| 32.7k/32.7k [00:00<00:00, 16.7MB/s]

[92m✓[0m Loaded [1m1[0m documents from [92mtech_companies_IE_results.jsonl[0m





Identify relationships between entities, as they can offer a solid ground for creating a knowledge graph

In [None]:
relations = defaultdict(list)

for extr_item in result.extractions:

    ext_lbl = extr_item.extraction_text
    if extr_item.attributes:
        for ext_attr in extr_item.attributes.items():
            attr_rel, attr_val = ext_attr
            relations[attr_rel].append((ext_lbl, attr_val))


for rel in relations.keys():
    print(rel)
    for pair in relations[rel]:
        print(pair)
    print("-------------------")

In [None]:
# list the distinct kinds of relations
relations.keys()

Some of these relations can / should be merged (e.g., 'works_for' and 'works_at', or 'employee' and 'employs'), to better capture the semantics of relations and set the grounds for knowledge graph creation. [This YouTube video](https://www.youtube.com/watch?v=dPL2vRDunMw) shows how that can be done.

#### Explore LangExtract further

To get a better understanding of LangExtract features and how it can be used for information extraction, it is recommended to explore:
* the examples available at the LangExtract's GitHub repo and especially the [Medication extraction example](https://github.com/google/langextract/blob/main/docs/examples/medication_examples.md)
* the examples presented in the [DataCamp's LangExtract tutorial](https://www.datacamp.com/tutorial/langextract)

#### Compare and contract LangExtract w/ alternative approaches to information extraction

A very nice comparison is given in the aforementioned [DataCamp's tutorial](https://www.datacamp.com/tutorial/langextract), specifically in the table towards the end of the article