# Extract Information From Text With LLMs

Information extraction is a key part of computational text analysis. üìöüíª

Specifically, information extraction consists in extracting structured information from text. Often, you want to extract specific types of information such as named entities or their relationships. Named entities can be persons, locations, organizations, dates, etc.

Information extraction can be very useful for your research by helping you process and make sense of text data.

There are many methods for information extraction. This workshop provides a broad overview of different approaches, their pros/cons and use cases, and offers some specific advice on when and how to use large language models (LLMs) for information extraction.

We'll start with approaches other than LLMs first to provide the context where LLMs fit and to get a better understanding of the pros and cons of LLMs. ü¶æ

## Import libraries

In [None]:
import pandas as pd # to use dataframes
import re # to use regular expressions
import spacy # for NLP, here named entity recognition
from openai import OpenAI # to use OpenAI API
import requests # here, to get txt files from GitHub
# To use OpenAI API's key:
# https://drlee.io/how-to-use-secrets-in-google-colab-for-api-key-protection-a-guide-for-openai-huggingface-and-c1ec9e1277e0
from google.colab import userdata
import os

## Read data

This workshop uses abstracts about "environmental sustainability" collected from [OpenAlex](https://openalex.org/). (If you're curious, [this is the script used for data collection](https://github.com/emiliolehoucq/trainings/blob/main/data/openalex_data_collection.ipynb).) üå±

Let's first read the data.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/emiliolehoucq/trainings/refs/heads/main/data/open_alex_data.csv')

## Explore data

Let's take a look at the data.

In [None]:
df.shape

In [None]:
df.head(2)

In [None]:
df.dtypes

In [None]:
# Set seed used throughout the notebook
SEED = 123

# Iterate over a sample of the data
for index, row in df.sample(10, random_state = SEED).iterrows():
  # This is for readability
  print("==============================================================")
  # Print some info for exploration
  print(f"Title: {row['title']}")
  print(f"Length abstract: {len(row['abstract'])}")
  print(f"Abstract:\n{row['abstract']}")
  print("==============================================================\n")

## Overview of approaches

This section provides an overview of different approaches to information extraction, their pros/cons, and their use cases. üë®üèΩ‚Äçüè´

*While the approaches are discussed separately, in practice it can be useful to combine them.*

As previously mentioned, we'll leave LLMs for the end of this section to provide better context first. We'll start with keyword search, then regular expressions, then pre-trained entity recognition models, and finally LLMs.

### Keyword search üîç

Definition: searching for a predefined term in a text.

Pros: can be fast and easy to implement and you know exactly what your code is doing.

Cons: can be prone to false positives/negatives and lacks a more sophisticated understanding of language.

Use cases: particularly useful when there is a well-defined set number of terms that you're looking for (e.g., company names). You can also use keyword search to filter documents.

In [None]:
# Set context window to be used later
CONTEXT_WINDOW = 75

for index, row in df.iterrows():
    abstract = row['abstract']
    title = row['title']

    search_term = "climate change"
    # Lowercase abstract
    lower_abstract = abstract.lower()

    start_index = 0  # Keep track of where to search from

    while True:
        # Find the next occurrence of search term
        # find() returns the index of the first occurrence of a subtring in a string
        # https://www.geeksforgeeks.org/python-string-find/
        match_index = lower_abstract.find(search_term, start_index)

        # If search term not found
        if match_index == -1:
            break  # Exit loop if no more matches are found

        print("===============================================================")
        print(f"Title: {title}")

        # Get the match
        match = lower_abstract[match_index:match_index + len(search_term)]
        print(f"Match: {match}")

        # Get the characters around the match
        start = max(0, match_index - CONTEXT_WINDOW)  # Ensure start index is not negative
        end = min(len(abstract), match_index + len(search_term) + CONTEXT_WINDOW)  # Ensure end index is within bounds
        context = abstract[start:end]

        print(f"Context for match: {context}")
        print("==============================================================\n")

        # Move start_index forward to continue searching after the current match
        start_index = match_index + len(search_term)

Above I was just printing the output to show you. In your research, you'll probably want to save the output. Here's one way to do it:

In [None]:
# Initialize a list to store the results
results = []

for index, row in df.iterrows():
    abstract = row['abstract']
    title = row['title']

    search_term = "climate change"
    lower_abstract = abstract.lower()

    start_index = 0

    while True:
        match_index = lower_abstract.find(search_term, start_index)

        if match_index == -1:
            break

        match = lower_abstract[match_index:match_index + len(search_term)]

        start = max(0, match_index - CONTEXT_WINDOW)
        end = min(len(abstract), match_index + len(search_term) + CONTEXT_WINDOW)
        context = abstract[start:end]

        # Append result to the list
        results.append({
            "title": title,
            "abstract": abstract,
            "match": match,
            "context": context
        })

        start_index = match_index + len(search_term)

print("This cell run successfully!")

In [None]:
# Take a look at the first two elements of the list
results[:2]

In [None]:
# Convert list to DataFrame
output_df = pd.DataFrame(results)

# Take a look at the first two rows of the dataframe
output_df.head(2)

In [None]:
# Save to CSV
output_df.to_csv("matches.csv", index=False)

**Exercise**

Using the approach above, search for "water".

### Regular expressions üß©

Definition: using patterns to match information in a text.

Pros: similarly to keyword search, you understand exactly what your code is doing. Regular expressions are more flexible than keyword search.

Cons: they can be difficult to write for complex patterns and lack a more sophisticated understanding of language.

Use cases: particularly useful when the information follows a well-defined format (e.g., dates, phone numbers). Regular expressions can also be useful to clean text data.

In [None]:
CONTEXT_WINDOW = 75

# Pattern to search for
# \d{1,3} - Matches between 1 and 3 digits
# \s? - Matches an optional whitespace character
# (%|percent|percentage) - Matches either the % symbol, the word percent, or the word percentage
# \s - Matches a mandatory space
# (reduction|increase|decrease|growth|drop|improvement|change) - Match one of the specified change-related keywords
pattern = r"\d{1,3}\s?(%|percent|percentage)\s(reduction|increase|decrease|growth|drop|improvement|change)"

for index, row in df.iterrows():
  abstract = row['abstract']
  # Find all occurrences of pattern and extract surrounding context
  # re.finditer() searches for all matches of a pattern in a string and returns them as an iterator
  # https://www.geeksforgeeks.org/re-finditer-in-python/
  matches = re.finditer(pattern, abstract, re.IGNORECASE)
  # Iterate over matches
  for each_match in matches:
    print("===============================================================")
    print(f"Title: {row['title']}")
    print(each_match.group())
    # Get the characters around the match
    start = max(0, each_match.start() - CONTEXT_WINDOW)  # Ensure start index is not negative
    end = min(len(abstract), each_match.end() + CONTEXT_WINDOW)  # Ensure end index is within bounds
    context = abstract[start:end]
    print(f"Context for match: {context}")
    print("==============================================================\n")

**Exercise**

Using the approach above, search for years.

### Pre-trained entity recognition models üèãüèΩ‚Äç‚ôÇÔ∏è

Definition: machine learning models that are trained to identify and classify entities such as persons, locations, organizations, years, etc.

Pros: they have a more sophisticated understanding of language and are easy to use off the shelf.

Cons: they can misclassify entities and have biases. They may not be pre-trained for the entity you need. You don't necessarily understand exactly what the model is doing.

Use cases: particularly useful for common entities if there is not a well-defined set list of terms or a clear pattern.

In [None]:
# Load the "en_core_web_sm" pre-trained language model (that does tokenization, part-of-speech tagging, named entitity recognition)
nlp = spacy.load("en_core_web_sm")

Note 1: spaCy is not the only place where you can get pre-trained entity recognition models. For example, you can also look for models in [Hugging Face](https://huggingface.co/models).

Note 2: of course, you can always train your own model or fine-tune an existing one.

In [None]:
test_abstract = df["abstract"][272]
test_abstract

In [None]:
# Create a Doc object, which contains the processed text (including tokens, linguistic annotations, and recognized entities)
doc = nlp(test_abstract)

In [None]:
# Visually highlight named entities in text
spacy.displacy.render(doc, style="ent", jupyter=True)

ü§Øü§Øü§Øü§Øü§Ø

While that visual representation is fun, in the context of your research you're more likely to use code like this:

In [None]:
# Iterate over the named entities
for ent in doc.ents:
  print("===============================================================")
  print(f"Text of the entity: {ent.text}")
  print(f"Label of the entity: {ent.label_}")
  print(f"Start character of the entity: {ent.start_char}")
  print(f"End character of the entity: {ent.end_char}")
  print("==============================================================\n")

In [None]:
for index, row in df.sample(20, random_state=SEED).iterrows():
  abstract = row['abstract']
  for ent in nlp(abstract).ents:
    if ent.label_ == "ORG":
      print("===============================================================")
      print(f"Title of the paper: {row['title']}")
      print(f"Text of the organization: {ent.text}")
      start = max(0, ent.start_char - CONTEXT_WINDOW)  # Ensure start index is not negative
      end = min(len(abstract), ent.end_char + CONTEXT_WINDOW)  # Ensure end index is within bounds
      context = abstract[start:end]
      print(f"Context for match: {context}")
      print("==============================================================\n")

**Exercise**

Using the approach above, search for persons.

### LLMs ü§ñ

Definition: large machine learning models trained for various natural language processing tasks.

Pros: they have a pretty sophisticated understanding of language and can be very flexible.

Cons: you don't necessarily understand what the model is doing, the output can vary across model calls, they have biases, they don't necessarily follow your instructions, and they can hallucinate. They can also be computationally intensive.

Use cases: particularly useful for complex tasks for which there is not a well-defined set list of terms, a clear pattern, or a pre-trained model.

----

Note 1: for the purposes of this workshop, we'll mostly use [HuggingChat](https://huggingface.co/chat/) with `meta-llama/Llama-3.3-70B-Instruct`. HuggingChat has the advantage that we can set the system prompt. Also, you can try different models with the same prompt.

Note 2: A system prompt refers to the instructions that you give to the LLM specifying how it should behave across all interactions. In contrast, a user prompt is a specific instruction that you give to the LLM for a particular interaction.

Let's use this abstract as example:

In [None]:
test_abstract = df["abstract"][954]
test_abstract

We're going to extract the main point of the article from the abstract. We're not asking the LLM to infer what the main point is. We're asking the LLM to "understand" the abstract, find where the main point is, and extract a direct citation for us.

Asking the LLM for the main point is a good example of what an LLM can do that other approaches can't do or can't do as well. To extract the main point, the LLM needs to understand language in a more sophisticated way to know where the main point is. It's not always as easy as "The argument/main finding of this article is ..." or "This article demonstrates that ...".

Let's try it out! üéâ

Here's the system prompt:

```
YOUR ROLE

You are a diligent, careful, and detailed-oriented assistant for information detection and extraction.

You value accuracy: when the user asks you to extract certain information from a given text, you will adhere to what is directly mentioned in the text and the extraction criteria.

You value conciseness: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt.

STEPS TO FOLLOW

First, read the given text carefully.

Second, review the extraction criteria and think about what the answer is based on the given text.

Third, explain your reasoning.

Finally, provide your answer.

FORMAT FOR YOUR ANSWER

Reasoning: <explanation of your reasoning>

My answer is: <concise answer strictly following the instructions provided in the extraction prompt>
```

And here's the user prompt:

```
Below I will provide the abstract of an article.

Based on the text of the abstract, give me a direct citation with the main point of the article (if available).

Instead of summarizing or infering something from the article, I want you to give me a direct citation (if available).

If you are not sure, respond "UNSURE".

Here is the abstract:
```

**Exercise**

Using the approach above, extract the method(s) used in the article.

#### OpenAI API

For this workshop, we're mostly using HuggingChat. However, in your research you'll want to query models through an API (or a way for software to communicate with each other).

Let's see an example using the [OpenAI API](https://platform.openai.com/docs/overview). (You can see the [billing here](https://platform.openai.com/settings/organization/billing/overview), [get API keys here](https://platform.openai.com/api-keys), and [check your usage here](https://platform.openai.com/settings/organization/usage).)

Don't worry if you're not familiar with APIs. While this workshop cannot get in the details, they're not complicated to use and [we're happy to help you](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f). Below is just an example.

‚ö†Ô∏è **Keep in mind that you cannot use OpenAI's API with private data. Make sure to check [Northwestern University's guidance on the use of generative AI](https://www.it.northwestern.edu/about/policies/guidance-on-the-use-of-generative-ai.html).** Feel free to [submit a consult request](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f) if you have any questions about this in your research.

OpenAI API's is not the only option. With private data, you can consider using [Microsoft Azure OpenAI Services](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=2546). In your local machine, you can use [Ollama](https://ollama.com/), including [with Python](https://github.com/ollama/ollama-python). You can also use Ollama on Quest (and apparently [on Google Colab](https://adasci.org/a-practitioners-guide-to-running-ollama-models-in-colab-collama/#:~:text=Unlock%20the%20power%20of%20AI,Run%20advanced%20language%20models%20effortlessly.) to try it out). Again, [we're happy to help you](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f)!

In [None]:
# https://drlee.io/how-to-use-secrets-in-google-colab-for-api-key-protection-a-guide-for-openai-huggingface-and-c1ec9e1277e0

# Set API key as environmental variable
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
# Create OpenAI client
client = OpenAI()

In [None]:
# Define function to query model
def get_model_response(prompt_system, prompt_user):
  """
  Function to get a response from a model.

  Inputs:
  - prompt_system (str): system prompt
  - prompt_user (str): user prompt

  Outputs:
  - response (str): response from the model

  https://platform.openai.com/docs/quickstart?language=python
  """
  # Get response from model
  response = client.chat.completions.create(
      model="gpt-4o-mini", # https://openai.com/api/pricing/
      messages=[{"role": "system", "content": prompt_system}, {"role": "user", "content": prompt_user}]
  )

  # Return response
  return response.choices[0].message.content

Let's get the prompts that we have already used:

In [None]:
prompt_system = requests.get("https://raw.githubusercontent.com/nuitrcs/info_extraction_llm_workshop/refs/heads/main/prompt_system.txt").text
prompt_system

In [None]:
prompt_main_point = requests.get("https://raw.githubusercontent.com/nuitrcs/info_extraction_llm_workshop/refs/heads/main/prompt_main_point.txt").text
prompt_main_point

Let's query the model:

In [None]:
# # Uncomment to not keep sending requests!

# for index, row in df.sample(5, random_state=SEED).iterrows():
#   # Put the prompt and the abstract together
#   prompt_user = prompt_main_point + '\n' + row['abstract']
#   print("===============================================================")
#   print("===============================================================")
#   print(f"Title of the paper: {row['title']}")
#   print("---------------------------------------------------------------")
#   print(f"Prompt and abstract:\n\n\n{prompt_user}")
#   print("---------------------------------------------------------------")
#   # Use the function created above to get response from the model
#   print(f"Response from the model:\n\n\n{get_model_response(prompt_system, prompt_user)}")
#   print("==============================================================")
#   print("==============================================================\n\n")

**Exercise**

This section discussed various approaches to information extraction: keyword search, regular expressions, pre-trained entity recognition models, and LLMs. Below are several examples of information extraction tasks in research. For each scenario, think about what approach you'd use and why. üß†

a) You are studying historical documents and need to extract references to specific historical figures, even when they are referred to indirectly (e.g., "the first president" referring to George Washington).

b) You need to extract standardized legal case citations (e.g., *Marbury v. Madison, 5 U.S. 137 (1803)*) from legal documents.

c) You are analyzing political speeches and want to identify mentions of specific policies (e.g., ‚ÄúMedicare for All‚Äù or ‚ÄúGreen New Deal‚Äù).

d) You need to extract company names and stock ticker symbols from financial news articles.

e) You are reviewing electronic health records to extract patient symptoms, even when they are described in different ways (e.g., ‚Äústomach pain,‚Äù ‚Äúabdominal discomfort,‚Äù ‚Äúache in the belly‚Äù).

f) You are studying sentiment in classic novels and need to identify passages that express themes of love and betrayal.

g) You are extracting inflation rates and GDP figures from economic reports.

h) You need to extract mentions of environmental disasters (e.g., ‚Äúoil spill,‚Äù ‚Äúwildfire‚Äù) from news reports.

i) You are scanning news articles to extract quotes from politicians.

## Best practices when using LLMs for information extraction

So far, this workshop has contextualized LLMs as one approach for information extraction among others and showed some examples of how you can use LLMs. This section discusses some best practices when using LLMs for information extraction. ü§ì

As with other aspects of LLMs, these are some preliminary ideas, but there is some uncertainty and things are evolving quickly. üí®

- **Think about the right approach**
  - *Try simpler approaches* first, or at least consider them.
  - *Consider combining the LLM approach with simpler approaches* as relevant.
- **Model selection**
  - *Close vs. open source*. Close source models can be easier to use, although not necessarily. Some close source models can have better performance, but not necessarily. Open source models are "free" to use. Open source models are better for reproducibility.
  - *Model size*. Depending on the size of the text that you need to feed into the LLM and the complexity of your task, you may be able to use a smaller vs. larger model. Smaller models are faster to run, can fit in your local machine or on Quest, and have less of an environmental impact. Larger models can have better performance, particularly without fine tuning.
  - *Experiment with a couple of models* and see how they perform in your specific task, including accuracy, biases, instruction following, etc.
- **Prompts**
  - *Follow prompt engineering guidelines* such as [these ones](https://github.com/nuitrcs/CoDEx-LLM-Workshop/blob/main/prompt_engineering_cheat_sheet.pdf). There are many resources about prompt engineering online (e.g., [this free course](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/)). Model developers also provide some guidelines specfic to their models (e.g., [OpenAI](https://platform.openai.com/docs/guides/prompt-engineering) and [Meta](https://www.llama.com/docs/how-to-guides/prompting/)). For information extraction, these are some tips that I have found helpful:
    - In describing the role of the model, emphasize the importance of accuracy, details, and conciseness.
    - Tell the model to read, review the extraction criteria, explain the reasoning, and give an answer.
    - Ask the model to format the answer in a way that works for you, potentially asking for the reasoning in addition to the answer.
    - Start with simple prompts and tweak them as you see fit to solve specific problems.
    - Tell the model the types and range of answers that you want.
    - Give the model the option to respond "unsure"/"missing".
    - Potentially give the model specific extraction criteria.
    - Potentially give the model examples of the behavior you want.
  - *Keep a record* of the prompts you tried, how they performed, and why you changed them.
  - *Organize prompts* in a way that serves your project over time.
- **Querying the model**
  - *Check the response*, specifically check that the model's output is in the correct format. If you asked the model for a direct citation from the text, you could also check that the response is actually in the text.
  - *Query the model several times*, record all responses, and select the response that is given at least certain proportion of times.
  - *Document* the model and version/date that you used.

----

[Here is an example of some of these ideas](https://github.com/nuitrcs/info_extraction_llm_workshop/tree/main/example). üí°

----

**Exercise**

Using `meta-llama/Llama-3.3-70B-Instruct` on HuggingChat, run the following examples. What do you notice?

1)

Start with this sytem prompt:

```
You are a helpful assistant.
```

Try these prompts:

a)

```
tell me the importance of this paper This study aims to show that the effectiveness of corporate governance in improving firms‚Äô environmental sustainability depends on national institutional context. Using a sample 210 firms from 14 countries North America and Europe, our findings regulatory pressures discourage independent directors separate board chairs promote whereas normative have opposite effect for these two mechanisms. We also found positive moderating relation cognitive directors. make unique contribution literature by combining factors explain sustainability. Although there is growing consensus institutions matter governance, has been little research how may moderate relationship between mechanisms Copyright ¬© 2014 John Wiley &amp; Sons, Ltd ERP Environment
```

b)

```
tell me the contribution of this paper This study aims to show that the effectiveness of corporate governance in improving firms‚Äô environmental sustainability depends on national institutional context. Using a sample 210 firms from 14 countries North America and Europe, our findings regulatory pressures discourage independent directors separate board chairs promote whereas normative have opposite effect for these two mechanisms. We also found positive moderating relation cognitive directors. make unique contribution literature by combining factors explain sustainability. Although there is growing consensus institutions matter governance, has been little research how may moderate relationship between mechanisms Copyright ¬© 2014 John Wiley &amp; Sons, Ltd ERP Environment
```

c)

```
tell me the contribution of this paper to the literature This study aims to show that the effectiveness of corporate governance in improving firms‚Äô environmental sustainability depends on national institutional context. Using a sample 210 firms from 14 countries North America and Europe, our findings regulatory pressures discourage independent directors separate board chairs promote whereas normative have opposite effect for these two mechanisms. We also found positive moderating relation cognitive directors. make unique contribution literature by combining factors explain sustainability. Although there is growing consensus institutions matter governance, has been little research how may moderate relationship between mechanisms Copyright ¬© 2014 John Wiley &amp; Sons, Ltd ERP Environment
```

d)

```
tell me the main contribution of this paper to the literature This study aims to show that the effectiveness of corporate governance in improving firms‚Äô environmental sustainability depends on national institutional context. Using a sample 210 firms from 14 countries North America and Europe, our findings regulatory pressures discourage independent directors separate board chairs promote whereas normative have opposite effect for these two mechanisms. We also found positive moderating relation cognitive directors. make unique contribution literature by combining factors explain sustainability. Although there is growing consensus institutions matter governance, has been little research how may moderate relationship between mechanisms Copyright ¬© 2014 John Wiley &amp; Sons, Ltd ERP Environment
```

e)

```
Below I will provide the abstract of an article.

Based on the text of the abstract, tell me the main contribution of the article to the literature.

If you are not sure, respond "UNSURE".

Here is the abstract:

This study aims to show that the effectiveness of corporate governance in improving firms‚Äô environmental sustainability depends on national institutional context. Using a sample 210 firms from 14 countries North America and Europe, our findings regulatory pressures discourage independent directors separate board chairs promote whereas normative have opposite effect for these two mechanisms. We also found positive moderating relation cognitive directors. make unique contribution literature by combining factors explain sustainability. Although there is growing consensus institutions matter governance, has been little research how may moderate relationship between mechanisms Copyright ¬© 2014 John Wiley &amp; Sons, Ltd ERP Environment
```

2)

Change to this system prompt and try the last prompt again:

```
YOUR ROLE

You are a diligent, careful, and detailed-oriented assistant for information detection and extraction.

You value accuracy: when the user asks you to extract certain information from a given text, you will adhere to what is directly mentioned in the text and the extraction criteria.

You value conciseness: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt.

STEPS TO FOLLOW

First, read the given text carefully.

Second, review the extraction criteria and think about what the answer is based on the given text.

Third, explain your reasoning.

Finally, provide your answer.

FORMAT FOR YOUR ANSWER

Reasoning: <explanation of your reasoning>

My answer is: <concise answer strictly following the instructions provided in the extraction prompt>
```

## Bonus exercise

Use the `get_clean_wikipedia_text` function below to get the text of a Wikipedia article that's of interest to you. Think about some information that you could extract from the text and try different approaches. Consider their pros and cons, try implementing them, and see what you get. ‚ñ∂Ô∏è

In [None]:
pip install wikipedia-api

In [None]:
import wikipediaapi

def get_clean_wikipedia_text(url):
    """
    Fetches and returns the clean text of a Wikipedia page from the provided URL.

    Args:
        url (str): The full URL of the Wikipedia page.

    Returns:
        str: The clean text content of the Wikipedia page, or an error message
             if the page doesn't exist.

    Note:
        A custom user-agent string is required to comply with Wikipedia's
        User-Agent policy. Make sure to replace the user-agent with your own
        application details.

    Example:
        url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
        print(get_clean_wikipedia_text(url))

    https://chatgpt.com/share/679bf411-23fc-8004-84cc-4e36746b9455
    """
    # Initialize the Wikipedia API with a custom user agent and English language
    wiki = wikipediaapi.Wikipedia(
        language='en',
        user_agent='YourAppName/1.0 (https://yourwebsite.com; contact@youremail.com)'
    )

    # Extract the page title from the URL
    page_title = url.split('/')[-1]

    # Fetch the page
    page = wiki.page(page_title)

    # Check if the page exists
    if not page.exists():
        return "Page not found."

    # Get the text from the page
    clean_text = page.text

    return clean_text

In [None]:
url = "https://en.wikipedia.org/wiki/Sustainability"
print(get_clean_wikipedia_text(url))

## Conclusion

This workshop provided a broad overview of different approaches to information extraction, their pros/cons and use cases, and offered some specific advice on how to use large language models (LLMs) for information extraction.

There is much more to learn! You can take a look at our past workshops on [Artificial Intelligence for Research](https://github.com/nuitrcs/artificial_intelligence_for_research). You can also stay updated on future workshops [in our website](https://www.it.northwestern.edu/departments/it-services-support/research/research-events.html) or by subscribing to [our listserv](https://listserv.it.northwestern.edu/scripts/wa.exe?SUBED1=NUIT-research&A=1).

If you're thinking of using LLMs for your research or are having trouble using one, [we're here to help you](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f)! üôÇ

## Answers to the exercises

**Exercise**

In [None]:
# Set context window to be used later
CONTEXT_WINDOW = 75

for index, row in df.iterrows():
    abstract = row['abstract']
    title = row['title']

    search_term = "water"
    # Lowercase abstract
    lower_abstract = abstract.lower()

    start_index = 0  # Keep track of where to search from

    while True:
        # Find the next occurrence of search term
        # find() returns the index of the first occurrence of a subtring in a string
        # https://www.geeksforgeeks.org/python-string-find/
        match_index = lower_abstract.find(search_term, start_index)

        # If search term not found
        if match_index == -1:
            break  # Exit loop if no more matches are found

        print("===============================================================")
        print(f"Title: {title}")

        # Get the match
        match = lower_abstract[match_index:match_index + len(search_term)]
        print(f"Match: {match}")

        # Get the characters around the match
        start = max(0, match_index - CONTEXT_WINDOW)  # Ensure start index is not negative
        end = min(len(abstract), match_index + len(search_term) + CONTEXT_WINDOW)  # Ensure end index is within bounds
        context = abstract[start:end]

        print(f"Context for match: {context}")
        print("==============================================================\n")

        # Move start_index forward to continue searching after the current match
        start_index = match_index + len(search_term)

**Exercise**

In [None]:
# Pattern to search for
# (?<=\s) ‚Äì A positive lookbehind to ensure there is a space before the 4 digits
# \d{4} ‚Äì Matches exactly 4 digits
# (?=\s) ‚Äì A positive lookahead to ensure there is a space after the 4 digits
pattern = r"(?<=\s)\d{4}(?=\s)"

for index, row in df.iterrows():
  abstract = row['abstract']
  # Find all occurrences of pattern and extract surrounding context
  # re.finditer() searches for all matches of a pattern in a string and returns them as an iterator
  # https://www.geeksforgeeks.org/re-finditer-in-python/
  # Note: for keyword search we didn't necessarily need to use re.finditer()
  matches = re.finditer(pattern, abstract, re.IGNORECASE)
  # Iterate over matches
  for each_match in matches:
    print("===============================================================")
    print(f"Title: {row['title']}")
    print(each_match.group())
    # Get the characters around the match
    start = max(0, each_match.start() - CONTEXT_WINDOW)  # Ensure start index is not negative
    end = min(len(abstract), each_match.end() + CONTEXT_WINDOW)  # Ensure end index is within bounds
    context = abstract[start:end]
    print(f"Context for match: {context}")
    print("==============================================================\n")

**Exercise**

In [None]:
for index, row in df.sample(40, random_state=SEED).iterrows():
  abstract = row['abstract']
  for ent in nlp(abstract).ents:
    if ent.label_ == "PERSON":
      print("===============================================================")
      print(f"Title of the paper: {row['title']}")
      print(f"Text of the person: {ent.text}")
      start = max(0, ent.start_char - CONTEXT_WINDOW)  # Ensure start index is not negative
      end = min(len(abstract), ent.end_char + CONTEXT_WINDOW)  # Ensure end index is within bounds
      context = abstract[start:end]
      print(f"Context for match: {context}")
      print("==============================================================\n")

**Exercise**

```
Below I will provide the abstract of an article.

Based on the text of the abstract, tell me the method(s) used in the article.

If you are not sure, respond "UNSURE".

Here is the abstract:
```

**Exercise**

Below are several examples of information extraction tasks in research. For each scenario, think about what approach you'd use and why.

a) You are studying historical documents and need to extract references to specific historical figures, even when they are referred to indirectly (e.g., "the first president" referring to George Washington). -- **Possibly a combination of keyword search and LLM.**

b) You need to extract standardized legal case citations (e.g., *Marbury v. Madison, 5 U.S. 137 (1803)*) from legal documents. -- **Regular expressions.**

c) You are analyzing political speeches and want to identify mentions of specific policies (e.g., ‚ÄúMedicare for All‚Äù or ‚ÄúGreen New Deal‚Äù). -- **Possibly a combination of keyword search and LLM.**

d) You need to extract company names and stock ticker symbols from financial news articles. -- **Pre-trained entity recognition models.**

e) You are reviewing electronic health records to extract patient symptoms, even when they are described in different ways (e.g., ‚Äústomach pain,‚Äù ‚Äúabdominal discomfort,‚Äù ‚Äúache in the belly‚Äù). -- **Possibly a combination of keyword search and a pre-trained entity recognition model and/or an LLM.**

f) You are studying sentiment in classic novels and need to identify passages that express themes of love and betrayal. -- **Possibly a combination of keyword search and LLM.**

g) You are extracting inflation rates and GDP figures from economic reports. -- **Regular expressions.**

h) You need to extract mentions of environmental disasters (e.g., ‚Äúoil spill,‚Äù ‚Äúwildfire‚Äù) from news reports. -- **Possibly a combination of keyword search and LLM.**

i) You are scanning news articles to extract quotes from politicians. -- **Depending on format, regular expressions and/or LLM.**