You find yourself staring at a dataset with tens or hundreds of thousands of rows. Maybe you want to get up-to-date FOIA contact details for all government departments in your country, or to find out which political donors have links to the fossil fuels industry. What do you do?

Large Language Models (LLMs) like those powering ChatGPT can help journalists automate simple research and classification tasks that would take an unreasonably long time to do otherwise.

In this session, we'll outline how to use LLMs, search engines and web scraping to help us identify links between Donald Trump and his donors. You can [download the notebook](https://github.com/nicucalcea/ddj-wiki/blob/main/ai/python-classification-rag.ipynb) and run it yourself, or you can [run it in the cloud using Google Colab](https://colab.research.google.com/github/nicucalcea/ddj-wiki/blob/main/ai/python-classification-rag.ipynb). Both links also in the sidebar to the right, or at the bottom of the page on mobile.

## Install and load libraries

First, we'll need to install some libraries to help us call different LLMs and retrieve search results.

In [None]:
!uv pip install pandas
!uv pip install pydantic-ai
!uv pip install duckduckgo-search

[2mUsing Python 3.13.3 environment at: C:\Users\NicuCalcea\Projects\ddj-wiki\.venv[0m
[2mResolved [1m72 packages[0m [2min 1.38s[0m[0m
[36m[1mDownloading[0m[39m botocore [2m(12.9MiB)[0m
[36m[1mDownloading[0m[39m pydantic-core [2m(1.9MiB)[0m
[36m[1mDownloading[0m[39m tokenizers [2m(2.3MiB)[0m
 [32m[1mDownloading[0m[39m pydantic-core
 [32m[1mDownloading[0m[39m tokenizers
 [32m[1mDownloading[0m[39m botocore
[2mPrepared [1m65 packages[0m [2min 36.22s[0m[0m
[2mInstalled [1m65 packages[0m [2min 4.76s[0m[0m
 [32m+[39m [1mannotated-types[0m[2m==0.7.0[0m
 [32m+[39m [1manthropic[0m[2m==0.51.0[0m
 [32m+[39m [1manyio[0m[2m==4.9.0[0m
 [32m+[39m [1margcomplete[0m[2m==3.6.2[0m
 [32m+[39m [1mboto3[0m[2m==1.38.14[0m
 [32m+[39m [1mbotocore[0m[2m==1.38.14[0m
 [32m+[39m [1mcachetools[0m[2m==5.5.2[0m
 [32m+[39m [1mcertifi[0m[2m==2025.4.26[0m
 [32m+[39m [1mcharset-normalizer[0m[2m==3.4.2[0m
 [32m+[39m [

Then, we'll import those libraries.

In [40]:
# maths
import pandas as pd

# scraping
from urllib import request
import ssl
import requests

# AI
from pydantic_ai import Agent
from pydantic_ai.common_tools.duckduckgo import duckduckgo_search_tool

Finally, we need to define some of the API keys we'll be using. You can get your own [OpenAI API key](https://platform.openai.com/account/api-keys) or one from another provider.

In [None]:
OPENAI_API_KEY=""

## Prep data

We'll work with some data from the US Federal Election Commission (FEC) on [donations to Donald Trump's inaugural committee](https://docquery.fec.gov/cgi-bin/forms/C00894162/1889684/f132). Because a downloadable version isn't provided, we'll scrape the data directly from the website.

In [None]:
response = request.urlopen("https://docquery.fec.gov/cgi-bin/forms/C00894162/1889684/f132",
                           context=ssl._create_unverified_context())
html = response.read()

fec_raw = pd.read_html(html)[0]

                  Name                                            Address  \
0  A10 ASSOCIATES, LLC            214 COMMERCIAL ST #202 MALDEN, MA 02148   
1        DANIEL ABBATE  340 ROYAL POINCIANA WAY STE 317 PALM BEACH, FL...   
2  ABBOTT LABORATORIES         100 ABBOTT PARK ROAD ABBOTT PARK, IL 60064   
3          CAROL ADAMS            6125 LUTHER LN STE 245 DALLAS, TX 75225   
4         HAYDEN ADAMS               524 BROADWAY FL 6 NEW YORK, NY 10013   

  Date Donation Received Donation Amount Donor's Aggregate Donations To Date  
0             01/10/2025       $50000.00                           $50000.00  
1             01/01/2025       $25000.00                           $25000.00  
2             12/24/2024      $500000.00                          $500000.00  
3             01/13/2025      $250000.00                          $250000.00  
4             01/09/2025      $245727.56                          $245727.56  


We need to fix some of the formatting issues, like dollar signs and converting dates to a more standard format.

In [None]:
fec = fec_raw.copy()
fec = fec.iloc[:-1] # the last row is a summary

fec['Date Donation Received'] = pd.to_datetime(
    fec['Date Donation Received'], 
    format='%m/%d/%Y'  # Format for "31/12/2025"
)

fec['Donation Amount'] = fec['Donation Amount'].str.replace('$', '', regex=False)
fec['Donation Amount'] = pd.to_numeric(fec['Donation Amount'])

fec = fec.drop(columns=["Donor's Aggregate Donations To Date"])

Some donors donated multiple times, so we'll want to group them together and only classify them once.

In [36]:
fec_total = fec.groupby('Name')['Donation Amount'].sum().reset_index()
fec_total = fec_total.sort_values(by='Donation Amount', ascending=False)

We should be ready to go. Let's see what the data looks like.

In [37]:
fec_total

Unnamed: 0,Name,Donation Amount
618,PILGRIM'S PRIDE CORPORATION,5000000.00
652,"RIPPLE LABS, INC.",4889345.33
824,WARREN A STEPHENS,4000000.00
667,"ROBINHOOD MARKETS, INC.",2000000.00
360,JARED ISAACMAN,2000000.00
...,...,...
365,JEFF STIBEL,1000.00
144,CHRISTOPHER SHEERON,1000.00
192,DAVID RATLIFF,516.53
438,KURT FOULDS,500.00


## Searching and scraping

First, we need a programmatic way to search the internet.

We could try simply scraping search results from Google like we did with the FEC data, but Google doesn't like that, and they will put various roadblocks in your way such as CAPTCHAs, rate limits or even blocking your IP address. Instead, they want you to pay for their [Custom Search API](https://developers.google.com/custom-search/v1/overview), which is paid and limited to 10k queries per day.

Other search engines like Brave provide generous free tiers and more reasonable pricing, and there are some search engines that specifically cater to AI applications, like [Tavily](https://tavily.com/) and [Perplexity](https://docs.perplexity.ai/home).

We'll use the DuckDuckGo API since it's free (but rate-limited), doesn't require signing up for an API key, and comes with PydanticAI.

We also want to give our AI the ability to visit each of those search results and extract the text from there. We'll use a library called [Trafilatura](https://github.com/adbar/trafilatura) since it extract the main text without headers, footers and other irrelevant bits that we probably don't need for the classification.

In [41]:
def extract_text(urls):
    results = []

    for url in urls:
        print(f"Scraping {url}...")
        try:
            response = requests.get(url, timeout=30, verify=False)  # Note: verify=False is not recommended for production use
            response.raise_for_status()  # Raises an HTTPError for bad responses
            extracted_text = extract(response.text, output_format="markdown")
            results.append((url, extracted_text))
        except requests.RequestException as e:
            print(f"Error scraping {url}: {str(e)}")
            results.append((url, f"Error: {str(e)}"))

    return results

## Agents

In an earlier iteration of this project, we searched the internet for each one of those donors using the exact name as listed in the data, extracted the first 10 results, scraped them, and then fed them to an LLM.

Since then, AI agents have become a lot more prominent. Agents are systems where LLMs are allowed to choose their own steps, use tools appropriate for different tasks and make decisions about when to stop.

The code required to set up an agent from scratch is fairly simple, but there are several frameworks which come with batteries included. We'll use [PydanticAI](https://ai.pydantic.dev), but most of them are similar.

In [42]:
agent = Agent(
    'openai:gpt-4.1-mini',
    tools=[duckduckgo_search_tool()],
    system_prompt='Answer the question to the best of your abilities, using DuckDuckGo search results.',
)

result = agent.run_sync(
    'What is the connection between WARREN A STEPHENS and Donald Trump?'
)
print(result.output)

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

## Searching and scraping

Large Language Models are prone to [hallucinations](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)). If you ask an LLM a question it doesn't know the answer to, [it will confidently make up a plausible-sounding answer](https://ddj.nicu.md/ai/) that is completely wrong. This is particularly the case with less known organisations that wouldn't feature promionently in the training data.

Let's ask ChatGPT if "Clean Resource Innovation Network" is a fossil fuel organisation or not.

In [23]:
org = "Clean Resource Innovation Network"

In [7]:
def make_request_openai_simple(prompt_system: str, prompt_user: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": prompt_user}
        ]
    )
    return response.choices[0].message.content

In [8]:
make_request_openai_simple(prompt_system="You are an AI whose job it is to help researchers identify fossil fuel organisations",
                           prompt_user=f"Is ${org} a fossil fuel organisation? Respond with YES or NO")

'NO'

One way to minimise (but not completely eliminate) hallucinations is Retrieval-Augmented Generation (RAG). Very simply, this means providing the AI with some additional factual context for the question you're asking.

One example comes from The San Francisco Chronicle, who launched [a chatbot that answers questions about Kamala Harris](https://www.sfchronicle.com/projects/2024/kamala-harris-election-questions/).

In our case, we'll provide the AI with relevant search results related to our organisation so that it knows who we're asking about.

We'll use DuckDuckGo because it has a free API. For better results, you can use the Google API or a [SERP API](https://developers.oxylabs.io/scraper-apis/serp-scraper-api/google/search).

Let's search for the organisation and extract the first 5 results.

In [9]:
search = f'"{org}" oil gas coal'
print(search)

results = DDGS().text(search, max_results=5)
results

"Clean Resource Innovation Network" oil gas coal


[{'title': 'Clean Resource Innovation Network',
  'href': 'https://www.cleanresourceinnovation.com/',
  'body': 'The Clean Resource Innovation Network (CRIN) is a pan-Canadian network founded to enable cleaner energy development by commercializing and adopting technologies for the oil and gas industry. We bring together diverse expertise from industry, entrepreneurs, investors, academia, governments, and many others to enable solutions that improve the ...'},
 {'title': '17 new technologies funded by CRIN competition to address economic ...',
  'href': 'https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/',
  'body': "March 9, 2022 CALGARY, Alberta - Clean Resource Innovation Network (CRIN) Today CRIN is announcing funding of over $44 million CAD for 17 projects identified through its Reducing Environmental Footprint oil and gas technology competition. This brings the total investment through three c

## Scrape search results

Now, we want to extract the text from each of those URLs. We'll use [Trafilatura](https://github.com/adbar/trafilatura), a library that will help us extract the main text without headers, footers and other irrelevant text.

In [10]:
def extract_text(urls):
    results = []

    for url in urls:
        print(f"Scraping {url}...")
        try:
            response = requests.get(url, timeout=30, verify=False)  # Note: verify=False is not recommended for production use
            response.raise_for_status()  # Raises an HTTPError for bad responses
            extracted_text = extract(response.text, output_format="markdown")
            results.append((url, extracted_text))
        except requests.RequestException as e:
            print(f"Error scraping {url}: {str(e)}")
            results.append((url, f"Error: {str(e)}"))

    return results

In [11]:
# Run the function
texts = extract_text([result['href'] for result in results if 'href' in result])
texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

Scraping https://www.cleanresourceinnovation.com/...
Scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/...
Error scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/
Scraping https://www.globenewswire.com/news-release/2023/11/03/2773454/0/en/CRIN-Funds-an-Additional-Nineteen-Projects-through-the-Oil-Gas-Technology-Competitions.html...
Scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/...
Error scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emiss

In [12]:
# Paste text together
prompt_documents = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()

In [13]:
prompt_system = 'You will be provided with a collection of documents collected from Google search results. Your task is to determine whether an organization is a fossil fuel company or lobbying group or not.'

prompt_instructions= f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

## Send request to LLM

There are [various LLMs available](https://ddj.nicu.md/ai/llm-comparison.html), each with different capabilities and costs.

For our task, there are a few things we need to consider:

- **Performance**: Is the model intelligent enough to understand the task?
- **Cost**: If you are running tens of thousands of requests, the cost can add up quickly. Models like Claude 3 Opus quickly become unaffordable.
- **Rate limits**: Some platforms impose limits on how many times you can call the API in a given time period (minute, hour, day) and how big the requests can be.
- **Other features**: Some models offer additional features like better support for various languages, prompt caching, or structured outputs.

We'll use OpenAI's gpt-4o-mini for this classification. One advantage of this particular model is its support for [Structured Outputs](https://openai.com/index/introducing-structured-outputs-in-the-api/). This means you can force the response to follow a certain set of rules.

Let's define what we want the output to be.

In [15]:
class Classification(BaseModel):
    fossil_fuel_link: bool = Field(description = "Is this a fossil fuel organization?")
    explanation: str = Field(description = "A brief explanation of your decision, in English")
    source: str = Field(description = "A link to the SINGLE most relevant source that supports your classification")

Now, let's make the request to OpenAI. First, we define a function like we did before, with a few tweaks.

In [16]:
def make_request_openai(prompt_system: str, prompt_instructions: str, prompt_documents: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    """Make a request to OpenAI models that support structured outputs."""
    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": f'${prompt_documents}\n\n${prompt_instructions}'}
        ],
        response_format=Classification
    )
    return response.choices[0].message.content

Now, let's run the function on our example.

In [17]:
openai_response = make_request_openai(prompt_system, prompt_instructions, prompt_documents)
print(openai_response)

{"fossil_fuel_link":true,"explanation":"The Clean Resource Innovation Network (CRIN) is focused on enabling cleaner energy development specifically for the oil and gas industry. It supports projects that aim to improve the environmental performance of this sector, which indicates a direct involvement with fossil fuels. The organization is dedicated to commercializing technologies that benefit the oil and gas industry, which aligns with the characteristics of a fossil fuel organization.","source":"https://www.cleanresourceinnovation.com/"}


# Scale

We've seen an example of how to classify one organisation. The advantage of using AI is that we can scale this to (hundreds of) thousands of operations.

Let's define a few functions to help us with this.

In [25]:
def classify_org(org: str):
    search = f'"{org}" oil gas coal'

    results = DDGS().text(search, max_results=5)

    texts = extract_text([result['href'] for result in results if 'href' in result])
    texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

    prompt_documents = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()
    prompt_instructions= f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

    openai_response = make_request_openai(prompt_system, prompt_instructions, prompt_documents)

    return openai_response

In [26]:
def apply_classify_org(df):
    df['classification'] = df.apply(lambda row: classify_org(org = row['organization']), axis=1)
    df['classification'] = df['classification'].apply(json.loads)
    df = pd.concat([df.drop(['classification'], axis=1), df['classification'].apply(pd.Series)], axis=1)

    return df

Now let's run this on our sample of organisations.

In [27]:
cop_orgs_classified = apply_classify_org(cop_orgs)

Scraping https://climatereality.ph/reenergizeph/...
Scraping https://climatereality.ph/2021/08/21/climate-reality-ph-geop-a-potent-weapon-against-unreliable-coal-sourced-power/...
Scraping https://climatereality.ph/2023/09/28/climate-reality-ph-builds-momentum-for-geop-implementation-in-mindanao/...
Scraping https://mirror.pia.gov.ph/news/2021/08/22/climate-reality-geop-a-potent-weapon-vscoal-sourced-power...
Scraping https://www.facebook.com/climaterealityphilippines/posts/green-energy-option-program-a-potent-weapon-against-unreliable-coal-sourced-powe/4243220149098727/...


In [28]:
cop_orgs_classified

Unnamed: 0,organization,fossil_fuel_link,explanation,source
37190,IUCN Regional Office for West Asia,False,The IUCN Regional Office for West Asia is prim...,https://www.iucn.org/regions/west-asia
33297,The Climate Reality Project Philippines,False,The Climate Reality Project Philippines focuse...,https://climatereality.ph/reenergizeph/
18860,Seychelles Meteorological Authorithy,False,The Seychelles Meteorological Authority is a g...,https://www.seychelles.gov.sc/Departments/mete...
7675,"Ministry of Local Government, Lands, Regional ...",False,"The Ministry of Local Government, Lands, Regio...",https://www.example.com/ministry-local-government
23338,Chairperson s Secretariat,False,The 'Chairperson's Secretariat' does not appea...,https://www.example.com/chairpersons-secretariat


In [29]:
cop_orgs_classified.to_csv('data/cop_orgs_classified.csv')

# What next?

There are lots of things we can improve about this process. Here are some ideas:

- **Play around with the prompt**. 
- **Change search engine**. DuckDuckGo is free and good for a prototype. However, their API isn't meant to be used to this way and will often deny requests. It also doesn't return the best results. I recomment switching to Google.
- **Try other models**. If you find that gpt-4o-mini is insufficient, you can use the smarter gpt-4o.
- **Validate the output**. If you use other models without Structured Output support, you can use Guardrails to [validate their output](https://ddj.nicu.md/ai/python-validation.html). It also lets you validate other things, like the language of the output.
- **Cache things**. Don't start over if something goes wrong, save the search results, scrapes and LLM outputs and continue where you left off.
- **Multithreading**. You can use Python's [multithreading](https://docs.python.org/3/library/threading.html) to run multiple classifications in parallel, significantly speeding up the process.
- **Verify**. LLMs are still dumb and shouldn't be trusted. Manually verify the classifications if you're going to publish the results!