<a target="_blank" href="https://colab.research.google.com/github/nicucalcea/ddj-wiki/blob/main/ai/python-classification-rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://github.com/nicucalcea/ddj-wiki/blob/main/ai/python-classification-rag.ipynb">
  <img src="https://badgen.net/badge/icon/github?icon=github&label=View%20code" alt="View on GitHub"/>
</a>

Without looking it up, would you know if the "Clean Resource Innovation Network” is an oil and gas lobbying organisation? Lobbyists often hide under unintuitive or misleading affiliations that obscure their origins. The work of uncovering their identities can be time-consuming and challenging.  

This workshop will outline an approach to using web scraping and Large Language Models (LLMs), like those powering ChatGPT, to systematically identify organisations that are affiliated with the fossil fuel industry. These techniques could also be adapted to other climate projects, such as identifying climate misinformation.

The tutorial will:

- Describe some of the challenges of web scraping at scale and the technical tools that can be used to address those challenges
- Describe how to design and test a successful LLM prompt
- How to identify the right LLM for the project you want to do
- How to test your approach and validate the results

## Install and load libraries

In [None]:
%pip install duckduckgo_search
# %pip install --upgrade duckduckgo-search
%pip install trafilatura
%pip install openai
%pip install pydantic
%pip install pandas

In [1]:
import requests
import pandas as pd
from duckduckgo_search import DDGS
from trafilatura import extract
from enum import Enum
from pydantic import BaseModel, Field
from openai import OpenAI
import json

## Prep data

The [UNFCCC website](https://unfccc.int/documents/634503) published an Excel sheet of COP28 participants. Let's download it to our local project.

In [4]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0'
}

url = "https://unfccc.int/sites/default/files/resource/PLOP%20COP28_on-site.xlsx"

response = requests.get(url, headers=headers)

cop_file = '../data/plop28.xlsx'

if response.status_code == 200:
    with open("../data/plop28.xlsx", "wb") as file:
        file.write(response.content)
    print("File downloaded successfully.")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully.


The participants are spread across multiple sheets, let's read in the data and bind them together.

In [5]:
# Create an empty list to store the dataframes from each sheet
cop_participans = []

# Read the Excel file
xls = pd.ExcelFile(cop_file)

# Iterate through the sheets and append the dataframes to the list
for sheet_name in xls.sheet_names:
  df = pd.read_excel(cop_file, sheet_name=sheet_name)
  cop_participans.append(df)

# Concatenate the dataframes into a single dataframe
cop_participans = pd.concat(cop_participans, ignore_index=True)

In [21]:
cop_participans.head()

Unnamed: 0,nominator,name,func_title,department,organization,relation
0,Albania,H.E. Mr. Edi Rama,Prime Minister,Prime Minister Office,Prime Minister Office,Choose not to disclose
1,Albania,H.E. Ms. Mirela Furxhi,Minister of Tourism and Environment,Ministry of Tourism and Environment,Ministry of Tourism and Environment,Choose not to disclose
2,Albania,H.E. Ms. Belinda Balluku,Deputy Prime Minister and Minister of Infrastr...,Ministry of Infrastructure and Energy,Ministry of Infrastructure and Energy,Choose not to disclose
3,Albania,Ms. Lindita Rama,Spouse of the Prime Minister,Not applicable,Not applicable,Choose not to disclose
4,Albania,H.E. Mr. Ridi Kurtezi,Ambassador of the Republic of Albania to the UAE,Albanian Embassy in United Arab Emirates,Albanian Embassy in United Arab Emirates,Choose not to disclose


This list contains all participants registered to attend the 2023 United Nations Climate Change Conference or Conference of the Parties (COP28).

We're not interested in the individuals, just the organisations they represent. Let's extract those.

In [22]:
cop_orgs = cop_participans[['nominator', 'organization']].drop_duplicates().sample(20)
cop_orgs

Unnamed: 0,nominator,organization
54467,World Bank Group,Velliv
79373,Himal Media,www.ukaalo.com
68015,The Association of Commonwealth Universities,South Eastern University of Sri Lanka
44364,Sweden,The Church of Sweden
44553,Thailand,"OKLIN (THAILAND) CO., LTD."


## Search

Large Language Models are prone to [hallucinations](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)).

If you ask an LLM a question it doesn't know the answer to, it will confidently make up a plausible-sounding answer that is completely wrong.

One way to help with that issue is to provide the AI with some additional context for the question you're asking.

For our case, we'll provide it with relevant search results related to our organisation so that it knows who we're asking about.

We'll use DuckDuckGo because it has a free API. For better results, you can use the Google API or a [SERP API](https://developers.oxylabs.io/scraper-apis/serp-scraper-api/google/search).

Let's search for the organisation and nominator and extract the first 5 results.

In [8]:
org = cop_orgs['organization'].iloc[0]
nominator = cop_orgs['nominator'].iloc[0]

search = f'("{org}" {nominator}) OR ("{org}") OR ({org})'
search += ' oil OR gas OR coal'
print(search)

results = DDGS().text(search, max_results=5)
results

("PRESIDENCIA DE LA REPUBLICA" Colombia) OR ("PRESIDENCIA DE LA REPUBLICA") OR (PRESIDENCIA DE LA REPUBLICA) oil OR gas OR coal


[{'title': 'presidencia.gov.co',
  'href': 'http://presidencia.gov.co/',
  'body': 'Presidencia es el sitio web oficial de la Presidencia de Colombia. Aquí podrá conocer las acciones y propuestas del gobierno, así como los canales de participación y comunicación ciudadana.'},
 {'title': 'Presidencia de la República - Colombia - YouTube',
  'href': 'https://www.youtube.com/@infopresidencia',
  'body': 'Share your videos with friends, family, and the world'},
 {'title': "Colombian electoral authorities investigating president's 2022 campaign ...",
  'href': 'https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign',
  'body': 'BOGOTA, Colombia — Electoral authorities in Colombia on Tuesday ruled in favor of investigating financial misconduct allegations against the 2022 campaign that got President Gustavo Petro elected.'},
 {'title': 'Presidente de Colombia - Wikipedia, la enciclopedia libre',
  'href'

## Scrape search results

Now, we want to extract the text from each of those URLs. We'll use [Trafilatura](https://github.com/adbar/trafilatura), which will help us extract the main text without headers, footers and other irrelevant text.

In [23]:
# Define function to scrape data
def extract_text(urls):
    results = []

    for url in urls:
        print(f"Scraping {url}...")
        try:
            response = requests.get(url, timeout=30, verify=False)  # Note: verify=False is not recommended for production use
            response.raise_for_status()  # Raises an HTTPError for bad responses
            extracted_text = extract(response.text, output_format="markdown")
            results.append((url, extracted_text))
        except requests.RequestException as e:
            print(f"Error scraping {url}: {str(e)}")
            results.append((url, f"Error: {str(e)}"))

    return results

In [10]:
# Run the function
texts = extract_text([result['href'] for result in results if 'href' in result])
texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

Scraping http://presidencia.gov.co/...
Scraping https://www.youtube.com/@infopresidencia...
Scraping https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign...
Error scraping https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign: 403 Client Error: Forbidden for url: https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign
Scraping https://es.wikipedia.org/wiki/Presidente_de_Colombia...
Scraping https://www.healthline.com/health/digestive-health/get-rid-of-gas-pains-and-bloating...


In [11]:
# Paste text together
stitched_text = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()

In [14]:
prompt_system = f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

A fossil fuel organization is NOT:
- NGOs campaigning AGAINST fossil fuels (e.g., Greenpeace).
- Investors, pension funds, universities, or banks that invest in fossil fuel companies.
- Consulting companies that work with a wide range of clients which may include fossil fuel companies.
- Governments that benefit from fossil fuel production.
- Government departments facilitating, regulating or taxing fossil fuel companies.
- Media organisations that report on the activities of fossil fuel companies.

Analyze the following text extracted from an internet search for "{org}" to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

## Send request to LLM

We'll use OpenAI's gpt-4o-mini for this classification.

One advantage of this particular model is its support for [Structured Outputs](https://openai.com/index/introducing-structured-outputs-in-the-api/). This means you can force the response to follow a certain set of rules.

Let's define what we want the output to be.

In [15]:
# Define OpenAI Structured Outputs rules
class OrgType(str, Enum):
    company = "company"
    ngo = "NGO/non-profit"
    government = "government"
    media = "media"
    other = "other"

class Classification(BaseModel):
    fossil_fuel_link: bool = Field(description = "Is this a fossil fuel organization?")
    org_type: OrgType
    explanation: str = Field(description = "A brief explanation of your decision, in English")
    source: str = Field(description = "A link to the SINGLE most relevant source that supports your classification")

Now, let's make the request to OpenAI. First, we define a function.

In [16]:
def make_request_openai(prompt_system: str, prompt_user: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    """Make a request to OpenAI models that support structured outputs."""
    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": prompt_user}
        ],
        response_format=Classification
    )
    return response.choices[0].message.content

Now, let's run the function on our example.

In [17]:
openai_response = make_request_openai(prompt_system, stitched_text)
print(openai_response)

{"fossil_fuel_link":false,"org_type":"government","explanation":"The 'PRESIDENCIA DE LA REPUBLICA' refers to the office of the President of Colombia, which is a government entity. It does not operate as a fossil fuel organization, as it does not engage in exploration, extraction, or any business activities related to fossil fuels. Instead, it is responsible for the administration of the country and its policies, which may include energy policy but does not directly involve fossil fuel operations.","source":"https://es.wikipedia.org/wiki/Presidente_de_Colombia"}


# Scale up

We can put the code above into one big function and run it over multiple organisations.

In [18]:
def extract_text(urls):
    results = []

    for url in urls:
        print(f"Scraping {url}...")
        try:
            response = requests.get(url, timeout=30, verify=False)  # Note: verify=False is not recommended for production use
            response.raise_for_status()  # Raises an HTTPError for bad responses
            extracted_text = extract(response.text, output_format="markdown")
            results.append((url, extracted_text))
        except requests.RequestException as e:
            print(f"Error scraping {url}: {str(e)}")
            results.append((url, f"Error: {str(e)}"))

    return results

# Define OpenAI Structured Outputs rules
class OrgType(str, Enum):
    company = "company"
    ngo = "NGO/non-profit"
    government = "government"
    media = "media"
    other = "other"

class Classification(BaseModel):
    fossil_fuel_link: bool = Field(description = "Is this a fossil fuel organization?")
    org_type: OrgType
    explanation: str = Field(description = "A brief explanation of your decision, in English")
    source: str = Field(description = "A link to the SINGLE most relevant source that supports your classification")

def make_request_openai(org: str, prompt_user: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    """Make a request to OpenAI models that support structured outputs."""

    prompt_system = f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

A fossil fuel organization is NOT:
- NGOs campaigning AGAINST fossil fuels (e.g., Greenpeace).
- Investors, pension funds, universities, or banks that invest in fossil fuel companies.
- Consulting companies that work with a wide range of clients which may include fossil fuel companies.
- Governments that benefit from fossil fuel production.
- Government departments facilitating, regulating or taxing fossil fuel companies.
- Media organisations that report on the activities of fossil fuel companies.

Analyze the following text extracted from an internet search for "{org}" to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": prompt_user}
        ],
        response_format=Classification
    )
    return response.choices[0].message.content

def classify_org(org: str, nominator: str):
    search = f'("{org}" {nominator}) OR ("{org}") OR ({org})'
    search += ' oil OR gas OR coal'

    results = DDGS().text(search, max_results=5)

    texts = extract_text([result['href'] for result in results if 'href' in result])
    texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

    stitched_text = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()

    openai_response = make_request_openai(org, stitched_text)

    return openai_response

In [19]:
def apply_classify_org(df):
    df['classification'] = df.apply(lambda row: classify_org(row['organization'], row['nominator']), axis=1)
    df['classification'] = df['classification'].apply(json.loads)
    df = pd.concat([df.drop(['classification'], axis=1), df['classification'].apply(pd.Series)], axis=1)

    return df

Now let's run this on our sample of organisations.

In [20]:
cop_orgs_classified = apply_classify_org(cop_orgs)

Scraping http://presidencia.gov.co/...
Scraping https://www.youtube.com/@infopresidencia...
Scraping http://web.presidencia.gov.co/...
Scraping https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign...
Error scraping https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign: 403 Client Error: Forbidden for url: https://www.latimes.com/world-nation/story/2024-10-08/colombian-electoral-authorities-open-investigation-against-president-petros-2022-campaign
Scraping https://es.wikipedia.org/wiki/Presidente_de_Colombia...
Scraping https://www.ridetocop28.com/...
Scraping https://www.reuters.com/world/us/biden-pledges-end-gas-powered-federal-vehicle-purchases-by-2035-2021-12-08/...
Error scraping https://www.reuters.com/world/us/biden-pledges-end-gas-powered-federal-vehicle-purchases-by-2035-2021-12-08/: 401 Client Error

In [27]:
cop_orgs_classified

Unnamed: 0,nominator,organization,fossil_fuel_link,org_type,explanation,source
30229,Colombia,PRESIDENCIA DE LA REPUBLICA,False,government,The 'PRESIDENCIA DE LA REPUBLICA' refers to th...,https://es.wikipedia.org/wiki/Presidente_de_Co...
74167,Host Country Badges - COP 28,Mazi Mobility,False,other,The text does not provide any information abou...,
61263,Fresh Energy,Fresh Energy,False,NGO/non-profit,Fresh Energy is focused on promoting clean ene...,https://www.freshenergy.org/
79561,L'Orient-Le Jour newspaper,L'Orient-Le Jour newspaper,False,media,L'Orient-Le Jour is a daily newspaper that cov...,https://en.wikipedia.org/wiki/L'Orient-Le_Jour
73802,Host Country Badges - COP 28,TAMALE,False,other,The text discusses the history and cultural si...,https://www.tastingtable.com/1529368/history-o...


In [26]:
cop_orgs_classified.to_csv('../data/cop_orgs_classified.csv')

# What to improve?

There are lots of things we can improve about this process. Here are some ideas:

- DuckDuckGo is free and good to use. However, their API isn't meant to be used to this way and will often deny requests. It also doesn't return the best results. I recomment switching to Google.
- Try other models! If you find that gpt-4o-mini is insufficient, you can use the smarter gpt-4o.
- If you use other models without Structured Output support, you can use Guardrails to [validate their output](https://ddj.nicu.md/ai/python-validation.html).
- Cache things! Don't start over if something goes wrong, save the search results, scrapes and LLM outputs and continue where you left off.
- You can use Python's [multithreading](https://docs.python.org/3/library/threading.html) to run multiple classifications in parallel, significantly speeding up the process.
- LLMs are still dumb and shouldn't be trusted. Manually verify the classifications if you're going to publish the results.