## Final Project: Fuzzy matching of organization names

In [None]:
project_id = "kiaraerica"
dataset = "us_climate_fin"
region = "us-central1"
connection_id = "vertex-connection"
embedding_model = "text-embedding-005"
gemini_model = "gemini-2.5-flash-preview-04-17"

### Part 1: Setup

##### Create the BQ datasets

In [None]:
from google.cloud import bigquery

bq_client = bigquery.Client()

dataset_id = bigquery.Dataset(f"{project_id}.{dataset}")
dataset_id.location = region
resp = bq_client.create_dataset(dataset_id, exists_ok=True)
print("Created dataset {}.{}".format(bq_client.project, resp.dataset_id))

Created dataset kiaraerica.us_climate_fin


##### Create a connection resource and register the latest embeddings model (`text-embedding-005`)

In [None]:
!bq mk --connection --location=$region --project_id=$project_id \
    --connection_type=CLOUD_RESOURCE $connection_id

BigQuery error in mk operation: Already Exists: Connection projects/3758592633/locations/us-
central1/connections/vertex-connection


In [None]:
!bq show --connection 3758592633.us-central1.vertex-connection

Connection 3758592633.us-central1.vertex-connection

                    name                     friendlyName   description    Last modified         type        hasCredential                                           properties                                           
 ------------------------------------------ -------------- ------------- ----------------- ---------------- --------------- --------------------------------------------------------------------------------------------- 
  3758592633.us-central1.vertex-connection                                20 Apr 22:11:01   CLOUD_RESOURCE   False           {"serviceAccountId": "bqcx-3758592633-7h5b@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}  



In [None]:
!gcloud projects add-iam-policy-binding $project_id --member='serviceAccount:bqcx-3758592633-7h5b@gcp-sa-bigquery-condel.iam.gserviceaccount.com' \
  --role='roles/aiplatform.user' --no-user-output-enabled

##### Replace the dataset, project, and connection as appropriate before running the next cell.

In [None]:
%%bigquery
CREATE OR REPLACE MODEL us_climate_fin.embedding_model
REMOTE WITH CONNECTION `projects/kiaraerica/locations/us-central1/connections/vertex-connection`
OPTIONS (endpoint = 'text-embedding-005')

Query is running:   0%|          |

### Part 2: Sample the input data, `us_climate_stg.facility_ghg_emissions`

In [None]:
%%bigquery
create or replace table us_climate_fin.ghg_facilities_no_org as
WITH ranked as (
  select
      facility_id,
      facility_name,
      city,
      state,
      naics_code,
      industry_sector1,
      industry_sector2,
      industry_sector3,
      max_rated_heat_input_capacity,
      carbon_dioxide_emissions,
      methane_emissions,
      nitrous_oxide_emissions,
      biogenic_co2_emissions,
      _data_source,
      _load_time,
      row_number() over (partition by facility_id order by _load_time desc) as rk
  from us_climate_stg.facility_ghg_emissions
  where facility_name is not null
)
select *
from ranked
where rk = 1
order by facility_name

Query is running:   0%|          |

In [None]:
import json
import time
import pandas as pd
import pandas_gbq
from google.cloud import bigquery
import vertexai
from vertexai.generative_models import GenerativeModel
from google.api_core.exceptions import GoogleAPIError

vertexai.init(project=project_id, location=region)
model = GenerativeModel(model_name=gemini_model)
bq_client = bigquery.Client(project=project_id)

sql_create_checkpoint = """
CREATE TABLE IF NOT EXISTS us_climate_fin.ghg_facilities_llm_org_checkpoint (
    facility_id STRING,
    organization_name STRING
);
"""
bq_client.query(sql_create_checkpoint).result()
print("Checkpoint table is ready.")

sql_ghg = """
WITH all_facilities AS (
    SELECT
        CAST(facility_id AS STRING) AS facility_id,
        facility_name,
        city,
        state,
        naics_code,
        industry_sector1,
        industry_sector2,
        industry_sector3
    FROM us_climate_fin.ghg_facilities_no_org
    ORDER BY facility_id
),
unprocessed_facilities AS (
    SELECT a.*
    FROM all_facilities a
    LEFT JOIN us_climate_fin.ghg_facilities_llm_org_checkpoint c
    ON a.facility_id = c.facility_id
    WHERE c.facility_id IS NULL
)
SELECT * FROM unprocessed_facilities
"""
df_ghg = bq_client.query(sql_ghg).to_dataframe()
print(f"Found {len(df_ghg)} unprocessed facilities.")

prompt_ghg = """Given a facility from the GHG emissions dataset:
facility_id, facility_name, city, state, naics_code, industry_sector1,2,3.

Identify the organization that owns or operates this facility, or return null if unknown.
Return EXACTLY one JSON line:
{
  "facility_id": <string>,
  "organization_name": <string or null>
}
No extra text or explanation.
"""

def find_org_for_ghg(row):
    row_dict = row.to_dict()

    row_text = (
        f"facility_id={row_dict.get('facility_id', 'Unknown')}, "
        f"facility_name='{row_dict.get('facility_name', 'Unknown')}', "
        f"city='{row_dict.get('city', 'Unknown')}', "
        f"state='{row_dict.get('state', 'Unknown')}', "
        f"naics_code='{row_dict.get('naics_code', 'Unknown')}', "
        f"sector1='{row_dict.get('industry_sector1', 'Unknown')}', "
        f"sector2='{row_dict.get('industry_sector2', 'Unknown')}', "
        f"sector3='{row_dict.get('industry_sector3', 'Unknown')}'"
    )

    combined_prompt = row_text + "\n" + prompt_ghg

    try:
        resp = model.generate_content(combined_prompt)
        raw_text = resp.text.replace("```json", "").replace("```", "").strip()
        parsed = json.loads(raw_text)

        if "facility_id" not in parsed:
            parsed["facility_id"] = row_dict.get("facility_id", None)
        return parsed

    except Exception as e:
        print(f"Error processing facility_id {row_dict.get('facility_id', None)}: {e}")
        return {
            "facility_id": row_dict.get("facility_id", None),
            "organization_name": None
        }

batch_size = 50
min_batch_size = 5
sleep_time = 5
max_retries = 5

results_ghg = []
i = 0

while i < len(df_ghg):
    try:
        batch_df = df_ghg.iloc[i:i + batch_size]
        batch_results = [find_org_for_ghg(row) for _, row in batch_df.iterrows()]

        df_batch = pd.DataFrame(batch_results)
        pandas_gbq.to_gbq(
            df_batch,
            "us_climate_fin.ghg_facilities_llm_org_checkpoint",
            project_id=project_id,
            if_exists="append"
        )

        results_ghg.extend(batch_results)
        i += batch_size
        print(f"Processed {i}/{len(df_ghg)} records.")

        time.sleep(sleep_time)

    except GoogleAPIError as e:
        print(f"Quota error encountered: {e}. Retrying with backoff...")

        for retry in range(1, max_retries + 1):
            time.sleep(sleep_time * retry)
            print(f"Retrying (attempt {retry}/{max_retries})...")
            try:
                batch_df = df_ghg.iloc[i:i + batch_size]
                batch_results = [find_org_for_ghg(row) for _, row in batch_df.iterrows()]

                df_batch = pd.DataFrame(batch_results)
                pandas_gbq.to_gbq(
                    df_batch,
                    "us_climate_fin.ghg_facilities_llm_org_checkpoint",
                    project_id=project_id,
                    if_exists="append"
                )

                results_ghg.extend(batch_results)
                i += batch_size
                print(f"Processed {i}/{len(df_ghg)} records after retry.")

                break

            except GoogleAPIError:
                if retry == max_retries:
                    print("Max retries reached. Reducing batch size.")
                    batch_size = max(batch_size // 2, min_batch_size)
                    if batch_size == min_batch_size:
                        print("Minimum batch size reached. Exiting.")
                        break

df_ghg_out = pd.DataFrame(results_ghg)
pandas_gbq.to_gbq(
    df_ghg_out,
    "us_climate_fin.ghg_facilities_llm_org",
    project_id=project_id,
    if_exists="replace"
)

print("LLM results for GHG saved to ghg_facilities_llm_org.")

Checkpoint table is ready.
Found 0 unprocessed facilities.
LLM results for GHG saved to ghg_facilities_llm_org.


### Part 3: Create the embeddings

In [None]:
%%bigquery

CREATE OR REPLACE TABLE us_climate_fin.ghg_org_names_raw AS
SELECT DISTINCT organization_name
FROM us_climate_fin.ghg_facilities_llm_org_checkpoint
WHERE organization_name IS NOT NULL AND TRIM(organization_name) != ''

Query is running:   0%|          |

In [None]:
%%bigquery

CREATE OR REPLACE TABLE us_climate_fin.ghg_org_embeddings AS (

WITH org_content AS (
  SELECT
    organization_name,
    organization_name AS content
  FROM
    us_climate_fin.ghg_org_names_raw
)

SELECT
  organization_name,
  content,
  ml_generate_embedding_result AS embedding
FROM
  ML.GENERATE_EMBEDDING(
    MODEL us_climate_fin.embedding_model,
    (
      SELECT organization_name, content
      FROM org_content
      WHERE content IS NOT NULL
    ),
    STRUCT('CLUSTERING' AS task_type)
  )
)

Query is running:   0%|          |

In [None]:
%%bigquery
SELECT *
FROM us_climate_fin.ghg_org_embeddings
LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,content,embedding
0,Deseret Power Electric Cooperative,Deseret Power Electric Cooperative,"[-0.05413160100579262, 0.035167302936315536, -..."
1,Greenleaf Energy Unit 1 LLC,Greenleaf Energy Unit 1 LLC,"[-0.10558508336544037, 0.007354965899139643, -..."
2,JSW Steel,JSW Steel,"[0.004947439301759005, 0.017695395275950432, -..."
3,"KAAPA Ethanol, LLC","KAAPA Ethanol, LLC","[-0.04476097226142883, 0.04654958099126816, -0..."
4,Algonquin Power & Utilities Corp.,Algonquin Power & Utilities Corp.,"[-0.0699002742767334, 0.013775769621133804, -0..."
5,"DAKOTA ETHANOL, LLC","DAKOTA ETHANOL, LLC","[-0.03983362019062042, 0.01606786996126175, -0..."
6,"The Andersons, Inc.","The Andersons, Inc.","[-0.058520857244729996, -0.023989615961909294,..."
7,Consolidated Natural Gas Company,Consolidated Natural Gas Company,"[-0.09865348041057587, -0.0037647339049726725,..."
8,"Niagara Generation, LLC","Niagara Generation, LLC","[-0.07392047345638275, -0.006972586736083031, ..."
9,"NutraSweet Holdings, Inc.","NutraSweet Holdings, Inc.","[-0.05890120193362236, -0.034589651972055435, ..."


#### Part 4: Find the nearest neighbors based on cosine distance

In [None]:
%%bigquery

CREATE OR REPLACE TABLE us_climate_fin.ghg_org_nearest_neighbors AS

SELECT
  query.organization_name AS organization_name,
  base.organization_name AS nearest_neighbor,
  distance
FROM
  VECTOR_SEARCH(
    TABLE us_climate_fin.ghg_org_embeddings,
    'embedding',
    TABLE us_climate_fin.ghg_org_embeddings,
    'embedding',
    TOP_K => 2,
    DISTANCE_TYPE => 'COSINE'
  )
WHERE
  query.organization_name != base.organization_name
ORDER BY
  distance

Query is running:   0%|          |

In [None]:
%%bigquery

SELECT *
FROM us_climate_fin.ghg_org_nearest_neighbors
ORDER BY distance

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,"EPL Oil and Gas, Inc.","EPL Oil & Gas, Inc.",0.001325
1,"EPL Oil & Gas, Inc.","EPL Oil and Gas, Inc.",0.001325
2,Analog Devices Inc.,"Analog Devices, Inc.",0.001820
3,"Analog Devices, Inc.",Analog Devices Inc.,0.001820
4,"K.P. Kauffman Company, Inc.","K. P. Kauffman Company, Inc.",0.001905
...,...,...,...
3845,CenTrio,Central Bi-Products,0.444565
3846,Didion,DuPont,0.455011
3847,Order of Saint Benedict,OCI Beaumont,0.457049
3848,Architect of the Capitol,General Services Administration,0.481078


####This was a real duplicate.
The only difference is abbreviation ("CO." vs "COMPANY") but refers to the same entity.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("ONEOK FIELD SERVICES CO., LLC", "ONEOK FIELD SERVICES COMPANY, LLC")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,"ONEOK FIELD SERVICES CO., LLC","ONEOK FIELD SERVICES COMPANY, LLC",0.004529
1,"ONEOK FIELD SERVICES COMPANY, LLC","ONEOK FIELD SERVICES CO., LLC",0.004529


#### This was a real duplicate.
"Inc." and "Incorporated" are stylistic variations.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("Leggett & Platt, Incorporated", "Leggett & Platt, Inc.")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,"Leggett & Platt, Incorporated","Leggett & Platt, Inc.",0.00605
1,"Leggett & Platt, Inc.","Leggett & Platt, Incorporated",0.00605


#### This was a real duplicate.
Dropped "Incorporated", but they refer to the same company.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("Texas Instruments Incorporated", "Texas Instruments")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,Texas Instruments Incorporated,Texas Instruments,0.019594
1,Texas Instruments,Texas Instruments Incorporated,0.019594


#### This was a real duplicate.
Same county name, just reversed phrasing.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("Jackson County", "County of Jackson")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,Jackson County,County of Jackson,0.073705
1,County of Jackson,Jackson County,0.073705


#### This was a real duplicate.
Minor typo or variation ("BCC" vs "BC") but contextually the same.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("BCC Operating, LLC", "BC Operating, Inc.")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,"BCC Operating, LLC","BC Operating, Inc.",0.078012
1,"BC Operating, Inc.","BCC Operating, LLC",0.078012


#### This was not a duplicate.
They are two different universities in Mississippi.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("University of Mississippi", "Mississippi State University")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,University of Mississippi,Mississippi State University,0.079382
1,Mississippi State University,University of Mississippi,0.079382


#### This was not a duplicate.
Completely different universities.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("New Mexico State University", "University of New Mexico")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,New Mexico State University,UNIVERSITY OF NEW MEXICO,0.087637


#### This was not a duplicate.
Different counties.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where organization_name in ("Prince George's County", "Prince William County")

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,Prince George's County,Prince William County,0.092238
1,Prince William County,Prince George's County,0.092238


#### A reasonable cutoff would be distance <= 0.075 to filter high-confidence duplicates.

In [None]:
%%bigquery
select *
from us_climate_fin.ghg_org_nearest_neighbors
where distance <= 0.075
order by distance

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,nearest_neighbor,distance
0,"EPL Oil and Gas, Inc.","EPL Oil & Gas, Inc.",0.001325
1,"EPL Oil & Gas, Inc.","EPL Oil and Gas, Inc.",0.001325
2,Analog Devices Inc.,"Analog Devices, Inc.",0.001820
3,"Analog Devices, Inc.",Analog Devices Inc.,0.001820
4,"K.P. Kauffman Company, Inc.","K. P. Kauffman Company, Inc.",0.001905
...,...,...,...
1187,PACIFICORP,PacifiCorp,0.074906
1188,PacifiCorp,PACIFICORP,0.074906
1189,LG&E and KU Services Company,LG&E and KU Energy LLC,0.074951
1190,LG&E and KU Energy LLC,LG&E and KU Services Company,0.074951


### Part 5: Assign unique cluster ids to the pairs of nearest neighbors which fall within our distance threshold (<= 0.075)

In [None]:
import pandas as pd
import pandas_gbq
from google.cloud import bigquery

input_table = "us_climate_fin.ghg_org_nearest_neighbors"
output_table = "us_climate_fin.ghg_org_clusters"

base_query = f"""
    SELECT organization_name, nearest_neighbor
    FROM `{input_table}`
    WHERE distance <= 0.075
"""

bq_client = bigquery.Client()
rows = bq_client.query(base_query).result()

cluster_id = 0
output_clusters = []
unique_ids = set()

for row in rows:
    id1 = row["organization_name"]
    id2 = row["nearest_neighbor"]

    if id1 not in unique_ids and id2 not in unique_ids:
        cluster_id += 1
        output_clusters.append((id1, cluster_id))
        output_clusters.append((id2, cluster_id))
        unique_ids.add(id1)
        unique_ids.add(id2)
        print(f"Assigned {id1} and {id2} to cluster {cluster_id}")

df = pd.DataFrame(output_clusters, columns=["organization_name", "cluster_id"])
pandas_gbq.to_gbq(df, output_table, project_id="kiaraerica", if_exists="replace")

Assigned EPL Oil and Gas, Inc. and EPL Oil & Gas, Inc. to cluster 1
Assigned Analog Devices Inc. and Analog Devices, Inc. to cluster 2
Assigned K.P. Kauffman Company, Inc. and K. P. Kauffman Company, Inc. to cluster 3
Assigned Neenah Paper Inc. and Neenah Paper, Inc. to cluster 4
Assigned PPG Industries, Inc and PPG Industries, Inc. to cluster 5
Assigned Oklahoma Gas & Electric Co. and Oklahoma Gas and Electric Co. to cluster 6
Assigned Chevron USA, Inc. and Chevron USA Inc. to cluster 7
Assigned Louisville Gas & Electric Company and Louisville Gas and Electric Company to cluster 8
Assigned Cargill, Inc and Cargill, Inc. to cluster 9
Assigned Dayton Power & Light and Dayton Power and Light to cluster 10
Assigned Oklahoma Gas and Electric Company and Oklahoma Gas & Electric Company to cluster 11
Assigned Pixelle Specialty Solutions LLC and Pixelle Specialty Solutions, LLC to cluster 12
Assigned Transcontinental Gas Pipe Line Co., LLC and Transcontinental Gas Pipe Line Company, LLC to cl

100%|██████████| 1/1 [00:00<00:00, 8612.53it/s]


In [None]:
%%bigquery
select * from us_climate_fin.ghg_org_clusters order by cluster_id

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,cluster_id
0,"EPL Oil and Gas, Inc.",1
1,"EPL Oil & Gas, Inc.",1
2,Analog Devices Inc.,2
3,"Analog Devices, Inc.",2
4,"K.P. Kauffman Company, Inc.",3
...,...,...
1015,Western Washington University,508
1016,PACIFICORP,509
1017,PacifiCorp,509
1018,LG&E and KU Services Company,510


### Part 6: Ask the LLM to determine which organization name to keep

In [None]:
%%bigquery
create or replace table us_climate_fin.org_filtered_practice as
  select * from us_climate_fin.ghg_org_clusters where cluster_id in (1, 393)

Query is running:   0%|          |

Creating small example first for LLM, since some clusters do not refer to the same entity, like `cluster_id` 393.

In [None]:
%%bigquery
select * from us_climate_fin.org_filtered_practice

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name,cluster_id
0,"EPL Oil & Gas, Inc.",1
1,"EPL Oil and Gas, Inc.",1
2,Utah State University,393
3,University of Utah,393


Checking LLM with small example

In [None]:
import json, pandas, pandas_gbq
from google.cloud import bigquery
import vertexai
from vertexai.generative_models import GenerativeModel

prompt = """Please verify each organization name.
If they refer to the same entity, return the most accurate organization name in json using the schema {"organization_name" : string}.
For example, {"organization_name": "EPL Oil & Gas, Inc."}
If they refer to different entities, return both organization names in json using the schema
[{"organization_name": "<name1>"}, {"organization_name": "<name2>"}]
For example, [{"organization_name": "University of Utah"}, {"organization_name": "Utah State University"}]
Do not include an explanation with your answer.
"""

sql = """select organization_name, cluster_id
from us_climate_fin.org_filtered_practice
order by cluster_id
"""

def do_inference(input_str):

    print("enter do_inference()")
    print("input_str:", input_str)

    vertexai.init(project=project_id, location=region)
    model = GenerativeModel(gemini_model)
    resp = model.generate_content([input_str, prompt])

    resp_text = resp.text.replace("```json", "").replace("```", "").replace("\n", "")
    print("resp_text:", resp_text)

    results = json.loads(resp_text)

    return results

bq_client = bigquery.Client()
rows = bq_client.query_and_wait(sql)

org_name_pairs_list = []
cluster_id = ""
prev_cluster_id = ""
combined_results = []

for row in rows:

    cluster_id = row['cluster_id']

    row_dict = {}
    row_dict["organization_name"] = row['organization_name']

    org_name_pairs_list.append(json.dumps(row_dict))

    if cluster_id == prev_cluster_id:
        org_name_pairs_str = ",".join(org_name_pairs_list)
        results = do_inference(org_name_pairs_str)
        if isinstance(results, list):
          combined_results.extend(results)
        else:
          combined_results.append(results)
        print(results)
        org_name_pairs_list.clear()

    prev_cluster_id = cluster_id

# write results to BQ
print("combined_results:", combined_results)
df = pandas.DataFrame(combined_results)

enter do_inference()
input_str: {"organization_name": "EPL Oil & Gas, Inc."},{"organization_name": "EPL Oil and Gas, Inc."}
resp_text: {"organization_name": "EPL Oil & Gas, Inc."}
{'organization_name': 'EPL Oil & Gas, Inc.'}
enter do_inference()
input_str: {"organization_name": "Utah State University"},{"organization_name": "University of Utah"}
resp_text: [{"organization_name": "Utah State University"}, {"organization_name": "University of Utah"}]
[{'organization_name': 'Utah State University'}, {'organization_name': 'University of Utah'}]
combined_results: [{'organization_name': 'EPL Oil & Gas, Inc.'}, {'organization_name': 'Utah State University'}, {'organization_name': 'University of Utah'}]


Use whole table now

In [None]:
import json, pandas, pandas_gbq
from google.cloud import bigquery
import vertexai
from vertexai.generative_models import GenerativeModel

prompt = """Please verify each organization name.
If they are under the same parent organization, return the closest to the parent organization name in json using the schema {"organization_name" : string}.
For example, {"organization_name": "EPL Oil & Gas, Inc."}
If they refer to different parent organizations, return both organization names in json using the schema
[{"organization_name": "<name1>"}, {"organization_name": "<name2>"}]
For example, [{"organization_name": "University of Utah"}, {"organization_name": "Utah State University"}]
Do not include an explanation with your answer.
"""

sql_create_checkpoint = """
create table if not exists us_climate_fin.cluster_checkpoint (
    cluster_id integer,
)
"""
bq_client = bigquery.Client()
bq_client.query(sql_create_checkpoint).result()
print("Checkpoint table is ready.")

sql = """select organization_name, cluster_id
from us_climate_fin.ghg_org_clusters
where cluster_id not in (
  select cluster_id from us_climate_fin.cluster_checkpoint
)
order by cluster_id
"""

def do_inference(input_str):

    print("enter do_inference()")
    print("input_str:", input_str)

    vertexai.init(project=project_id, location=region)
    model = GenerativeModel(gemini_model)
    resp = model.generate_content([input_str, prompt])

    resp_text = resp.text.replace("```json", "").replace("```", "").replace("\n", "")
    print("resp_text:", resp_text)

    results = json.loads(resp_text)

    if isinstance(results, dict):
      return [results]

    return results

rows = bq_client.query_and_wait(sql)

org_name_pairs_list = []
cluster_id = ""
prev_cluster_id = ""
combined_results = []
results_table = "us_climate_fin.ghg_org_name_filtered"
checkpoint_table = "us_climate_fin.cluster_checkpoint"

for row in rows:

    cluster_id = row['cluster_id']

    row_dict = {}
    row_dict["organization_name"] = row['organization_name']

    org_name_pairs_list.append(json.dumps(row_dict))

    if cluster_id == prev_cluster_id:
        org_name_pairs_str = ",".join(org_name_pairs_list)
        results = do_inference(org_name_pairs_str)

        for r in results:
          r["cluster_id"] = cluster_id

        combined_results.extend(results)
        print(results)
        org_name_pairs_list.clear()

        checkpoint_query = f"""
        insert into `{checkpoint_table}` (cluster_id)
        values ({cluster_id})
        """
        bq_client.query(checkpoint_query).result()

        if len(combined_results) >= 50:
            df = pandas.DataFrame(combined_results)
            pandas_gbq.to_gbq(df, results_table, project_id=project_id, if_exists="append")
            combined_results.clear()

    prev_cluster_id = cluster_id

if combined_results:
    df = pandas.DataFrame(combined_results)
    pandas_gbq.to_gbq(df, results_table, project_id=project_id, if_exists="append")

#         if isinstance(results, list):
#           combined_results.extend(results)
#         else:
#           combined_results.append(results)
#         print(results)
#         org_name_pairs_list.clear()

#     prev_cluster_id = cluster_id

# # write results to BQ
# print("combined_results:", combined_results)
# df = pandas.DataFrame(combined_results)

# table_id = "us_climate_fin.organization_name_filtered" # output table
# pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists="replace")

Checkpoint table is ready.
enter do_inference()
input_str: {"organization_name": "EPL Oil and Gas, Inc."},{"organization_name": "EPL Oil & Gas, Inc."}
resp_text: {"organization_name": "EPL Oil & Gas, Inc."}
[{'organization_name': 'EPL Oil & Gas, Inc.', 'cluster_id': 1}]
enter do_inference()
input_str: {"organization_name": "Analog Devices Inc."},{"organization_name": "Analog Devices, Inc."}
resp_text: {"organization_name": "Analog Devices, Inc."}
[{'organization_name': 'Analog Devices, Inc.', 'cluster_id': 2}]
enter do_inference()
input_str: {"organization_name": "K.P. Kauffman Company, Inc."},{"organization_name": "K. P. Kauffman Company, Inc."}
resp_text: {"organization_name": "K.P. Kauffman Company, Inc."}
[{'organization_name': 'K.P. Kauffman Company, Inc.', 'cluster_id': 3}]
enter do_inference()
input_str: {"organization_name": "Neenah Paper Inc."},{"organization_name": "Neenah Paper, Inc."}
resp_text: {"organization_name": "Neenah Paper, Inc."}
[{'organization_name': 'Neenah Pape

100%|██████████| 1/1 [00:00<00:00, 9000.65it/s]


enter do_inference()
input_str: {"organization_name": "Kinder Morgan Texas Pipeline, LLC"},{"organization_name": "Kinder Morgan Texas Pipeline LLC"}
resp_text: {"organization_name": "Kinder Morgan Texas Pipeline LLC"}
[{'organization_name': 'Kinder Morgan Texas Pipeline LLC', 'cluster_id': 51}]
enter do_inference()
input_str: {"organization_name": "Hi-Crush Inc"},{"organization_name": "Hi-Crush Inc."}
resp_text: {"organization_name": "Hi-Crush Inc."}
[{'organization_name': 'Hi-Crush Inc.', 'cluster_id': 52}]
enter do_inference()
input_str: {"organization_name": "OFS GPRP LLC"},{"organization_name": "OFS GPRP, LLC"}
resp_text: {"organization_name": "OFS GPRP, LLC"}
[{'organization_name': 'OFS GPRP, LLC', 'cluster_id': 53}]
enter do_inference()
input_str: {"organization_name": "Kinder Morgan Tejas Pipeline, LLC"},{"organization_name": "Kinder Morgan Tejas Pipeline LLC"}
resp_text: {"organization_name": "Kinder Morgan Tejas Pipeline LLC"}
[{'organization_name': 'Kinder Morgan Tejas Pipeli

100%|██████████| 1/1 [00:00<00:00, 10330.80it/s]


enter do_inference()
input_str: {"organization_name": "Occidental Petroleum"},{"organization_name": "Occidental Petroleum Corporation"}
resp_text: {"organization_name": "Occidental Petroleum Corporation"}
[{'organization_name': 'Occidental Petroleum Corporation', 'cluster_id': 101}]
enter do_inference()
input_str: {"organization_name": "COX OIL OFFSHORE, LLC"},{"organization_name": "COX OIL OFFSHORE, L.L.C."}
resp_text: {"organization_name": "COX OIL OFFSHORE, LLC"}
[{'organization_name': 'COX OIL OFFSHORE, LLC', 'cluster_id': 102}]
enter do_inference()
input_str: {"organization_name": "PANHANDLE EASTERN PIPE LINE COMPANY"},{"organization_name": "PANHANDLE EASTERN PIPE LINE CO"}
resp_text: {"organization_name": "PANHANDLE EASTERN PIPE LINE COMPANY"}
[{'organization_name': 'PANHANDLE EASTERN PIPE LINE COMPANY', 'cluster_id': 103}]
enter do_inference()
input_str: {"organization_name": "Alcoa Corporation"},{"organization_name": "Alcoa Inc."}
resp_text: {"organization_name": "Alcoa Corpora

100%|██████████| 1/1 [00:00<00:00, 8867.45it/s]


enter do_inference()
input_str: {"organization_name": "American Zinc Recycling Corp."},{"organization_name": "American Zinc Recycling Corporation"}
resp_text: {"organization_name": "American Zinc Recycling Corporation"}
[{'organization_name': 'American Zinc Recycling Corporation', 'cluster_id': 151}]
enter do_inference()
input_str: {"organization_name": "Energy Transfer Operating, LP"},{"organization_name": "Energy Transfer Operating, L.P."}
resp_text: {"organization_name": "Energy Transfer Operating, L.P."}
[{'organization_name': 'Energy Transfer Operating, L.P.', 'cluster_id': 152}]
enter do_inference()
input_str: {"organization_name": "U.S. Navy"},{"organization_name": "United States Navy"}
resp_text: {"organization_name": "United States Navy"}
[{'organization_name': 'United States Navy', 'cluster_id': 153}]
enter do_inference()
input_str: {"organization_name": "Rain CII CARBON LLC"},{"organization_name": "RAIN CII CARBON LLC"}
resp_text: {"organization_name": "Rain CII CARBON LLC"}

100%|██████████| 1/1 [00:00<00:00, 8719.97it/s]


enter do_inference()
input_str: {"organization_name": "Mohawk Industries, Inc."},{"organization_name": "Mohawk Industries"}
resp_text: {"organization_name": "Mohawk Industries, Inc."}
[{'organization_name': 'Mohawk Industries, Inc.', 'cluster_id': 200}]
enter do_inference()
input_str: {"organization_name": "Crusoe Energy Systems, Inc."},{"organization_name": "Crusoe Energy Systems LLC"}
resp_text: {"organization_name": "Crusoe Energy Systems LLC"}
[{'organization_name': 'Crusoe Energy Systems LLC', 'cluster_id': 201}]
enter do_inference()
input_str: {"organization_name": "TYSON FARMS, INC."},{"organization_name": "TYSON FARMS INC"}
resp_text: {"organization_name": "TYSON FARMS, INC."}
[{'organization_name': 'TYSON FARMS, INC.', 'cluster_id': 202}]
enter do_inference()
input_str: {"organization_name": "Superior Industries International, Inc."},{"organization_name": "Superior Industries International"}
resp_text: {"organization_name": "Superior Industries International, Inc."}
[{'organiz

100%|██████████| 1/1 [00:00<00:00, 10034.22it/s]


enter do_inference()
input_str: {"organization_name": "Casella Waste Systems"},{"organization_name": "Casella Waste Systems, Inc."}
resp_text: {"organization_name": "Casella Waste Systems, Inc."}
[{'organization_name': 'Casella Waste Systems, Inc.', 'cluster_id': 250}]
enter do_inference()
input_str: {"organization_name": "Holcim"},{"organization_name": "Holcim Group"}
resp_text: {"organization_name": "Holcim Group"}
[{'organization_name': 'Holcim Group', 'cluster_id': 251}]
enter do_inference()
input_str: {"organization_name": "INVISTA S.A.R.L."},{"organization_name": "INVISTA S.\u00e0 r.l."}
resp_text: {"organization_name": "INVISTA S.à r.l."}
[{'organization_name': 'INVISTA S.à r.l.', 'cluster_id': 252}]
enter do_inference()
input_str: {"organization_name": "Alliant Energy Corporation"},{"organization_name": "Alliant Energy"}
resp_text: {"organization_name": "Alliant Energy Corporation"}
[{'organization_name': 'Alliant Energy Corporation', 'cluster_id': 253}]
enter do_inference()
in

100%|██████████| 1/1 [00:00<00:00, 9279.43it/s]


enter do_inference()
input_str: {"organization_name": "CITGO Petroleum Corporation"},{"organization_name": "Citgo Petroleum Corporation"}
resp_text: {"organization_name": "CITGO Petroleum Corporation"}
[{'organization_name': 'CITGO Petroleum Corporation', 'cluster_id': 300}]
enter do_inference()
input_str: {"organization_name": "Legacy Reserves Operating GP, LLC"},{"organization_name": "Legacy Reserves Operating LP"}
resp_text: {"organization_name": "Legacy Reserves Operating LP"}
[{'organization_name': 'Legacy Reserves Operating LP', 'cluster_id': 301}]
enter do_inference()
input_str: {"organization_name": "Phillips 66"},{"organization_name": "Phillips 66 Company"}
resp_text: {"organization_name": "Phillips 66 Company"}
[{'organization_name': 'Phillips 66 Company', 'cluster_id': 302}]
enter do_inference()
input_str: {"organization_name": "Knauf Group"},{"organization_name": "Knauf"}
resp_text: {"organization_name": "Knauf Group"}
[{'organization_name': 'Knauf Group', 'cluster_id': 303

100%|██████████| 1/1 [00:00<00:00, 8507.72it/s]


enter do_inference()
input_str: {"organization_name": "Kaiser Aluminum Corporation"},{"organization_name": "Kaiser Aluminum"}
resp_text: {"organization_name": "Kaiser Aluminum Corporation"}
[{'organization_name': 'Kaiser Aluminum Corporation', 'cluster_id': 350}]
enter do_inference()
input_str: {"organization_name": "Perdue Farms Inc."},{"organization_name": "Perdue Farms"}
resp_text: {"organization_name": "Perdue Farms Inc."}
[{'organization_name': 'Perdue Farms Inc.', 'cluster_id': 351}]
enter do_inference()
input_str: {"organization_name": "Martin Marietta Materials"},{"organization_name": "Martin Marietta Materials, Inc."}
resp_text: {"organization_name": "Martin Marietta Materials, Inc."}
[{'organization_name': 'Martin Marietta Materials, Inc.', 'cluster_id': 352}]
enter do_inference()
input_str: {"organization_name": "Total Petrochemicals & Refining USA, Inc."},{"organization_name": "TOTAL PETROCHEMICALS & REFINING USA, INC."}
resp_text: {"organization_name": "Total Petrochemical

100%|██████████| 1/1 [00:00<00:00, 12483.05it/s]


enter do_inference()
input_str: {"organization_name": "Trunkline Gas Company, LLC"},{"organization_name": "Trunkline Gas Company"}
resp_text: {"organization_name": "Trunkline Gas Company, LLC"}
[{'organization_name': 'Trunkline Gas Company, LLC', 'cluster_id': 398}]
enter do_inference()
input_str: {"organization_name": "ARCHER DANIELS MIDLAND"},{"organization_name": "ARCHER DANIELS MIDLAND COMPANY"}
resp_text: {"organization_name": "ARCHER DANIELS MIDLAND COMPANY"}
[{'organization_name': 'ARCHER DANIELS MIDLAND COMPANY', 'cluster_id': 399}]
enter do_inference()
input_str: {"organization_name": "Arizona Public Service (APS)"},{"organization_name": "Arizona Public Service"}
resp_text: {"organization_name": "Arizona Public Service"}
[{'organization_name': 'Arizona Public Service', 'cluster_id': 400}]
enter do_inference()
input_str: {"organization_name": "Eastern Gas Transmission and Storage, Inc."},{"organization_name": "Eastern Gas Transmission and Storage"}
resp_text: {"organization_nam

100%|██████████| 1/1 [00:00<00:00, 7384.34it/s]


enter do_inference()
input_str: {"organization_name": "Michelin Group"},{"organization_name": "Michelin"}
resp_text: {"organization_name": "Michelin Group"}
[{'organization_name': 'Michelin Group', 'cluster_id': 446}]
enter do_inference()
input_str: {"organization_name": "Black Hills Power and Light Company"},{"organization_name": "Black Hills Power, Inc."}
resp_text: {"organization_name": "Black Hills Power, Inc."}
[{'organization_name': 'Black Hills Power, Inc.', 'cluster_id': 447}]
enter do_inference()
input_str: {"organization_name": "RAYONIER PERFORMANCE FIBERS LLC"},{"organization_name": "Rayonier Performance Fibers LLC"}
resp_text: {"organization_name": "Rayonier Performance Fibers LLC"}
[{'organization_name': 'Rayonier Performance Fibers LLC', 'cluster_id': 448}]
enter do_inference()
input_str: {"organization_name": "SunCoke Energy, Inc."},{"organization_name": "SunCoke Energy"}
resp_text: {"organization_name": "SunCoke Energy, Inc."}
[{'organization_name': 'SunCoke Energy, Inc

100%|██████████| 1/1 [00:00<00:00, 6069.90it/s]


enter do_inference()
input_str: {"organization_name": "Wisconsin Electric Power Company"},{"organization_name": "Wisconsin Power and Light Company"}
resp_text: {"organization_name": "WEC Energy Group"}
[{'organization_name': 'WEC Energy Group', 'cluster_id': 496}]
enter do_inference()
input_str: {"organization_name": "Evergy, Inc."},{"organization_name": "Evergy"}
resp_text: {"organization_name": "Evergy, Inc."}
[{'organization_name': 'Evergy, Inc.', 'cluster_id': 497}]
enter do_inference()
input_str: {"organization_name": "Northwest Pipeline GP"},{"organization_name": "NORTHWEST PIPELINE GP"}
resp_text: {"organization_name": "Northwest Pipeline GP"}
[{'organization_name': 'Northwest Pipeline GP', 'cluster_id': 498}]
enter do_inference()
input_str: {"organization_name": "Grede LLC"},{"organization_name": "Grede Holdings LLC"}
resp_text: {"organization_name": "Grede Holdings LLC"}
[{'organization_name': 'Grede Holdings LLC', 'cluster_id': 499}]
enter do_inference()
input_str: {"organiza

100%|██████████| 1/1 [00:00<00:00, 754.51it/s]


In [None]:
%%bigquery
alter table `us_climate_fin.ghg_org_name_filtered`
drop column cluster_id

Query is running:   0%|          |

In [None]:
%%bigquery
select * from us_climate_fin.ghg_org_name_filtered

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name
0,Kaiser Aluminum Corporation
1,Perdue Farms Inc.
2,"Martin Marietta Materials, Inc."
3,"Total Petrochemicals & Refining USA, Inc."
4,JR SIMPLOT COMPANY
...,...
511,"Waste Management, Inc."
512,Western Sugar Cooperative
513,"Williams Field Services Company, LLC"
514,XTO ENERGY INC


Compute the list of organization names to discard

In [None]:
%%bigquery
create or replace table us_climate_fin.ghg_org_name_discard as
    select organization_name from us_climate_fin.ghg_org_clusters
    except distinct
    select organization_name from us_climate_fin.ghg_org_name_filtered

Query is running:   0%|          |

In [None]:
%%bigquery
select * from us_climate_fin.ghg_org_name_discard

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,organization_name
0,E & J Gallo Winery
1,COLUMBIA GULF TRANSMISSION CO.
2,"Starwood Energy Group Global, L.L.C."
3,TRANSCONTINENTAL GAS PIPE LINE CO LLC
4,WISE ALLOYS LLC
...,...
509,"Formosa Plastics Corporation, U.S.A."
510,Duke Energy
511,Solvay
512,PepsiCo


### Part 7: Construct the final ghg organization table

In [None]:
%%bigquery
create or replace table us_climate_fin.ghg_organization_final as
    select * from us_climate_fin.ghg_org_names_raw
    where organization_name not in (select organization_name from us_climate_fin.ghg_org_name_discard)

Query is running:   0%|          |

Check for duplicates, however this is not an accurate way to check for actual duplicates in this data since our duplicates will have slight differences.

In [None]:
%%bigquery
select organization_name, count(*) as count
from us_climate_fin.ghg_organization_final
group by organization_name
having count(*) > 1

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,organization_name,count


### Part 8: Conclusion

We built an embedding approach to normalize GHG organization names in our dataset. Our pipeline began by generating embeddings and identifying organization name pairs with high similarity using cosine distance. We then filtered those with distances ≤ 0.075 and grouped them into clusters, each assumed to represent a single entity.

To resolve which organization name to retain from each cluster, we used a large language model (LLM) to evaluate the names and return the most representative one. This method allowed us to capture duplicates that traditional string-matching techniques would have missed, especially when variations involved abbreviations, suffixes, or inconsistent formatting. The final output was a cleaned list of distinct organization names suitable for downstream analysis.

Previously, we used an LLM to directly normalize organization names by inferring the most canonical form from raw inputs. This approach was quite effective in many cases, especially for subtle variations in naming. In this project, we explored a complementary method by using text embeddings and cosine distance to pair organization names that are likely duplicates. This let us group similar names before invoking the LLM. However, to ensure reasonable recall, we had to use a relatively large distance threshold, which introduced some false positives and still missed a few known duplicates. Moving forward, our plan is to link these normalized organization names back to their corresponding facility IDs and integrate them into emissions datasets.