# Who Owns the Large Buildings in Seattle?

## Problem

The GHGE dataset does not include buildings' owners. Scraping the eRealProperty website for building owners has two limitations:

1. The data quality is poor and many buildings don't have an owner listed.
1. Many corporations with multiple properties set up a separate LLC for each building. There is no straightforward way to trace a child corporation to its parent coroporation. This obfuscates the portfolio size of each company.

We will use the Corporations and Charities Filings System from the Secretary of State to figure this out. The basic process is:
    
1. Start with a company name.
1. Find that company's official name in CCFS.
1. Collect the principals/governors names from that company.
1. Collect all the businesses with those same governors. 
1. Human review to check which companies are connected based on number of overlapping governor, ID number, address, name, etc.
1. Profit?

The CCFS does not have a public API. API endpoints are in the utility methods found in each step. 

Our most up-to-date list of buildings owners, with owner names normalized (e.g., "City of Seattle" and "Seattle City" are both normalized to "City of Seattle"), can be found [here](https://github.com/linnealovespie/BPS/tree/dig_into_owners/experiments/worst_offenders#:~:text=updated_owners_2_15_23.csv). 

In [2]:
import pandas as pd
import numpy as np
import requests
from fuzzywuzzy import fuzz
import json
import os
import re
import geopandas as gp
import util



## Step 0: Isolate to D1 Landlords
1. Get all buildings in d1
2. Look up their tax parcel ID
3. Map that tax parcel ID to a landlord 
4. Aggregate landlords in d1

In [3]:
df_districts = gp.read_file("../data/Council_Districts.geojson")
df = pd.read_csv('../data/2020_Building_Energy_Benchmarking.csv')
df = gp.GeoDataFrame(df, geometry=gp.points_from_xy(df.Longitude, df.Latitude))
util.clean_districts(df, df_districts)

Building WATERWORKS OFFICE & MARINA 2353/ 4088803975 doesn't have a district POINT (-122.33895 47.63575) 
	 Found district 7 for WATERWORKS OFFICE & MARINA
Building NAUTICAL LANDING 2381/ 4088804350 doesn't have a district POINT (-122.34219 47.64306) 
	 Found district 7 for NAUTICAL LANDING
Building UNION HARBOR CONDOMINIUM 2540/ 8807200000 doesn't have a district POINT (-122.33003 47.6401) 
	 Found district 4 for UNION HARBOR CONDOMINIUM
Building THE PIER AT LESCHI 2997/ 6780900000 doesn't have a district POINT (-122.28563 47.59926) 
	 Found district 3 for THE PIER AT LESCHI
Building THE LAKESHORE 3046/ 1180001715 doesn't have a district POINT EMPTY 
Building EDUCARE 3218/ 2895800030 doesn't have a district POINT EMPTY 


In [195]:
df1=df.loc[df['Neighborhood']=="DOWNTOWN"]#df.loc[((df["Neighborhood"]=="DOWNTOWN") | (df["Neighborhood"]=="LAKE UNION"))]

In [196]:
df1.head()

Unnamed: 0,OSEBuildingID,DataYear,BuildingName,BuildingType,TaxParcelIdentificationNumber,Address,City,State,ZipCode,Latitude,...,Electricity(kWh),SteamUse(kBtu),NaturalGas(therms),ComplianceStatus,ComplianceIssue,Electricity(kBtu),NaturalGas(kBtu),TotalGHGEmissions,GHGEmissionsIntensity,geometry
0,1,2020,MAYFLOWER PARK HOTEL,NonResidential,659000030,405 OLIVE WAY,SEATTLE,WA,98101.0,47.6122,...,801392,1457837,6326,Compliant,No Issue,2734351.0,632586.0,169.1,1.9,POINT (-122.33799 47.61220)
1,2,2020,PARAMOUNT HOTEL,NonResidential,659000220,724 PINE ST,SEATTLE,WA,98101.0,47.61317,...,568667,0,16614,Compliant,No Issue,1940292.0,1661402.0,98.6,1.1,POINT (-122.33393 47.61317)
2,3,2020,WESTIN HOTEL (Parent Building),NonResidential,659000475,1900 5TH AVE,SEATTLE,WA,98101.0,47.61367,...,7478716,10359896,8955,Compliant,No Issue,25517379.0,895500.0,1043.2,1.4,POINT (-122.33822 47.61367)
3,5,2020,HOTEL MAX,NonResidential,659000640,620 STEWART ST,SEATTLE,WA,98101.0,47.61412,...,345231,917724,8871,Compliant,No Issue,1177927.0,887059.0,129.6,2.1,POINT (-122.33664 47.61412)
4,8,2020,WARWICK SEATTLE HOTEL,NonResidential,659000970,401 LENORA ST,SEATTLE,WA,98121.0,47.61375,...,1102452,0,46034,Compliant,No Issue,3761566.0,4603411.0,264.5,2.3,POINT (-122.34047 47.61375)


In [197]:
parcels = pd.read_csv("../data/final_parcels.csv")
parcels.head()
d = pd.Series(parcels.Owner.values, index=parcels.TaxParcelIdentificationNumber).to_dict()
# Map tax ids to landlord name
df1['Landlord'] = df1['TaxParcelIdentificationNumber'].map(lambda row: d.get(row, ""))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


In [198]:
landlords=df1.groupby("Landlord")[['TotalGHGEmissions','PropertyGFATotal']]
landlords.size()
df1.value_counts(subset=["Landlord"])

Landlord                     
                                 239
SEATTLE CITY OF                    6
NOT FOUND                          4
GATEWAY KING LLC                   3
HUDSON MERRILL PLACE L L C         2
                                ... 
FANA DOYLE LLC                     1
FG92 LENORA LLC+CWS LENORA LP      1
FIFTH & PINE LLC                   1
FIRST & LENORA MIX USE L L C       1
YUAN PAUL J L+SOPHIA Y             1
Length: 236, dtype: int64

In [199]:
df1_landlords = landlords.sum()
df1_landlords["NumUnits"] = landlords.size()
df1_landlords=df1_landlords.drop("")

In [200]:
df1_landlords = df1_landlords.reset_index()

In [201]:
df1_landlords.to_csv("../data/downtown_landlords.csv")

CRE WINSTON LLC         | 	CRE WINSTON, LLC
TESHOME FAMILY LLC      | 	TESHOME FAMILY LLC
TRIAD PIER 70 L L C	    |   TRIAD PIER 70 L.L.C.
BOREN DEVELOPMENT LLC   | 	BOREN DEVELOPMENT, LIMITED LIABILITY COMPANY

In [40]:
result = {
    "BusinessName": "TRIAD PIER 70 L L C", 
    "SearchEntityName": "TRIAD PIER 70 L.L.C."
}

p = re.compile("L[\s.]?L[\s,.]?[PC]" ,flags=re.IGNORECASE)  
p.match("LLC")

result['BusinessName']=result["BusinessName"].replace(",", "")
result['BusinessName']= re.sub(p, "LLC", result['BusinessName'])
result['BusinessName']=result["BusinessName"].replace("LIMITED LIABILITY COMPANY", "LLC") 
result['BusinessName']=result["BusinessName"].replace("LIMITED PARTNERSHIP", "LLC") 
# TODO: keep original business name and add a column that's standardizedname


# Do the same for the search term, so that we have more exact matches
result["SearchEntityName"]=result["SearchEntityName"].replace(",", "").replace(".", "") 
# result["SearchEntityName"]=result["SearchEntityName"].replace("LLP", "LLC") 
result['SearchEntityName']=re.sub(p, "LLC", result['SearchEntityName'])
result["SearchEntityName"]=result["SearchEntityName"].replace("LIMITED LIABILITY COMPANY", "LLC") 
result["SearchEntityName"]=result["SearchEntityName"].replace("LIMITED PARTNERSHIP", "LLC") 

In [237]:
# Utils for finding principals

search_for_business_url = 'https://cfda.sos.wa.gov/api/BusinessSearch/GetBusinessSearchList'

def get_business_search_payload(business_name):
    return {
        'Type': 'BusinessName',
        'SearchType': 'BusinessName',
        'SearchEntityName': business_name,
        'SortType': 'ASC',
        'SortBy': 'Entity Name',
        'SearchValue': business_name,
        'SearchCriteria': 'Contains',
        'IsSearch': 'true',
        'PageID': 1,
        'PageCount': 25,
    }

def get_business_search_results(business_name):
    r = requests.post(search_for_business_url, get_business_search_payload(business_name))
    return json.loads(r.text)

# given one JSON element `result` in the list of search results, standardize
# the business name and address to collapse results into one 
def standardize_result(search_term, result):
    # Don't care about the result if it doesn't have an "active" status
    if(result["Status"] != "ACTIVE"): return

    # LLC, LLP, L L C, L.L.C., L.L.C. L.L.P., L.L.P
    # Limited Partnership, Limited liability company
    # Comma before any of the above
    # Just map all the results to be standardized to this name, then drop duplicates based on name? 
    p = re.compile("L[\s.]?L[\s,.]?[PC]" ,flags=re.IGNORECASE)

    result['BusinessName']=result["BusinessName"].replace(",", "")
    result['BusinessName']= re.sub(p, "LLC", result['BusinessName'])
    result['BusinessName']=result["BusinessName"].replace("LIMITED LIABILITY COMPANY", "LLC") 
    result['BusinessName']=result["BusinessName"].replace("LIMITED PARTNERSHIP", "LLC") 

    # Do the same for the search term, so that we have more exact matches
    # TODO: need to add this as an element of the dict
    result["SearchTerm"]=search_term.replace(",", "")
    result["SearchTerm"]=re.sub(p, "LLC", search_term)
    result["SearchTerm"]=search_term.replace("LIMITED PARTNERSHIP", "LLC") 
    result["SearchTerm"]=search_term.replace("LIMITED LIABILITY COMPANY", "LLC") 
    
    # Strip addressses of all commas
    result['Address'].replace(",", "")
    return result

def extract_search_results(search_term, search_req_response):
    # TODO: add all the columns, or at least filing status 
    # TODO: collapse all listings with L.L.P or L.L.C or LLC LLP
    # res_list = [standardize_result(res) for res in search_req_response]
    res_list = [[search_term, res['BusinessName'], res['UBINumber'], res['BusinessID'], res['PrincipalOffice']['PrincipalStreetAddress']['FullAddress'], res["BusinessStatus"]] for res in search_req_response]
    res_df = pd.DataFrame(res_list, columns=['SearchTerm', 'BusinessName', 'UBINumber', 'BusinessId', 'Address', "Status"])
    # print(res_df)
    # res_df = res_df[res_df['Status']`=="Active"]#res_df.drop(res_df[res_df["Status"]=="Terminated"].index)
    # TODO: If there's an exact match, keep only that business 
    # Basically keep a list of exact matches, and build a list of potential matches that we give to human verifiers
    exact_match = res_df.index[res_df['BusinessName'] == search_term].tolist()
    if exact_match:
        # print(res_df)
        # print(exact_match)
        res_df = pd.concat([res_df.iloc[[exact_match[0]],:], res_df.drop(exact_match[0], axis=0)], axis=0)
    return res_df
    

# Mark row as potential match: UBI number is a duplicate, or Address is the same
# df.duplicated just sees if that address is already in the dataframe, NOT that the serach term
# and result have the same address. Could add search terms as a subset for duplicated call
def determine_search_matches(search_results_df):
    search_results_df['address_match'] = search_results_df.duplicated(subset=['Address'], keep=False) 
    search_results_df['ubi_match'] = search_results_df.duplicated(subset=['UBINumber'], keep=False)
    search_results_df['id_match'] = search_results_df.duplicated(subset=['BusinessId'], keep=False)

def get_business_details(business_id):
    url = 'https://cfda.sos.wa.gov/api/BusinessSearch/BusinessInformation?businessID={business_id}'.format(business_id=business_id)
    r = requests.get(url)
    return json.loads(r.text)

In [243]:
def get_potential_company_name_matches(owner_name):
    all_search_results = get_business_search_results(owner_name)
    extracted_results = extract_search_results(owner_name, all_search_results)
    determine_search_matches(extracted_results)
    return extracted_results

In [239]:
def get_company_list_name_matches(owner_list):
    matches = pd.DataFrame([], columns = ['SearchTerm', 'BusinessName', 'UBINumber', 'BusinessId', 'Address', 'IsMatch'])
    
    for owner in owner_list:
        matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
    
    return matches
    

In [244]:
buildings_and_landlords_df = df1_landlords#pd.read_csv('../../experiments/worst_offenders/landlords_with_total_energy_use_2_16_23.csv')

In [245]:
owner_search_list = buildings_and_landlords_df['Landlord'].unique()

owner_search_list[:5]

array(['101 PINE STREET LLC', '101 STEWART STREET LLC ',
       '1201 TAB OWNER LLC', '1225 MAR LLC ',
       '1301 6TH AVENUE (WA) OWNER LLC'], dtype=object)

In [246]:
get_potential_company_name_matches('4TH AVENUE BLDG LLC')

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,4TH AVENUE BLDG LLC,4TH AVENUE BLDG LLC,602 253 701,95495,,Administratively Dissolved,False,False,False
1,4TH AVENUE BLDG LLC,4TH AVENUE BLDG LLC,604 065 660,95496,"11221 PACIFIC HWY SW, LAKEWOOD, WA, 98499-5170...",Active,False,False,False


In [247]:
# Trying this with our first 200 

owner_search_chunk_1 = get_company_list_name_matches(owner_search_list)

owner_search_chunk_1.head()

  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match,IsMatch
0,YESSIX LLC,"YESSIX, LLC",603 446 762,1082614,"2900 NE BLAKELEY ST STE B, SEATTLE, WA, 98105,...",Active,False,False,False,
1,XERAD LLC,"THE TAX ERADICATOR BOOKKEEPING & TAX SERVICES,...",603 427 461,986537,"10003 201ST AVENUE PL E, BONNEY LAKE, WA, 9839...",Active,False,False,False,
2,XERAD LLC,"XERAD I, LLC",602 498 747,526872,"3021 4TH AVE, SEATTLE, WA, 98121, UNITED STATES",Administratively Dissolved,False,False,False,
3,XERAD LLC,"XERAD II, LLC",602 498 750,526873,"3425 67TH SE, MERCER ISLAND, WA, 98040, UNITED...",Active,False,False,False,
4,WRI 2200 WESTLAKE LP,WRI 2200 WESTLAKE LP,603 587 485,528804,"500 N BROADWAY, SUITE 201, JERICHO, NY, 11753,...",Active,False,False,False,


In [248]:
owner_search_chunk_1.to_csv('downtown_owners.csv')

In [79]:
all_building_owner_chunks = pd.concat([
    owner_search_chunk_1,
    owner_search_chunk_2,
    owner_search_chunk_3,
    owner_search_chunk_4,
    owner_search_chunk_5,
    owner_search_chunk_6,
    owner_search_chunk_7,
    owner_search_chunk_8,
    owner_search_chunk_9,
    owner_search_chunk_10,
    owner_search_chunk_11
])

In [86]:
all_building_owner_chunks.to_csv('../chunks.csv')

## (Optional) Step 1b

You can ask annotaters to check the possible matches found by the scraping script. 

Proposed categories (hat tip Alice): Match, possible match (will be reviewed by another annotator), not a match

Then you can filter out the "not a match" companies and use the resulting list as input for Step 2.

## Step 2

Fetch the principals for each company found in Step 1.

In [122]:
def get_business_details(business_id):
    url = 'https://cfda.sos.wa.gov/api/BusinessSearch/BusinessInformation?businessID={business_id}'.format(business_id=business_id)
    r = requests.get(url)
    return json.loads(r.text)

def extract_principals(business_res, business_id):
    agent = business_res['Agent']['EntityName']
    rows = [[
        # name of company?
        business_id,
        agent,
        'Entity' if principal['TypeID'] == 'E' else 'Individual',
        principal['PrincipalID'],
         principal['Name'] if principal['TypeID'] == 'E' else principal['FirstName'] + ' ' + principal['LastName']
    ] for principal in business_res['PrincipalsList']]
    return pd.DataFrame(rows, columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])

def get_companies_principals(business_names_df):
    '''
    Takes a DF of companies with BusinessId and returns a DF of each company's principals, 
    with one row for each principal.
    '''
    principals = pd.DataFrame([], columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])
    for business in business_names_df['BusinessId']:
        business_res = get_business_details(business)
        principals = pd.concat([extract_principals(business_res, business), principals], ignore_index=True)
    
    merged_principals = pd.merge(business_names_df, principals, on='BusinessId', how='left')
    
    return merged_principals

In [123]:
#create a table with principals and the business id, then left join on existing table to get business info + principal

principals_search_chunk_1 = get_companies_principals(owner_search_chunk_1)

In [124]:
principals_search_chunk_1.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,address_match,ubi_match,id_match,IsMatch,Agent,EntityType,PrincipalID,PrincipalName
0,ASHFORD SEATTLE WATERFRONT L P,ASHFORD SEATTLE WATERFRONT GP LLC,604 268 117,1193505,"14185 DALLAS PKWY STE 1100, DALLAS, TX, 75254-...",True,False,False,,CORPORATION SERVICE COMPANY,Entity,2826137,ASHFORD CHICAGO SENIOR MEZZ LLC
1,ASHFORD SEATTLE WATERFRONT L P,ASHFORD SEATTLE WATERFRONT GP LLC,604 268 117,1193505,"14185 DALLAS PKWY STE 1100, DALLAS, TX, 75254-...",True,False,False,,CORPORATION SERVICE COMPANY,Individual,3132878,J. ROBINSON HAYS III
2,ASHFORD SEATTLE WATERFRONT L P,ASHFORD SEATTLE WATERFRONT GP LLC,604 268 117,1193505,"14185 DALLAS PKWY STE 1100, DALLAS, TX, 75254-...",True,False,False,,CORPORATION SERVICE COMPANY,Individual,3132879,ROBERT G. HAIMAN
3,ASHFORD SEATTLE WATERFRONT L P,ASHFORD SEATTLE WATERFRONT LEASING LLC,604 417 033,1268461,"14185 DALLAS PKWY STE 1100, DALLAS, TX, 75254-...",True,False,False,,CORPORATION SERVICE COMPANY,Individual,2820892,ROBERT G HAIMAN
4,ASHFORD SEATTLE WATERFRONT L P,ASHFORD SEATTLE WATERFRONT LEASING LLC,604 417 033,1268461,"14185 DALLAS PKWY STE 1100, DALLAS, TX, 75254-...",True,False,False,,CORPORATION SERVICE COMPANY,Individual,3139500,J. ROBINSON HAYS III


In [128]:
# keep getting an error if creating a csv in this directory, so using a redundant file path
principals_search_chunk_1.to_csv('../building_owners/principals/principals_search_chunk_1.csv')

In [129]:
principals_search_chunk_2 = get_companies_principals(owner_search_chunk_2)
principals_search_chunk_2.to_csv('../building_owners/principals/principals_search_chunk_2.csv')

In [130]:
principals_search_chunk_3 = get_companies_principals(owner_search_chunk_3)
principals_search_chunk_3.to_csv('../building_owners/principals/principals_search_chunk_3.csv')

In [131]:
principals_search_chunk_4 = get_companies_principals(owner_search_chunk_4)
principals_search_chunk_4.to_csv('../building_owners/principals/principals_search_chunk_4.csv')

In [132]:
principals_search_chunk_5 = get_companies_principals(owner_search_chunk_5)
principals_search_chunk_5.to_csv('../building_owners/principals/principals_search_chunk_5.csv')

In [133]:
principals_search_chunk_6 = get_companies_principals(owner_search_chunk_6)
principals_search_chunk_6.to_csv('../building_owners/principals/principals_search_chunk_6.csv')

In [134]:
principals_search_chunk_7 = get_companies_principals(owner_search_chunk_7)
principals_search_chunk_7.to_csv('../building_owners/principals/principals_search_chunk_7.csv')

In [135]:
principals_search_chunk_8 = get_companies_principals(owner_search_chunk_8)
principals_search_chunk_8.to_csv('../building_owners/principals/principals_search_chunk_8.csv')

In [136]:
principals_search_chunk_9 = get_companies_principals(owner_search_chunk_9)
principals_search_chunk_9.to_csv('../building_owners/principals/principals_search_chunk_9.csv')

In [137]:
principals_search_chunk_10 = get_companies_principals(owner_search_chunk_10)
principals_search_chunk_10.to_csv('../building_owners/principals/principals_search_chunk_10.csv')

In [138]:
principals_search_chunk_11 = get_companies_principals(owner_search_chunk_11)
principals_search_chunk_11.to_csv('../building_owners/principals/principals_search_chunk_11.csv')

Now we need to find the companies these people are principals for.

Step 1: search for the person using the advanced search

Step 2: go through all the search results, hit the API for each listed business, and look at the principals listed. If one of them matches the person we're looking for, download relevant info. Otherwise, skip.
(NB: the search results don't show the principals' names in either the UI or API response, so you have to look at the full business listing.)


## Step 3

Find every company associated with the governors found in Step 2.

This is slightly convoluted because of the API. The process is:

a) Search for the principal's name using the advanced search API. 

b) Get a paginated list of results. This will *not* include the principal's name because... reasons?

c) For each result, send another request to the business information endpoint to fetch the business details.

d) If the company's principals include the original principal we were looking for, save the business' information.

In [141]:
def get_governor_payload(governor_name, page_num):
    return "Type=Principal&BusinessStatusID=0&SearchEntityName=&SearchType=&BusinessTypeID=0&AgentName=&PrincipalName={governor_name}&StartDateOfIncorporation=&EndDateOfIncorporation=&ExpirationDate=&IsSearch=true&IsShowAdvanceSearch=true&&&AgentAddress%5BIsAddressSame%5D=false&AgentAddress%5BIsValidAddress%5D=false&AgentAddress%5BisUserNonCommercialRegisteredAgent%5D=false&AgentAddress%5BIsInvalidState%5D=false&AgentAddress%5BbaseEntity%5D%5BFilerID%5D=0&AgentAddress%5BbaseEntity%5D%5BUserID%5D=0&AgentAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&AgentAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&AgentAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&AgentAddress%5BID%5D=0&&&&AgentAddress%5BState%5D=WA&&AgentAddress%5BCountry%5D=USA&&&&&&&&PrincipalAddress%5BIsAddressSame%5D=false&PrincipalAddress%5BIsValidAddress%5D=false&PrincipalAddress%5BisUserNonCommercialRegisteredAgent%5D=false&PrincipalAddress%5BIsInvalidState%5D=false&PrincipalAddress%5BbaseEntity%5D%5BFilerID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BUserID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&PrincipalAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&PrincipalAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&PrincipalAddress%5BID%5D=0&&&&PrincipalAddress%5BState%5D=&&PrincipalAddress%5BCountry%5D=USA&&&&&&IsHostHomeSearch=&IsPublicBenefitNonProfitSearch=&IsCharitableNonProfitSearch=&IsGrossRevenueNonProfitSearch=&IsHasMembersSearch=&IsHasFEINSearch=&NonProfit%5BIsNonProfitEnabled%5D=false&NonProfit%5BchkSearchByIsHostHome%5D=false&NonProfit%5BchkSearchByIsPublicBenefitNonProfit%5D=false&NonProfit%5BchkSearchByIsCharitableNonProfit%5D=false&NonProfit%5BchkSearchByIsGrossRevenueNonProfit%5D=false&NonProfit%5BchkSearchByIsHasMembers%5D=false&NonProfit%5BchkSearchByIsHasFEIN%5D=false&NonProfit%5BFEINNoSearch%5D=&NonProfit%5BchkIsHostHome%5D%5Bnone%5D=false&NonProfit%5BchkIsHostHome%5D%5Byes%5D=false&NonProfit%5BchkIsHostHome%5D%5Bno%5D=false&NonProfit%5BchkIsPublicBenefitNonProfit%5D%5Bnone%5D=false&NonProfit%5BchkIsPublicBenefitNonProfit%5D%5Byes%5D=false&NonProfit%5BchkIsPublicBenefitNonProfit%5D%5Bno%5D=false&NonProfit%5BchkIsCharitableNonProfit%5D%5Bnone%5D=false&NonProfit%5BchkIsCharitableNonProfit%5D%5Byes%5D=false&NonProfit%5BchkIsCharitableNonProfit%5D%5Bno%5D=false&NonProfit%5BchkIsGrossRevenueNonProfit%5D%5Bnone%5D=false&NonProfit%5BchkIsGrossRevenueNonProfit%5D%5Byes%5D=false&NonProfit%5BchkIsGrossRevenueNonProfit%5D%5Bno%5D=false&NonProfit%5BchkIsGrossRevenueNonProfit%5D%5Bover500k%5D=false&NonProfit%5BchkIsGrossRevenueNonProfit%5D%5Bunder500k%5D=false&NonProfit%5BchkIsHasMembers%5D%5Bnone%5D=false&NonProfit%5BchkIsHasMembers%5D%5Byes%5D=false&NonProfit%5BchkIsHasMembers%5D%5Bno%5D=false&NonProfit%5BchkIsHasFEIN%5D%5Byes%5D=false&NonProfit%5BchkIsHasFEIN%5D%5Bno%5D=false&PageID={page_num}&PageCount=100".format(governor_name=governor_name, page_num=page_num)

governor_headers = {
    'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '2778',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'DNT': '1',
'Host': 'cfda.sos.wa.gov',
'Origin': 'https://ccfs.sos.wa.gov',
'Referer': 'https://ccfs.sos.wa.gov/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
'sec-ch-ua': '"Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': "macOS"
}

def get_governor_results_json(governor_name, page_num):
    r = requests.post('https://cfda.sos.wa.gov/api/BusinessSearch/GetAdvanceBusinessSearchList', data=get_governor_payload(governor_name, page_num), headers=governor_headers)
    return json.loads(r.text)

def get_all_governor_search_results(governor_name):
    n = 1
    res_length = 100
    search_results = []
    
    while res_length == 100:
        res = get_governor_results_json(governor_name, n)
        search_results.append(res)
        n += 1
        res_length = len(res)
    
    return search_results

def get_governors_from_all_results_pages(governor_name):
    search_results = get_all_governor_search_results(governor_name)
    business_ids = [res['BusinessId'] for res in search_reults]
    
    # should include company name, see below
    principals = pd.DataFrame([], columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])
    
    for id in business_ids:
        business_json = get_business_details(business_id)
        principals_df = extract_principals(business_json, id) # we should add the company name--write a variant of extract_principals
        # check if the principal is included
        if len(principals_df[principals_df['PrincipalName'] == governor_name]) > 0:
            pd.concat([extract_principals(business_res, business), principals], ignore_index=True)
    
    return principals_df

In [None]:
NB: I have not tested these functions, you will have to:
        - add the business name to the results returned in get_governors_from_all_results_pages
        - test the methods to make sure they work

You should be able to run the get_governors_from_all_results_pages for each row of the CSV generated in Step 2 to get a full list of principals.

Then you can have humans review the outcomes.

If you want to find out what a particular person is involved in, you can just do this process starting with step 3.