# Who Owns the Large Buildings in Seattle?

## Problem

The GHGE dataset does not include buildings' owners. Scraping the eRealProperty website for building owners has two limitations:

1. The data quality is poor and many buildings don't have an owner listed.
1. Many corporations with multiple properties set up a separate LLC for each building. There is no straightforward way to trace a child corporation to its parent coroporation. This obfuscates the portfolio size of each company.

We will use the Corporations and Charities Filings System from the Secretary of State to figure this out. The basic process is:
    
1. Start with a company name.
1. Find that company's official name in CCFS.
1. Collect the principals/governors names from that company.
1. Collect all the businesses with those same governors. 
1. Human review to check which companies are connected based on number of overlapping governor, ID number, address, name, etc.
1. Profit?

The CCFS does not have a public API. API endpoints are in the utility methods found in each step. 

Our most up-to-date list of buildings owners, with owner names normalized (e.g., "City of Seattle" and "Seattle City" are both normalized to "City of Seattle"), can be found [here](https://github.com/linnealovespie/BPS/tree/dig_into_owners/experiments/worst_offenders#:~:text=updated_owners_2_15_23.csv). 

In [2]:
# This is a locally defined package.
# If you have trouble with this import, go to the root directory and run `pip install .`

from utils import geo

In [3]:
import pandas as pd
import numpy as np
import requests
import json
import os
import re
import geopandas as gp
import urllib.parse

## Step 0: Isolate to downtown Landlords
1. Get all buildings in downtown neighborhood
2. Look up their tax parcel ID
3. Map that tax parcel ID to a landlord 
4. Aggregate landlords in downtown

In [4]:
df_districts = gp.read_file("../../../data/Council_Districts.geojson")
df = pd.read_csv('../../../data/2020_Building_Energy_Benchmarking.csv')
df = gp.GeoDataFrame(df, geometry=gp.points_from_xy(df.Longitude, df.Latitude))
geo.clean_districts(df, df_districts)

FileNotFoundError: [Errno 2] No such file or directory: '../../../data/2020_Building_Energy_Benchmarking.csv'

In [8]:
df1=df.loc[df['Neighborhood']=="DOWNTOWN"]

In [11]:
# this CSV uses consolidated owner names
building_owners = pd.read_csv('../../../experiments/worst_offenders/updated_owners_2_15_23.csv')
# Map tax ids to landlord name
d = pd.Series(building_owners.Owner.values, index=building_owners.TaxParcelIdentificationNumber).to_dict()
df1['Landlord'] = df1['TaxParcelIdentificationNumber'].map(lambda row: d.get(row, ""))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


In [12]:
# use this
unique_d1_landlords = df1['Landlord'].unique()

In [20]:
# skip

landlords=df1.groupby("Landlord")[['TotalGHGEmissions','PropertyGFATotal']]
landlords.size()
df1.value_counts(subset=["Landlord"])

Landlord                                                     
                                                                 239
UNDEFINED                                                        109
CITY OF SEATTLE                                                   10
STATE OF WASHINGTON                                                5
NOT FOUND                                                          4
                                                                ... 
FAIRFIELD WASHINGTON TERRACE LP                                    1
FAN-SHENG TEMPLE OF THE AMERICAN SEATTLE BUDDHIST ASSOCIATION      1
FANA DOYLE LLC                                                     1
FG92 LENORA LLC+CWS LENORA LP                                      1
YUAN PAUL J L+SOPHIA Y                                             1
Length: 230, dtype: int64

In [21]:
# skip
df1_landlords = landlords.sum()
df1_landlords["NumUnits"] = landlords.size()
df1_landlords=df1_landlords.drop("")

In [23]:
#skip
df1_landlords.to_csv("../../../data/downtown_landlords.csv")

CRE WINSTON LLC         | 	CRE WINSTON, LLC
TESHOME FAMILY LLC      | 	TESHOME FAMILY LLC
TRIAD PIER 70 L L C	    |   TRIAD PIER 70 L.L.C.
BOREN DEVELOPMENT LLC   | 	BOREN DEVELOPMENT, LIMITED LIABILITY COMPANY

In [24]:
# Utils for finding principals

search_for_business_url = 'https://cfda.sos.wa.gov/api/BusinessSearch/GetBusinessSearchList'

def get_business_search_payload(business_name, page_count, page_num):
    return {
        'Type': 'BusinessName',
        'SearchEntityName': business_name,
        'SearchType': 'BusinessName',
        'SortType': 'ASC',
        'SortBy': 'Entity Name',
        'SearchValue': business_name,
        'SearchCriteria': 'Contains',
        'IsSearch': 'true',
        'PageID': page_num,
        'PageCount': page_count,
    }


def get_business_search_results(business_name, page_num):
    r = requests.post(search_for_business_url, get_business_search_payload(business_name, 100, page_num))
    return json.loads(r.text)

c

def extract_search_results(search_term, search_req_response):
    # TODO: add all the columns, or at least filing status 
    # TODO: collapse all listings with L.L.P or L.L.C or LLC LLP
    # res_list = [standardize_result(res) for res in search_req_response]
    res_list = [[search_term, res['BusinessName'], res['UBINumber'], res['BusinessID'], res['PrincipalOffice']['PrincipalStreetAddress']['FullAddress'], res["BusinessStatus"]] for res in search_req_response]
    res_df = pd.DataFrame(res_list, columns=['SearchTerm', 'BusinessName', 'UBINumber', 'BusinessId', 'Address', "Status"])
    # print(res_df)
    # res_df = res_df[res_df['Status']`=="Active"]#res_df.drop(res_df[res_df["Status"]=="Terminated"].index)
    # TODO: If there's an exact match, keep only that business 
    # Basically keep a list of exact matches, and build a list of potential matches that we give to human verifiers
    exact_match = res_df.index[res_df['BusinessName'] == search_term].tolist()
    if exact_match:
        # print(res_df)
        # print(exact_match)
        res_df = pd.concat([res_df.iloc[[exact_match[0]],:], res_df.drop(exact_match[0], axis=0)], axis=0)
    return res_df
    

# Mark row as potential match: UBI number is a duplicate, or Address is the same
# df.duplicated just sees if that address is already in the dataframe, NOT that the serach term
# and result have the same address. Could add search terms as a subset for duplicated call
def determine_search_matches(search_results_df):
    search_results_df['address_match'] = search_results_df.duplicated(subset=['Address'], keep=False) 
    search_results_df['ubi_match'] = search_results_df.duplicated(subset=['UBINumber'], keep=False)
    search_results_df['id_match'] = search_results_df.duplicated(subset=['BusinessId'], keep=False)

def get_business_details(business_id):
    url = 'https://cfda.sos.wa.gov/api/BusinessSearch/BusinessInformation?businessID={business_id}'.format(business_id=business_id)
    r = requests.get(url)
    return json.loads(r.text)

def get_empty_df():
    return pd.DataFrame([], columns = ['SearchTerm', 'BusinessName', 'UBINumber', 'BusinessId', 'Address', 'Status', 'address_match', 'ubi_match', 'id_match'])

In [25]:
def get_all_company_name_match_search_results(owner_name):
    n = 1
    res_length = 100
    search_results = []
    
    while res_length == 100:
        res = get_business_search_results(owner_name, n)
        search_results += (res)
        n += 1
        res_length = len(res)
    
    return search_results

In [26]:
def get_potential_company_name_matches(owner_name):
    all_search_results = get_all_company_name_match_search_results(owner_name)
    extracted_results = extract_search_results(owner_name, all_search_results)
    determine_search_matches(extracted_results)
    return extracted_results

### Filter search results

Separate your search results into Alice's three categories:

- exact match
- potential matches (where no exact match was found)
- additional matches (extra matches if there was an exact match *and* additional matches)

In [27]:
def separate_search_results(results):
    exact_matches = get_empty_df()
    exact_matches.columns
    potential_matches = get_empty_df()
    additional_matches = get_empty_df()
    
    exact_match = results[results['SearchTerm'] == results['BusinessName']]
    if len(exact_match) > 0:
        exact_matches = pd.concat([exact_matches, exact_match], ignore_index=True)
        additional_matches = pd.concat([additional_matches, results[results['SearchTerm'] != results['BusinessName']]], ignore_index=True)
    else:
        potential_matches = pd.concat([potential_matches, results], ignore_index=True)
    
    return exact_matches, potential_matches, additional_matches

In [28]:
def get_company_list_name_matches(owner_list):
    exact_matches = get_empty_df()
    potential_matches = get_empty_df()
    additional_matches = get_empty_df()
    
    for owner in owner_list:
        matches = get_potential_company_name_matches(owner)
        temp_exact, temp_potential, temp_add = separate_search_results(matches)
        exact_matches = pd.concat([temp_exact, exact_matches], ignore_index=True)
        potential_matches = pd.concat([temp_potential, potential_matches], ignore_index=True)
        additional_matches = pd.concat([temp_add, additional_matches], ignore_index=True)
    
    return exact_matches, potential_matches, additional_matches

In [29]:
buildings_and_landlords_df = df1_landlords#pd.read_csv('../../experiments/worst_offenders/landlords_with_total_energy_use_2_16_23.csv')

In [30]:
#owner_search_list = buildings_and_landlords_df['Landlord'].unique()

# hacking this
owner_search_list = unique_d1_landlords
owner_search_list = list(owner_search_list)
owner_search_list.remove('NOT FOUND')

owner_search_list[:5]

['',
 '1301 6TH AVENUE (WA) OWNER LLC',
 'FIFTH & PINE LLC',
 'ARRENDELL AMY+ARRENDELL REVOCABLE LIVING TRUST AMY +ET AL',
 'GIBRALTAR TOWER LLC']

In [186]:
owner_search_list = list(owner_search_list).remove('NOT FOUND')

In [31]:
get_potential_company_name_matches('4TH AVENUE BLDG LLC')

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,4TH AVENUE BLDG LLC,4TH AVENUE BLDG LLC,604 065 660,95496,"11221 PACIFIC HWY SW, LAKEWOOD, WA, 98499-5170...",Active,False,False,False
1,4TH AVENUE BLDG LLC,4TH AVENUE BLDG LLC,602 253 701,95495,,Administratively Dissolved,False,False,False


In [32]:
target = get_potential_company_name_matches('TARGET CORPORATION')
target.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
77,TARGET CORPORATION,TARGET CORPORATION,601 007 793,294338,"1000 NICOLLET MALL, MINNEAPOLIS, MN, 55403-254...",Active,True,False,False
0,TARGET CORPORATION,"A-TARGET, INC.",602 963 157,435176,"16454 N 91ST ST UNIT 101, SCOTTSDALE, AZ, 8526...",Terminated,False,False,False
1,TARGET CORPORATION,ACTION TARGET ACQUISITION CORP.,602 831 478,448747,"485 WEST PUTNAM AVENUE, GREENWICH, CT, 06830, ...",Terminated,False,False,False
2,TARGET CORPORATION,ACTION TARGET INC.,602 420 120,17961,"3411 S MOUNTAIN VISTA PRWY, PROVO, UT, 84606, ...",Terminated,False,False,False
3,TARGET CORPORATION,ACTION TARGET INC.,603 126 062,482951,"3411 S MOUNTAIN VISTA PKWY, PROVO, UT, 84606, ...",Active,False,False,False


## Matches for Step 1

In [203]:
exact_matches_1.to_csv('exact_matches_1.csv')
potential_matches_1.to_csv('potential_matches_1.csv')
additional_matches_1.to_csv('additional_matches_1.csv')

In [204]:
exact_matches_2, potential_matches_2, additional_matches_2 = get_company_list_name_matches(owner_search_list[200:])

In [205]:
exact_matches_2.to_csv('exact_matches_2.csv')
potential_matches_2.to_csv('potential_matches_2.csv')
additional_matches_2.to_csv('additional_matches_2.csv')

In [247]:
# Trying this with our first 200 

owner_search_chunk_1 = get_company_list_name_matches(owner_search_list)

owner_search_chunk_1.head()

  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match,IsMatch
0,YESSIX LLC,"YESSIX, LLC",603 446 762,1082614,"2900 NE BLAKELEY ST STE B, SEATTLE, WA, 98105,...",Active,False,False,False,
1,XERAD LLC,"THE TAX ERADICATOR BOOKKEEPING & TAX SERVICES,...",603 427 461,986537,"10003 201ST AVENUE PL E, BONNEY LAKE, WA, 9839...",Active,False,False,False,
2,XERAD LLC,"XERAD I, LLC",602 498 747,526872,"3021 4TH AVE, SEATTLE, WA, 98121, UNITED STATES",Administratively Dissolved,False,False,False,
3,XERAD LLC,"XERAD II, LLC",602 498 750,526873,"3425 67TH SE, MERCER ISLAND, WA, 98040, UNITED...",Active,False,False,False,
4,WRI 2200 WESTLAKE LP,WRI 2200 WESTLAKE LP,603 587 485,528804,"500 N BROADWAY, SUITE 201, JERICHO, NY, 11753,...",Active,False,False,False,


## (Optional) Step 1b

You can ask annotaters to check the possible matches found by the scraping script. They can put 1 and 0 into a "isMatch" column.

## Step 2

Fetch the principals for each company found in Step 1.

In [4]:
def get_business_details(business_id):
    url = 'https://cfda.sos.wa.gov/api/BusinessSearch/BusinessInformation?businessID={business_id}'.format(business_id=business_id)
    r = requests.get(url)
    return json.loads(r.text)

def extract_principals(business_res, business_id):
    agent = business_res['Agent']['EntityName']
    rows = [[
        # name of company?
        business_id,
        agent,
        'Entity' if principal['TypeID'] == 'E' else 'Individual',
        principal['PrincipalID'],
         principal['Name'] if principal['TypeID'] == 'E' else principal['FirstName'] + ' ' + principal['LastName']
    ] for principal in business_res['PrincipalsList']]
    return pd.DataFrame(rows, columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])

def get_companies_principals(business_names_df):
    '''
    Takes a DF of companies with BusinessId and returns a DF of each company's principals, 
    with one row for each principal.
    '''
    principals = pd.DataFrame([], columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])
    for business in business_names_df['BusinessId']:
        business_res = get_business_details(business)
        principals = pd.concat([extract_principals(business_res, business), principals], ignore_index=True)
    
    merged_principals = pd.merge(business_names_df, principals, on='BusinessId', how='left')
    
    return merged_principals

Start with the first exact matches results for proof of concept.

In [5]:
exact_matches_1 = pd.read_csv("exact_matches_1.csv")
exact_matches_1_principals = get_companies_principals(exact_matches_1)

exact_matches_2 = pd.read_csv("exact_matches_2.csv")
exact_matches_2_principals = get_companies_principals(exact_matches_2)

In [10]:
potential_matches_1 = pd.read_csv("potential_matches_1.csv")
potential_matches_1 = potential_matches_1[potential_matches_1['isMatch']==1]
potential_matches_1_principals = get_companies_principals(potential_matches_1)

potential_matches_2 = pd.read_csv("potential_matches_2.csv")
potential_matches_2 = potential_matches_1[potential_matches_2['is_match']==1]
potential_matches_2_principals = get_companies_principals(potential_matches_2)

  potential_matches_2 = potential_matches_1[potential_matches_2['is_match']==1]


You will need to run the potential matches and the 2nd batch, too.

In [11]:
all_matches = pd.concat([exact_matches_1, 
                        exact_matches_2,
                        potential_matches_1,
                        potential_matches_2]) 
all_matches_principals = pd.concat([exact_matches_1_principals, 
                                    exact_matches_2_principals, 
                                    potential_matches_1_principals,
                                    potential_matches_2_principals])

In [12]:
all_matches.to_csv("all_matches.csv", index=False)

In [13]:
# Output is a row for each company and each governor for that company 
# So one compnay can have multiple rows if multiple governors
all_matches_principals.to_csv('all_matches_principals.csv', index=False)

## Step 3

Find every company associated with the governors found in Step 2.

This is slightly convoluted because of the API. The process is:

a) Search for the principal's name using the advanced search API. 

b) Get a paginated list of results. This will *not* include the principal's name because... reasons?

c) For each result, send another request to the business information endpoint to fetch the business details.

d) If the company's principals include the original principal we were looking for, save the business' information.

In [14]:
principal_url = 'https://cfda.sos.wa.gov/api/BusinessSearch/GetAdvanceBusinessSearchList'

principal_headers = {
    'Accept-Language': 'en-US,en;q=0.8,es-AR;q=0.5,es;q=0.3',
    'Referer': 'https://ccfs.sos.wa.gov/',
    'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', # this might be an issue
    'Origin': 'https://ccfs.sos.wa.gov'
}

def get_principal_data(principal_name, page_num):
        principal_name = urllib.parse.quote(principal_name)
        return 'Type=Principal&BusinessStatusID=0&SearchEntityName=&SearchType=&BusinessTypeID=0&AgentName=&PrincipalName={principal_name}&StartDateOfIncorporation=&EndDateOfIncorporation=&ExpirationDate=&IsSearch=true&IsShowAdvanceSearch=true&&&AgentAddress%5BIsAddressSame%5D=false&AgentAddress%5BIsValidAddress%5D=false&AgentAddress%5BisUserNonCommercialRegisteredAgent%5D=false&AgentAddress%5BIsInvalidState%5D=false&AgentAddress%5BbaseEntity%5D%5BFilerID%5D=0&AgentAddress%5BbaseEntity%5D%5BUserID%5D=0&AgentAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&AgentAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&AgentAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&AgentAddress%5BID%5D=0&&&&AgentAddress%5BState%5D=WA&&AgentAddress%5BCountry%5D=USA&&&&&&&&PrincipalAddress%5BIsAddressSame%5D=false&PrincipalAddress%5BIsValidAddress%5D=false&PrincipalAddress%5BisUserNonCommercialRegisteredAgent%5D=false&PrincipalAddress%5BIsInvalidState%5D=false&PrincipalAddress%5BbaseEntity%5D%5BFilerID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BUserID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&PrincipalAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&PrincipalAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&PrincipalAddress%5BID%5D=0&&&&PrincipalAddress%5BState%5D=&&PrincipalAddress%5BCountry%5D=USA&&&&&&PageID={page_num}&PageCount=100'.format(principal_name=principal_name, page_num=page_num)

def get_principal_response(principal_name, page_num):
    data = get_principal_data(principal_name, page_num)
    r = requests.post(principal_url, data=data, headers=principal_headers)
    return json.loads(r.text)

In [None]:
get_principal_response('SMITH', 1)

In [15]:
def extract_principals_business(business_res, business_id, business_name):
    """
        Given a json of the business search result business_res and the business' id, 
        Create a dataframe of all the principals returned in business_res.
    """
    agent = business_res['Agent']['EntityName']
    rows = [[
        business_res['UBINumber'],
        business_id,
        business_name,
        agent,
        'Entity' if principal['TypeID'] == 'E' else 'Individual',
        principal['PrincipalID'],
        principal['Name'] if principal['TypeID'] == 'E' else principal['FirstName'] + ' ' + principal['LastName'],
        business_res['PrincipalOffice']['PrincipalStreetAddress']['FullAddress'],
        business_res["BusinessStatus"]
    ] for principal in business_res['PrincipalsList']]
    return pd.DataFrame(rows, columns=['UBINumber', 'BusinessId', 'BusinessName', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName', "Address", "Status"])

def get_all_principal_search_results(principal_name):
    n = 1
    res_length = 100
    search_results = []
    
    while res_length == 100:
        res = get_principal_response(principal_name, n)
        search_results += res
        n += 1
        res_length = len(res)
    
    return search_results

def get_principals_from_all_results_pages(principal_name):
    """
        given a principal_name, gets all the results across all potential response pages
        TODO: filter to only include business results that we care about, ie. that are 
        in the list of exact_matches. 
    """
    # should include company name, see below
    principals = pd.DataFrame([], columns=['UBINumber', 'BusinessId', 'BusinessName', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])

    try:
        search_results = get_all_principal_search_results(principal_name)
    except:
        print(f"couldn't get results for {principal_name}")
        return principals
    # print(search_results)
    business_ids = [res['BusinessID'] for res in search_results]
    business_names = [res['BusinessName'] for res in search_results]
    ubi_nums = [res['UBINumber'] for res in search_results]
    
    for id, name in zip(business_ids, business_names):
        # filter principals_df to only have businesses we care about
        business_json = get_business_details(id)
        principals_df = extract_principals_business(business_json, id, name)
        
        if len(principals_df[principals_df['PrincipalName'] == principal_name]) > 0:
            principals = pd.concat([extract_principals_business(business_json, id, name), principals], ignore_index=True)
        principals = principals[principals['BusinessId'].isin(all_matches['BusinessId'].values)]
    return principals

In [75]:
principal_responses = get_principals_from_all_results_pages('UNICO PROPERTIES LLC')

In [76]:
len(principal_responses)

6

### Formatting 

We need to format the results from get_principals_from_all_results_pages into a CSV with the following format:

- SearchTerm: Original owner name from the 2020 building emissions dataset
- BusinessName: The business name in the CCFS database that we have matched to the SearchTerm
- PotentialRelatedCompany: A company that may be related to the company in the BusinessName field. If this field is the same as BusinessName, the row represents the "parent" or "hub" company that we are trying to match companies to
- UBINumber: ID number
- BusinessId: ID number
- Address: Address of the PotentialRelatedCompany
- Status: If the company is active/closed, etc.
- Principals: A comma separated, alphabetized list of the PotenitalRelatedCompany's principals 
- isMatch: Your best guess about whether or not the PotentialRelatedCompany is connected to the BusinessName
- notes: any useful notes about the company or explaining the isMatch value

This is also available in the [sample output spreadsheet](https://github.com/linnealovespie/BPS/blob/bug-fix/data/building_owners/sample_step_3_output.xlsx).

In [112]:
all_matches_principals = pd.read_csv("all_matches_principals.csv", index_col=[0]).reset_index()

In [None]:
# Principal_match_list: SearchTerm, BusinessName, UBINumber, BusinessId, Address, Status, address_match, ubi_match, id_match, Agent, EntityType, PrincipalID, PrincipalName

In [16]:
def get_principal_companies(principal_match_list):
    columns=['SearchTerm', 'BusinessName', 'PotentialRelatedCompany', 'UBINumber', 'BusinessId', 'Address', 'Status', 'Agent', 'Principals', 'isMatch', 'notes']
    results = pd.DataFrame([], columns)
    for idx, row in principal_match_list.iterrows():
        possible_matching_companies_df = get_principals_from_all_results_pages(row['PrincipalName'])
        grouped = possible_matching_companies_df.groupby("BusinessName")
        for name, group in grouped:
            principals_list = group["PrincipalName"].tolist()
            principals_list.sort()
            poss_company = group.iloc[0]

            # # row['BusinessName']: the name we mapped to from SearchTerm
            # # poss_company['BusinessName]: PotentialRelatedCompany
            new_row = pd.Series(data=[row['SearchTerm'], 
                            row["BusinessName"], 
                            poss_company["BusinessName"],
                            poss_company['UBINumber'],
                            poss_company["BusinessId"],
                            poss_company["Address"],
                            poss_company["Status"],
                            poss_company["Agent"],
                            principals_list,
                            "", # isMatch
                            ""  # Notes
                            ], 
                        index = columns)
            results = pd.concat([new_row.to_frame().T, results], ignore_index=True).drop_duplicates(subset="UBINumber").dropna()
            # results = results[results['BusinessId'].isin(all_matches['BusinessId'])]
        if(idx % 25 == 0): 
            print(f"Processing row {idx} of principal_match_list, results is {len(results)}")
            results.to_csv("companies_and_potential_matches.csv")
    return results

In [18]:
companies_and_potential_matches = get_principal_companies(all_matches_principals)
companies_and_potential_matches.head()

Processing row 0 of principal_match_list, results is 1


In [100]:
companies_and_potential_matches.to_csv("companies_and_potential_matches.csv")

In [38]:
companies_and_potential_matches = pd.read_csv("companies_and_potential_matches.csv", index_col=0)

In [43]:
companies_and_potential_matches=companies_and_potential_matches[companies_and_potential_matches['BusinessId'].isin(all_matches['BusinessId'])]

In [44]:
len(companies_and_potential_matches)

155