# Who Owns the Large Buildings in Seattle?

## Problem

The GHGE dataset does not include buildings' owners. Scraping the eRealProperty website for building owners has two limitations:

1. The data quality is poor and many buildings don't have an owner listed.
1. Many corporations with multiple properties set up a separate LLC for each building. There is no straightforward way to trace a child corporation to its parent coroporation. This obfuscates the portfolio size of each company.

We will use the Corporations and Charities Filings System from the Secretary of State to figure this out. The basic process is:
    
1. Start with a company name.
1. Find that company's official name in CCFS.
1. Collect the principals/governors names from that company.
1. Collect all the businesses with those same governors. 
1. Human review to check which companies are connected based on number of overlapping governor, ID number, address, name, etc.
1. Profit?

The CCFS does not have a public API. API endpoints are in the utility methods found in each step. 

Our most up-to-date list of buildings owners, with owner names normalized (e.g., "City of Seattle" and "Seattle City" are both normalized to "City of Seattle"), can be found [here](https://github.com/linnealovespie/BPS/tree/dig_into_owners/experiments/worst_offenders#:~:text=updated_owners_2_15_23.csv). 

In [3]:
import pandas as pd
import numpy as np
import requests
from fuzzywuzzy import fuzz
import json
import os
import re
import geopandas as gp
import util



## Step 0: Isolate to D1 Landlords
1. Get all buildings in d1
2. Look up their tax parcel ID
3. Map that tax parcel ID to a landlord 
4. Aggregate landlords in d1

In [73]:
df_districts = gp.read_file("../data/Council_Districts.geojson")
df = pd.read_csv('../data/2020_Building_Energy_Benchmarking.csv')
df = gp.GeoDataFrame(df, geometry=gp.points_from_xy(df.Longitude, df.Latitude))
util.clean_districts(df, df_districts)

Building UNION HARBOR CONDOMINIUM 454/ 8807200000 doesn't have a district POINT (-122.33003 47.6401) 
	 Found district 4 for UNION HARBOR CONDOMINIUM
Building WATERWORKS OFFICE & MARINA 1494/ 4088803975 doesn't have a district POINT (-122.33895 47.63575) 
	 Found district 7 for WATERWORKS OFFICE & MARINA
Building NAUTICAL LANDING 1742/ 4088804350 doesn't have a district POINT (-122.34219 47.64306) 
	 Found district 7 for NAUTICAL LANDING
Building THE PIER AT LESCHI 3453/ 6780900000 doesn't have a district POINT (-122.28563 47.59926) 
	 Found district 3 for THE PIER AT LESCHI
Building EDUCARE 3496/ 2895800030 doesn't have a district POINT EMPTY 
Building THE LAKESHORE 3506/ 1180001715 doesn't have a district POINT EMPTY 


In [74]:
df1=df.loc[df['Neighborhood']=="DOWNTOWN"]#df.loc[((df["Neighborhood"]=="DOWNTOWN") | (df["Neighborhood"]=="LAKE UNION"))]

In [75]:
df1.head()

Unnamed: 0,OSEBuildingID,TaxParcelIdentificationNumber,DataYear,BuildingType,BuildingName,Owner,CouncilDistrictCode,Neighborhood,Units,YearBuilt,...,SecondLargestPropertyUseTypeGFA,ThirdLargestPropertyUseType,ThirdLargestPropertyUseTypeGFA,Outlier,ComplianceIssue,ComplianceStatus,Comments,DefaultData,LegislationPropertyType,geometry
2,50160,659000775,2020,NonResidential,AMAZON DOPPLER BUILDING,ACORN DEVELOPMENT LLC,7.0,DOWNTOWN,,2016,...,448625.0,Convention Center,59672.0,,No Issue,Compliant,,,Office,POINT (-122.33835 47.61523)
4,50192,660001605,2020,NonResidential,MIDTOWN 21 (AMAZON),MIDTOWN21 LLC,7.0,DOWNTOWN,,2016,...,110813.0,,,,No Issue,Compliant,,,Office,POINT (-122.33304 47.61632)
13,50304,660002125,2020,Multifamily HR (10+),KINECTS TOWER,1823 MINOR WPT LLC +1823 MINOR MM LLC,7.0,DOWNTOWN,,1970,...,117664.0,,,,No Issue,Compliant,,,Multifamily Housing,POINT (-122.33151 47.61711)
31,50172,524780-0100,2020,Multifamily MR (5-9),80 MAIN APARTMENTS,NOT FOUND,7.0,DOWNTOWN,,2016,...,1136.0,,,,No Issue,Compliant,,,Multifamily Housing,POINT (-122.33508 47.60024)
40,50322,695000225,2020,Multifamily MR (5-9),MINNIE FLATS,101 DENNY LLC,7.0,DOWNTOWN,,2016,...,,,,,No Issue,Compliant,,,Multifamily Housing,POINT (-122.35480 47.61860)


In [76]:
#parcels = pd.read_csv("../data/final_parcels.csv")
#parcels.head()
#d = pd.Series(parcels.Owner.values, index=parcels.TaxParcelIdentificationNumber).to_dict()

# this CSV uses consolidated owner names
building_owners = pd.read_csv('../experiments/worst_offenders/updated_owners_2_15_23.csv')
# Map tax ids to landlord name
d = pd.Series(building_owners.Owner.values, index=building_owners.TaxParcelIdentificationNumber).to_dict()
df1['Landlord'] = df1['TaxParcelIdentificationNumber'].map(lambda row: d.get(row, ""))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


In [85]:
# use this
unique_d1_landlords = df1['Landlord'].unique()

In [84]:
# skip

landlords=df1.groupby("Landlord")[['TotalGHGEmissions','PropertyGFATotal']]
landlords.size()
df1.value_counts(subset=["Landlord"])

KeyError: "Columns not found: 'TotalGHGEmissions'"

In [199]:
# skip
df1_landlords = landlords.sum()
df1_landlords["NumUnits"] = landlords.size()
df1_landlords=df1_landlords.drop("")

In [200]:
#skip
df1_landlords = df1_landlords.reset_index()

In [201]:
#skip
df1_landlords.to_csv("../data/downtown_landlords.csv")

CRE WINSTON LLC         | 	CRE WINSTON, LLC
TESHOME FAMILY LLC      | 	TESHOME FAMILY LLC
TRIAD PIER 70 L L C	    |   TRIAD PIER 70 L.L.C.
BOREN DEVELOPMENT LLC   | 	BOREN DEVELOPMENT, LIMITED LIABILITY COMPANY

In [88]:
result = {
    "BusinessName": "TRIAD PIER 70 L L C", 
    "SearchEntityName": "TRIAD PIER 70 L.L.C."
}

p = re.compile("L[\s.]?L[\s,.]?[PC]" ,flags=re.IGNORECASE)  
p.match("LLC")

result['BusinessName']=result["BusinessName"].replace(",", "")
result['BusinessName']= re.sub(p, "LLC", result['BusinessName'])
result['BusinessName']=result["BusinessName"].replace("LIMITED LIABILITY COMPANY", "LLC") 
result['BusinessName']=result["BusinessName"].replace("LIMITED PARTNERSHIP", "LLC") 
# TODO: keep original business name and add a column that's standardizedname


# Do the same for the search term, so that we have more exact matches
result["SearchEntityName"]=result["SearchEntityName"].replace(",", "").replace(".", "") 
# result["SearchEntityName"]=result["SearchEntityName"].replace("LLP", "LLC") 
result['SearchEntityName']=re.sub(p, "LLC", result['SearchEntityName'])
result["SearchEntityName"]=result["SearchEntityName"].replace("LIMITED LIABILITY COMPANY", "LLC") 
result["SearchEntityName"]=result["SearchEntityName"].replace("LIMITED PARTNERSHIP", "LLC") 

In [171]:
# Utils for finding principals

search_for_business_url = 'https://cfda.sos.wa.gov/api/BusinessSearch/GetBusinessSearchList'

def get_business_search_payload(business_name, page_count, page_num):
    return {
        'Type': 'BusinessName',
        'SearchEntityName': business_name,
        'SearchType': 'BusinessName',
        'SortType': 'ASC',
        'SortBy': 'Entity Name',
        'SearchValue': business_name,
        'SearchCriteria': 'Contains',
        'IsSearch': 'true',
        'PageID': page_num,
        'PageCount': page_count,
    }


def get_business_search_results(business_name, page_num):
    r = requests.post(search_for_business_url, get_business_search_payload(business_name, 100, page_num))
    return json.loads(r.text)

# given one JSON element `result` in the list of search results, standardize
# the business name and address to collapse results into one 
def standardize_result(search_term, result):
    # Don't care about the result if it doesn't have an "active" status
    if(result["Status"] != "ACTIVE"): return

    # LLC, LLP, L L C, L.L.C., L.L.C. L.L.P., L.L.P
    # Limited Partnership, Limited liability company
    # Comma before any of the above
    # Just map all the results to be standardized to this name, then drop duplicates based on name? 
    p = re.compile("L[\s.]?L[\s,.]?[PC]" ,flags=re.IGNORECASE)

    result['BusinessName']=result["BusinessName"].replace(",", "")
    result['BusinessName']= re.sub(p, "LLC", result['BusinessName'])
    result['BusinessName']=result["BusinessName"].replace("LIMITED LIABILITY COMPANY", "LLC") 
    result['BusinessName']=result["BusinessName"].replace("LIMITED PARTNERSHIP", "LLC") 

    # Do the same for the search term, so that we have more exact matches
    # TODO: need to add this as an element of the dict
    result["SearchTerm"]=search_term.replace(",", "")
    result["SearchTerm"]=re.sub(p, "LLC", search_term)
    result["SearchTerm"]=search_term.replace("LIMITED PARTNERSHIP", "LLC") 
    result["SearchTerm"]=search_term.replace("LIMITED LIABILITY COMPANY", "LLC") 
    
    # Strip addressses of all commas
    result['Address'].replace(",", "")
    return result

def extract_search_results(search_term, search_req_response):
    # TODO: add all the columns, or at least filing status 
    # TODO: collapse all listings with L.L.P or L.L.C or LLC LLP
    # res_list = [standardize_result(res) for res in search_req_response]
    res_list = [[search_term, res['BusinessName'], res['UBINumber'], res['BusinessID'], res['PrincipalOffice']['PrincipalStreetAddress']['FullAddress'], res["BusinessStatus"]] for res in search_req_response]
    res_df = pd.DataFrame(res_list, columns=['SearchTerm', 'BusinessName', 'UBINumber', 'BusinessId', 'Address', "Status"])
    # print(res_df)
    # res_df = res_df[res_df['Status']`=="Active"]#res_df.drop(res_df[res_df["Status"]=="Terminated"].index)
    # TODO: If there's an exact match, keep only that business 
    # Basically keep a list of exact matches, and build a list of potential matches that we give to human verifiers
    exact_match = res_df.index[res_df['BusinessName'] == search_term].tolist()
    if exact_match:
        # print(res_df)
        # print(exact_match)
        res_df = pd.concat([res_df.iloc[[exact_match[0]],:], res_df.drop(exact_match[0], axis=0)], axis=0)
    return res_df
    

# Mark row as potential match: UBI number is a duplicate, or Address is the same
# df.duplicated just sees if that address is already in the dataframe, NOT that the serach term
# and result have the same address. Could add search terms as a subset for duplicated call
def determine_search_matches(search_results_df):
    search_results_df['address_match'] = search_results_df.duplicated(subset=['Address'], keep=False) 
    search_results_df['ubi_match'] = search_results_df.duplicated(subset=['UBINumber'], keep=False)
    search_results_df['id_match'] = search_results_df.duplicated(subset=['BusinessId'], keep=False)

def get_business_details(business_id):
    url = 'https://cfda.sos.wa.gov/api/BusinessSearch/BusinessInformation?businessID={business_id}'.format(business_id=business_id)
    r = requests.get(url)
    return json.loads(r.text)

def get_empty_df():
    return pd.DataFrame([], columns = ['SearchTerm', 'BusinessName', 'UBINumber', 'BusinessId', 'Address', 'Status', 'address_match', 'ubi_match', 'id_match'])

In [111]:
def get_all_company_name_match_search_results(owner_name):
    n = 1
    res_length = 100
    search_results = []
    
    while res_length == 100:
        res = get_business_search_results(owner_name, n)
        search_results += (res)
        n += 1
        res_length = len(res)
    
    return search_results

In [130]:
def get_potential_company_name_matches(owner_name):
    all_search_results = get_all_company_name_match_search_results(owner_name)
    extracted_results = extract_search_results(owner_name, all_search_results)
    determine_search_matches(extracted_results)
    return extracted_results

### Filter search results

Separate your search results into Alice's three categories:

- exact match
- potential matches (where no exact match was found)
- additional matches (extra matches if there was an exact match *and* additional matches)

In [187]:
def separate_search_results(results):
    exact_matches = get_empty_df()
    exact_matches.columns
    potential_matches = get_empty_df()
    additional_matches = get_empty_df()
    
    exact_match = results[results['SearchTerm'] == results['BusinessName']]
    if len(exact_match) > 0:
        exact_matches = pd.concat([exact_matches, exact_match], ignore_index=True)
        additional_matches = pd.concat([additional_matches, results[results['SearchTerm'] != results['BusinessName']]], ignore_index=True)
    else:
        potential_matches = pd.concat([potential_matches, results], ignore_index=True)
    
    return exact_matches, potential_matches, additional_matches

In [168]:
def get_company_list_name_matches(owner_list):
    exact_matches = get_empty_df()
    potential_matches = get_empty_df()
    additional_matches = get_empty_df()
    
    for owner in owner_list:
        matches = get_potential_company_name_matches(owner)
        temp_exact, temp_potential, temp_add = separate_search_results(matches)
        exact_matches = pd.concat([temp_exact, exact_matches], ignore_index=True)
        potential_matches = pd.concat([temp_potential, potential_matches], ignore_index=True)
        additional_matches = pd.concat([temp_add, additional_matches], ignore_index=True)
    
    return exact_matches, potential_matches, additional_matches

In [244]:
buildings_and_landlords_df = df1_landlords#pd.read_csv('../../experiments/worst_offenders/landlords_with_total_energy_use_2_16_23.csv')

In [195]:
#owner_search_list = buildings_and_landlords_df['Landlord'].unique()

# hacking this
owner_search_list = unique_d1_landlords
owner_search_list = list(owner_search_list)
owner_search_list.remove('NOT FOUND')

owner_search_list[:5]

['ACORN DEVELOPMENT LLC',
 'MIDTOWN21 LLC',
 '1823 MINOR WPT LLC +1823 MINOR MM LLC',
 '101 DENNY LLC',
 'ASPEN FLOWER LLC+MAYFLOWER HOTEL OWNER LLC']

In [186]:
owner_search_list = list(owner_search_list).remove('NOT FOUND')

In [131]:
get_potential_company_name_matches('4TH AVENUE BLDG LLC')

[{"IsAvailable":false,"PrincipalOffice":{"PrincipalID":0,"SequenceNo":0,"FirstName":null,"LastName":null,"FullName":null,"Title":null,"Name":null,"MiddleName":null,"PhoneNumber":null,"EmailAddress":null,"TypeID":null,"PrincipalBaseType":null,"PrincipalMailingAddress":{"Attention":null,"NotificationAttention":null,"CorrespondenceEmailAddress":null,"ConsolidationCorrespondenceEmailAddress":null,"ZipExtension":null,"AddressEntityType":null,"IsAddressSame":false,"isUserNonCommercialRegisteredAgent":false,"baseEntity":{"FilerID":0,"UserID":0,"CreatedBy":0,"IPAddress":null,"ModifiedBy":0,"ModifiedIPAddress":null},"IsInvalidState":false,"IsAgentInWA":null,"isRAStreetAddressValid":false,"IsAddressReturnedMail":false,"FullAddress":"GIG HARBOR, WA, USA, 4329, 98335","ID":0,"StreetAddress1":"GIG HARBOR","StreetAddress2":"WA","City":"USA","State":"","OtherState":"","Country":"98335","Zip5":"","Zip4":"","PostalCode":"4329","County":null,"CountyName":null,"CountryName":"98335"},"PrincipalStreetAddre

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,4TH AVENUE BLDG LLC,4TH AVENUE BLDG LLC,604 065 660,95496,"11221 PACIFIC HWY SW, LAKEWOOD, WA, 98499-5170...",Active,False,False,False
1,4TH AVENUE BLDG LLC,4TH AVENUE BLDG LLC,602 253 701,95495,,Administratively Dissolved,False,False,False


In [133]:
target = get_potential_company_name_matches('TARGET CORPORATION')
target.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
77,TARGET CORPORATION,TARGET CORPORATION,601 007 793,294338,"1000 NICOLLET MALL, MINNEAPOLIS, MN, 55403-254...",Active,True,False,False
0,TARGET CORPORATION,"A-TARGET, INC.",602 963 157,435176,"16454 N 91ST ST UNIT 101, SCOTTSDALE, AZ, 8526...",Terminated,False,False,False
1,TARGET CORPORATION,ACTION TARGET ACQUISITION CORP.,602 831 478,448747,"485 WEST PUTNAM AVENUE, GREENWICH, CT, 06830, ...",Terminated,False,False,False
2,TARGET CORPORATION,ACTION TARGET INC.,602 420 120,17961,"3411 S MOUNTAIN VISTA PRWY, PROVO, UT, 84606, ...",Terminated,False,False,False
3,TARGET CORPORATION,ACTION TARGET INC.,603 126 062,482951,"3411 S MOUNTAIN VISTA PKWY, PROVO, UT, 84606, ...",Active,False,False,False


In [134]:
target[target['SearchTerm'] == target['BusinessName']]

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
77,TARGET CORPORATION,TARGET CORPORATION,601 007 793,294338,"1000 NICOLLET MALL, MINNEAPOLIS, MN, 55403-254...",Active,True,False,False


In [189]:
owner_search_list

In [196]:
test_exact, test_potential, test_add = get_company_list_name_matches(owner_search_list[:5])
    

In [197]:
test_add

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,101 DENNY LLC,"NWCH 101 DENNY, L.P.",603 591 921,357697,"PARKLAND HALL, 3889 MAPLE AVENUE, SUITE 200, D...",Withdrawn,False,False,False
1,ACORN DEVELOPMENT LLC,ACORN DEVELOPMENT CORP.,601 430 169,476734,,Terminated,True,False,False
2,ACORN DEVELOPMENT LLC,ACORN HOUSING DEVELOPMENT LIMITED PARTNERSHIP,601 946 235,14601,,Inactive,True,False,False


In [198]:
test_exact

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,101 DENNY LLC,101 DENNY LLC,603 370 372,3148,"506 2ND AVE, SUITE 1020, SEATTLE, WA, 98104-23...",Active,False,False,False
1,MIDTOWN21 LLC,MIDTOWN21 LLC,604 129 091,318280,"925 4TH AVE STE 3900, SEATTLE, WA, 98104-1113,...",Active,False,False,False
2,ACORN DEVELOPMENT LLC,ACORN DEVELOPMENT LLC,603 183 512,1093666,"410 TERRY AVENUE NORTH, SEATTLE, WA, 98103, U...",Active,False,False,False


In [199]:
len(owner_search_list)

397

In [200]:
exact_matches_1, potential_matches_1, additional_matches_1 = get_company_list_name_matches(owner_search_list[:200])

In [201]:
exact_matches_1.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,KAR STANDARD LLC,KAR STANDARD LLC,604 145 518,763141,"1 FEDERAL ST FL 17, BOSTON, MA, 02110-2003, UN...",Active,False,False,False
1,300 FIFTH AVENUE LLC,300 FIFTH AVENUE LLC,604 670 512,1399595,"1000 2ND AVE STE 1800, SEATTLE, WA, 98104-3619...",Active,False,False,False
2,MSI - 1ST & KING LLC,MSI - 1ST & KING LLC,602 739 680,880019,"316 OCCIDENTAL AVE S, STE 300, SEATTLE, WA, 98...",Active,False,False,False
3,BRICKMAN PACIFIC LLC,BRICKMAN PACIFIC LLC,603 445 367,74763,"C/O BRICKMAN, ONCE GREENWICH OFFICE PARK, BUIL...",Active,False,False,False
4,BPP 800 FIFTH PROPERTY OWNER LLC,BPP 800 FIFTH PROPERTY OWNER LLC,604 371 982,1248153,"233 S WACKER DR, SUITE 4700, CHICAGO, IL, 6060...",Active,False,False,False


In [202]:
potential_matches_1.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match
0,DLLC,"EXCEEDLLC, LLC",602 906 178,680608,"123 FIFTH AVENUE, NEW YORK, NY, 10003, UNITED ...",Terminated,False,False,False
1,DLLC,"EXCEEDLLC, LLC",603 253 693,669957,"250 WEST 57TH STREET, NEW YORK, NY, 10107, UNI...",Administratively Dissolved,False,False,False
2,DLLC,KDDLLC3 LIMITED LIABILITY COMPANY,605 166 608,1650596,"4416 ROLLING WATER DR, PFLUGERVILLE, TX, 78660...",Active,False,False,False
3,DLLC,"NASU FOODS, L. L. C.",602 010 413,884150,,Administratively Dissolved,False,False,False
4,HOWARD BUILDING SEATTLE LLC,"HOWARD BUILDING SEATTLE, LLC",603 468 238,154242,"614 1ST AVE, SUITE 400, SEATTLE, WA, 98104-225...",Active,False,False,False


## Matches for Step 1

In [203]:
exact_matches_1.to_csv('exact_matches_1.csv')
potential_matches_1.to_csv('potential_matches_1.csv')
additional_matches_1.to_csv('additional_matches_1.csv')

In [204]:
exact_matches_2, potential_matches_2, additional_matches_2 = get_company_list_name_matches(owner_search_list[200:])

In [205]:
exact_matches_2.to_csv('exact_matches_2.csv')
potential_matches_2.to_csv('potential_matches_2.csv')
additional_matches_2.to_csv('additional_matches_2.csv')

In [247]:
# Trying this with our first 200 

owner_search_chunk_1 = get_company_list_name_matches(owner_search_list)

owner_search_chunk_1.head()

  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_matches(owner), matches], ignore_index=True)
  matches = pd.concat([get_potential_company_name_

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match,IsMatch
0,YESSIX LLC,"YESSIX, LLC",603 446 762,1082614,"2900 NE BLAKELEY ST STE B, SEATTLE, WA, 98105,...",Active,False,False,False,
1,XERAD LLC,"THE TAX ERADICATOR BOOKKEEPING & TAX SERVICES,...",603 427 461,986537,"10003 201ST AVENUE PL E, BONNEY LAKE, WA, 9839...",Active,False,False,False,
2,XERAD LLC,"XERAD I, LLC",602 498 747,526872,"3021 4TH AVE, SEATTLE, WA, 98121, UNITED STATES",Administratively Dissolved,False,False,False,
3,XERAD LLC,"XERAD II, LLC",602 498 750,526873,"3425 67TH SE, MERCER ISLAND, WA, 98040, UNITED...",Active,False,False,False,
4,WRI 2200 WESTLAKE LP,WRI 2200 WESTLAKE LP,603 587 485,528804,"500 N BROADWAY, SUITE 201, JERICHO, NY, 11753,...",Active,False,False,False,


## (Optional) Step 1b

You can ask annotaters to check the possible matches found by the scraping script. They can put 1 and 0 into a "isMatch" column.

## Step 2

Fetch the principals for each company found in Step 1.

In [25]:
def get_business_details(business_id):
    url = 'https://cfda.sos.wa.gov/api/BusinessSearch/BusinessInformation?businessID={business_id}'.format(business_id=business_id)
    r = requests.get(url)
    return json.loads(r.text)

def extract_principals(business_res, business_id):
    agent = business_res['Agent']['EntityName']
    rows = [[
        # name of company?
        business_id,
        agent,
        'Entity' if principal['TypeID'] == 'E' else 'Individual',
        principal['PrincipalID'],
         principal['Name'] if principal['TypeID'] == 'E' else principal['FirstName'] + ' ' + principal['LastName']
    ] for principal in business_res['PrincipalsList']]
    return pd.DataFrame(rows, columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])

def get_companies_principals(business_names_df):
    '''
    Takes a DF of companies with BusinessId and returns a DF of each company's principals, 
    with one row for each principal.
    '''
    principals = pd.DataFrame([], columns=['BusinessId', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])
    for business in business_names_df['BusinessId']:
        business_res = get_business_details(business)
        principals = pd.concat([extract_principals(business_res, business), principals], ignore_index=True)
    
    merged_principals = pd.merge(business_names_df, principals, on='BusinessId', how='left')
    
    return merged_principals

Start with the first exact matches results for proof of concept.

In [209]:
exact_matches_1_principals = get_companies_principals(exact_matches_1)

You will need to run the potential matches and the 2nd batch, too.

In [210]:
exact_matches_1_principals.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match,Agent,EntityType,PrincipalID,PrincipalName
0,KAR STANDARD LLC,KAR STANDARD LLC,604 145 518,763141,"1 FEDERAL ST FL 17, BOSTON, MA, 02110-2003, UN...",Active,False,False,False,CORPORATION SERVICE COMPANY,Entity,1570323,KAONOULU RANCH LLLP
1,300 FIFTH AVENUE LLC,300 FIFTH AVENUE LLC,604 670 512,1399595,"1000 2ND AVE STE 1800, SEATTLE, WA, 98104-3619...",Active,False,False,False,,Individual,3315049,JOHN M. GREELEY
2,300 FIFTH AVENUE LLC,300 FIFTH AVENUE LLC,604 670 512,1399595,"1000 2ND AVE STE 1800, SEATTLE, WA, 98104-3619...",Active,False,False,False,,Individual,3349317,MARTIN SELIG
3,MSI - 1ST & KING LLC,MSI - 1ST & KING LLC,602 739 680,880019,"316 OCCIDENTAL AVE S, STE 300, SEATTLE, WA, 98...",Active,False,False,False,FIKSO KRETSCHMER SMITH DIXON ORMSETH PS,Individual,2114613,"H MARTIN SMITH, III"
4,BRICKMAN PACIFIC LLC,BRICKMAN PACIFIC LLC,603 445 367,74763,"C/O BRICKMAN, ONCE GREENWICH OFFICE PARK, BUIL...",Active,False,False,False,C T CORPORATION SYSTEM,Entity,3617104,BRICKMAN FUND VI REIT INC.


In [211]:
exact_matches_1_principals.to_csv('exact_matches_1_principals.csv')

## Step 3

Find every company associated with the governors found in Step 2.

This is slightly convoluted because of the API. The process is:

a) Search for the principal's name using the advanced search API. 

b) Get a paginated list of results. This will *not* include the principal's name because... reasons?

c) For each result, send another request to the business information endpoint to fetch the business details.

d) If the company's principals include the original principal we were looking for, save the business' information.

In [1]:
import urllib.parse

In [4]:
principal_url = 'https://cfda.sos.wa.gov/api/BusinessSearch/GetAdvanceBusinessSearchList'

principal_headers = {
    'Accept-Language': 'en-US,en;q=0.8,es-AR;q=0.5,es;q=0.3',
    'Referer': 'https://ccfs.sos.wa.gov/',
    'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', # this might be an issue
    'Origin': 'https://ccfs.sos.wa.gov'
}

def get_principal_data(principal_name, page_num):
        principal_name = urllib.parse.quote(principal_name)
        return 'Type=Principal&BusinessStatusID=0&SearchEntityName=&SearchType=&BusinessTypeID=0&AgentName=&PrincipalName={principal_name}&StartDateOfIncorporation=&EndDateOfIncorporation=&ExpirationDate=&IsSearch=true&IsShowAdvanceSearch=true&&&AgentAddress%5BIsAddressSame%5D=false&AgentAddress%5BIsValidAddress%5D=false&AgentAddress%5BisUserNonCommercialRegisteredAgent%5D=false&AgentAddress%5BIsInvalidState%5D=false&AgentAddress%5BbaseEntity%5D%5BFilerID%5D=0&AgentAddress%5BbaseEntity%5D%5BUserID%5D=0&AgentAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&AgentAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&AgentAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&AgentAddress%5BID%5D=0&&&&AgentAddress%5BState%5D=WA&&AgentAddress%5BCountry%5D=USA&&&&&&&&PrincipalAddress%5BIsAddressSame%5D=false&PrincipalAddress%5BIsValidAddress%5D=false&PrincipalAddress%5BisUserNonCommercialRegisteredAgent%5D=false&PrincipalAddress%5BIsInvalidState%5D=false&PrincipalAddress%5BbaseEntity%5D%5BFilerID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BUserID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&PrincipalAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&PrincipalAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&PrincipalAddress%5BID%5D=0&&&&PrincipalAddress%5BState%5D=&&PrincipalAddress%5BCountry%5D=USA&&&&&&PageID={page_num}&PageCount=100'.format(principal_name=principal_name, page_num=page_num)

def get_principal_response(principal_name, page_num):
    data = get_principal_data(principal_name, page_num)
    r = requests.post(principal_url, data=data, headers=principal_headers)
    return json.loads(r.text)

In [13]:
get_principal_response('SMITH', 1)

In [181]:
def extract_principals_business(business_res, business_id, business_name):
    """
        Given a json of the business search result business_res and the business' id, 
        Create a dataframe of all the principals returned in business_res.
    """
    agent = business_res['Agent']['EntityName']
    # res['PrincipalOffice']['PrincipalStreetAddress']['FullAddress'], res["BusinessStatus"]
    rows = [[
        # name of company?
        business_id,
        business_name,
        agent,
        'Entity' if principal['TypeID'] == 'E' else 'Individual',
        principal['PrincipalID'],
        principal['Name'] if principal['TypeID'] == 'E' else principal['FirstName'] + ' ' + principal['LastName'],
        business_res['PrincipalOffice']['PrincipalStreetAddress']['FullAddress'],
        business_res["BusinessStatus"]
    ] for principal in business_res['PrincipalsList']]
    # TODO: add address and sth
    return pd.DataFrame(rows, columns=['BusinessId', 'BusinessName', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName', "Address", "Status"])

def get_all_principal_search_results(principal_name):
    n = 1
    res_length = 100
    search_results = []
    
    while res_length == 100:
        res = get_principal_response(principal_name, n)
        search_results += res
        n += 1
        res_length = len(res)
    
    return search_results

def get_principals_from_all_results_pages(principal_name):
    """
        given a principal_name, gets all the results across all potential response pages
    """
    search_results = get_all_principal_search_results(principal_name)
    # print(search_results)
    business_ids = [res['BusinessID'] for res in search_results]
    business_names = [res['BusinessName'] for res in search_results]
    
    # should include company name, see below
    principals = pd.DataFrame([], columns=['BusinessId', 'BusinessName', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'])
    
    for id, name in zip(business_ids, business_names):
        business_json = get_business_details(id)
        # print(name)
        principals_df = extract_principals_business(business_json, id, name) # we should add the company name--write a variant of extract_principals
        # principals_df["BusinessName"] = business_json.get("BusinessName")
        # check if the principal is included
        # print(principals_df.head())
        if len(principals_df[principals_df['PrincipalName'] == principal_name]) > 0:
            principals = pd.concat([extract_principals_business(business_json, id,name), principals], ignore_index=True)
    
    return principals

This returns no results. Why? We encode the string in the data so that can't be the problem.

HAHAH, joke is on us, this is an agent and so we get no results.

In [18]:
test_principal = 'JOHN M. GREELEY'

get_all_principal_search_results(test_principal)

In [40]:
martin = get_principal_response('MARTIN SELIG', 1)

In [87]:
principal_responses = get_all_principal_search_results('STANFORD HOSPITALITY INC')

In [182]:
test_principals_results = get_principals_from_all_results_pages('STANFORD HOSPITALITY INC')

In [183]:
test_principals_results.head()

Unnamed: 0,BusinessId,BusinessName,Agent,EntityType,PrincipalID,PrincipalName,Address,Status
0,1001050,SEATTLE DOWNTOWN HOTEL & RESIDENCES LLC,C T CORPORATION SYSTEM,Entity,259920,STANFORD HOSPITALITY INC,"433 CALIFORNIA ST, STE 700, SAN FRANCISCO, CA,...",Active
1,1001050,SEATTLE DOWNTOWN HOTEL & RESIDENCES LLC,C T CORPORATION SYSTEM,Entity,606221,ELLIOTT DEVELOPMENT LLC,"433 CALIFORNIA ST, STE 700, SAN FRANCISCO, CA,...",Active
2,1180707,BELLEVUE TOD LLC,C T CORPORATION SYSTEM,Entity,2787875,STANFORD HOSPITALITY INC,"433 CALIFORNIA ST, STE 700, SAN FRANCISCO, CA,...",Active


### Formatting 

We need to format the results from get_principals_from_all_results_pages into a CSV with the following format:

- SearchTerm: Original owner name from the 2020 building emissions dataset
- BusinessName: The business name in the CCFS database that we have matched to the SearchTerm
- PotentialRelatedCompany: A company that may be related to the company in the BusinessName field. If this field is the same as BusinessName, the row represents the "parent" or "hub" company that we are trying to match companies to
- UBINumber: ID number
- BusinessId: ID number
- Address: Address of the PotentialRelatedCompany
- Status: If the company is active/closed, etc.
- Principals: A comma separated, alphabetized list of the PotenitalRelatedCompany's principals 
- isMatch: Your best guess about whether or not the PotentialRelatedCompany is connected to the BusinessName
- notes: any useful notes about the company or explaining the isMatch value

This is also available in the [sample output spreadsheet](https://github.com/linnealovespie/BPS/blob/bug-fix/data/building_owners/sample_step_3_output.xlsx).

In [127]:
exact_matches_1_principals = pd.read_csv("exact_matches_1_principals.csv", index_col=0)
exact_matches_1_principals.head()

Unnamed: 0,SearchTerm,BusinessName,UBINumber,BusinessId,Address,Status,address_match,ubi_match,id_match,Agent,EntityType,PrincipalID,PrincipalName
0,KAR STANDARD LLC,KAR STANDARD LLC,604 145 518,763141,"1 FEDERAL ST FL 17, BOSTON, MA, 02110-2003, UN...",Active,False,False,False,CORPORATION SERVICE COMPANY,Entity,1570323,KAONOULU RANCH LLLP
1,300 FIFTH AVENUE LLC,300 FIFTH AVENUE LLC,604 670 512,1399595,"1000 2ND AVE STE 1800, SEATTLE, WA, 98104-3619...",Active,False,False,False,,Individual,3315049,JOHN M. GREELEY
2,300 FIFTH AVENUE LLC,300 FIFTH AVENUE LLC,604 670 512,1399595,"1000 2ND AVE STE 1800, SEATTLE, WA, 98104-3619...",Active,False,False,False,,Individual,3349317,MARTIN SELIG
3,MSI - 1ST & KING LLC,MSI - 1ST & KING LLC,602 739 680,880019,"316 OCCIDENTAL AVE S, STE 300, SEATTLE, WA, 98...",Active,False,False,False,FIKSO KRETSCHMER SMITH DIXON ORMSETH PS,Individual,2114613,"H MARTIN SMITH, III"
4,BRICKMAN PACIFIC LLC,BRICKMAN PACIFIC LLC,603 445 367,74763,"C/O BRICKMAN, ONCE GREENWICH OFFICE PARK, BUIL...",Active,False,False,False,C T CORPORATION SYSTEM,Entity,3617104,BRICKMAN FUND VI REIT INC.


In [None]:
# Principal_match_list: SearchTerm, BusinessName, UBINumber, BusinessId, Address, Status, address_match, ubi_match, id_match, Agent, EntityType, PrincipalID, PrincipalName

In [247]:
# fill me in and run me below!

def get_principal_companies(principal_match_list):
    columns=['SearchTerm', 'BusinessName', 'PotentialRelatedCompany', 'UBINumber', 'Address', 'Status', 'Principals', 'isMatch', 'notes']
    results = pd.DataFrame([], columns)
    for idx, row in principal_match_list.iterrows():
        # print(row)
        # possible_matching_companies_df: 'BusinessId', 'BusinessName', 'Agent', 'EntityType', 'PrincipalID', 'PrincipalName'
        possible_matching_companies_df = get_principals_from_all_results_pages(row['PrincipalName'])
        # print(possible_matching_companies_df)
        grouped = possible_matching_companies_df.groupby("BusinessName")
        # print(group.groups)
        for name, group in grouped:
            # print(name)
            principals_list = group["PrincipalName"].tolist() #TODO: Alpahbetize!
            principals_list.sort()
            poss_company = group.iloc[0]
            # print(principals_list)

            # # row['BusinessName']: the name we mapped to from SearchTerm
            # # poss_company['BusinessName]: PotentialRelatedCompany
            new_row = pd.Series(data=[row['SearchTerm'], 
                            row["BusinessName"], 
                            poss_company["BusinessName"],
                            poss_company["BusinessId"],
                            poss_company["Address"],
                            poss_company["Status"],
                            principals_list,
                            "", # isMatch
                            ""  # Notes
                            ], 
                        index = columns)
            # print(new_row)
            results = pd.concat([new_row.to_frame().T, results], ignore_index=True).dropna()
            #if(name == 'MARTIN SMITH DEVELOPMENT CORPORATION') : return results
        if(idx % 25 == 0): 
            print(f"Processing row {idx} of principal_match_list, results is {len(results)}")
            # print(results.tail()) # periodically check in

    return results

In [248]:
exact_matches_1_final = get_principal_companies(exact_matches_1_principals)
exact_matches_1_final.head()

Processing row 0 of principal_match_list, results is 1
Processing row 25 of principal_match_list, results is 1825


KeyboardInterrupt: 

In [249]:
exact_matches_1_final.to_csv("exact_matches_1_final.csv")

NameError: name 'exact_matches_1_final' is not defined