# Grouping Companies for non-downtown buildings

In [1]:
# Be sure to run `pip install . ` in the root directory to access utils modules
from utils import owners
from utils import geo
from utils import parcel_owners

In [2]:
import pandas as pd
import numpy as np
import requests
import json
import os
import re
import geopandas as gp
import urllib.parse

# Step 0: Clean up Seattle OSE Data 
Seattle's [Office of Sustainability and Environment](https://www.seattle.gov/environment) has released emissions data for buildings in the city from 2015 to 2021. Building owners over 20,000 sq.ft. are required to self-report annually to the city of Seattle. Features of the data include:
- building name
- address
- tax parcel identification number
- council district the building is located within
- buliding type
- different metrics for emissions. 
More metadata about the energy benchmarking datasets can be found [here](https://data.seattle.gov/Community/2021-Building-Energy-Benchmarking/bfsh-nrm6).

When doing initial exploratory analysis, we noticed that not all of the council district codes were correct. To rectify this we got the official council district boundaries form [Seattle GeoData Portal](https://data-seattlecitygis.opendata.arcgis.com/datasets/seattle-city-council-districts-for-council-members-serving-through-2023/explore) in the form of a geojson file [`Council_Districts.geojson`](../../../data/Council_Districts.geojson). For each address in the OSE building efficiency dataset, we convert it to a (latitude, longitude) point and confirmed with which council distirct it belonged to. 

To learn more, refer to `utils/owners.py`.


In [3]:
# Clean the OSE dataset 
df_districts = gp.read_file("../../../data/Council_Districts.geojson")
df = pd.read_csv('../../../data/2020_Building_Energy_Benchmarking.csv')
df = gp.GeoDataFrame(df, geometry=gp.points_from_xy(df.Longitude, df.Latitude))
geo.clean_districts(df, df_districts)

# Get all the buildings in the downtown neighborhood
# Note this is a slightly different bounds than council districts
df_filtered = df.loc[df['Neighborhood'] != "DOWNTOWN"]

Building WATERWORKS OFFICE & MARINA 2353/ 4088803975 doesn't have a district POINT (-122.33895 47.63575) 
	 Found district 7 for WATERWORKS OFFICE & MARINA
Building NAUTICAL LANDING 2381/ 4088804350 doesn't have a district POINT (-122.34219 47.64306) 
	 Found district 7 for NAUTICAL LANDING
Building UNION HARBOR CONDOMINIUM 2540/ 8807200000 doesn't have a district POINT (-122.33003 47.6401) 
	 Found district 4 for UNION HARBOR CONDOMINIUM
Building THE PIER AT LESCHI 2997/ 6780900000 doesn't have a district POINT (-122.28563 47.59926) 
	 Found district 3 for THE PIER AT LESCHI
Building THE LAKESHORE 3046/ 1180001715 doesn't have a district POINT EMPTY 
Building EDUCARE 3218/ 2895800030 doesn't have a district POINT EMPTY 


In [4]:
df_filtered.head()

Unnamed: 0,OSEBuildingID,DataYear,BuildingName,BuildingType,TaxParcelIdentificationNumber,Address,City,State,ZipCode,Latitude,...,Electricity(kWh),SteamUse(kBtu),NaturalGas(therms),ComplianceStatus,ComplianceIssue,Electricity(kBtu),NaturalGas(kBtu),TotalGHGEmissions,GHGEmissionsIntensity,geometry
22,28,2020,GRAHAM HILL ELEMENTARY SCHOOL (SPS-DISTRICT),SPS-District K-12,1102000138,5101 S GRAHAM ST,SEATTLE,WA,98118.0,47.54576,...,180482,0,6241,Compliant,No Issue,615806.0,624133.0,36.4,0.6,POINT (-122.26853 47.54576)
23,29,2020,WATERTOWN HOTEL,NonResidential,1142000755,4242 ROOSEVELT WAY NE,SEATTLE,WA,98105.0,47.65958,...,558500,0,7336,Compliant,No Issue,1905602.0,733634.0,49.1,0.8,POINT (-122.31738 47.65958)
25,32,2020,HOMEWOOD SUITES -SEATTLE - Hilton,NonResidential,660001832,1011 PIKE ST,SEATTLE,WA,98101.0,47.61301,...,1383664,0,27690,Compliant,No Issue,4721062.0,2769020.0,172.2,1.3,POINT (-122.32929 47.61301)
27,34,2020,MEANY MIDDLE SCHOOL (SPS-DISTRICT),SPS-District K-12,688000090,300 20TH AVE E,SEATTLE,WA,98112.0,47.62266,...,361212,0,31268,Compliant,No Issue,1232456.0,3126792.0,172.6,1.4,POINT (-122.30547 47.62266)
29,36,2020,JANE ADDAMS MIDDLE (SPS-DISTRICT),SPS-District K-12,752000170,11031 34TH AVE NE,SEATTLE,WA,98125.0,47.70994,...,328674,0,44380,Compliant,No Issue,1121434.0,4438026.0,241.7,1.5,POINT (-122.29301 47.70994)


## Step 1: Getting Parcel Owners from eRealProperty

The first step in finding a building's owner is to find the owner of the building parcel. This is listed in [King County's eRealProperty database](https://blue.kingcounty.com/Assessor/eRealProperty/default.aspx). 

This script takes a list of buildings' Tax Parcel Identification Number and returns a CSV listing the current owners according to eRealProperty. Optionally, you can also produce a JSON file with the number, types, and square footage of the different units in the building.

To use:

1. Instantiate an instance of the `ParcelLookupHelper` class, including the file path where you want to save your results.
2. Run the `scrape_parcel_owners` method. Params:
    - `tax_parcel_id_numbers` (list): a list of the tax parcel IDs you want to look up
    - `file_name` (str): the file name to save the results
    - `get_unit_details` (bool): whether or not to create a JSON file of the types of units in each building. Defaults to `False`. 

Sample use:

```
scraper = ParcelLookupHelper('building_owners')
scraper.scrape_parcel_owners([659000030, 659000220], 'building_owners_grp_1', True)
```

Two important notes: 

- This is a web scraping script, so it is highly dependent on the HTML structure of the Property Detail pages. Test it to make sure you're getting the correct data in case the HTML structure has changed since this script was written.
- eRealProperty restricts you to 1,000 calls a day from a given IP address. You can divide your buildings into chunks of 1,000 and do this over several days or divide them between team members. Using different AWS or Google Cloud instances is possible but probably a bit rude. You could also use a VPN for similar results. Either way, you'll have to manually chunk this up into calls of 1,000 at least.


In [12]:
# Warning: This will likely take a long time to run!
scraper = parcel_owners.ParcelLookupHelper(os.getcwd())
scraper.scrape_parcel_owners(df_filtered['TaxParcelIdentificationNumber'][:10], 'building_owners_downtown')

Writing parcel owners to CSV: c:\Users\linne\Documents\BPS\experiments\landlords\parent_company_search/building_owners_downtown.csv


In [6]:
building_owners = pd.read_csv("building_owners_downtown.csv", index_col=0)
building_owners.head()

Unnamed: 0,TaxParcelIdentificationNumber,Owner
0,7733600135,TMUD GSL LLC
1,1991200090,HH SEATTLE LLC
2,659000775,ACORN DEVELOPMENT LLC
3,4083306985,BRE-BMR 34TH LLC
4,660001605,MIDTOWN21 LLC


In [7]:
# Map tax ids to landlord name
d = pd.Series(building_owners.Owner.values, index=building_owners.TaxParcelIdentificationNumber).to_dict()
df_filtered['Landlord'] = df_filtered['TaxParcelIdentificationNumber'].map(lambda row: d.get(row, ""))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


## Step 2: Getting Company Principals from Washington State Corporations and Charities Filing Database

### Step 2.1: Map Landlord Names to Businesses in the CCF Database

In [8]:
unique_not_downtown_landlords = df_filtered['Landlord'].unique()
unique_not_downtown_landlords = pd.DataFrame(unique_not_downtown_landlords, columns=['owner_name'])
unique_not_downtown_landlords = unique_not_downtown_landlords[~unique_not_downtown_landlords['owner_name'].isin(['NOT FOUND', 'UNDEFINED'])]
unique_not_downtown_landlords.to_csv('unique_not_downtown_landlords.csv')
owner_search_list = list(unique_not_downtown_landlords['owner_name'])

In [None]:
lookup_helper = owners.LookupCompaniesHelper(os.getcwd())
lookup_helper.get_company_matches_and_export(owner_search_list[:10], 1) # Processing only 1 batch of data

### Step 2.2: Get all companies and their principals
Now that we've mapped company names to their entries in the CCFS database, we create a list of all the companies and all the principals registered to that company. This all-matches-all-principals will be used in step 3 to iterate through and group by shared principals. 

For potential matches, make sure to go in and manually create and label the `is_match` column. 

In [10]:
group_helper = owners.GroupCompaniesHelper(os.getcwd(), "companies_and_principals.csv")

In [47]:
all_matches = pd.DataFrame([])
num_batches = 1  # Increase the range based on how many batches you processed previously
for i in range(1, num_batches):
    print(f"Getting principals for exact_matches_{i}")
    exact_matches = pd.read_csv(f"exact_matches_{i}.csv", index_col=0)
    exact_matches_principals = group_helper.get_companies_principals(exact_matches)
    all_matches = pd.concat([all_matches, exact_matches_principals], ignore_index=True)

for i in range(1, num_batches):
    print(f"Getting principals for potential_matches_{i}")
    potential_matches = pd.read_csv(f"potential_matches_{i}.csv",index_col=0)
    potential_matches = potential_matches[potential_matches['is_match']==1]
    potential_matches_principals = group_helper.get_companies_principals(potential_matches)
    potential_matches_principals = potential_matches_principals.drop(columns=["is_match"])
    all_matches = pd.concat([all_matches, potential_matches_principals], ignore_index=True)


### Step 2.3: Group Companies by shared principals
Now that we have all of the companies in CCFS database and all of the principals registered to that company, we can group all of the results by shared principals. 

In [40]:
companies_and_matches = group_helper.group_companies_by_principals(all_matches)

Saving to c:\Users\linne\Documents\BPS\experiments\landlords\parent_company_searchcompanies_and_principals.csv
Processing row 0 of principal_match_list, results is 1
Processing row 0 of principal_match_list, results is 7


## Step 3: Finding the Principals' Employer