## Match Reference USA School Data to School Building Data

Reference USA provides details on businesses which inlcudes schools. The Reference USA can be used to help validate other data sources and identify education organizations that my be outside of the NCES data.

Based on help from:

https://osmnx.readthedocs.io/en/stable/osmnx.html

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html

Goal is to accurately assign business information to buildings.


## Description of Program
- program:    IN-CORE_2bv1_MatchRefUSASchoolBuilding
- task:       Match RefUSA point data to nearest building location
- Version:    2021-06-17
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, N. (2021) “Obtain, Clean, and Explore Labor Market Allocation Methods". 
Archived on Github and ICPSR.

In [1]:
%matplotlib inline

import pandas as pd
import geopandas as gpd
import numpy as np  # group by aggregation
import folium as fm # folium has more dynamic maps - but requires internet connection

In [2]:
# Display versions being used - important information for replication
import sys
print("Python Version     ", sys.version)
print("geopandas version: ", gpd.__version__)
print("pandas version:    ", pd.__version__)
print("numpy version:     ", np.__version__)
print("folium version:    ", fm.__version__)

Python Version      3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:37:01) [MSC v.1916 64 bit (AMD64)]
geopandas version:  0.9.0
pandas version:     1.2.4
numpy version:      1.20.2
folium version:     0.12.1


In [3]:
import os # For saving output to path
# Store Program Name for output files to have the same name
programname = "IN-CORE_2cv1_MatchRefUSASchoolBuilding_2021-06-19"
# Make directory to save output
if not os.path.exists(programname):
    os.mkdir(programname)

# Setup access to IN-CORE
https://incore.ncsa.illinois.edu/

In [4]:
from pyincore import IncoreClient, Dataset, FragilityService, MappingSet, DataService
from pyincore_viz.geoutil import GeoUtil as viz

In [5]:
#client = IncoreClient()
# IN-CORE chaches files on the local machine, it might be necessary to clear the memory
#client.clear_cache()

In [6]:
# create data_service object for loading files
#data_service = DataService(client)

### IN-CORE addons
This program uses coded that is being developed as potential add ons to pyincore. These functions are in a folder called pyincore_addons - this folder is located in the same directory as this notebook.
The add on functions are organized to mirror the folder sturcture of https://github.com/IN-CORE/pyincore

Each add on function attempts to follow the structure of existing pyincore functions and includes some help information.

In [7]:
# open, read, and execute python program with reusable commands
import pyincore_addons.geoutil_20210618 as add2incore

# since the geoutil is under construction it might need to be reloaded
from importlib import reload 
add2incore = reload(add2incore)

# Print list of add on functions
from inspect import getmembers, isfunction
print(getmembers(add2incore,isfunction))

[('df2gdf_WKTgeometry', <function df2gdf_WKTgeometry at 0x000001F470080EE8>), ('nearest_pt_search', <function nearest_pt_search at 0x000001F470080A68>)]


In [8]:
# example help details
help(add2incore.nearest_pt_search)

Help on function nearest_pt_search in module pyincore_addons.geoutil_20210618:

nearest_pt_search(gdf_a: geopandas.geodataframe.GeoDataFrame, gdf_b: geopandas.geodataframe.GeoDataFrame, uniqueid_a: str, uniqueid_b: str, k=1, dist_cutoff=99999)
    Given two sets of points add unique id from locations a to locations b
    Inspired by: https://towardsdatascience.com/using-scikit-learns-binary-trees-to-efficiently-find-latitude-and-longitude-neighbors-909979bd929b
    
    This function is used to itdentify buildings associated with businesses, schools, hospitals.
    The locations of businesses might be geocoded by address and may not overlap
    the actual structure. This function helps resolve this issue.
    
    Tested Python Enviroment:
        Python Version      3.7.10
        geopandas version:  0.9.0
        pandas version:     1.2.4
        scipy version:     1.6.3
        numpy version:      1.20.2
    
    Args:
        gdf_a: Geodataframe with list of locations with unique i

## Read in Building Data


In [9]:
sourceprogram = "IN-CORE_1gv1_Lumberton_SchoolBuildingData_2021-06-17"
filename = sourceprogram+"/"+sourceprogram+"_schoolbuildings.csv"
building_df = pd.read_csv(filename)
building_df.head(2)

Unnamed: 0.1,Unnamed: 0,guid,strctid,ffe_elev,archetype,parid,struct_typ,no_stories,a_stories,b_stories,...,str_typ2,occ_typ2,appr_bldg,appr_land,appr_tot,year_built,lhsm_elev,g_elev,age_group,geometry
0,2270,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,ST66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,43.88056,10,3715519218,,1,,,...,,,,,,1978.0,,,2.0,POINT (-79.12246588033371 34.58986850192746)
1,2271,31d34dad-4211-40d9-b4e3-38677b5ee72f,ST31d34dad-4211-40d9-b4e3-38677b5ee72f,43.86102,10,3715519217,,1,,,...,,,,,,1978.0,,,2.0,POINT (-79.12254426706757 34.59002228274374)


In [10]:
# Convert dataframe to gdf
building_gdf = add2incore.df2gdf_WKTgeometry(df = building_df, projection = "epsg:4326",reproject="epsg:26917")
building_gdf.head(2)

Unnamed: 0.1,Unnamed: 0,guid,strctid,ffe_elev,archetype,parid,struct_typ,no_stories,a_stories,b_stories,...,str_typ2,occ_typ2,appr_bldg,appr_land,appr_tot,year_built,lhsm_elev,g_elev,age_group,geometry
0,2270,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,ST66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,43.88056,10,3715519218,,1,,,...,,,,,,1978.0,,,2.0,POINT (672189.465 3829164.868)
1,2271,31d34dad-4211-40d9-b4e3-38677b5ee72f,ST31d34dad-4211-40d9-b4e3-38677b5ee72f,43.86102,10,3715519217,,1,,,...,,,,,,1978.0,,,2.0,POINT (672181.958 3829181.790)


In [11]:
building_gdf.columns

Index(['Unnamed: 0', 'guid', 'strctid', 'ffe_elev', 'archetype', 'parid',
       'struct_typ', 'no_stories', 'a_stories', 'b_stories', 'bsmt_type',
       'sq_foot', 'gsq_foot', 'occ_type', 'occ_detail', 'major_occ',
       'broad_occ', 'repl_cst', 'str_cst', 'nstra_cst', 'nstrd_cst', 'dgn_lvl',
       'cont_val', 'efacility', 'dwell_unit', 'str_typ2', 'occ_typ2',
       'appr_bldg', 'appr_land', 'appr_tot', 'year_built', 'lhsm_elev',
       'g_elev', 'age_group', 'geometry'],
      dtype='object')

## Read in Reference USA Data

In [12]:
sourceprogram = "IN-CORE_1bv2_Lumberton_CleanReferenceUSA_2021-05-13"
filename = sourceprogram+"/"+sourceprogram+"_EPSG4326.csv"
refusa_df = pd.read_csv(filename)
refusa_df.head()

Unnamed: 0.1,Unnamed: 0,IUSA Number,BLOCKID10,STATEFP10,COUNTYFP10,TRACTCE10,PUMGEOID10,PUMNAME10,PLCGEOID10,PLCNAME10,...,Longitude,Firm or Individual,Record Type,Corporate Employee Size Actual,Corporate Sales Volume Actual,Years In Database,Year Established,Home Business,geometry,NAICS2D
0,0,70-869-1014,371559600000000.0,37.0,155.0,961301.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.007328,2,Verified,0,$0,3,,No,POINT (-79.007328 34.658815),99
1,1,41-147-4682,371559600000000.0,37.0,155.0,960900.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.00869,1,Verified,0,$0,7,,Yes,POINT (-79.00869 34.630325),99
2,2,42-516-9180,371559600000000.0,37.0,155.0,960801.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.015588,2,Verified,0,$0,6,,No,POINT (-79.01558799999999 34.617695),99
3,3,71-122-6414,371559600000000.0,37.0,155.0,960801.0,3705100.0,Robeson County (West)--Lumberton City PUMA,,,...,-79.087891,2,Verified,0,$0,3,,No,POINT (-79.087891 34.610195),99
4,4,71-812-0078,371559600000000.0,37.0,155.0,961302.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-78.983589,2,Verified,0,$0,3,,No,POINT (-78.98358899999999 34.643772),99


In [13]:
refusa_df.columns

Index(['Unnamed: 0', 'IUSA Number', 'BLOCKID10', 'STATEFP10', 'COUNTYFP10',
       'TRACTCE10', 'PUMGEOID10', 'PUMNAME10', 'PLCGEOID10', 'PLCNAME10',
       'Version Year', 'Company Name', 'Executive First Name',
       'Executive Last Name', 'Executive Title', 'Executive Gender', 'Address',
       'City', 'State', 'ZIP Code', 'ZIP Four', 'County',
       'Phone Number Combined', 'Primary SIC Code', 'Primary SIC Description',
       'SIC Code 1', 'Unnamed: 17', 'SIC Code 1 Description',
       'SIC Code 1 Ad Size', 'SIC Code 1 Year Appeared', 'SIC Code 2',
       'SIC Code 2 Description', 'SIC Code 3', 'SIC Code 3 Description',
       'SIC Code 4', 'SIC Code 4 Description', 'SIC Code 5',
       'SIC Code 5 Description', 'SIC Code 6', 'SIC Code 6 Description',
       'SIC Code 7', 'SIC Code 7 Description', 'SIC Code 8',
       'SIC Code 8 Description', 'SIC Code 9', 'SIC Code 9 Description',
       'SIC Code 10', 'SIC Code 10 Description', 'Primary NAICS',
       'Primary NAICS Descript

## Select NACIS 61 - Education Services 

https://www.bls.gov/oes/current/naics3_611000.htm 

Industries within NAICS 611000 - Educational Services
- NAICS 611100 - Elementary and Secondary Schools
- NAICS 611200 - Junior Colleges
- NAICS 611300 - Colleges, Universities, and Professional Schools
- NAICS 611400 - Business Schools and Computer and Management Training
- NAICS 611500 - Technical and Trade Schools
- NAICS 611600 - Other Schools and Instruction
- NAICS 611700 - Educational Support Services


In [14]:
refusa_df['NAICS2D'].describe()

count    2547.000000
mean       59.420102
std        17.172025
min        11.000000
25%        50.000000
50%        62.000000
75%        72.000000
max        99.000000
Name: NAICS2D, dtype: float64

In [15]:
naics61_df = refusa_df.loc[refusa_df['NAICS2D']==61].copy()
naics61_df['NAICS2D'].describe()

count    38.0
mean     61.0
std       0.0
min      61.0
25%      61.0
50%      61.0
75%      61.0
max      61.0
Name: NAICS2D, dtype: float64

In [16]:
naics61_df.groupby(['Primary NAICS','Primary NAICS Description']).aggregate({'Location Employee Size Actual':np.sum,
                                                                            'IUSA Number':'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Location Employee Size Actual,IUSA Number
Primary NAICS,Primary NAICS Description,Unnamed: 2_level_1,Unnamed: 3_level_1
611110,Elementary & Secondary Schools,4672,32
611310,"Colleges, Universities & Professional Schools",225,3
611410,Business & Secretarial Schools,2,1
611610,Fine Art Schools,1,1
611620,Sports & Recreation Instruction,5,1


In [17]:
naics61_gdf = add2incore.df2gdf_WKTgeometry(df = naics61_df, projection = "epsg:4326",reproject="epsg:26917")
naics61_gdf.head()

Unnamed: 0.1,Unnamed: 0,IUSA Number,BLOCKID10,STATEFP10,COUNTYFP10,TRACTCE10,PUMGEOID10,PUMNAME10,PLCGEOID10,PLCNAME10,...,Longitude,Firm or Individual,Record Type,Corporate Employee Size Actual,Corporate Sales Volume Actual,Years In Database,Year Established,Home Business,geometry,NAICS2D
569,569,53-111-5749,371559600000000.0,37.0,155.0,961301.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-78.983332,2,Verified,0,$0,17,,No,POINT (684796.364 3837152.587),61
571,571,71-380-0212,371559600000000.0,37.0,155.0,961301.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.002682,2,Verified,0,$0,3,,No,POINT (683000.401 3838255.900),61
572,572,68-953-6704,371559600000000.0,37.0,155.0,960801.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.071538,2,Verified,0,$0,10,,No,POINT (676817.804 3831495.112),61
573,573,12-620-8164,371559600000000.0,37.0,155.0,960701.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.006439,2,Verified,0,$0,23,1965.0,No,POINT (682640.378 3839044.556),61
574,574,53-111-3496,371559600000000.0,37.0,155.0,960801.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.053016,2,Verified,0,$0,19,,No,POINT (678487.365 3833025.180),61


## Set up data for nearest neighbor search

Need to do this in the revese order as the businesses. For business there is one business location going to one building and one building could have multiple businesses...
In this case we want the one school id to go multiple nearby buildings.

## Run nearest neighbor algorithm

In [18]:
help(add2incore.nearest_pt_search)

Help on function nearest_pt_search in module pyincore_addons.geoutil_20210618:

nearest_pt_search(gdf_a: geopandas.geodataframe.GeoDataFrame, gdf_b: geopandas.geodataframe.GeoDataFrame, uniqueid_a: str, uniqueid_b: str, k=1, dist_cutoff=99999)
    Given two sets of points add unique id from locations a to locations b
    Inspired by: https://towardsdatascience.com/using-scikit-learns-binary-trees-to-efficiently-find-latitude-and-longitude-neighbors-909979bd929b
    
    This function is used to itdentify buildings associated with businesses, schools, hospitals.
    The locations of businesses might be geocoded by address and may not overlap
    the actual structure. This function helps resolve this issue.
    
    Tested Python Enviroment:
        Python Version      3.7.10
        geopandas version:  0.9.0
        pandas version:     1.2.4
        scipy version:     1.6.3
        numpy version:      1.20.2
    
    Args:
        gdf_a: Geodataframe with list of locations with unique i

In [19]:
buiding_refusa_gdf = add2incore.nearest_pt_search(gdf_a = naics61_gdf,
                       gdf_b = building_gdf,
                       uniqueid_a = 'IUSA Number',
                       uniqueid_b = 'guid',
                       k = 3,
                       dist_cutoff = 250)

In [20]:
buiding_refusa_gdf.head()

Unnamed: 0,guid,geometry_x,LON_x,LAT_x,neighbor,distance,distoutlier,location a index,index,IUSA Number,geometry_y,LON_y,LAT_y
0,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,POINT (672189.465 3829164.868),672189.465374,3829165.0,1,70.868936,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0
1,31d34dad-4211-40d9-b4e3-38677b5ee72f,POINT (672181.958 3829181.790),672181.958294,3829182.0,1,53.617977,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0
2,81370d16-d258-4dba-9405-f264534550c0,POINT (672172.253 3829203.772),672172.25308,3829204.0,1,33.340389,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0
3,f1d92274-c480-4658-b5af-723436a08af9,POINT (672112.146 3829264.495),672112.145685,3829264.0,1,55.733101,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0
4,942f44ec-e3a4-4e86-b5ee-79bb9a502d6e,POINT (672108.372 3829220.729),672108.371874,3829221.0,1,33.581385,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0


In [21]:
buiding_refusa_gdf.guid.describe()

count                                       94
unique                                      79
top       6091f122-497a-4725-bc3e-b0ea56c44987
freq                                         2
Name: guid, dtype: object

In [22]:
buiding_refusa_gdf.crs

<Projected CRS: EPSG:26917>
Name: NAD83 / UTM zone 17N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: North America - between 84°W and 78°W - onshore and offshore. Canada - Nunavut; Ontario; Quebec. United States (USA) - Florida; Georgia; Kentucky; Maryland; Michigan; New York; North Carolina; Ohio; Pennsylvania; South Carolina; Tennessee; Virginia; West Virginia.
- bounds: (-84.0, 23.81, -78.0, 84.0)
Coordinate Operation:
- name: UTM zone 17N
- method: Transverse Mercator
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [23]:
buiding_refusa_gdf[['guid','IUSA Number']]

Unnamed: 0,guid,IUSA Number
0,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,36-706-7980
1,31d34dad-4211-40d9-b4e3-38677b5ee72f,36-706-7980
2,81370d16-d258-4dba-9405-f264534550c0,36-706-7980
3,f1d92274-c480-4658-b5af-723436a08af9,36-706-7980
4,942f44ec-e3a4-4e86-b5ee-79bb9a502d6e,36-706-7980
...,...,...
83,0ee5dc80-461e-48a6-aa98-f8c26ac50a08,48-818-7220
84,7fbb7876-9e6a-43c4-8b91-e5583f6ef612,48-818-7220
85,94dac1cb-d603-4151-85fa-d4458b8d43c7,48-818-7220
86,d6141c94-45fe-4a58-b659-884ff34da7ae,48-818-7220


In [24]:
# Save Work at this point as CSV
savefile = sys.path[0]+"/"+programname+"/"+programname+".csv"
buiding_refusa_gdf.to_csv(savefile)

## Explore results
Need to check to see if the method is working as expected

In [25]:
buiding_refusa_gdf[['neighbor','guid']].fillna('none').groupby(
    ['neighbor']).count()

Unnamed: 0_level_0,guid
neighbor,Unnamed: 1_level_1
1,79
2,15


In [26]:
companyname_df = pd.merge(buiding_refusa_gdf, naics61_df[['IUSA Number','Company Name','Location Employee Size Actual']], 
                        left_on='IUSA Number', right_on='IUSA Number', how='left')
companyname_df.head()

Unnamed: 0,guid,geometry_x,LON_x,LAT_x,neighbor,distance,distoutlier,location a index,index,IUSA Number,geometry_y,LON_y,LAT_y,Company Name,Location Employee Size Actual
0,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,POINT (672189.465 3829164.868),672189.465374,3829165.0,1,70.868936,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20
1,31d34dad-4211-40d9-b4e3-38677b5ee72f,POINT (672181.958 3829181.790),672181.958294,3829182.0,1,53.617977,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20
2,81370d16-d258-4dba-9405-f264534550c0,POINT (672172.253 3829203.772),672172.25308,3829204.0,1,33.340389,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20
3,f1d92274-c480-4658-b5af-723436a08af9,POINT (672112.146 3829264.495),672112.145685,3829264.0,1,55.733101,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20
4,942f44ec-e3a4-4e86-b5ee-79bb9a502d6e,POINT (672108.372 3829220.729),672108.371874,3829221.0,1,33.581385,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20


In [27]:
companyname_df['geometry_x']

0     POINT (672189.465 3829164.868)
1     POINT (672181.958 3829181.790)
2     POINT (672172.253 3829203.772)
3     POINT (672112.146 3829264.495)
4     POINT (672108.372 3829220.729)
                   ...              
89    POINT (682987.108 3842920.002)
90    POINT (682955.266 3842938.094)
91    POINT (682854.461 3842833.972)
92    POINT (682849.757 3842904.186)
93    POINT (683006.101 3842839.303)
Name: geometry_x, Length: 94, dtype: geometry

In [28]:
## Convert Dataframe to Geodataframe
building_gdf = companyname_df.set_geometry(companyname_df['geometry_x'])
building_gdf.head()

Unnamed: 0,guid,geometry_x,LON_x,LAT_x,neighbor,distance,distoutlier,location a index,index,IUSA Number,geometry_y,LON_y,LAT_y,Company Name,Location Employee Size Actual,geometry
0,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,POINT (672189.465 3829164.868),672189.465374,3829165.0,1,70.868936,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (672189.465 3829164.868)
1,31d34dad-4211-40d9-b4e3-38677b5ee72f,POINT (672181.958 3829181.790),672181.958294,3829182.0,1,53.617977,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (672181.958 3829181.790)
2,81370d16-d258-4dba-9405-f264534550c0,POINT (672172.253 3829203.772),672172.25308,3829204.0,1,33.340389,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (672172.253 3829203.772)
3,f1d92274-c480-4658-b5af-723436a08af9,POINT (672112.146 3829264.495),672112.145685,3829264.0,1,55.733101,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (672112.146 3829264.495)
4,942f44ec-e3a4-4e86-b5ee-79bb9a502d6e,POINT (672108.372 3829220.729),672108.371874,3829221.0,1,33.581385,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (672108.372 3829220.729)


In [29]:
building_gdf.crs

In [30]:
from pyproj import CRS
building_gdf.crs = CRS("epsg:26917")
building_gdf = building_gdf.to_crs("epsg:4326")
building_gdf.head()

Unnamed: 0,guid,geometry_x,LON_x,LAT_x,neighbor,distance,distoutlier,location a index,index,IUSA Number,geometry_y,LON_y,LAT_y,Company Name,Location Employee Size Actual,geometry
0,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,POINT (672189.465 3829164.868),672189.465374,3829165.0,1,70.868936,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12247 34.58987)
1,31d34dad-4211-40d9-b4e3-38677b5ee72f,POINT (672181.958 3829181.790),672181.958294,3829182.0,1,53.617977,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12254 34.59002)
2,81370d16-d258-4dba-9405-f264534550c0,POINT (672172.253 3829203.772),672172.25308,3829204.0,1,33.340389,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12265 34.59022)
3,f1d92274-c480-4658-b5af-723436a08af9,POINT (672112.146 3829264.495),672112.145685,3829264.0,1,55.733101,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12329 34.59078)
4,942f44ec-e3a4-4e86-b5ee-79bb9a502d6e,POINT (672108.372 3829220.729),672108.371874,3829221.0,1,33.581385,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12334 34.59039)


In [31]:
naics61_gdf.crs

<Projected CRS: EPSG:26917>
Name: NAD83 / UTM zone 17N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: North America - between 84°W and 78°W - onshore and offshore. Canada - Nunavut; Ontario; Quebec. United States (USA) - Florida; Georgia; Kentucky; Maryland; Michigan; New York; North Carolina; Ohio; Pennsylvania; South Carolina; Tennessee; Virginia; West Virginia.
- bounds: (-84.0, 23.81, -78.0, 84.0)
Coordinate Operation:
- name: UTM zone 17N
- method: Transverse Mercator
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [32]:
naics61_gdf = naics61_gdf.to_crs("epsg:4326")
naics61_gdf.head()

Unnamed: 0.1,Unnamed: 0,IUSA Number,BLOCKID10,STATEFP10,COUNTYFP10,TRACTCE10,PUMGEOID10,PUMNAME10,PLCGEOID10,PLCNAME10,...,Longitude,Firm or Individual,Record Type,Corporate Employee Size Actual,Corporate Sales Volume Actual,Years In Database,Year Established,Home Business,geometry,NAICS2D
569,569,53-111-5749,371559600000000.0,37.0,155.0,961301.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-78.983332,2,Verified,0,$0,17,,No,POINT (-78.98333 34.65966),61
571,571,71-380-0212,371559600000000.0,37.0,155.0,961301.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.002682,2,Verified,0,$0,3,,No,POINT (-79.00268 34.66993),61
572,572,68-953-6704,371559600000000.0,37.0,155.0,960801.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.071538,2,Verified,0,$0,10,,No,POINT (-79.07154 34.61008),61
573,573,12-620-8164,371559600000000.0,37.0,155.0,960701.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.006439,2,Verified,0,$0,23,1965.0,No,POINT (-79.00644 34.67710),61
574,574,53-111-3496,371559600000000.0,37.0,155.0,960801.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,-79.053016,2,Verified,0,$0,19,,No,POINT (-79.05302 34.62358),61


In [37]:
from folium import plugins # Add minimap and search plugin functions to maps
from folium.map import *

def folium_marker_map(gdf,k,popuplabel,gdf2,layer2name,popuplabel2):
    """
    """
    
    # Check projection is epsg:4326
    
    # Find the bounds of the Census Block File
    minx = gdf.bounds.minx.min()
    miny = gdf.bounds.miny.min()
    maxx = gdf.bounds.maxx.max()
    maxy = gdf.bounds.maxy.max()

    map = fm.Map(location=[(miny+maxy)/2,(minx+maxx)/2], zoom_start=16)

    # add marker one by one on the map
    colorlist = ['red','green','blue']
    for i in range(0,k):
        layername='Neighbor '+str(i+1)
        feature_group = FeatureGroup(name=layername)
        locations = gdf.loc[gdf['neighbor'] == i+1]
        for idx, row in locations.iterrows():
            # Get lat and lon of points
            lon = row['geometry'].x
            lat = row['geometry'].y

            # Get NAME information
            label = row[popuplabel]
            # Add marker to the map
            feature_group.add_child(Marker([lat, lon], 
                                        popup=label,
                                        icon=fm.Icon(color=colorlist[i], icon="school")))
        map.add_child(feature_group)
    
    feature_group = FeatureGroup(name=layer2name)
    for idx, row in gdf2.iterrows():
        # Get lat and lon of points
        lon = row['geometry'].x
        lat = row['geometry'].y

        # Get NAME information
        label = row[popuplabel2]
        # Add marker to the map
        feature_group.add_child(Marker([lat, lon], 
                                    popup=label,
                                    icon=fm.Icon(color='gray', icon="school")))
    map.add_child(feature_group)
    fm.LayerControl(collapsed=False, autoZIndex=False).add_to(map)

    # Add minimap
    plugins.MiniMap().add_to(map)

    # How should the map be bound - look for the southwest and northeast corners of the data
    sw_corner = [gdf.bounds.miny.min(),gdf.bounds.minx.min()]
    ne_corner = [gdf.bounds.maxy.max(),gdf.bounds.maxx.max()]
    map.fit_bounds([sw_corner, ne_corner])

    return map

explore_map = folium_marker_map(building_gdf,3,['Company Name','distance'],naics61_gdf,'RefUSA 61','Company Name')
explore_map.save(f'{programname}/{programname}.html')

explore_map

In [34]:
median = building_gdf['distance'].quantile(0.50) 
median + building_gdf['distance'].std()*3 

242.20945919321753

In [35]:
varlist = ['IUSA Number','Company Name','Address','geometry','neighbor','distance','di']
building_gdf.loc[building_gdf['Company Name'].str.contains("Career")]

Unnamed: 0,guid,geometry_x,LON_x,LAT_x,neighbor,distance,distoutlier,location a index,index,IUSA Number,geometry_y,LON_y,LAT_y,Company Name,Location Employee Size Actual,geometry
0,66b1392e-c7b0-4bd8-a092-a7ff0ea6c15a,POINT (672189.465 3829164.868),672189.465374,3829165.0,1,70.868936,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12247 34.58987)
1,31d34dad-4211-40d9-b4e3-38677b5ee72f,POINT (672181.958 3829181.790),672181.958294,3829182.0,1,53.617977,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12254 34.59002)
2,81370d16-d258-4dba-9405-f264534550c0,POINT (672172.253 3829203.772),672172.25308,3829204.0,1,33.340389,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12265 34.59022)
3,f1d92274-c480-4658-b5af-723436a08af9,POINT (672112.146 3829264.495),672112.145685,3829264.0,1,55.733101,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12329 34.59078)
4,942f44ec-e3a4-4e86-b5ee-79bb9a502d6e,POINT (672108.372 3829220.729),672108.371874,3829221.0,1,33.581385,False,13.0,583,36-706-7980,POINT (672141.777 3829217.292),672141.776891,3829217.0,Robeson County Career Ctr,20,POINT (-79.12334 34.59039)


In [36]:
varlist = ['IUSA Number','Company Name','Address','geometry','Latitude','Longitude','Primary NAICS','Primary NAICS Description']
naics61_gdf[varlist].loc[naics61_gdf['Company Name'].str.contains("Career")]

Unnamed: 0,IUSA Number,Company Name,Address,geometry,Latitude,Longitude,Primary NAICS,Primary NAICS Description
583,36-706-7980,Robeson County Career Ctr,1339 Hilly Branch Rd,POINT (-79.12297 34.59035),34.590349,-79.122975,611110,Elementary & Secondary Schools
