## Match Open Street Map, NCES, and Reference USA School Data

This notebook attempts to match the point data from three sources to make one file with details on all schools in a county.


## Description of Program
- program:    IN-CORE_2av1_MatchSchoolPoints
- task:       Match OpentStreetMap, NCES, and Reference USA point data in one file - use OpenStreetMap Points as main geocode
- Version:    2021-06-21
-             2021-06-22 - Add Day Care Centers (NAICS 624410)
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, N. (2021) “Obtain, Clean, and Explore Labor Market Allocation Methods". 
Archived on Github and ICPSR.

In [1]:
%matplotlib inline

import pandas as pd
import geopandas as gpd
import numpy as np  # group by aggregation
import folium as fm # folium has more dynamic maps - but requires internet connection

In [2]:
# Display versions being used - important information for replication
import sys
print("Python Version     ", sys.version)
print("geopandas version: ", gpd.__version__)
print("pandas version:    ", pd.__version__)
print("numpy version:     ", np.__version__)
print("folium version:    ", fm.__version__)

Python Version      3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:37:01) [MSC v.1916 64 bit (AMD64)]
geopandas version:  0.9.0
pandas version:     1.2.4
numpy version:      1.20.2
folium version:     0.12.1


In [3]:
import os # For saving output to path
# Store Program Name for output files to have the same name
programname = "IN-CORE_2av1_MatchSchoolPoints_2021-06-22"
# Make directory to save output
if not os.path.exists(programname):
    os.mkdir(programname)

# Setup access to IN-CORE
https://incore.ncsa.illinois.edu/

In [4]:
from pyincore import IncoreClient, Dataset, FragilityService, MappingSet, DataService
from pyincore_viz.geoutil import GeoUtil as viz

In [5]:
#client = IncoreClient()
# IN-CORE chaches files on the local machine, it might be necessary to clear the memory
#client.clear_cache()

In [6]:
# create data_service object for loading files
#data_service = DataService(client)

### IN-CORE addons
This program uses coded that is being developed as potential add ons to pyincore. These functions are in a folder called pyincore_addons - this folder is located in the same directory as this notebook.
The add on functions are organized to mirror the folder sturcture of https://github.com/IN-CORE/pyincore

Each add on function attempts to follow the structure of existing pyincore functions and includes some help information.

In [7]:
# open, read, and execute python program with reusable commands
import pyincore_addons.geoutil_20210618 as add2incore

# since the geoutil is under construction it might need to be reloaded
from importlib import reload 
add2incore = reload(add2incore)

# Print list of add on functions
from inspect import getmembers, isfunction
print(getmembers(add2incore,isfunction))

[('df2gdf_WKTgeometry', <function df2gdf_WKTgeometry at 0x0000015A3DDF0EE8>), ('nearest_pt_search', <function nearest_pt_search at 0x0000015A3DDF0708>)]


## Read in OSM Point Data


In [8]:
sourceprogram = "IN-CORE_1dv2_Lumberton_CleanOpenStreeMap_2021-06-21"
filename = sourceprogram+"/"+sourceprogram+"_EPSG4326.csv"
osm_df = pd.read_csv(filename)

# Convert dataframe to gdf
osm_df['geometry_osm'] = osm_df['geometry'] # save original geometry
osm_gdf = add2incore.df2gdf_WKTgeometry(df = osm_df, projection = "epsg:4326",reproject="epsg:26917")
osm_gdf.head(2)

Unnamed: 0.1,Unnamed: 0,osmid,element_type,amenity,ele,gnis:county_id,gnis:created,gnis:feature_id,gnis:state_id,name,...,addr:housenumber,addr:postcode,addr:state,addr:street,note,source,nodes,old_name,phone,geometry_osm
0,0,357767730,node,school,43.0,155.0,06/17/1980,980556.0,37.0,Barker Tenmile School,...,,,,,,,,,,POINT (-78.9522511 34.7118314)
1,1,357771489,node,school,59.0,155.0,06/17/1980,984036.0,37.0,Dean School,...,,,,,,,,,,POINT (-79.3447649 34.7318283)


In [9]:
# Convert unique id to object - makes it easier to compare counts and unique values
osm_gdf['osmid'] = osm_gdf['osmid'].apply(lambda x: str(x))
osm_gdf['osmid'].describe()

count            77
unique           77
top       925917089
freq              1
Name: osmid, dtype: object

## Read in NCES School Data

In [10]:
sourcefolder = '../SourceData/nces.ed.gov/WorkNPR/'
sourceprogram = "NCES_2bv1_AddTeacherCount_2021-06-15"
filename = sourcefolder+"/"+sourceprogram+"/"+sourceprogram+".csv"
nces_df = pd.read_csv(filename)

# Convert dataframe to gdf
nces_df['geometry_nces'] = nces_df['geometry'] # save original geometry
nces_gdf = add2incore.df2gdf_WKTgeometry(df = nces_df, projection = "epsg:4326",reproject="epsg:26917")
nces_gdf.head()

Unnamed: 0.2,Unnamed: 0,ncesid,HRTOTLT,ppin,p410,FTE,Unnamed: 0.1,name,addr,city,...,zip,cnty15,geometry,level,schtype,lat,lon,schyr,numstaff,geometry_nces
0,0,370004002349,,,,7.99,0,CIS Academy,818 West 3rd Street,Pembroke,...,28372,37155,POINT (664581.949 3839584.931),99,1,34.685038,-79.203357,2015-2016,7.99,POINT (-79.20335664043833 34.68503759480223)
1,1,370034603302,,,,15.0,1,Southeastern Academy,12251 NC HWY 41 North,Lumberton,...,28358,37155,POINT (694854.567 3836475.290),99,1,34.651697,-78.873789,2015-2016,15.0,POINT (-78.87378865362859 34.65169717880272)
2,2,370225003249,,,,36.25,2,Sandy Grove Middle,300 Chason Road,Lumber Bridge,...,28357,37155,POINT (676730.741 3863273.806),99,1,34.89651,-79.065819,2015-2016,36.25,POINT (-79.06581931618486 34.89650979378273)
3,3,370393001569,,,,23.84,3,Deep Branch Elementary,4045 Deep Branch Road,Lumberton,...,28360,37155,POINT (669947.382 3833617.725),99,1,34.630377,-79.14601,2015-2016,23.84,POINT (-79.14600999186194 34.63037683072379)
4,4,370393001570,,,,22.98,4,Fairgrove Middle,1953 Fairgrove Sch Road,Fairmont,...,28340,37155,POINT (667683.210 3818368.148),99,1,34.493298,-79.173707,2015-2016,22.98,POINT (-79.17370687961406 34.49329831006692)


In [11]:
nces_gdf['ncesid'].describe()

count               55
unique              55
top       370393001583
freq                 1
Name: ncesid, dtype: object

## Read in Reference USA School Data

In [12]:
sourceprogram = "IN-CORE_1bv2_Lumberton_CleanReferenceUSA_2021-05-13"
filename = sourceprogram+"/"+sourceprogram+"_EPSG4326.csv"
refusa_df = pd.read_csv(filename)
refusa_df[['IUSA Number','Company Name','Primary NAICS','NAICS2D']].head(2)

Unnamed: 0,IUSA Number,Company Name,Primary NAICS,NAICS2D
0,70-869-1014,Bluewave,999990,99
1,41-147-4682,"Britt, Evander M",999990,99


## Select NAICS 61 - Education Services 

https://www.bls.gov/oes/current/naics3_611000.htm 

Industries within NAICS 611000 - Educational Services
- NAICS 611100 - Elementary and Secondary Schools
- NAICS 611200 - Junior Colleges
- NAICS 611300 - Colleges, Universities, and Professional Schools
- NAICS 611400 - Business Schools and Computer and Management Training
- NAICS 611500 - Technical and Trade Schools
- NAICS 611600 - Other Schools and Instruction
- NAICS 611700 - Educational Support Services


## Select NAICS 624110 - Child Care Services
https://www.bls.gov/oes/current/naics2_62.htm
NAICS 624400 - Child Day Care Services
624110 Child & Youth Services (Boy scouts and girl scouts, counseling services)


In [13]:
naics61_df = refusa_df.loc[(refusa_df['NAICS2D']==61) |
                          (refusa_df['Primary NAICS']==624110) |
                          (refusa_df['Primary NAICS']==624400) ].copy()
naics61_df['NAICS2D'].describe()

count    43.000000
mean     61.116279
std       0.324353
min      61.000000
25%      61.000000
50%      61.000000
75%      61.000000
max      62.000000
Name: NAICS2D, dtype: float64

In [14]:
naics61_df.groupby(['Primary NAICS','Primary NAICS Description']).aggregate({'Location Employee Size Actual':np.sum,
                                                                            'IUSA Number':'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Location Employee Size Actual,IUSA Number
Primary NAICS,Primary NAICS Description,Unnamed: 2_level_1,Unnamed: 3_level_1
611110,Elementary & Secondary Schools,4672,32
611310,"Colleges, Universities & Professional Schools",225,3
611410,Business & Secretarial Schools,2,1
611610,Fine Art Schools,1,1
611620,Sports & Recreation Instruction,5,1
624110,Child & Youth Services,46,5


In [15]:
naics61_df['geometry_refusa'] = naics61_df['geometry'] # save original geometry
naics61_gdf = add2incore.df2gdf_WKTgeometry(df = naics61_df, projection = "epsg:4326",reproject="epsg:26917")
naics61_gdf.head(2)

Unnamed: 0.1,Unnamed: 0,IUSA Number,BLOCKID10,STATEFP10,COUNTYFP10,TRACTCE10,PUMGEOID10,PUMNAME10,PLCGEOID10,PLCNAME10,...,Firm or Individual,Record Type,Corporate Employee Size Actual,Corporate Sales Volume Actual,Years In Database,Year Established,Home Business,geometry,NAICS2D,geometry_refusa
183,183,58-743-7179,371559600000000.0,37.0,155.0,961000.0,3705100.0,Robeson County (West)--Lumberton City PUMA,3739700.0,Lumberton,...,2,Verified,0,$0,17,,No,POINT (682707.086 3832418.157),62,POINT (-79.007141 34.617371)
545,545,40-032-6554,371559600000000.0,37.0,155.0,960701.0,3705100.0,Robeson County (West)--Lumberton City PUMA,,,...,2,Verified,0,$0,9,1984.0,No,POINT (683071.208 3843450.509),62,POINT (-79.000784 34.716734)


In [16]:
naics61_gdf['IUSA Number'].describe()

count              43
unique             43
top       41-700-4231
freq                1
Name: IUSA Number, dtype: object

## Set up data for nearest neighbor search

Need to do this in the revese order as the businesses. For business there is one business location going to one building and one building could have multiple businesses...
In this case we want the one school id to go multiple nearby buildings.

## Run nearest neighbor algorithm - Add NCESID to OSMID

In [17]:
help(add2incore.nearest_pt_search)

Help on function nearest_pt_search in module pyincore_addons.geoutil_20210618:

nearest_pt_search(gdf_a: geopandas.geodataframe.GeoDataFrame, gdf_b: geopandas.geodataframe.GeoDataFrame, uniqueid_a: str, uniqueid_b: str, k=1, dist_cutoff=99999)
    Given two sets of points add unique id from locations a to locations b
    Inspired by: https://towardsdatascience.com/using-scikit-learns-binary-trees-to-efficiently-find-latitude-and-longitude-neighbors-909979bd929b
    
    This function is used to itdentify buildings associated with businesses, schools, hospitals.
    The locations of businesses might be geocoded by address and may not overlap
    the actual structure. This function helps resolve this issue.
    
    Tested Python Enviroment:
        Python Version      3.7.10
        geopandas version:  0.9.0
        pandas version:     1.2.4
        scipy version:     1.6.3
        numpy version:      1.20.2
    
    Args:
        gdf_a: Geodataframe with list of locations with unique i

In [18]:
osm_nces_gdf = add2incore.nearest_pt_search(gdf_a = nces_gdf,
                                               gdf_b = osm_gdf,
                                               uniqueid_a = 'ncesid',
                                               uniqueid_b = 'osmid',
                                               k = 4,
                                               dist_cutoff = 250)

In [19]:
osm_nces_gdf.head()

Unnamed: 0,osmid,geometry_x,LON_x,LAT_x,neighbor,distance,distoutlier,location a index,index,ncesid,geometry_y,LON_y,LAT_y
1,357771489,POINT (651541.469 3844552.257),651541.468511,3844552.0,1,101.253934,False,22.0,22,370393002051,POINT (651535.266 3844653.320),651535.265569,3844653.0
2,357772472,POINT (667689.267 3818328.693),667689.266644,3818329.0,1,39.916393,False,4.0,4,370393001570,POINT (667683.210 3818368.148),667683.209998,3818368.0
3,357774000,POINT (670363.910 3825527.231),670363.910479,3825527.0,1,34.573736,False,5.0,5,370393001571,POINT (670395.980 3825540.149),670395.980267,3825540.0
4,357774969,POINT (672130.965 3829227.353),672130.964789,3829227.0,1,91.597936,False,20.0,20,370393002049,POINT (672082.102 3829304.830),672082.102055,3829305.0
6,357775906,POINT (682862.835 3833039.391),682862.834897,3833039.0,1,200.734311,False,54.0,54,199476,POINT (682950.014 3832858.576),682950.013974,3832859.0


In [20]:
osm_nces_gdf[['neighbor','osmid']].fillna('none').groupby(
    ['neighbor']).count()

Unnamed: 0_level_0,osmid
neighbor,Unnamed: 1_level_1
1,52
2,10
3,7


In [21]:
osm_nces_gdf[['osmid','ncesid']].describe()

Unnamed: 0,osmid,ncesid
count,69,69
unique,52,36
top,357814481,199281
freq,3,18


## Merge in original data to make full list of points with descriptions

In [22]:
osm_nces_gdf['distancept1'] = osm_nces_gdf['distance']
nces_gdf['name_nces'] = nces_gdf['name']
osm_nces_gdf_pt1 = pd.merge(osm_nces_gdf[['osmid','ncesid','distancept1']],
                            nces_gdf[['ncesid','name_nces','level','schtype','numstaff','geometry_nces']], 
                        left_on='ncesid', right_on='ncesid', how='outer')

In [23]:
osm_nces_gdf_pt1.head(2)

Unnamed: 0,osmid,ncesid,distancept1,name_nces,level,schtype,numstaff,geometry_nces
0,357771489,370393002051,101.253934,R B Dean Elementary,99,1,16.2,POINT (-79.34481445940865 34.73274021259314)
1,357772472,370393001570,39.916393,Fairgrove Middle,99,1,22.98,POINT (-79.17370687961406 34.49329831006692)


In [24]:
osm_nces_gdf_pt1[['osmid','ncesid']].describe()

Unnamed: 0,osmid,ncesid
count,69,88
unique,52,55
top,357814472,199281
freq,3,18


In [25]:
osm_gdf['name_osm'] = osm_gdf['name']
osm_nces_gdf_pt2 = pd.merge(osm_nces_gdf_pt1, osm_gdf[['osmid','name_osm','ele','old_name','geometry_osm']], 
                        left_on='osmid', right_on='osmid', how='outer')

In [26]:
osm_nces_gdf_pt2.head(2)

Unnamed: 0,osmid,ncesid,distancept1,name_nces,level,schtype,numstaff,geometry_nces,name_osm,ele,old_name,geometry_osm
0,357771489,370393002051,101.253934,R B Dean Elementary,99.0,1.0,16.2,POINT (-79.34481445940865 34.73274021259314),Dean School,59.0,,POINT (-79.3447649 34.7318283)
1,357772472,370393001570,39.916393,Fairgrove Middle,99.0,1.0,22.98,POINT (-79.17370687961406 34.49329831006692),Fairgrove School,38.0,,POINT (-79.1736487 34.4929417)


In [27]:
osm_nces_gdf_pt2[['osmid','ncesid']].describe()

Unnamed: 0,osmid,ncesid
count,94,88
unique,77,55
top,357799374,199281
freq,3,18


##  Create new geometry that combines OSM and NCES point locations
The combined data includes point locations from the OSM and point locations from NCES. Need to fill in missing point locations with a new geometry.

In [28]:
osm_nces_gdf_pt2[['osmid','ncesid','name_osm','name_nces']].head()

Unnamed: 0,osmid,ncesid,name_osm,name_nces
0,357771489,370393002051,Dean School,R B Dean Elementary
1,357772472,370393001570,Fairgrove School,Fairgrove Middle
2,357774000,370393001571,Green Grove School,Green Grove Elementary
3,357774969,370393002049,Hilly Branch School,Robeson Co Career Ctr
4,357775906,199476,Joe P Moore School,Robeson Community College


In [29]:
osm_nces_gdf_pt2['ncesid'].loc[osm_nces_gdf_pt2['ncesid'].isna()].describe()

count       0
unique      0
top       NaN
freq      NaN
Name: ncesid, dtype: object

In [30]:
osm_nces_gdf_pt2.loc[osm_nces_gdf_pt2['osmid'].isna()]

Unnamed: 0,osmid,ncesid,distancept1,name_nces,level,schtype,numstaff,geometry_nces,name_osm,ele,old_name,geometry_osm
69,,370034603302,,Southeastern Academy,99.0,1.0,15.0,POINT (-78.87378865362859 34.65169717880272),,,,
70,,370225003249,,Sandy Grove Middle,99.0,1.0,36.25,POINT (-79.06581931618486 34.89650979378273),,,,
71,,370393001573,,Long Branch Elementary,99.0,1.0,24.53,POINT (-78.95646719569689 34.53306742472606),,,,
72,,370393001576,,Oxendine Elementary,99.0,1.0,19.18,POINT (-79.26317964381715 34.81010651845104),,,,
73,,370393001578,,Pembroke Elementary,99.0,1.0,38.4,POINT (-79.19435654937362 34.67814062401999),,,,
74,,370393001581,,Piney Grove Elementary,99.0,1.0,37.4,POINT (-79.03834029249344 34.68615865716654),,,,
75,,370393001583,,Prospect Elementary,99.0,1.0,52.94,POINT (-79.29500979640778 34.73313631448021),,,,
76,,370393001589,,Union Chapel Elementary,99.0,1.0,28.88,POINT (-79.13411016815866 34.71279651378435),,,,
77,,370393001590,,Union Elementary,99.0,1.0,23.29,POINT (-79.25019992731022 34.62667677925854),,,,
78,,370393002102,,Purnell Swett High,99.0,1.0,95.68,POINT (-79.24578700781115 34.69647553738752),,,,


In [31]:
osm_nces_gdf_pt2['geometry_pt2'] = osm_nces_gdf_pt2['geometry_osm']
osm_nces_gdf_pt2['geometry_pt2'] = osm_nces_gdf_pt2['geometry_pt2'].fillna(osm_nces_gdf_pt2['geometry_nces'])

In [32]:
osm_nces_gdf_pt2.loc[osm_nces_gdf_pt2['geometry_pt2'].isna()]

Unnamed: 0,osmid,ncesid,distancept1,name_nces,level,schtype,numstaff,geometry_nces,name_osm,ele,old_name,geometry_osm,geometry_pt2


In [33]:
osm_nces_gdf_pt2[['geometry_pt2','geometry_osm','geometry_nces']]

Unnamed: 0,geometry_pt2,geometry_osm,geometry_nces
0,POINT (-79.3447649 34.7318283),POINT (-79.3447649 34.7318283),POINT (-79.34481445940865 34.73274021259314)
1,POINT (-79.1736487 34.4929417),POINT (-79.1736487 34.4929417),POINT (-79.17370687961406 34.49329831006692)
2,POINT (-79.1430917 34.557386),POINT (-79.1430917 34.557386),POINT (-79.14273972087641 34.55749711706876)
3,POINT (-79.1230908 34.5904415),POINT (-79.1230908 34.5904415),POINT (-79.12360764715368 34.59114800040614)
4,POINT (-79.00530910000001 34.6229421),POINT (-79.00530910000001 34.6229421),POINT (-79.00439765883218 34.62129695803053)
...,...,...,...
108,POINT (-79.205 34.687),POINT (-79.205 34.687),
109,POINT (-79.2008 34.6904),POINT (-79.2008 34.6904),
110,POINT (-79.2 34.6851),POINT (-79.2 34.6851),
111,POINT (-79.1996 34.6921),POINT (-79.1996 34.6921),


### Add new Unique ID for Pt2 Data

In [34]:
osm_nces_gdf_pt2['uniqueidpt2'] = osm_nces_gdf_pt2['osmid'].fillna('missingosmid') + \
                                  osm_nces_gdf_pt2['ncesid'].fillna('missingncesid')
osm_nces_gdf_pt2[['uniqueidpt2','osmid','ncesid']].head()

Unnamed: 0,uniqueidpt2,osmid,ncesid
0,357771489370393002051,357771489,370393002051
1,357772472370393001570,357772472,370393001570
2,357774000370393001571,357774000,370393001571
3,357774969370393002049,357774969,370393002049
4,357775906199476,357775906,199476


In [35]:
osm_nces_gdf_pt2[['uniqueidpt2','osmid','ncesid']].describe()

Unnamed: 0,uniqueidpt2,osmid,ncesid
count,113,94,88
unique,113,77,55
top,357783754missingncesid,357799374,199281
freq,1,3,18


In [36]:
help(add2incore.df2gdf_WKTgeometry)

Help on function df2gdf_WKTgeometry in module pyincore_addons.geoutil_20210618:

df2gdf_WKTgeometry(df: pandas.core.frame.DataFrame, projection='epsg:4326', reproject='epsg:4326', geometryvar='geometry')
    Function to convert dataframe with WKT Geometry to Geodata Frame
    
    Tested Python Enviroment:
        Python Version      3.7.10
        geopandas version:  0.9.0
        pandas version:     1.2.4
        shapely version:    1.7.1
    Args:
        :param df: dataframe with Well Known Text (WKT) geometry
        :param projection: String with Coordinate Reference System - default is epsg:4326
        :help projection: https://spatialreference.org/ref/epsg/wgs-84/
            Use UTM for measuring distances and area in meters
            Common Universal Transverse Mercator (UTM) for North America
            UTM zone 10N = West Coast     = epsg:26910
            UTM zone 17N = North Carolina = epsg:26917
            UTM zone 19N = Maine          = epsg:26919
            https

In [37]:
osm_nces_gdf_pt2_gdf = add2incore.df2gdf_WKTgeometry(df = osm_nces_gdf_pt2, 
                                                     projection = "epsg:4326",
                                                     reproject="epsg:26917",
                                                     geometryvar = "geometry_pt2")
osm_nces_gdf_pt2_gdf.head(2)

Unnamed: 0,osmid,ncesid,distancept1,name_nces,level,schtype,numstaff,geometry_nces,name_osm,ele,old_name,geometry_osm,geometry_pt2,uniqueidpt2,geometry
0,357771489,370393002051,101.253934,R B Dean Elementary,99.0,1.0,16.2,POINT (-79.34481445940865 34.73274021259314),Dean School,59.0,,POINT (-79.3447649 34.7318283),POINT (-79.3447649 34.7318283),357771489370393002051,POINT (651541.469 3844552.257)
1,357772472,370393001570,39.916393,Fairgrove Middle,99.0,1.0,22.98,POINT (-79.17370687961406 34.49329831006692),Fairgrove School,38.0,,POINT (-79.1736487 34.4929417),POINT (-79.1736487 34.4929417),357772472370393001570,POINT (667689.267 3818328.693)


## Run nearest neighbor algorithm - Add RefUSA ID to OSMID

In [38]:
osm_refUSA_nces_gdf = add2incore.nearest_pt_search(gdf_a = naics61_gdf,
                                               gdf_b = osm_nces_gdf_pt2_gdf,
                                               uniqueid_a = 'IUSA Number',
                                               uniqueid_b = 'uniqueidpt2',
                                               k = 4,
                                               dist_cutoff = 250)

In [39]:
osm_refUSA_nces_gdf[['neighbor','uniqueidpt2']].fillna('none').groupby(
    ['neighbor']).count()

Unnamed: 0_level_0,uniqueidpt2
neighbor,Unnamed: 1_level_1
1,22
2,5
3,1


## Merge in original data to make full list of points with descriptions

In [40]:
osm_refUSA_nces_gdf['distancept2'] = osm_refUSA_nces_gdf['distance']
naics61_gdf['name_refusa'] = naics61_gdf['Company Name']
osm_refUSA_nces_gdf_pt1 = pd.merge(osm_refUSA_nces_gdf[['uniqueidpt2','IUSA Number','distancept2','neighbor']],
                            naics61_gdf[['IUSA Number','name_refusa','Location Employee Size Actual',
                                            'Primary NAICS','Primary NAICS Description','NAICS2D','geometry_refusa']], 
                        left_on='IUSA Number', right_on='IUSA Number', how='outer')
osm_refUSA_nces_gdf_pt1.head()

Unnamed: 0,uniqueidpt2,IUSA Number,distancept2,neighbor,name_refusa,Location Employee Size Actual,Primary NAICS,Primary NAICS Description,NAICS2D,geometry_refusa
0,357774969370393002049,36-706-7980,14.769493,1.0,Robeson County Career Ctr,20,611110,Elementary & Secondary Schools,61,POINT (-79.122975 34.590349)
1,357775906199476,88-314-8140,179.800901,1.0,Southeastern Family Violence,12,624110,Child & Youth Services,62,POINT (-79.007262 34.623086)
2,357777106370393001572,48-818-7204,80.179997,1.0,Littlefield Middle School,91,611110,Elementary & Secondary Schools,61,POINT (-78.91708199999999 34.644767)
3,357777526370393001574,48-818-7220,53.946759,1.0,Magnolia School,82,611110,Elementary & Secondary Schools,61,POINT (-79.00396000000001 34.711962)
4,357784148370393002242,71-781-1356,40.230162,1.0,Public Schools-Robeson County,14,611110,Elementary & Secondary Schools,61,POINT (-78.995773 34.622437)


In [41]:
osm_refUSA_nces_gdf_pt1[['uniqueidpt2','IUSA Number']].describe()

Unnamed: 0,uniqueidpt2,IUSA Number
count,28,44
unique,22,43
top,missingosmid3703930,45-029-5779
freq,3,2


In [42]:
osm_refUSA_nces_gdf_pt2 = pd.merge(osm_nces_gdf_pt2_gdf,
                            osm_refUSA_nces_gdf_pt1,
                        left_on='uniqueidpt2', right_on='uniqueidpt2', how='outer')
osm_refUSA_nces_gdf_pt2.head()

Unnamed: 0,osmid,ncesid,distancept1,name_nces,level,schtype,numstaff,geometry_nces,name_osm,ele,...,geometry,IUSA Number,distancept2,neighbor,name_refusa,Location Employee Size Actual,Primary NAICS,Primary NAICS Description,NAICS2D,geometry_refusa
0,357771489,370393002051,101.253934,R B Dean Elementary,99.0,1.0,16.2,POINT (-79.34481445940865 34.73274021259314),Dean School,59.0,...,POINT (651541.469 3844552.257),,,,,,,,,
1,357772472,370393001570,39.916393,Fairgrove Middle,99.0,1.0,22.98,POINT (-79.17370687961406 34.49329831006692),Fairgrove School,38.0,...,POINT (667689.267 3818328.693),,,,,,,,,
2,357774000,370393001571,34.573736,Green Grove Elementary,99.0,1.0,15.8,POINT (-79.14273972087641 34.55749711706876),Green Grove School,42.0,...,POINT (670363.910 3825527.231),,,,,,,,,
3,357774969,370393002049,91.597936,Robeson Co Career Ctr,99.0,1.0,17.75,POINT (-79.12360764715368 34.59114800040614),Hilly Branch School,43.0,...,POINT (672130.965 3829227.353),36-706-7980,14.769493,1.0,Robeson County Career Ctr,20.0,611110.0,Elementary & Secondary Schools,61.0,POINT (-79.122975 34.590349)
4,357775906,199476,200.734311,Robeson Community College,5.0,5.0,393.0,POINT (-79.00439765883218 34.62129695803053),Joe P Moore School,42.0,...,POINT (682862.835 3833039.391),88-314-8140,179.800901,1.0,Southeastern Family Violence,12.0,624110.0,Child & Youth Services,62.0,POINT (-79.007262 34.623086)


In [43]:
osm_refUSA_nces_gdf_pt2[['uniqueidpt2','osmid','ncesid','IUSA Number']].describe()

Unnamed: 0,uniqueidpt2,osmid,ncesid,IUSA Number
count,119,98,93,44
unique,113,77,55,43
top,missingosmid3703930,357799374,199281,45-029-5779
freq,3,3,18,2


## Add Geometry

In [44]:
osm_refUSA_nces_gdf_pt2.columns

Index(['osmid', 'ncesid', 'distancept1', 'name_nces', 'level', 'schtype',
       'numstaff', 'geometry_nces', 'name_osm', 'ele', 'old_name',
       'geometry_osm', 'geometry_pt2', 'uniqueidpt2', 'geometry',
       'IUSA Number', 'distancept2', 'neighbor', 'name_refusa',
       'Location Employee Size Actual', 'Primary NAICS',
       'Primary NAICS Description', 'NAICS2D', 'geometry_refusa'],
      dtype='object')

In [45]:
osm_refUSA_nces_gdf_pt2[['geometry','geometry_pt2','geometry_refusa']].head()

Unnamed: 0,geometry,geometry_pt2,geometry_refusa
0,POINT (651541.469 3844552.257),POINT (-79.3447649 34.7318283),
1,POINT (667689.267 3818328.693),POINT (-79.1736487 34.4929417),
2,POINT (670363.910 3825527.231),POINT (-79.1430917 34.557386),
3,POINT (672130.965 3829227.353),POINT (-79.1230908 34.5904415),POINT (-79.122975 34.590349)
4,POINT (682862.835 3833039.391),POINT (-79.00530910000001 34.6229421),POINT (-79.007262 34.623086)


In [46]:
# Fill In Missing Geomtries from RefUSA
osm_refUSA_nces_gdf_pt2['geometry_pt3'] = osm_refUSA_nces_gdf_pt2['geometry_pt2']
osm_refUSA_nces_gdf_pt2['geometry_pt3'] = osm_refUSA_nces_gdf_pt2['geometry_pt3'].fillna(osm_refUSA_nces_gdf_pt2['geometry_refusa'])
#osm_refUSA_nces_gdf_pt2['geometry_pt3'].loc[osm_refUSA_nces_gdf_pt2['geometry_pt3'].isna()]

In [47]:
osm_refUSA_nces_gdf_pt2['uniqueidpt3'] = osm_refUSA_nces_gdf_pt2['osmid'].fillna('missingosmid') + \
                                  osm_refUSA_nces_gdf_pt2['ncesid'].fillna('missingncesid') + \
                                  osm_refUSA_nces_gdf_pt2['IUSA Number'].fillna('missingrefusaid')

osm_refUSA_nces_gdf_pt2[['uniqueidpt3','osmid','ncesid','IUSA Number']].head()

Unnamed: 0,uniqueidpt3,osmid,ncesid,IUSA Number
0,357771489370393002051missingrefusaid,357771489,370393002051,
1,357772472370393001570missingrefusaid,357772472,370393001570,
2,357774000370393001571missingrefusaid,357774000,370393001571,
3,35777496937039300204936-706-7980,357774969,370393002049,36-706-7980
4,35777590619947688-314-8140,357775906,199476,88-314-8140


In [48]:
osm_refUSA_nces_gdf_pt2[['uniqueidpt3','osmid','ncesid','IUSA Number']].describe()

Unnamed: 0,uniqueidpt3,osmid,ncesid,IUSA Number
count,135,98,93,44
unique,135,77,55,43
top,missingosmidmissingncesid40-031-7243,357799374,199281,45-029-5779
freq,1,3,18,2


In [50]:
# Save Work at this point as CSV
savefile = sys.path[0]+"/"+programname+"/"+programname+".csv"
osm_refUSA_nces_gdf_pt2.to_csv(savefile)