# School Locations Processing
We have 3 different files with school location information, and each file has slightly different contents. Need to compare contents and resolve what our final/true list of geo-locatable schools is.

In [None]:
import geopandas as gpd
import pandas as pd

## School Point Locations
Data source: https://data.cityofnewyork.us/Education/School-Point-Locations/jfju-ynrr/about_data

Last updated: November 26, 2024

Annoyingly, the data dictionary on the above linked page doesn't match the data itself, so we're left to guess on the meaning of some of these fields. Also, the description on the above linked page says this data contains Address, Principal, and Principal contact info, but that isn't in here.

In [None]:
school_points_gdf = gpd.read_file('DOE/School Locations/School Point Locations/SchoolPoints_APS_2024_08_28/SchoolPoints_APS_2024_08_28.shp')
school_points_gdf.rename(columns={'Location_C': 'Location Code', 'Name': 'Location Name'}, inplace=True)
school_points_gdf

### LCGMS
Last updated: November 26, 2024

This data has more robust fields in it related to grades, address, open date, principal contact info, etc. But there is a discrepancy in the records included in the geocoded vs. non-geocoded files. Not sure yet if there are any other discrepancies between these two files but need to figure that out.

#### Non-geocoded LCGMS data
Source: https://www.nycenet.edu/PublicApps/LCGMS.aspx

In [None]:
lcgms_df = pd.read_excel('DOE/School Locations/LCGMS/LCGMS_SchoolData_20250806_0112.xlsx', engine='openpyxl')
lcgms_df

#### Geocoded LCGMS data
Source: https://data.cityofnewyork.us/Education/NYC-DOE-Public-School-Location-Information/3bkj-34v2/about_data

In [None]:
lcgms_geocoded_df = pd.read_csv('DOE/School Locations/LCGMS/LCGMS_SchoolData_additional_geocoded_fields_added_.csv', encoding='latin-1')
lcgms_geocoded_df
# lcgms_geocoded_gdf = gpd.GeoDataFrame(lcgms_geocoded_df, geometry=gpd.GeoSeries.from_xy(lcgms_geocoded_df['lon'], lcgms_geocoded_df['lat']), crs=4326)

### Discrepancies

#### TODO: resolve record count discrepancies between LCGMS datasets

In [None]:
# Show record that are NOT in geocoded data but are in non-geocoded data
lcgms_df.merge(lcgms_geocoded_df[['Location Code', 'Location Name']], on='Location Code', how='left', suffixes=('_lcgms', '_geocoded'), indicator=True).query('_merge == "left_only"').drop(columns='_merge')

#### TODO: resolve record count discrepancies between LCGMS data and School Points Data

In [None]:
# Show records that are NOT in lcgms data but are in school points data
school_points_gdf.merge(lcgms_df[['Location Code', 'Location Name']], left_on='Location Code', right_on='Location Code', how='left', suffixes=('_school', '_lcgms'), indicator=True).query('_merge != "both"')

In [None]:
# TODO: are any of the points missing from LCGMS data in school points data and could be mapped by pulling them from there?