# School Locations Processing
We have 3 different files with school location information, and each file has slightly different contents. Need to compare contents and resolve what our final/true list of geo-locatable schools is.

In [None]:
import geopandas as gpd
import pandas as pd

## Load Raw Data

### School Point Locations
Data source: https://data.cityofnewyork.us/Education/School-Point-Locations/jfju-ynrr/about_data

Last updated: November 26, 2024

Annoyingly, the data dictionary on the above linked page doesn't match the data itself, so we're left to guess on the meaning of some of these fields. Also, the description on the above linked page says this data contains Address, Principal, and Principal contact info, but that isn't in here.

In [None]:
school_points_gdf = gpd.read_file('../data/raw_data/DOE/School Locations/School Point Locations/SchoolPoints_APS_2024_08_28/SchoolPoints_APS_2024_08_28.shp')
school_points_gdf.rename(columns={'Location_C': 'Location Code', 'Name': 'Location Name'}, inplace=True)
school_points_gdf

### LCGMS
Last updated: November 26, 2024

This data has more robust fields in it related to grades, address, open date, principal contact info, etc. But there is a discrepancy in the records included in the geocoded vs. non-geocoded files. Not sure yet if there are any other discrepancies between these two files but need to figure that out.

#### Non-geocoded LCGMS data
Source: https://www.nycenet.edu/PublicApps/LCGMS.aspx

In [None]:
lcgms_df = pd.read_excel('../data/raw_data/DOE/School Locations/LCGMS/LCGMS_SchoolData_20250806_0112.xlsx', dtype=str, engine='openpyxl')
lcgms_df

#### Geocoded LCGMS data
Source: https://data.cityofnewyork.us/Education/NYC-DOE-Public-School-Location-Information/3bkj-34v2/about_data

In [None]:
lcgms_geocoded_df = pd.read_csv('../data/raw_data/DOE/School Locations/LCGMS/LCGMS_SchoolData_additional_geocoded_fields_added_.csv', encoding='latin-1')
lcgms_geocoded_df
# lcgms_geocoded_gdf = gpd.GeoDataFrame(lcgms_geocoded_df, geometry=gpd.GeoSeries.from_xy(lcgms_geocoded_df['lon'], lcgms_geocoded_df['lat']), crs=4326)

## Investigate Discrepancies

In [None]:
# TODO: need to somehow summarize for our non-technical folks how/that there are discrepances between these datasets so they can maybe run those down by emailing DOE officials or something. Wouldn't want us suggesting fake schools to the Z campaign or something weird like that.
# Maybe they can also just do a manual fact-check on the records that are different between datasets?

### School Points vs. LCGMS non-geocoded

The non-geocoded LCGMS data seems like it is the more "official" data compared to School Points and geocoded LCGMS, due to it being the spreadsheet downloaded from following links on the official LCGMS page [here](https://infohub.nyced.org/in-our-schools/operations/lcgms) rather than being sort of a sneaky extra dataset included on NYC open data (i.e. the geocoded one) or a poorly documented Shapefile on NYC Open Data. So, I think it makes sense to trust/prefer the non-geocoded LCGMS data over the geocoded data and attempt to map the non-geocoded LCGMS data somehow. The easiest way to do that would be to attach it to the school points layer, but we need to figure out how doable that is first (i.e. discrepancies in that potential join).

In [None]:
# Show columns that are in both school_points_gdf and lcgms_df
school_points_gdf.columns.intersection(lcgms_df.columns)

In [None]:
# Show instances where Location Name is different between school_points_gdf and lcgms_df
school_points_gdf[['Location Code', 'Location Name']].merge(
    lcgms_df[['Location Code', 'Location Name']],
    on='Location Code',
    how='inner',
    suffixes=('_school', '_lcgms')
).query('`Location Name_school` != `Location Name_lcgms`')

#### How many LCGMS records are NOT in School Points? (i.e. `set(LCGMS).difference(set(School Points))`)

In [None]:
# Show records that are NOT in school points but ARE in LCGMS
lcgms_records_missing_from_school_points = school_points_gdf[['Location Code', 'Location Name']].merge(
    lcgms_df[['Location Code', 'Location Name']],
    on='Location Code',
    how='outer',
    suffixes=('_school_points', '_lcgms'),
    indicator=True
).query('_merge == "right_only"')
print("Total records in LCGMS that are NOT in School Points:", len(lcgms_records_missing_from_school_points))
lcgms_records_missing_from_school_points

Check if we can find the LCGMS records that are missing from School Points in geocoded LCGMS instead

In [None]:
# Welp, only 1 of the LCGMS records that are missing from school points are actually in the geocoded LCGMS data, but the lat/lon for that record is missing in geocoded LCGMS as well. So we won't have a use for geocoded LCGMS.
# TODO: have someone on our team research these 17 schools we can't map
lcgms_records_missing_from_school_points.drop(columns=['_merge']).merge(
    lcgms_geocoded_df[['Location Code', 'Location Name', 'Latitude', 'Longitude']],
    on='Location Code',
    how='left',
    suffixes=('_missing_from_school_points', '_geocoded'),
    indicator=True
).query('_merge == "both"')


#### How many School Points records are NOT in LCGMS? (i.e. `set(School Points).difference(set(LCGMS))`)

In [None]:
school_points_records_missing_from_lcgms = school_points_gdf[['Location Code', 'Location Name']].merge(
    lcgms_geocoded_df[['Location Code', 'Location Name']],
    on='Location Code',
    how='outer',
    suffixes=('_school_points', '_lcgms'),
    indicator=True
).query('_merge == "left_only"')
print("Total records in School Points that are NOT in LCGMS:", len(school_points_records_missing_from_lcgms))
school_points_records_missing_from_lcgms

Check if we can find the School Points records that are missing from LCGMS in geocoded LCGMS instead

In [None]:
# phewf - no records from geocoded LCGMS that would have to be added to school points even after joining non-geocoded LCGMS onto school points.
school_points_records_missing_from_lcgms.drop(columns=['_merge']).merge(
    lcgms_geocoded_df[['Location Code', 'Location Name']],
    on='Location Code',
    how='left',
    suffixes=('_missing_from_lcgms', '_geocoded'),
    indicator=True
).query('_merge == "both"')

#### Are there any records in geocoded LCGMS that are NOT in the joined result of non-geocoded LCGMS + School Points?

Thankfully, no.

In [None]:
# Show records that are in geocoded LCGMS but NOT in the joined result of non-geocoded LCGMS + School Points
school_points_gdf.merge(
    lcgms_df[['Location Code', 'Location Name']],
    on='Location Code',
    how='outer',
).merge(
    lcgms_geocoded_df[['Location Code', 'Location Name']],
    on='Location Code',
    how='left',
    suffixes=('', '_geocoded'),
    indicator=True
).query('_merge == "right_only"')

### LCGMS non-geocoded vs. LCGMS geocoded

The TL;DR here is that the delta between these two datasets is really befuddling, especially since it seems the geocoded one should be the exact same as the non-geocoded one but for additional lat/lon fields. I don't know the rhyme nor reason for these discrepancies, and so I think we should just prefer/trust the more official-looking one wherever possible, which is the non-geocoded LCGMS.

Would love for the civil servants who made these datasets to explain why they are different.

In [None]:
# NOTE from LCGMS data dict: "LOCATION CODE: a unique identifier that can include schools, administrative offices, learning communities, etc.  When the Learning_Community_Name = ‘School’, the Location_Code is a combination of the borough code and the school number.""

In [None]:
# Show fields that are in lcgms_geocoded_df but NOT in lcgms_df
lcgms_geocoded_df.columns.difference(lcgms_df.columns)

In [None]:
# Show fields that are in lcgms_df but NOT in lcgms_geocoded_df
lcgms_df.columns.difference(lcgms_geocoded_df.columns)

In [None]:
# Compare data in shared columns between lcgms_df and lcgms_geocoded_df
shared_columns = set(lcgms_df.columns).intersection(set(lcgms_geocoded_df.columns))
print(f"Shared columns: {sorted(shared_columns)}")

# Merge the dataframes on Location Code to compare shared columns
comparison_df = lcgms_df.merge(
    lcgms_geocoded_df, 
    on='Location Code', 
    how='inner', 
    suffixes=('_non_geocoded', '_geocoded')
)

# Check for differences in each shared column (excluding Location Code which is the join key)
shared_data_columns = [col for col in shared_columns if col != 'Location Code']
differences_summary = {}

for col in shared_data_columns:
    col_non_geo = f"{col}_non_geocoded"
    col_geo = f"{col}_geocoded"
    
    # Count records where values differ (handling NaN values)
    different_mask = (
        (comparison_df[col_non_geo].fillna('') != comparison_df[col_geo].fillna('')) |
        (comparison_df[col_non_geo].isna() != comparison_df[col_geo].isna())
    )
    
    num_differences = different_mask.sum()
    differences_summary[col] = num_differences
    
    if num_differences > 0:
        print(f"\n{col}: {num_differences} differences found")
        # Show first few examples of differences
        # diff_examples = comparison_df[different_mask][['Location Code', col_non_geo, col_geo]].head()
        # print(diff_examples)

# print(f"\nSummary of differences:")
# for col, count in differences_summary.items():
#     print(f"{col}: {count} differences")

In [None]:
# Show records where Location Code matches but Location Name does not match
lcgms_df[['Location Code', 'Location Name']].merge(
    lcgms_geocoded_df[['Location Code', 'Location Name']], 
    on='Location Code', 
    how='inner', 
    suffixes=('_non-geocoded', '_geocoded')
).query('`Location Name_non-geocoded` != `Location Name_geocoded`')

## Join Data

Show that there's no difference in joining by Location Code vs. ATS System Code

In [None]:
ats_count = len(school_points_gdf.merge(lcgms_df, left_on='ATS', right_on='ATS System Code', how='inner'))
loc_count = len(school_points_gdf.merge(lcgms_df, on='Location Code', how='inner'))
assert ats_count == loc_count, f"ATS join: {ats_count} records, Location Code join: {loc_count} records"

### Outer Join LCGMS with School Points

In [None]:
# For our final school points data (at least for now), join LCGMS onto school points
school_points_with_lcgms = school_points_gdf.merge(
    lcgms_df,
    on='Location Code',
    how='outer',
    suffixes=('_school_points', '_lcgms'),
    indicator=True
)

# Keep a column that indicates whether the record is missing from LCGMS or not so we can select just LCGMS records if we find out that the school points data is outdated or inaccurate compared to LCGMS.
school_points_with_lcgms.rename(columns={'_merge': 'in_LCGMS'}, inplace=True)
school_points_with_lcgms['in_LCGMS'] = school_points_with_lcgms['in_LCGMS'].str.contains('both|right_only')

school_points_with_lcgms

# Clean Data

## Cleaning up specific fields

In [None]:
school_points_with_lcgms['Open Date'] = pd.to_datetime(school_points_with_lcgms['Open Date'], format='%b %d %Y', errors='coerce')

### Coalesce `Location Name` from School Points and LCGMS

In [None]:
school_points_with_lcgms['Location Name'] = school_points_with_lcgms['Location Name_lcgms'].fillna(school_points_with_lcgms['Location Name_school_points'])
school_points_with_lcgms.drop(columns=['Location Name_lcgms', 'Location Name_school_points'], inplace=True)

### Coalesce `ATS System Code` from School Points and LCGMS

In [None]:
school_points_with_lcgms['ATS'] = school_points_with_lcgms['ATS'].fillna(school_points_with_lcgms['ATS System Code'])

Building Code

In [None]:
# NOTE: not sure why we have disagreements between Building Code and Building_C. Seems likely that one is more up-to-date than the other, and likely LCGMS is more up-to-date, but haven't verified.
school_points_with_lcgms[
    (school_points_with_lcgms['Building_C'] != school_points_with_lcgms['Building Code'])
    & school_points_with_lcgms['Building Code'].notna()
    & school_points_with_lcgms['Building_C'].notna()
][['Location Code', 'Location Name', 'Building Code', 'Building_C', 'in_LCGMS']]

In [None]:
# For now, we're going to keep the Building Code from LCGMS unless NaN
school_points_with_lcgms['Building Code'] = school_points_with_lcgms['Building Code'
                                                                     ].fillna(school_points_with_lcgms['Building_C'])

Address Fields

In [None]:
school_points_with_lcgms.rename(columns={'State Code': 'State'}, inplace=True)

In [None]:
address_fields = ['Primary Address', 'City', 'State', 'Zip']
school_points_with_lcgms.loc[:, address_fields] = school_points_with_lcgms[address_fields].apply(lambda x: x.str.strip().str.upper())

### Drop Unnecessary Columns

In [None]:
# TODO: need to go through all the LCGMS columns and figure out which ones we can drop
# Drop unnecessary columns
cols_to_drop = [
    'ATS System Code', # Duplicate of ATS column
    'Geographic', # This is from School Points and I have no idea what it means.
    'Building_C',  # This is the Building Code from School Point, which is duplicate
    'Status Description', # These are all either "Open" or NaN, so not useful
]

# All the "HighSchool" columns are null
cols_to_drop += [x for x in school_points_with_lcgms.columns if 'HighSchool' in x]
school_points_with_lcgms.drop(columns=cols_to_drop, inplace=True)

Reorder columns for readability

In [None]:
core_cols = ['Location Name', 'Managed By Name', 'Location Code', 'Building Code', 'ATS', 'Primary Address', 'City', 'State', 'Zip', 'Borough Block Lot', 'Census Tract', 'Community District', 'Council District']
school_points_with_lcgms = school_points_with_lcgms[core_cols + [col for col in school_points_with_lcgms.columns if not col in core_cols]]

## Identify DOE and Charter Schools that weren't in LCGMS

In [None]:
# For the points that don't join to LCGMS, filter out ones that have "charter" in the name and add the "Managed By Name" as "Charter" for non-LCGMS records
additional_charter_mask = (
    school_points_with_lcgms['Managed By Name'].isna() 
    & 
    (school_points_with_lcgms['Location Name'].fillna('').str.contains('charter', case=False))
)
school_points_with_lcgms.loc[additional_charter_mask, 'Managed By Name'] = 'Charter'

In [None]:
# For the points that don't join to LCGMS, set "Managed By Name" as "DOE" if meets a few DOE-related regex patterns
additional_doe_mask = (
    school_points_with_lcgms['Managed By Name'].isna() 
    & 
    # ALC="Alternative Learning Center", "YABC"="Young Adult Borough Center", District 79 is for alternative schools
    school_points_with_lcgms['Location Name'].fillna('').str.contains(r'[PMH]\.S\.|ALC|YABC|District 79|D79', regex=True)
)
school_points_with_lcgms.loc[additional_doe_mask, 'Managed By Name'] = 'DOE'


In [None]:
# TODO: have someone manually look over these remaining schools with no DOE/Charter category and categorize them
print("Schools without DOE/Charter category:", len(school_points_with_lcgms[school_points_with_lcgms['Managed By Name'].isna()]))

# Duplicate geometries

Nearly HALF of the school points are duplicated locations. This means we don't end up seeing half of the points on the map because they're on top of each other.

A lot of these appear to be different schools that share the same address, even if the building they're in is different.

In [None]:
# Almost HALF the data is duplicate geometries. (almost no difference when dealing with lat/lon -- two more duplicates when looking at lat/lon)
print('Count of records with duplicate geometry:', school_points_with_lcgms['geometry'].duplicated(keep=False).sum())
print('Proportion of records with duplicate geometry:', school_points_with_lcgms['geometry'].duplicated(keep=False).sum() / len(school_points_with_lcgms))

In [None]:
# Slightly less when looking at address but still nearly half duplicated
print('Count of records with duplicate address:', school_points_with_lcgms['Primary Address'].duplicated(keep=False).sum())
print('Proportion of records with duplicate address:', school_points_with_lcgms['Primary Address'].duplicated(keep=False).sum() / len(school_points_with_lcgms))

In [None]:
# Find where geometry is duplicated but address within duplicate group is different
school_points_with_lcgms.groupby('geometry').filter(lambda g: g['Primary Address'].nunique() > 1).sort_values(['geometry', 'Primary Address'])

In [None]:
# Ok at least with these 3 examples, they all appear to just be different schools that 
# share an address:
#    - CSI High School for International Studies
#    - Gaynor McCown Expeditionary Learning School
#    - Marsh Avenue School for Expeditionary Learning
school_points_with_lcgms[
    (school_points_with_lcgms.duplicated(subset=['geometry'], keep=False))
    & school_points_with_lcgms['geometry'].notna()
    ].sort_values(by='geometry')

# Geocoding

For now, we're geocoding all addresses from the either of the following scenarios:
- Records from LCGMS that didn't join onto School Points and thus have no geometry or lat/lon
- Records that have a duplicate geometry


It's potentially useful to geocode all the data -- both addresses and school names -- to compare the results to what's in school points and ensure accuracy. But for now we're not going down that route.

In [None]:
# NOTE: don't mess with how the `full_address` column is created. Geocoding idempotency relies on it being formatted exactly the same as it is here to prevent unnecessary Google Maps API calls.
# Create a full_address column for geocoding
school_points_with_lcgms.loc[school_points_with_lcgms['Primary Address'].notnull(), 'full_address'] =(
        school_points_with_lcgms['Primary Address'].fillna('') + ', ' +
        school_points_with_lcgms['City'].fillna('') + ', ' +
        school_points_with_lcgms['State'].fillna('') + ' ' +
        school_points_with_lcgms['Zip'].fillna('')
    ).str.strip(', ').replace(r', $', '', regex=True)

In [None]:
# Show how many addresses we have and what the duplicate breakdown is
missing_addresses = school_points_with_lcgms['Primary Address'].isna().sum()
non_null_addresses = len(school_points_with_lcgms) - missing_addresses
unique_addresses = school_points_with_lcgms['Primary Address'].nunique()

# Let's get the value counts to understand the distribution
address_counts = school_points_with_lcgms['Primary Address'].value_counts()

# How many addresses appear exactly once?
addresses_appearing_once = (address_counts == 1).sum()
print(f"Addresses that appear exactly once: {addresses_appearing_once}")

# How many addresses appear more than once?
addresses_appearing_multiple = (address_counts > 1).sum() 
print(f"Addresses that appear multiple times: {addresses_appearing_multiple}")

# Total records with those duplicate addresses
records_with_duplicate_addresses = address_counts[address_counts > 1].sum()
print(f"Total records that have duplicate addresses: {records_with_duplicate_addresses}")

print(f"Check: {addresses_appearing_once} + {addresses_appearing_multiple} should equal {unique_addresses}")
print(f"Check: {addresses_appearing_once} + {records_with_duplicate_addresses} should equal {non_null_addresses}")

In [None]:
# NOTE: for now, we're going to geocode all addresses associated with a duplicate geometry or no geometry at all.
# Addresses with no geometry at all
no_geometry_addresses = school_points_with_lcgms[school_points_with_lcgms['geometry'].isna()]['full_address'].dropna().unique().tolist()
# Addresses associated with a duplicate geometry
dupe_geometry_addresses = school_points_with_lcgms[
    (school_points_with_lcgms['geometry'].duplicated(keep=False))
]['full_address'].drop_duplicates().dropna().tolist()

addresses_to_geocode = list(set(no_geometry_addresses + dupe_geometry_addresses))

In [None]:
import os
import json
import googlemaps
import time
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

cache_file = '../data/google_maps_geocode_cache.json'
if os.path.exists(cache_file):
    with open(cache_file, 'r', encoding='utf-8') as f:
        cached_geocodes = json.load(f)
else:
    cached_geocodes = {}

# Get API key from environment variable
api_key = os.getenv('GOOGLE_MAPS_API_KEY')
if not api_key:
    raise ValueError("GOOGLE_MAPS_API_KEY not found in environment variables. Please check your .env file.")

gmaps = googlemaps.Client(key=api_key)

for i, address in enumerate(addresses_to_geocode):
    if address in cached_geocodes:
        #print(f"{address} found in cache.")
        continue
    print(f"Geocoding {i+1}/{len(addresses_to_geocode)}: {address}")
    result = gmaps.geocode(address)
    cached_geocodes[address] = result[0]
    
    # Save cache periodically (every 10 geocodes) and at the end
    if (i + 1) % 10 == 0 or i == len(addresses_to_geocode) - 1:
        with open(cache_file, 'w+', encoding='utf-8') as f:
            json.dump(cached_geocodes, f, indent=2, ensure_ascii=False)
    
    # Current rate limit is 3K per minute, which we should be well under
    # if i < len(addresses_to_geocode) - 1:  # Don't delay after last request
    #     time.sleep(.1)

In [None]:
geocodes_data = []
for full_address, geocode_result in cached_geocodes.items():
    row = {
        'full_address': full_address,
        'google_lat': geocode_result.get('geometry', {}).get('location', {}).get('lat', None),
        'google_lng': geocode_result.get('geometry', {}).get('location', {}).get('lng', None),
        'google_location_type': geocode_result.get('geometry', {}).get('location_type', None)
    }
    geocodes_data.append(row)
geocodes_df = pd.DataFrame(geocodes_data)

In [None]:
print('Count of records with duplicate geometry BEFORE GEOCODING:', school_points_with_lcgms['geometry'].duplicated(keep=False).sum())
print('Proportion of records with duplicate geometry BEFORE GEOCODING:', school_points_with_lcgms['geometry'].duplicated(keep=False).sum() / len(school_points_with_lcgms))

In [None]:
school_points_with_lcgms = school_points_with_lcgms.merge(geocodes_df, on='full_address', how='left')
school_points_with_lcgms['lat'] = school_points_with_lcgms['google_lat'].fillna(school_points_with_lcgms['Latitude'])
school_points_with_lcgms['lng'] = school_points_with_lcgms['google_lng'].fillna(school_points_with_lcgms['Longitude'])
school_points_with_lcgms.set_geometry(gpd.points_from_xy(school_points_with_lcgms['lng'], school_points_with_lcgms['lat']), crs='EPSG:4326', inplace=True)

In [None]:
print('Count of records with duplicate geometry AFTER GEOCODING:', school_points_with_lcgms['geometry'].duplicated(keep=False).sum())
print('Proportion of records with duplicate geometry AFTER GEOCODING:', school_points_with_lcgms['geometry'].duplicated(keep=False).sum() / len(school_points_with_lcgms))

# Post-Geocoding Cleaning

## Remove Unnecessary Geocoding Columns

In [None]:
geocode_cols_to_drop = ['Latitude', 'Longitude', 'google_lat', 'google_lng']
school_points_with_lcgms.drop(columns=geocode_cols_to_drop, inplace=True)

## Remove non-NYC points

We have 1 Long Island point that shows up after geocoding

In [None]:
school_points_with_lcgms = school_points_with_lcgms.clip(gpd.read_file('../data/raw_data/nybb_25c/nybb.shp').to_crs(school_points_with_lcgms.crs))


# Sanity Check: Visualize Data with Geopandas

In [None]:
school_points_with_lcgms.explore(tiles='CartoDB positron',
                popup=['Location Name', 'Community District', 'Council District',
                       'Principal Name', 'Principal Title', 'Principal Phone Number'],
                tooltip=['Location Name'],  # Show on hover
                legend=True,
                style_kwds={'fillOpacity': 0.7, 'weight': 1}
)

# Export Data

For now, we're manually uploading these files to Google Drive to avoid having to deal with Google Drive API keys or S3.

Must shorten columns to 10 chars to export to Shapefile

In [None]:
# shorten fields to 10 characters or less for shapefile export
short_col_map = {'Location Name': 'Loc_Name',
    'Managed By Name': 'Managed_By',
    'Location Code': 'Loc_Code',
    'Building Code': 'Bldg_Code',
    'ATS': 'ATS',
    'Primary Address': 'Address',
    'City': 'City',
    'State': 'State',
    'Zip': 'Zip',
    'Borough Block Lot': 'BBL',
    'Census Tract': 'C_Tract',
    'Community District': 'Comm_Dist',
    'Council District': 'Council_Di',
    'geometry': 'geometry',
    'BEDS Number': 'BEDS_Num',
    'Location Type Description': 'Loc_Type_D',
    'Location Category Description': 'Loc_Cat_D',
    'Grades': 'Grades',
    'Grades Final': 'Grades_Fin',
    'Open Date': 'Open_Date',
    'NTA': 'NTA',
    'NTA_Name': 'NTA_Name',
    'Principal Name': 'Princ_Name',
    'Principal Title': 'Princ_Titl',
    'Principal Phone Number': 'Princ_Phon',
    'Fax Number': 'Fax_Num',
    'Geographical District Code': 'GeoDisCode',
    'Administrative District Code': 'AdDistCode',
    'Administrative District Location Code': 'AdDistLocC',
    'Administrative District Name': 'AdDistName',
    'Community School Sup Name': 'ComScSupNa',
    'BCO Location Code': 'BCOLocCode',
    'in_LCGMS': 'in_LCGMS',
    'full_address': 'full_addr',
    'google_location_type': 'g_loc_type',
    'lat': 'lat',
    'lng': 'lng'
}

import zipfile
import os
# Save shapefile first
shp_path = '../data/processed_data/school_points_with_lcgms.shp'
school_points_with_lcgms.rename(columns=short_col_map).to_file(
    shp_path,
    driver='ESRI Shapefile'
)

# Create zip file with all shapefile components
zip_path = '../data/processed_data/school_points_with_lcgms.zip'
base_name = '../data/processed_data/school_points_with_lcgms'

# Shapefile extensions to include
extensions = ['.shp', '.shx', '.dbf', '.prj', '.cpg']

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for ext in extensions:
        file_path = base_name + ext
        if os.path.exists(file_path):
            # Add file to zip with just the filename (no path)
            zipf.write(file_path, os.path.basename(file_path))
            print(f"Added {os.path.basename(file_path)} to zip")

print(f"Shapefile saved as zip: {zip_path}")

Export to GeoJSON for easier use when loading data into other Python scripts

In [None]:
school_points_with_lcgms.to_file(
    '../data/processed_data/school_points_with_lcgms.geojson',
      driver='GeoJSON'
      )