## Load Enrollment Capacity and Utilization Data


Questions: 
- Do we use building enrollment/capacity or school enrollment/capacity?
- What threshold do we use for pct utilization to indicate burden?
    - The language in the policy memo is "whether it meets the capacity of enrollment demands", which I interpret as less than or equal to 100% Utilization. But that leaves us with a lot of options

In [None]:
import pandas as pd

In [None]:
capacity_utilization_df = pd.read_csv('../data/raw_data/SCA/Capacity and Utilization/Enrollment_Capacity_And_Utilization_Reports_20250915.csv')
print('total records:', len(capacity_utilization_df))
print('unique buildings:', capacity_utilization_df['Bldg ID'].nunique())
print('unique organizations:', capacity_utilization_df['Organization Name'].nunique())

In [None]:
# Fix data types
# Convert 'Data As Of' to datetime
capacity_utilization_df['Data As Of'] = pd.to_datetime(capacity_utilization_df['Data As Of'], format='%m/%d/%Y')

In [None]:
# Deduplicate by taking most recent record for each organization/building combination
capacity_utilization_df = capacity_utilization_df.sort_values('Data As Of').drop_duplicates(subset=['Org ID', 'Bldg ID'], keep='last')

In [None]:
pd.set_option('display.max_rows', 20)
capacity_utilization_df[
    ~(capacity_utilization_df['Bldg Name'].str.contains(r'P\.S\.', regex=True))
    &
    (capacity_utilization_df['Bldg Name'].str.contains(r'P.S.', regex=True))
]

In [None]:
# TODO: need to get 5K records down to 1.5K unique buildings

# For these 837 buildings, there is an org name that matches the building name
# ASSUMPTION: when org name==building name, this is the school whose capacity we care about.
capacity_utilization_df[capacity_utilization_df['Bldg Name']==capacity_utilization_df['Organization Name']]['Bldg ID'].nunique()

In [None]:
# TODO: find buildings that don't have a matching organization name
# For each group, check if name of group appears in Organization Name column

capacity_utilization_df.groupby('Bldg Name').apply(lambda x: (x['Organization Name']==x.name).any())

In [None]:
capacity_utilization_df[capacity_utilization_df['Bldg Name']=='51ST AVENUE ACADEMY - Q']

In [None]:
capacity_utilization_df[capacity_utilization_df['Bldg Name']=='1368 FULTON STREET']

## Join Schools Data

Load Schools Data

In [None]:
import geopandas as gpd
schools = gpd.read_file('../data/processed_data/school_points_with_lcgms.gpkg')

In [None]:
# AHA -- need to merge on Org ID not Bldg ID. Still missing some tho
schools_capacity_merged = schools[['Location Code', 'Location Name']].merge(
    capacity_utilization_df[['Org ID', 'Organization Name', 'Data As Of']].drop_duplicates(subset='Org ID'), 
    left_on='Location Code', 
    right_on='Org ID', 
    how='outer', 
    indicator=True
)
schools_capacity_merged['_merge'].value_counts()

In [None]:
schools_capacity_merged[
    (schools_capacity_merged['_merge']=='right_only')
    &
    (schools_capacity_merged['Organization Name'].str.contains(r'[PIM]\.S\.', regex=True))
    ].sort_values("Data As Of", ascending=False)

In [None]:
# TODO: looks like there are 228 records from LCGMS with no matching Org ID. For these, try joining on Building ID instead
pd.set_option('display.max_rows', 20)
schools_capacity_merged[
    (schools_capacity_merged['_merge']=='left_only')
    # &
    # (schools_capacity_merged['Organization Name'].str.contains(r'[PIM]\.S\.', regex=True))
    ].sort_values("Data As Of", ascending=False)

In [None]:
schools_capacity_merged[schools_capacity_merged['_merge']=='left_only'].merge(capacity_utilization_df[['Bldg ID', 'Org ID', 'Organization Name', 'Data As Of']],
                              left_on='Location Code', 
    right_on='Bldg ID', 
    how='left', 
    indicator='new_merge')[lambda x: x['new_merge']!='left_only']