# Load A/C & Ventilation Data

We scraped the following link for downloadable Excel files for both A/C and ventilation data: https://www.nycenet.edu/roomassessment?code=M001. Then we consolidated those downloaded Excel files into a single CSV.

Interestingly, the IBO "barriers to schools" report uses a different A/C dataset from a report made for City Council by DOE. Find it on [this page](https://council.nyc.gov/budget/fy2021/) for the 2021 City Council budget under the "Terms and Conditions" section in the hyperlink for "Department of Education – Air Conditioning Report (1 of 2) (XLSX)". This data only exists for FY2021 and prior. Theoretically, the data we pulled above is more recent, but the catch is that the DOE AC report to city council was wrapped up a little better (i.e. it had already done the math of sqft with A/C by building code) and had extra stats like cost estimated for remaining A/C buildout.

In [1]:
import pandas as pd

ac_df = pd.read_csv('../data/raw_data/DOE/Ventilation/ac_dataset.csv')
ventilation_df = pd.read_csv('../data/raw_data/DOE/Ventilation/ra_dataset.csv')
# Drop unneeded column
ac_df = ac_df.drop(columns=['index in original file'])
ventilation_df = ventilation_df.drop(columns=['index in original file'])
# Drop duplicates after removing the index column
ac_df = ac_df.drop_duplicates()
ventilation_df = ventilation_df.drop_duplicates()
# Clean up join keys
ac_df['BuildingCode'] = ac_df['BuildingCode'].str.strip().str.upper()
ac_df['Room'] = ac_df['Room'].str.strip().str.upper()
ventilation_df['BuildingCode'] = ventilation_df['BuildingCode'].str.strip().str.upper()
ventilation_df['Room'] = ventilation_df['Room'].str.strip().str.upper()

In [17]:
ac_df['Primary Usage'].value_counts()

Primary Usage
REGULAR CLASSROOM    55738
CAFETERIA             1659
GYM                   1503
AUDITORIUM            1108
MULTIPURPOSE           989
LIBRARY                828
Name: count, dtype: int64

In [19]:
ac_df['Room Status'].value_counts()

Room Status
Operational           56095
Repair in Progress     2959
No A/C                 2771
Name: count, dtype: int64

In [23]:
ac_df

Unnamed: 0,BuildingCode,Room,Primary Usage,Room Status
0,X150,102,REGULAR CLASSROOM,Operational
4,X150,103,REGULAR CLASSROOM,Operational
8,X150,104,REGULAR CLASSROOM,Operational
12,X150,105,REGULAR CLASSROOM,Operational
16,X150,107,REGULAR CLASSROOM,Operational
...,...,...,...,...
247304,R020,305,REGULAR CLASSROOM,Operational
247308,R020,306,REGULAR CLASSROOM,Operational
247312,R020,AUD,AUDITORIUM,Operational
247316,R020,CAFE1,CAFETERIA,No A/C


In [None]:
# ac_pivot = 
ac_df[ac_df['Primary Usage']=='REGULAR CLASSROOM'].pivot_table(index='BuildingCode', columns='Room Status', values='Room', aggfunc='count', fill_value=0)

Room Status,No A/C,Operational,Repair in Progress
BuildingCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
K001,0,36,8
K002,0,64,0
K003,4,30,4
K005,1,42,0
K006,0,38,0
...,...,...,...
X896,0,5,0
X905,0,14,0
X970,0,51,0
X972,0,38,0


In [None]:
ac_df.pivot_table(index=['BuildingCode', 'Primary Usage'], columns='Room Status', values='Room', aggfunc='count', fill_value=0)

Unnamed: 0_level_0,Room Status,No A/C,Operational,Repair in Progress
BuildingCode,Primary Usage,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
K001,AUDITORIUM,0,1,0
K001,CAFETERIA,1,0,0
K001,GYM,1,0,0
K001,REGULAR CLASSROOM,0,36,8
K002,AUDITORIUM,0,3,0
...,...,...,...,...
X972,MULTIPURPOSE,0,4,0
X972,REGULAR CLASSROOM,0,38,0
X973,CAFETERIA,0,1,0
X973,MULTIPURPOSE,0,1,0


In [28]:
ac_df.pivot_table(index=['BuildingCode'], columns=['Room Status', 'Primary Usage'], values='Room', aggfunc='count', fill_value=0)

Room Status,No A/C,No A/C,No A/C,No A/C,No A/C,No A/C,Operational,Operational,Operational,Operational,Operational,Operational,Repair in Progress,Repair in Progress,Repair in Progress,Repair in Progress,Repair in Progress,Repair in Progress
Primary Usage,AUDITORIUM,CAFETERIA,GYM,LIBRARY,MULTIPURPOSE,REGULAR CLASSROOM,AUDITORIUM,CAFETERIA,GYM,LIBRARY,MULTIPURPOSE,REGULAR CLASSROOM,AUDITORIUM,CAFETERIA,GYM,LIBRARY,MULTIPURPOSE,REGULAR CLASSROOM
BuildingCode,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
K001,0,1,1,0,0,0,1,0,0,0,0,36,0,0,0,0,0,8
K002,0,0,0,0,0,0,3,0,0,0,0,64,0,0,0,0,0,0
K003,0,2,1,1,2,4,1,0,0,0,0,30,0,0,0,0,0,4
K005,0,2,0,0,0,1,0,0,0,2,1,42,1,0,0,0,0,0
K006,0,0,0,0,0,0,2,2,1,1,0,38,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X896,0,0,0,0,0,0,0,0,0,0,2,5,0,0,0,0,0,0
X905,0,0,0,0,0,0,0,0,0,0,0,14,0,0,0,0,0,0
X970,0,0,0,0,0,0,0,2,4,1,0,51,0,0,0,0,0,0
X972,0,0,0,0,0,0,0,0,0,0,4,38,0,0,0,0,0,0


In [2]:
import geopandas as gpd
schools = gpd.read_file('../data/processed_data/school_points_with_lcgms.shp')

In [3]:
# we're missing ~200 building codes after scraping, most of which look like weirdo codes
print("building codes missing from AC data:", len(schools[~schools['Bldg_Code'].isin(ac_df['BuildingCode'].drop_duplicates())]['Bldg_Code'].drop_duplicates()))#.to_csv('missing_ac_bldg_codes.csv', index=False)\
print("building codes missing from AC data:", len(schools[~schools['Bldg_Code'].isin(ventilation_df['BuildingCode'].drop_duplicates())]['Bldg_Code'].drop_duplicates()))#.to_csv('missing_ventilation_bldg_codes.csv', index=False)

building codes missing from AC data: 185
building codes missing from AC data: 198


In [22]:
print('unique buildings in schools data:', len(schools['Bldg_Code'].drop_duplicates()))
print('unique buildings from schools data found in ac data:', schools['Bldg_Code'].drop_duplicates().isin(ac_df['BuildingCode'].drop_duplicates()).sum())

unique buildings in schools data: 1427
unique buildings from schools data found in ac data: 1242


# Load Building Space Data

Downloaded from Open Data NYC [here](https://data.cityofnewyork.us/Education/DOE-Building-Space-Usage/wavz-fkw8/about_data).

Ok so I explored joining our A/C data onto the [buildings space data](https://data.cityofnewyork.us/Education/DOE-Building-Space-Usage/wavz-fkw8/about_data) from Open Data NYC to get a pct area figure of square feet covered by A/C. But that is proving to be messy because the joins don't work very cleanly. So, instead I am going to just do a proportion of regular classrooms with A/C. That follows the methodology from DOE's reports on A/C to City Council, where they focused on instructional rooms rather than gyms, auditoriums, cafeterias, libraries, etc.

I think the main metric we'll put in the dashboard is proportion of classrooms with A/C, but I want to add in to that data the proportion of all the other room types so that we can see when a building is truly 100% and when it's not. Also it matters if the gym doesn't have A/C.

In [4]:
# TODO: join this with the dataset from DOE that has per-classroom square footage so we can get percent area with a/c
building_space_df = pd.read_csv('../data/raw_data/DOE/Building Space Usage/DOE_Building_Space_Usage_20251027.csv')
building_space_df.columns = building_space_df.columns.str.strip()
building_space_df['Data As Of'] = pd.to_datetime(building_space_df['Data As Of'])
# Clean up join keys
building_space_df['Org Code'] = building_space_df['Org Code'].str.strip().str.upper()
building_space_df['Bldg ID'] = building_space_df['Bldg ID'].str.strip().str.upper()
building_space_df['Room Number'] = building_space_df['Room Number'].str.strip().str.upper()
# Drop all but most recent data for each bldg code + room type
building_space_df = building_space_df.sort_values('Data As Of', ascending=False).drop_duplicates(subset=['Org Code', 'Bldg ID', 'Room Number'])

In [5]:
print('unique buildings in schools data:', len(schools['Bldg_Code'].drop_duplicates()))
print('unique buildings from schools data found in building space data:', schools['Bldg_Code'].drop_duplicates().isin(building_space_df['Bldg ID'].drop_duplicates()).sum())
print('unique buildings from schools data found in ac data:', schools['Bldg_Code'].drop_duplicates().isin(ac_df['BuildingCode'].drop_duplicates()).sum())

unique buildings in schools data: 1427
unique buildings from schools data found in building space data: 1260
unique buildings from schools data found in ac data: 1242


In [7]:
print('pct unique buildings from ac data found in schools data:', ac_df['BuildingCode'].drop_duplicates().isin(schools['Bldg_Code'].drop_duplicates()).sum() / len(ac_df['BuildingCode'].drop_duplicates()))
print('pct unique buildings from buildings space data found in schools data:', building_space_df['Bldg ID'].drop_duplicates().isin(schools['Bldg_Code'].drop_duplicates()).sum() / len(building_space_df['Bldg ID'].drop_duplicates()))

pct unique buildings from ac data found in schools data: 1.0
pct unique buildings from buildings space data found in schools data: 0.6976744186046512


In [8]:
merged = building_space_df.merge(ac_df, left_on=['Bldg ID', 'Room Number'], right_on=['BuildingCode', 'Room'], how='outer', indicator=True)
merged['_merge'].value_counts()

_merge
left_only     80220
both          69507
right_only     1514
Name: count, dtype: int64

In [None]:
# How many buildings from AC data didn't match are in the schools data?
merged[merged["_merge"]=='right_only']['BuildingCode'].drop_duplicates().isin(schools['Bldg_Code'].drop_duplicates()).sum()

np.int64(500)