# Load A/C & Ventilation Data

We scraped the following link for downloadable Excel files for both A/C and ventilation data: https://www.nycenet.edu/roomassessment?code=M001. Then we consolidated those downloaded Excel files into a single CSV.

Interestingly, the IBO "barriers to schools" report uses a different A/C dataset from a report made for City Council by DOE. Find it on [this page](https://council.nyc.gov/budget/fy2021/) for the 2021 City Council budget under the "Terms and Conditions" section in the hyperlink for "Department of Education – Air Conditioning Report (1 of 2) (XLSX)". This data only exists for FY2021 and prior. Theoretically, the data we pulled above is more recent, but the catch is that the DOE AC report to city council was wrapped up a little better (i.e. it had already done the math of sqft with A/C by building code) and had extra stats like cost estimated for remaining A/C buildout.

I explored joining our A/C data onto the [buildings space data](https://data.cityofnewyork.us/Education/DOE-Building-Space-Usage/wavz-fkw8/about_data) from Open Data NYC to get a pct area figure of square feet covered by A/C. But that is proving to be messy because the joins don't work very cleanly. So, instead I am going to just do a proportion of regular classrooms with A/C. That follows the methodology from DOE's reports on A/C to City Council, where they focused on instructional rooms rather than gyms, auditoriums, cafeterias, libraries, etc.

I think the main metric we'll put in the dashboard is proportion of classrooms with A/C, but I want to add in to that data the proportion of all the other room types so that we can see when a building is truly 100% and when it's not. Also it matters if the gym doesn't have A/C.

In [1]:
import pandas as pd

ac_df = pd.read_csv('../data/raw_data/DOE/Ventilation/ac_dataset.csv')
ventilation_df = pd.read_csv('../data/raw_data/DOE/Ventilation/ra_dataset.csv')
# Drop unneeded column
ac_df = ac_df.drop(columns=['index in original file'])
ventilation_df = ventilation_df.drop(columns=['index in original file'])
# Drop duplicates after removing the index column
ac_df = ac_df.drop_duplicates()
ventilation_df = ventilation_df.drop_duplicates()
# Clean up join keys
ac_df['BuildingCode'] = ac_df['BuildingCode'].str.strip().str.upper()
ac_df['Room'] = ac_df['Room'].str.strip().str.upper()
ventilation_df['BuildingCode'] = ventilation_df['BuildingCode'].str.strip().str.upper()
ventilation_df['Room'] = ventilation_df['Room'].str.strip().str.upper()

# Process A/C Data

In [2]:
ac_df['Primary Usage'].value_counts()

Primary Usage
REGULAR CLASSROOM    55738
CAFETERIA             1659
GYM                   1503
AUDITORIUM            1108
MULTIPURPOSE           989
LIBRARY                828
Name: count, dtype: int64

In [3]:
ac_df['Room Status'].value_counts()

Room Status
Operational           56095
Repair in Progress     2959
No A/C                 2771
Name: count, dtype: int64

We're going to use "Operational" as the primary metric of whether A/C is present. My understanding is that this will be data that is current as of the time we scraped the data: 10/26/2025. So A/C in repair is a recent metric. 

If we need a more permanent metric, we might just use "No A/C" instead to avoid the repairs making things confusing. If we use "No A/C", it makes it seem like you could 1-n to get the affirmative (i.e. x% *do* have A/C), but that's not true bc of the repairs category. So idk maybe people will find the operational thing confusing, but just trying to work around this repair category.

In [4]:
# Less conservative treatment of A/C: Calculate percentage of "Operational" A/C for each BuildingCode and Primary Usage combination
# This will include all buildings, even those with 0% "Operational"
ac_operational_summary = ac_df.groupby(['BuildingCode', 'Primary Usage']).apply(
    lambda x: (x['Room Status'] == 'Operational').sum() / len(x)
).unstack(fill_value=0)
# Rename columns to include "_Operational" suffix
ac_operational_summary.columns = [f"{col}_Op_AC".replace(' ', '_') for col in ac_operational_summary.columns]
# Reset index to get BuildingCode as a regular column
ac_operational_summary = ac_operational_summary.reset_index()
# Shorten column names for shapefile 10 char limit
ac_operational_summary.columns = ac_operational_summary.columns.str.replace(
    'AUDITORIUM', 'AUD').str.replace(
        'CAFETERIA', 'CAF').str.replace(
            'REGULAR_CLASSROOM', 'CLS').str.replace(
                'GYMNASIUM', 'GYM').str.replace(
                    'LIBRARY', 'LIB').str.replace(
                        'MULTIPURPOSE', 'MPR')

# Filter for buildings with <100% classrooms with operational A/C
ac_operational_summary[ac_operational_summary['CLS_Op_AC'] < 1.0]


  ac_operational_summary = ac_df.groupby(['BuildingCode', 'Primary Usage']).apply(


Unnamed: 0,BuildingCode,AUD_Op_AC,CAF_Op_AC,GYM_Op_AC,LIB_Op_AC,MPR_Op_AC,CLS_Op_AC
0,K001,1.0,0.00,0.0,0.0,0.0,0.818182
2,K003,1.0,0.00,0.0,0.0,0.0,0.789474
3,K005,0.0,0.00,0.0,1.0,1.0,0.976744
6,K008,0.0,1.00,1.0,1.0,1.0,0.928571
7,K009,1.0,0.00,0.0,0.0,0.0,0.892857
...,...,...,...,...,...,...,...
1219,X660,1.0,1.00,0.0,1.0,0.0,0.936508
1222,X779,0.0,1.00,0.0,1.0,0.0,0.937500
1226,X819,1.0,0.00,0.0,0.0,0.0,0.888889
1227,X826,0.0,0.50,1.0,1.0,0.0,0.933333


In [5]:
# Most conservative treatment of A/C: Calculate percentage of "No" A/C" for each BuildingCode and Primary Usage combination
# This will include all buildings, even those with 0% "No A/C"
no_ac_summary = ac_df.groupby(['BuildingCode', 'Primary Usage']).apply(
    lambda x: (x['Room Status'] == 'No A/C').sum() / len(x)
).unstack(fill_value=0)
# Rename columns to include "_No_AC" suffix
no_ac_summary.columns = [f"{col}_No_AC".replace(' ', '_') for col in no_ac_summary.columns]
# Reset index to get BuildingCode as a regular column
no_ac_summary = no_ac_summary.reset_index()
# Shorten column names for shapefile 10 char limit
no_ac_summary.columns = no_ac_summary.columns.str.replace(
    'AUDITORIUM', 'AUD').str.replace(
        'CAFETERIA', 'CAF').str.replace(
            'REGULAR_CLASSROOM', 'CLS').str.replace(
                'GYMNASIUM', 'GYM').str.replace(
                    'LIBRARY', 'LIB').str.replace(
                        'MULTIPURPOSE', 'MPR')

# Filter for buildings with >0% classrooms with No A/C
no_ac_summary[no_ac_summary['CLS_No_AC'] > 0]

  no_ac_summary = ac_df.groupby(['BuildingCode', 'Primary Usage']).apply(


Unnamed: 0,BuildingCode,AUD_No_AC,CAF_No_AC,GYM_No_AC,LIB_No_AC,MPR_No_AC,CLS_No_AC
2,K003,0.0,1.000000,1.0,1.0,1.0,0.105263
3,K005,0.0,1.000000,0.0,0.0,0.0,0.023256
12,K014,0.0,0.000000,1.0,0.0,0.0,0.015385
17,K019,0.0,0.000000,1.0,0.0,0.0,0.025641
21,K023,0.0,0.000000,1.0,0.0,0.0,0.025641
...,...,...,...,...,...,...,...
1206,X450,1.0,0.666667,1.0,0.0,0.0,0.052632
1216,X600,0.0,0.000000,0.0,0.0,0.0,0.015385
1217,X650,1.0,1.000000,1.0,0.0,0.0,0.180000
1226,X819,0.0,0.000000,1.0,0.0,1.0,0.037037


In [6]:
# Show top 20 buildings with highest percentage of classrooms with No A/C
no_ac_summary.sort_values('CLS_No_AC', ascending=False).head(20)

Unnamed: 0,BuildingCode,AUD_No_AC,CAF_No_AC,GYM_No_AC,LIB_No_AC,MPR_No_AC,CLS_No_AC
914,Q883,1.0,0.0,0.0,0.0,1.0,1.0
521,M223,0.0,0.0,0.0,0.0,0.0,0.705882
715,Q136,1.0,0.0,1.0,0.0,0.0,0.540541
190,K231,0.0,0.0,0.0,0.0,0.0,0.461538
423,M052,0.0,0.0,1.0,0.0,0.0,0.235294
1055,X067,0.0,1.0,1.0,0.0,1.0,0.218182
498,M173,1.0,1.0,0.0,0.0,0.25,0.209302
201,K242,0.0,1.0,0.0,0.0,0.0,0.194444
1217,X650,1.0,1.0,1.0,0.0,0.0,0.18
104,K135,1.0,1.0,0.0,0.0,0.0,0.166667


# Export A/C Data

Exporting both approaches to defining A/C status: "no A/C at all" and "no *working* A/C at the time of data scraping"

In [7]:
no_ac_summary.to_csv('../data/processed_data/no_ac_summary.csv', index=False)
ac_operational_summary.to_csv('../data/processed_data/ac_operational_summary.csv', index=False)

# Process Ventilation Data

According to the webpages we scraped this data from (e.g. [here](https://www.nycenet.edu/roomassessment?code=K170)):
>Ventilation = any way for fresh air to enter a room including operational windows, mechanical ventilation, or a combination of both.


It's a mystery to me what the `AtLeast` column is in this data. You'd think it would be something like "there is at least one ventilation method". But the below assertions don't bear that out.

For now, I'm going to chock this up to a mistake in the data and DIY the boolean field for whether ventilation is present, based on the definition provided above.

### This data is kind of confusing

We can see that the `AtLeast` column is never "Yes" when none of the ventilation methods exist, which makes sense

In [8]:
# Show that `AtLeast` column is never "Yes" when none of the ventilation methods exist
assert ventilation_df[
    (ventilation_df['AtLeast']=='Yes') & 
    (ventilation_df['Windows']=='No') & 
    (ventilation_df['Exhaust Fan']=="Doesn't Exist") & 
    (ventilation_df['Supply Fan']=="Doesn't Exist")
    ].empty

But the inverse is perplexing: we have situations where `AtLeast` is still "No" even when Windows and/or fully-functiong fans are present. That shouldn't happen if `AtLeast` corresponds to the above definition.

In [9]:
# Case 1: AtLeast is "No" even when Windows AND full-functioning fans are present (i.e. both supply and exhaust are operational)
ventilation_df[
        (ventilation_df['AtLeast']=='No') & 
        (ventilation_df['Windows']=='Yes') &
        (ventilation_df['Exhaust Fan']=="Operational") &
        (ventilation_df['Supply Fan']=="Operational")
    ]

Unnamed: 0,BuildingCode,Room,Primary Usage,Windows,AtLeast,Supply Fan,Exhaust Fan
658,K170,571,Gymnasium,Yes,No,Operational,Operational
679,K170,575,Staff Office,Yes,No,Operational,Operational
10654,R435,E207A,Staff Office,Yes,No,Operational,Operational
10661,R435,E207B,Staff Office,Yes,No,Operational,Operational
10703,R435,E217A,Staff Office,Yes,No,Operational,Operational
...,...,...,...,...,...,...,...
934948,K564,227,Staff Office,Yes,No,Operational,Operational
935081,K564,319,Staff Office,Yes,No,Operational,Operational
939127,K022,111,Gymnasium,Yes,No,Operational,Operational
940520,Q143,N106,Kitchen,Yes,No,Operational,Operational


In [10]:
# Case 2: AtLeast is "No" when either Windows or full-functioning fans are present (i.e. both supply and exhaust are operational)
ventilation_df[
    (
        (ventilation_df['AtLeast']=='No') & 
        (
            (ventilation_df['Windows']=='Yes') |
            (
                (ventilation_df['Exhaust Fan']=="Operational") &
                (ventilation_df['Supply Fan']=="Operational")
            )
        )
    )
    ]

Unnamed: 0,BuildingCode,Room,Primary Usage,Windows,AtLeast,Supply Fan,Exhaust Fan
84,K170,112,Bathroom,Yes,No,Doesn't Exist,Operational
539,K170,476,Staff Office,No,No,Operational,Operational
658,K170,571,Gymnasium,Yes,No,Operational,Operational
672,K170,571B,Bathroom,Yes,No,Doesn't Exist,Operational
679,K170,575,Staff Office,Yes,No,Operational,Operational
...,...,...,...,...,...,...,...
940912,Q143,N405,Bathroom,No,No,Operational,Operational
940940,Q143,N409,Bathroom,No,No,Operational,Operational
941360,K615,127,Staff Office,No,No,Operational,Operational
941437,K615,152,Bathroom,No,No,Operational,Operational


### Screw it, we're going to just work around the weirdness

So, we're going to DIY the boolean column for "no ventilation", using the above definition from the NYC Schools website: "no ventilation" means either there is no window or there is no supply/exhaust fan. 

In [11]:
sorted(ventilation_df['Exhaust Fan'].unique())

["Doesn't Exist", 'Not Operational', 'Operational', 'Partially Operational']

In [12]:
sorted(ventilation_df['Supply Fan'].unique())

['Cannot Access',
 "Doesn't Exist",
 'Not Operational',
 'Operational',
 'Partially Operational']

If we were to use the most methodologically simple and conservative option (i.e. treat "no ventilation" as no existing infrastructure for ventilation at all), there are NO classrooms in this data that meet that definition.

In [13]:
# Most conservative definition of "no ventilation": no windows AND no existing supply/exhaust fans
ventilation_df[
        (ventilation_df['Primary Usage'].str.upper().str.contains('CLASSROOM')) &
        (ventilation_df['Windows']=='No') &
        (ventilation_df['Exhaust Fan']=="Doesn't Exist") &
        (ventilation_df['Supply Fan']=="Doesn't Exist")
    ]['BuildingCode'].nunique()

0

Instead, we're going to focus on the existence of BOTH a supply and exhaust fan in a classroom. The rationale here is that with increased wildfire smoke polluting our air, students need fully functioning mechanical ventilation in every classroom, not just windows (which are actually a hazard on smokey days). With GHS renovations, mechanical ventilation will also be necessary so that  HEPA filters can be used to purify air.

In [14]:
# Show how many school builings have at least one classroom that's missing either a supply or exhaust fan
ventilation_df[
        (ventilation_df['Primary Usage'].str.upper().str.contains('CLASSROOM')) &
        (
            (ventilation_df['Exhaust Fan']=="Doesn't Exist") |
            (ventilation_df['Supply Fan']=="Doesn't Exist")
        )
    ]['BuildingCode'].nunique()

993

In [15]:
# Create flag for these classrooms missing at least one element of mechanical ventilation
ventilation_df['missing_mech_ventilation'] = (
        (
            (ventilation_df['Exhaust Fan']=="Doesn't Exist") |
            (ventilation_df['Supply Fan']=="Doesn't Exist")
        )
)

In [16]:
ventilation_df[ventilation_df['missing_mech_ventilation']]['Primary Usage'].value_counts()

Primary Usage
Student Classroom                                                      36834
Bathroom                                                               21902
Staff Office                                                           20307
Closet/Storage Room                                                     5775
Building Support Space                                                  2531
Student Staff Space (library, shop, science lab, media center etc.)     1329
Locker Room                                                              527
Cafeteria                                                                308
Multi-Purpose Room                                                       268
Kitchen                                                                  184
Gymnasium                                                                147
Inaccessible                                                             102
Auditorium                                                    

In [17]:
no_vent_summary = ventilation_df.groupby(['BuildingCode', 'Primary Usage']).apply(
    lambda x: (x['missing_mech_ventilation']).sum() / len(x)
).unstack(fill_value=0)
# Rename columns to include "_No_VT" suffix
no_vent_summary.columns = [f"{col.replace(' ', '_').upper()}_No_VT" for col in no_vent_summary.columns]
# Reset index to get BuildingCode as a regular column
no_vent_summary = no_vent_summary.reset_index()
# Shorten column names for shapefile 10 char limit
no_vent_summary.columns = no_vent_summary.columns.str.replace(
    'AUDITORIUM', 'AUD').str.replace(
        'CAFETERIA', 'CAF').str.replace(
            'STUDENT_CLASSROOM', 'CLS').str.replace(
                'GYMNASIUM', 'GYM').str.replace(
                    'BATHROOM', 'BTH').str.replace(
                        'MULTI-PURPOSE_ROOM', 'MPR').str.replace(
                            'STAFF_OFFICE', 'OFF').str.replace(
                                'KITCHEN', 'KIT').str.replace(
                                    'STUDENT_STAFF_SPACE_(LIBRARY,_SHOP,_SCIENCE_LAB,_MEDIA_CENTER_ETC.)', 'LAB')

  no_vent_summary = ventilation_df.groupby(['BuildingCode', 'Primary Usage']).apply(


In [18]:
# Drop columns we won't use
no_vent_summary = no_vent_summary.drop(columns=[
    'BUILDING_SUPPORT_SPACE_No_VT',
    'CLOSET/STORAGE_ROOM_No_VT',
    'INACCESSIBLE_No_VT',
    'LOCKER_ROOM_No_VT'
])

In [19]:
# Filter for buildings with >0% classrooms with Missing mechanical ventilation
no_vent_summary[no_vent_summary['CLS_No_VT'] > 0]

Unnamed: 0,BuildingCode,AUD_No_VT,BTH_No_VT,CAF_No_VT,GYM_No_VT,KIT_No_VT,MPR_No_VT,OFF_No_VT,CLS_No_VT,LAB_No_VT
0,K001,0.0,1.000000,0.0,0.0,0.0,0.0,0.937500,0.936170,1.000000
2,K003,0.0,0.913043,0.0,0.0,0.0,0.0,0.724138,0.031250,0.333333
3,K005,0.0,1.000000,0.0,0.0,0.0,1.0,0.958333,0.937500,1.000000
4,K006,0.0,1.000000,0.0,0.0,0.0,0.0,0.833333,0.208333,0.000000
6,K008,0.0,0.500000,0.0,0.0,0.0,0.0,0.727273,0.781250,0.000000
...,...,...,...,...,...,...,...,...,...,...
1208,X701,0.0,1.000000,0.0,0.0,0.0,0.0,1.000000,1.000000,0.000000
1210,X779,0.0,1.000000,0.0,0.0,0.0,1.0,1.000000,1.000000,1.000000
1211,X781,0.0,0.692308,0.0,0.0,0.0,0.0,1.000000,0.941176,0.000000
1213,X819,0.0,1.000000,0.0,1.0,0.0,0.0,0.933333,0.950000,1.000000


## Export Ventilation Data

In [20]:
no_vent_summary.to_csv('../data/processed_data/missing_ventilation_summary.csv', index=False)