# NYC DOE Accessibility Dataset
This website details the school accessibility data: https://www.schools.nyc.gov/school-life/space-and-facilities/building-accessibility


Data downloaded 12/19/2025. There are several hidden sheets in the downloaded data that you can't see in Excel but can access programmatically. This notebook explores those different sheets and finds that the following two sheets are most important: 'Current Accessible School List' and 'RAW Data'.

The "Current Accessible School List" sheet appears to be updated every week and has more accurate/timely data than the "RAW Data" sheet. However, it only contains schools that are at least partially accessible -- none of the "No accessibility" schools. 

The 'RAW Data' sheet appears to contain the full list of schools that have been assessed historically, regardless of whether they have closed. We coalesce the "current" and "raw" sheets in this script (preferring data from "current" sheet when available) to arrive at our final accessibility status for each school.

# Loading and Cleaning

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
from pathlib import Path
import warnings
from openpyxl import load_workbook
warnings.filterwarnings('ignore')

In [2]:
data_dir = Path("../data/raw_data/DOE/Building Accessibility Profile")
output_dir = Path("../data/processed_data")
output_dir.mkdir(exist_ok=True, parents=True)

schools = gpd.read_file(output_dir / "school_points_with_lcgms.geojson")
wb = load_workbook(data_dir / "Current_Building_Accessibility_Profile_List.xlsm", read_only=True)

In [3]:
ws_curr_data = wb["Current Accessible School List"] #current list
ws_raw_data = wb["RAW Data"] #backup list

curr_data = ws_curr_data.values
raw_data = ws_raw_data.values

next(curr_data)
next(curr_data)
curr_cols = next(curr_data) #index 0, 1, and 2 rows are all header data
curr_df = pd.DataFrame(curr_data, columns=curr_cols)

raw_cols = next(raw_data) #index 0 and 1 rows are both header data
raw_df = pd.DataFrame(raw_data, columns=raw_cols)

wb.close()

Basic Processing

In [4]:
import re

# extracting BAP rating from a given HYPERLINK formula
def extract_rating(hyperlink_str):
    if pd.isna(hyperlink_str):
        return None
    
    # Extract the display text from HYPERLINK("url", "X out of 10")
    # Pattern: find "X out of 10" where X is a number
    match = re.search(r'"(\d+(?:\.\d+)?)\s+out\s+of\s+10"', str(hyperlink_str))
    
    if match:
        return float(match.group(1))
    return None

# Create new BAP Rating column from the HYPERLINK column
curr_df['BAP Rating'] = curr_df.iloc[:, 11].apply(extract_rating)

curr_df = curr_df.drop(curr_df.columns[[0, 11]], axis=1) #dropping columns 0 (col titled None and filled with None values) and 11 (hyperlink dupe of url col)

In [5]:
# Cleaning
raw_df = raw_df.replace('N/A', np.nan)
raw_df['BAP Rating'] = pd.to_numeric(raw_df['BAP Rating'])

curr_df['BAP Rating'] = pd.to_numeric(curr_df['BAP Rating'])

schools['Location Code'] = schools['Location Code'].astype(str).str.strip()
schools['Building Code'] = schools['Building Code'].astype(str).str.strip()
curr_df['Location Code'] = curr_df['Location Code'].astype(str).str.strip()
curr_df['Building Code'] = curr_df['Building Code'].astype(str).str.strip()
raw_df['Location Code'] = raw_df['Location Code'].astype(str).str.strip()
raw_df['Building Code'] = raw_df['Building Code'].astype(str).str.strip()

curr_df.replace('None', np.nan, inplace=True)
raw_df.replace('None', np.nan, inplace=True)

# Drop the two rows in curr_df with "None" in every field and null building code
curr_df = curr_df[~(curr_df['Building Code'].isna())]

# Data Quality Investigations

## Which Sheets to Use?

In [6]:
xls = pd.ExcelFile(data_dir / "Current_Building_Accessibility_Profile_List.xlsm")
sheet_names = xls.sheet_names
print(sheet_names)

['March_2023 (2)', 'All Active Schools', 'Current Accessible School List', 'Mechanics', 'RAW Data', 'BAPs', 'BAP MASTER', 'BAP Rating Definitions', 'BAPs Under Maintenance', 'Filters - HIDE']


NOTE: this file has the following sheets that are hidden from view when opening in Excel app:
- 'March_2023 (2)'
- 'All Active Schools'
    - This seems to just have the key for abbreviations about which borough a school is in, nothing else.
- 'Current Accessible School List'
    - This sheet only has buildings that are "accessible" or "partially accessible"
- 'Mechanics'
    - This sheet has just a single data point: the date the excel workbook was last updated
- 'RAW Data'
- 'BAPs'
- 'BAP MASTER'
- 'BAP Rating Definitions'
- 'BAPs Under Maintenance'
- 'Filters - HIDE'

The potentially useful ones (in order of usefulness) look like:
- 'Current Accessible School List'
- 'RAW Data'
- 'BAPs'
- 'BAP MASTER'

(the others are mostly NaNs)

In [7]:
# Annoyingly, the BAPs sheet appears to be the result of a join, so there's two different columns for most things (e.g. "Building Code" and "Building Code.1")
baps_sheet_df = pd.read_excel(data_dir / "Current_Building_Accessibility_Profile_List.xlsm", sheet_name="BAPs")
bap_master_sheet_df = pd.read_excel(data_dir / "Current_Building_Accessibility_Profile_List.xlsm", sheet_name="BAP MASTER")
# This sheet contains just one data point: the last refresh date
last_refresh_dt = pd.to_datetime(pd.read_excel(data_dir / "Current_Building_Accessibility_Profile_List.xlsm", sheet_name="Mechanics", header=3)['Last Refresh']).iloc[0]

### Explore overlaps in different sheets from source data

In [8]:
# Figure out overlapping building codes between these different sheets in the source data
print('pct of curr_df in `BAPS` sheet', curr_df['Building Code'].isin(baps_sheet_df['Building Code']).sum() / len(curr_df))
print('pct of curr_df in raw_df', curr_df['Building Code'].isin(raw_df['Building Code']).sum() / len(curr_df))
print('pct of curr_df in bap_master_sheet_df', curr_df['Building Code'].isin(bap_master_sheet_df['Building Code']).sum() / len(curr_df))
print()
print('pct of raw_df in `BAPS` sheet', raw_df['Building Code'].isin(baps_sheet_df['Building Code']).sum() / len(raw_df))
print('pct of raw_df in curr_df', raw_df['Building Code'].isin(curr_df['Building Code']).sum() / len(raw_df))
print('pct of raw_df in bap_master_sheet_df', raw_df['Building Code'].isin(bap_master_sheet_df['Building Code']).sum() / len(raw_df))
print()
print('pct of `BAPS` sheet in curr_df', baps_sheet_df['Building Code'].isin(curr_df['Building Code']).sum() / len(baps_sheet_df))
print('pct of `BAPS` sheet in raw_df', baps_sheet_df['Building Code'].isin(raw_df['Building Code']).sum() / len(baps_sheet_df))
print('pct of `BAPS` sheet in bap_master_sheet_df', baps_sheet_df['Building Code'].isin(bap_master_sheet_df['Building Code']).sum() / len(baps_sheet_df))
print()
print('pct of bap_master_sheet_df in `BAPS` sheet', bap_master_sheet_df['Building Code'].isin(baps_sheet_df['Building Code']).sum() / len(bap_master_sheet_df))
print('pct of bap_master_sheet_df in curr_df', bap_master_sheet_df['Building Code'].isin(curr_df['Building Code']).sum() / len(bap_master_sheet_df))
print('pct of bap_master_sheet_df in raw_df', bap_master_sheet_df['Building Code'].isin(raw_df['Building Code']).sum() / len(bap_master_sheet_df))

pct of curr_df in `BAPS` sheet 0.825588066551922
pct of curr_df in raw_df 0.9827882960413081
pct of curr_df in bap_master_sheet_df 0.8387837062535858

pct of raw_df in `BAPS` sheet 0.6663793103448276
pct of raw_df in curr_df 0.7262931034482759
pct of raw_df in bap_master_sheet_df 0.6827586206896552

pct of `BAPS` sheet in curr_df 0.8610206297502715
pct of `BAPS` sheet in raw_df 0.9554831704668838
pct of `BAPS` sheet in bap_master_sheet_df 0.9695982627578719

pct of bap_master_sheet_df in `BAPS` sheet 0.9695982627578719
pct of bap_master_sheet_df in curr_df 0.8816503800217155
pct of bap_master_sheet_df in raw_df 0.9858849077090119


In [9]:
print('pct buildings from schools in curr_df', schools['Building Code'].drop_duplicates().isin(curr_df['Building Code']).sum() / schools['Building Code'].nunique())
print('pct buildings from schools in raw_df', schools['Building Code'].drop_duplicates().isin(raw_df['Building Code']).sum() / schools['Building Code'].nunique())
print('pct buildings from schools in baps_sheet_df', schools['Building Code'].drop_duplicates().isin(baps_sheet_df['Building Code']).sum() / schools['Building Code'].nunique())
print('pct buildings from schools in bap_master_sheet_df', schools['Building Code'].drop_duplicates().isin(bap_master_sheet_df['Building Code']).sum() / schools['Building Code'].nunique())

pct buildings from schools in curr_df 0.6533613445378151
pct buildings from schools in raw_df 0.8690476190476191
pct buildings from schools in baps_sheet_df 0.5686274509803921
pct buildings from schools in bap_master_sheet_df 0.584733893557423


The below checks show that we need to use raw_df, curr_df, and bap_master_sheet_df to get all of the unique building codes available from source data that match to a building code in master schools.

In [10]:
# building code from master schools in raw_df but not in curr_df
schools[
    schools['Building Code'].isin(raw_df['Building Code'])
    &
    ~schools['Building Code'].isin(curr_df['Building Code'])
]['Building Code'].nunique()

319

In [11]:
# building code from master schools in curr_df but not in raw_df
schools[
    schools['Building Code'].isin(curr_df['Building Code'])
    &
    ~schools['Building Code'].isin(raw_df['Building Code'])
]['Building Code'].nunique()

11

In [12]:
# building code from masterschools in bap_master_sheet_df but not in raw_df nor curr_df
schools[
    schools['Building Code'].isin(bap_master_sheet_df['Building Code'])
    &
    ~schools['Building Code'].isin(raw_df['Building Code'])
    &
    ~schools['Building Code'].isin(curr_df['Building Code'])
]['Building Code'].nunique()

1

In [13]:
# building code from master schools in BAPs but not in raw_df nor curr_df nor bap_master_sheet_df
schools[
    schools['Building Code'].isin(baps_sheet_df['Building Code'])
    &
    ~schools['Building Code'].isin(raw_df['Building Code'])
    &
    ~schools['Building Code'].isin(curr_df['Building Code'])
    &
    ~schools['Building Code'].isin(bap_master_sheet_df['Building Code'])
]['Building Code'].nunique()

0

## Handling Duplicates

In [14]:
# Assert that all BAP Ratings associated with each distinct building code are the same
assert (curr_df.groupby('Building Code')['BAP Rating'].nunique() == 1).all()

In [15]:
# Assert that all (non-null) BAP Ratings associated with each distinct building code are the same (using `<=` bc nulls will have nunique of 0)
assert (raw_df.groupby('Building Code')['BAP Rating'].nunique() <= 1).all()

In [16]:
# NOTE: there are disagreements beteween curr_df and raw_df for BAP rating at the same building codes
# Manually checked K013, K071, and M180, and confirmed that curr_df matches what's on the web now, so we should use that when we have it rather than raw_df
curr_vs_raw_check = curr_df[['Building Code', 'BAP Rating']].merge(raw_df[['Building Code', 'BAP Rating']], on='Building Code', suffixes=('_curr', '_raw'), how='outer', indicator=True)

curr_vs_raw_check[
    (curr_vs_raw_check['_merge'] == 'both') 
    &
    (curr_vs_raw_check['BAP Rating_curr'].notnull())
    &
    (curr_vs_raw_check['BAP Rating_raw'].notnull())
    &
    (curr_vs_raw_check['BAP Rating_curr'] != curr_vs_raw_check['BAP Rating_raw'])
].drop_duplicates('Building Code')

Unnamed: 0,Building Code,BAP Rating_curr,BAP Rating_raw,_merge
26,K013,6.0,1.0,both
155,K071,1.0,8.0,both
259,K138,4.0,1.0,both
281,K152,6.0,8.0,both
319,K175,9.0,5.0,both
419,K223,7.0,8.0,both
482,K246,9.0,8.0,both
582,K281,2.0,1.0,both
594,K286,8.0,6.0,both
598,K289,7.0,8.0,both


# Join BAP to Schools

In [17]:
schools_join_keys = schools[['Location Code', 'Building Code']]

In [18]:
# Drop duplicates from curr_df and raw_df on building code (since we verified above that all ratings are the same per building code)
bap_join_cols = ['Building Code', 'Accessibility Description', 'BAP Rating']
bap_curr = curr_df[bap_join_cols].drop_duplicates(subset=['Building Code'])
bap_raw = raw_df[bap_join_cols].drop_duplicates(subset=['Building Code'])

In [19]:
# Merge both curr and raw onto schools
final_df = schools_join_keys.merge(bap_curr, on='Building Code', how='left', suffixes=('', '_curr'))
final_df = final_df.merge(bap_raw, on='Building Code', how='left', suffixes=('_curr', '_raw'))

In [20]:
# Coalesce: use curr, fallback to raw
final_df['BAP Rating'] = final_df['BAP Rating_curr'].fillna(final_df['BAP Rating_raw'])
final_df['Accessibility Description'] = final_df['Accessibility Description_curr'].fillna(final_df['Accessibility Description_raw'])

# Clean up temp columns
final_df = final_df.drop(['BAP Rating_curr', 'BAP Rating_raw', 'Accessibility Description_curr', 'Accessibility Description_raw'], axis=1)

Compare this join to the baseline we had from joining data exported from the IBO report

In [21]:
# NOTE: the number of buildings joining that we need to beat for this to be worth it is 1202 unique building codes (out of 1427 total unique building codes in master schools)
master_schools = gpd.read_file('../data/processed_data/master_schools.geojson')
print('Unique school buildings with accessibility joined from IBO report:', master_schools[
        master_schools['Accessible'].notnull()
        |
        master_schools['bap_rating'].notnull()
    ]['Bldg_Code'].nunique()
)

Unique school buildings with accessibility joined from IBO report: 1202


In [22]:
# Drumroll please...
print('Unique school buildings with accessibility joined from BAP data:', final_df['Accessibility Description'].notna().sum())

Unique school buildings with accessibility joined from BAP data: 1772


# Export Data

In [23]:
output_file = output_dir / "bap_with_school_codes.csv"
final_df.to_csv(output_file, index=False)