I wanted to build a scraper that would allow me to easily pull down NYPD's quarterly data about drone flights. To do so, first I had to import libraries to prepare to scrape the website. Scraping the website itself would be fairly straightforward as the urls followed a simple pattern. Base_url + {year}-{quarter}, which automatically downloads the excel file.

In [None]:
import pip
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=bdd2cad9bf4ce296c126853798a85c21067b9e01bd30afbdd96ce207ce9baa6a
  Stored in directory: /root/.cache/pip/wheels/01/46/3b/e29ffbe4ebe614ff224bad40fc6a5773a67a163251585a13a9
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
## import libraries
import requests  # Makes HTTP requests to fetch web pages from URLs
from bs4 import BeautifulSoup  # Parses HTML content into navigable Python objects for web scraping
import pandas as pd  # Creates and manipulates DataFrames for organizing scraped data into tables
import time  # Adds delays between requests to avoid overwhelming the server
from random import uniform  # Generates random time intervals to make scraping delays less predictable
import wget ## allows to use to fetch the links from the website

In [None]:
## create headers to be sneaky

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

The NYPD data seemed clean on first glace... but it turned out I needed to create function to allow to me skip rows in a "smart" way because not every excel file was formated the exact same way.

My intial code I wrote by hand worked perfectly up until q4 2020 when the row/columns get messed up. That's when I turned to AI to help me write the function -- I had it pivot on "Brooklyn" because that value stays the same throughout. The number of rows/variables horizontally stays the same but the # of columns changes as new types of drone flights are introduced over time.

## here's how I tested it out on my problem sheet

def get_num_rows_to_skip(raw_df):
    """
    Find the row containing 'Brooklyn' as the header row.
    Skip up to that row.
    """
    for i, row in raw_df.iterrows():
        # Normalize all cells: convert to string, strip whitespace, drop empty cells
        cleaned = row.astype(str).str.strip().fillna("")
        if cleaned.str.contains(r"\bBrooklyn\b", case=False, regex=True).any():
            return i
    return 0  # fallback


def get_column_range_to_use(raw_df):
    """
    Determine column range based on whether 'Other Agency' column exists.
    If column I is missing or fully empty, use A:H; else use A, D:J
    """
    header_row_idx = get_num_rows_to_skip(raw_df)
    header = raw_df.iloc[header_row_idx]

    # Count actual non-null columns
    num_cols = header.count()

    # Case 1: If sheet only has 8 columns → use A:H**
    if num_cols <= 8:
        return "A:H"

    # Case 2: If column I (index 8) exists but is fully empty → use A:H
    if header.iloc[8] in [None, "", "nan"] or pd.isna(header.iloc[8]):
        return "A:H"

    # Otherwise sheet has valid 'Other Agency' column → use extended range
    return "A, D:J"

# download and inspect excel file
url = "https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-2020-q4.xlsx"
my_data = wget.download(url)

# smart function
num_rows_to_skip = get_num_rows_to_skip(raw_df)
col_range_to_use = get_column_range_to_use(raw_df)
df = pd.read_excel(my_data, skiprows=num_rows_to_skip, usecols=col_range_to_use)

# Now you can rename the first column
df.rename(columns={df.columns[0]: "Category"}, inplace=True)

df.tail()

After I knew that function worked I integrated into the code I had built by hand earlier.

In [None]:
all_dfs = []  ## hold all dfs (dataframes) that will be created for each page
base_url = "https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-" ## base url of the nypd site which holds the files

## smart row and column function

def get_num_rows_to_skip(raw_df):
    """
    Find the row containing 'Brooklyn' as the header row.
    Skip up to that row.
    """
    for i, row in raw_df.iterrows():
        # Normalize all cells: convert to string, strip whitespace, drop empty cells
        cleaned = row.astype(str).str.strip().fillna("")
        if cleaned.str.contains(r"\bBrooklyn\b", case=False, regex=True).any():
            return i
    return 0  # fallback


def get_column_range_to_use(raw_df):
    """
    Determine column range based on whether 'Other Agency' column exists.
    If column I is missing or fully empty, use A:H; else use A, D:J
    """
    header_row_idx = get_num_rows_to_skip(raw_df)
    header = raw_df.iloc[header_row_idx]

    # Count actual non-null columns
    num_cols = header.count()

    # Case 1: If sheet only has 8 columns → use A:H**
    if num_cols <= 8:
        return "A:H"

    # Case 2: If column I (index 8) exists but is fully empty → use A:H
    if header.iloc[8] in [None, "", "nan"] or pd.isna(header.iloc[8]):
        return "A:H"

    # Otherwise sheet has valid 'Other Agency' column → use extended range
    return "A, D:J"

## main scraping loop

for year in range(2019, 2026):
    for quarter in range (1, 5):
        try:  ## attempt to request the page
            url = f"{base_url}{year}-q{quarter}.xlsx"
            current_page_data = wget.download(url)

            ## use smart row and column function
            raw_df = pd.read_excel(current_page_data, header=None)
            num_rows_to_skip = get_num_rows_to_skip(raw_df)
            col_range_to_use = get_column_range_to_use(raw_df)

            ## read data
            df = pd.read_excel(current_page_data, skiprows=num_rows_to_skip, usecols=col_range_to_use)

            # rename first column
            df.rename(columns={df.columns[0]: "Category"}, inplace=True)

            # add metadata (year/quarter)
            df["Year"] = year
            df["Quarter"] = f"Q{quarter}"

            # Add this DataFrame to the list
            all_dfs.append(df)

            ## pause between page requests to avoid overwhelming the server (random delay between 30–45 seconds)
            second_to_snooze = uniform(30,45)
            print(f"Created DF from page {url} and snoozing for {second_to_snooze} seconds before next page")
            time.sleep(second_to_snooze)  ## actually wait the random time before continuing

        except:  ## if something broke somewhere in here
            print(f"Problem with {year}-q{quarter}")

print(f"Done scraping all available quarters of drone data")  ## confirm completion once all quarters are processed

Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-2019-q1.xlsx and snoozing for 33.16301675206698 seconds before next page
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-2019-q2.xlsx and snoozing for 37.016604957486464 seconds before next page
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-2019-q3.xlsx and snoozing for 36.31792353431926 seconds before next page
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-2019-q4.xlsx and snoozing for 36.07870246041845 seconds before next page
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas-operations-2020-q1.xlsx and snoozing for 42.94183447003387 seconds before next page
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_pl

Since I started scraping this data ~three months ago, the agency started changing the way the naming the files so my naming covention no longer works.

There is no standard naming convention between 2025 Q2 and Q3. So I will I upload them invidivually -- I will look for how they name Q4 to determine how to proceed with the scraping project.

In [None]:
new_dfs = []  ## hold all dfs (dataframes) that will be created for each page
base_url = "https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas" ## base url of the nypd site which holds the files

## smart row and column function

def get_num_rows_to_skip(raw_df):
    """
    Find the row containing 'Brooklyn' as the header row.
    Skip up to that row.
    """
    for i, row in raw_df.iterrows():
        # Normalize all cells: convert to string, strip whitespace, drop empty cells
        cleaned = row.astype(str).str.strip().fillna("")
        if cleaned.str.contains(r"\bBrooklyn\b", case=False, regex=True).any():
            return i
    return 0  # fallback


def get_column_range_to_use(raw_df):
    """
    Determine column range based on whether 'Other Agency' column exists.
    If column I is missing or fully empty, use A:H; else use A, D:J
    """
    header_row_idx = get_num_rows_to_skip(raw_df)
    header = raw_df.iloc[header_row_idx]

    # Count actual non-null columns
    num_cols = header.count()

    # Case 1: If sheet only has 8 columns → use A:H**
    if num_cols <= 8:
        return "A:H"

    # Case 2: If column I (index 8) exists but is fully empty → use A:H
    if header.iloc[8] in [None, "", "nan"] or pd.isna(header.iloc[8]):
        return "A:H"

    # Otherwise sheet has valid 'Other Agency' column → use extended range
    return "A, D:J"

## main scraping loop for new wonky files

# Define the specific files to scrape with their metadata
# You will need to manually determine the correct Year and Quarter for these files
new_files_to_scrape = [
    {"filename_suffix": "_2q_report.xlsx", "year": 2025, "quarter": 2}, # Example for uas_2q_report.xlsx
    {"filename_suffix": "_3q_report_final.xlsx", "year": 2025, "quarter": 3} # Example for uas_3q_report_final.xlsx
]

for file_info in new_files_to_scrape:
    filename_suffix = file_info["filename_suffix"]
    year_metadata = file_info["year"]
    quarter_metadata = file_info["quarter"]

    try:  ## attempt to request the page
        url = f"{base_url}{filename_suffix}"
        print(f"Attempting to download: {url}") # Added for debugging
        current_page_data = wget.download(url)

        ## use smart row and column function
        raw_df = pd.read_excel(current_page_data, header=None)
        num_rows_to_skip = get_num_rows_to_skip(raw_df)
        col_range_to_use = get_column_range_to_use(raw_df)

        # Force inclusion of column I (index 8) for 2025 Q3 file if it's supposed to be 'Outside NYC*'
        if filename_suffix == "_3q_report_final.xlsx":
            col_range_to_use = "A,D:J" # This assumes column I is the one that becomes 'Outside NYC*'

        ## read data
        df = pd.read_excel(current_page_data, skiprows=num_rows_to_skip, usecols=col_range_to_use)

        # rename first column
        df.rename(columns={df.columns[0]: "Category"}, inplace=True)

        # Add specific renaming for 'Outside NYC*' if applicable for Q3 2025
        if filename_suffix == "_3q_report_final.xlsx":
            if 'Other Agency' in df.columns:
                df.rename(columns={'Other Agency': 'Outside NYC*'}, inplace=True)

        # add metadata (year/quarter) using the explicit values
        df["Year"] = year_metadata
        df["Quarter"] = f"Q{quarter_metadata}"

        # Add this DataFrame to the list
        new_dfs.append(df)

        ## pause between page requests to avoid overwhelming the server (random delay between 30–45 seconds)
        second_to_snooze = uniform(30,45)
        print(f"Created DF from page {url} and snoozing for {second_to_snooze} seconds before next page")
        time.sleep(second_to_snooze)  ## actually wait the random time before continuing

    except Exception as e:  ## if something broke somewhere in here
        print(f"Problem with {filename_suffix}: {e}")

print(f"Done scraping new wonky available quarters of drone data")  ## confirm completion once all quarters are processed

Attempting to download: https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas_2q_report.xlsx
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas_2q_report.xlsx and snoozing for 42.70900653557747 seconds before next page
Attempting to download: https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas_3q_report_final.xlsx
Created DF from page https://www.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/uas/uas_3q_report_final.xlsx and snoozing for 38.804866030028265 seconds before next page
Done scraping new wonky available quarters of drone data


In [None]:
## put them all together

df = pd.concat(all_dfs + new_dfs, ignore_index = True)

# Consolidate 'Other Agency' and 'Outside NYC*' if both exist
if 'Other Agency' in df.columns and 'Outside NYC*' in df.columns:
    # Coalesce values: fill NaNs in 'Outside NYC*' with values from 'Other Agency'
    df['Outside NYC*'] = df['Outside NYC*'].fillna(df['Other Agency'])
    df.drop(columns=['Other Agency'], inplace=True)
elif 'Other Agency' in df.columns and 'Outside NYC*' not in df.columns:
    # If only 'Other Agency' exists, rename it to 'Outside NYC*'
    df.rename(columns={'Other Agency': 'Outside NYC*'}, inplace=True)

df

Unnamed: 0,Category,Unnamed: 1,Unnamed: 2,Brooklyn,Bronx,Queens,Manhattan,Staten Island,Year,Quarter,Total,Other Agency\nAssist,Outside NYC*
0,Search and rescue operations,,,0.0,0.0,0.0,0.0,0.0,2019,Q1,,,
1,Collision / Crime Scene Documentation,,,1.0,1.0,4.0,0.0,0.0,2019,Q1,,,
2,Evidence searches at large or inaccessible scenes,,,0.0,1.0,0.0,0.0,0.0,2019,Q1,,,
3,Hazardous material incidents,,,0.0,0.0,0.0,0.0,0.0,2019,Q1,,,
4,Monitoring vehicular traffic and pedestrian co...,,,0.0,0.0,0.0,2.0,0.0,2019,Q1,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,Drone as a First Responder \n(DFR),,,1322.0,1502.0,182.0,19.0,74.0,2025,Q3,3099.0,,1.0
319,Warrant,,,23.0,8.0,15.0,2.0,0.0,2025,Q3,48.0,,0.0
320,TOTAL,,,1551.0,1653.0,371.0,195.0,149.0,2025,Q3,3923.0,,5.0
321,"* This category was previously reported as ""Ou...",,,,,,,,2025,Q3,,,


In [None]:
## how many DFR flight have occured since the policy has been passed

df[df["Category"] == "Drone as a First Responder \n(DFR)"][["Brooklyn","Bronx","Queens","Manhattan","Staten Island", "Outside NYC*"]].sum()

Unnamed: 0,0
Brooklyn,5592.0
Bronx,4678.0
Queens,344.0
Manhattan,377.0
Staten Island,139.0
Outside NYC*,1.0


In [None]:
## how many total drone flights across the boros all time

df[df["Category"] == "TOTAL"][["Brooklyn","Bronx","Queens","Manhattan","Staten Island","Outside NYC*"]].sum()

Unnamed: 0,0
Brooklyn,6612.0
Bronx,5426.0
Queens,1771.0
Manhattan,1429.0
Staten Island,550.0
Outside NYC*,5.0


In [54]:
## look at the total amount of DFR flights by boro

dfr_flights = df[df["Category"] == "Drone as a First Responder \n(DFR)"][["Brooklyn","Bronx","Queens","Manhattan","Staten Island", "Outside NYC*"]].sum()
total_flights = df[df["Category"] == "TOTAL"][["Brooklyn","Bronx","Queens","Manhattan","Staten Island","Outside NYC*"]].sum()

percentage_dfr_of_total = (dfr_flights / total_flights) * 100

print("Percentage of DFR flights out of total flights by borough:")
display(percentage_dfr_of_total)

Percentage of DFR flights out of total flights by borough:


Unnamed: 0,0
Brooklyn,84.573503
Bronx,86.214523
Queens,19.424054
Manhattan,26.382085
Staten Island,25.272727
Outside NYC*,20.0


In [55]:
# Calculate overall totals
overall_dfr_total = dfr_flights.sum()
overall_total_flights = total_flights.sum()

# Calculate overall percentage
overall_percentage_dfr_of_total = (overall_dfr_total / overall_total_flights) * 100

print(f"\nOverall DFR Flights: {overall_dfr_total}")
print(f"Overall Total Known Flights: {overall_total_flights}")
print(f"Overall Percentage of DFR Flights: {overall_percentage_dfr_of_total:.2f}%")


Overall DFR Flights: 11131.0
Overall Total Known Flights: 15793.0
Overall Percentage of DFR Flights: 70.48%


In [57]:
## some of this dataframe is a little weird but its still workable -- ideally we'd clean it up a bit more
df.info()

DataFrame Information (data types, non-null values):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Category             323 non-null    object 
 1   Unnamed: 1           0 non-null      float64
 2   Unnamed: 2           0 non-null      float64
 3   Brooklyn             321 non-null    float64
 4   Bronx                321 non-null    float64
 5   Queens               321 non-null    float64
 6   Manhattan            321 non-null    float64
 7   Staten Island        321 non-null    float64
 8   Year                 323 non-null    int64  
 9   Quarter              323 non-null    object 
 10  Total                48 non-null     float64
 11  Other Agency
Assist  24 non-null     float64
 12  Outside NYC*         24 non-null     float64
dtypes: float64(10), int64(1), object(2)
memory usage: 32.9+ KB


In [60]:
## lets also make a cleaner dataset just for DFR
## filter for drone as first responder flights, which began in Q3 2024

dfr_flights_df = df[df['Category'] == 'Drone as a First Responder \n(DFR)']
dfr_flights_df.head()

Unnamed: 0,Category,Unnamed: 1,Unnamed: 2,Brooklyn,Bronx,Queens,Manhattan,Staten Island,Year,Quarter,Total,Other Agency\nAssist,Outside NYC*
270,Drone as a First Responder \n(DFR),,,1253.0,420.0,0.0,42.0,0.0,2024,Q3,,,
282,Drone as a First Responder \n(DFR),,,585.0,441.0,2.0,66.0,2.0,2024,Q4,,,
294,Drone as a First Responder \n(DFR),,,1215.0,988.0,0.0,202.0,3.0,2025,Q1,,,
306,Drone as a First Responder \n(DFR),,,1217.0,1327.0,160.0,48.0,60.0,2025,Q2,,,
318,Drone as a First Responder \n(DFR),,,1322.0,1502.0,182.0,19.0,74.0,2025,Q3,3099.0,,1.0


In [68]:
## filter for only the relevant columns: boros, year, and quarter

selected_columns = ['Year', 'Quarter', 'Brooklyn', 'Bronx', 'Queens', 'Manhattan', 'Staten Island', 'Outside NYC*']
dfr_flights_distribution = dfr_flights_df[selected_columns]
dfr_flights_distribution.tail()

Unnamed: 0,Year,Quarter,Brooklyn,Bronx,Queens,Manhattan,Staten Island,Outside NYC*
270,2024,Q3,1253.0,420.0,0.0,42.0,0.0,
282,2024,Q4,585.0,441.0,2.0,66.0,2.0,
294,2025,Q1,1215.0,988.0,0.0,202.0,3.0,
306,2025,Q2,1217.0,1327.0,160.0,48.0,60.0,
318,2025,Q3,1322.0,1502.0,182.0,19.0,74.0,1.0
