# Price Index of Operating Costs (PIOC)
**Developed by**: Itzamna Huerta, for the Association for Neighborhood and Housing Development (ANHD)  
**Created**: Feb 2025  
**Last Updated**: N/A  


#### Overview
The Price Index of Operating Costs (PIOC) is an annual report published by the New York City Rent Guidelines Board (RGB), tracking changes in the costs of maintaining rent-stabilized housing. This project automates the extraction and analysis of PIOC data to support advocacy efforts, policy analysis, and preservation campaigns.

By systematically collecting and analyzing cost trends, this project provides insight into the financial pressures faced by property owners of rent-stabilized units, informing decisions on affordability policies and tenant protections.

#### Objectives
<u>Extract Key Variables</u>: Identify and capture percentage changes for major cost categories:
- Rent Stabilized Apartments increased
- Real estate taxes
- Insurance Costs
- Adminstrative Costs
- Maintenance 
- Utilities
- Labor Costs
- Fuel 
- Natural Gas & Fuel Oil Heat
- Costs in Pre-1974 Buildings
- Buildings that contain ret stabilized apartments

<u>Standardize Data</u>: Clean and structure extracted values to ensure consistency across multiple years.

<u>Analyze Cost Trends</u>: Compile data into a structured format for visualization and policy analysis.


#### Methodology
1. Text Extraction & Preprocessing: Extract relevant text from PIOC PDF reports, clean and standardize formatting.
2. Pattern Matching: Use regex-based searches to locate and extract percentage changes associated with key cost categories.
3. Multi-Year Data Compilation: Iterate through multiple reports, associate extracted values with corresponding years, and consolidate into a structured dataset.
4. Data Transformation & Visualization: Convert data into a numerical format, reshape for analysis, and prepare for visualization.

In [1]:
# Import libraries
import pandas as pd
import pdfplumber 
import re
import os

<hr>


## Phase 1: Analyzing Text Structure and Developing Extraction Methodology

The first phase focuses on understanding the structure of PIOC reports and developing an approach to extract percentage values for key cost categories. Since PIOC data is presented in PDF format with varying layouts, this phase involves:


1. Text Cleaning & Preprocessing: Removing unnecessary line breaks and formatting inconsistencies to improve data extraction accuracy.
2. Keyword Identification: Defining key cost categories and their variations to ensure comprehensive data capture.
3. Pattern Recognition & Matching: Using regex-based searches to locate percentage values associated with each category.
4. Validation & Debugging: Testing extraction logic on a single PDF before scaling to multiple years.


In [2]:
# First attempt one pdf file before creating a function to iteration through all pdf files in folder 
pdf = pdfplumber.open('./pdfs/2024-PIOC.pdf')

In [3]:
# Define keywords (with expected variations in phrasing)
keywords = {
    "Rent Stabilized Apartments increased": None,
    "Real estate taxes": None,
    "Insurance costs": None,  # Fixed capitalization
    "Administrative costs": None,
    "Maintenance": None,
    "Utilities": None,
    "Labor costs": None,
    "Fuel": None,
    "Natural gas": None,  # Adjusted to lowercase for better matching
    "Costs in Pre-1974 buildings": None,
    "Buildings that contain rent stabilized apartments": None
}

first_page = pdf.pages[2] # Accessing the first page
text = first_page.extract_text() # Extracting text from page
# print(text) # Uncomment to see the original text from the page

# Preprocess text: Remove extra line breaks to keep sentences intact
clean_text = re.sub(r'\n+', ' ', text)
print("Preproccessed text:\n\n",clean_text)


Preproccessed text:

 New York City Rent Guidelines Board 2024 Price Index of Operating Costs 04 Introduction What’s New 04 Overview R The Price Index of Operating Costs (PIOC) for buildings that contain rent stabilized apartments increased 3.9% this year. PIOC Components - 05 Apartments R Real estate taxes rose by 3.2% primarily due to a rise in the tax rate for Class 2 properties. PIOC by Building 08 R Insurance costs rose by the greatest proportion in this Type year’s PIOC, 21.7%. R The Administrative costs component rose 4.6%. 08 Hotel PIOC R The Maintenance component increased by 3.5%. 09 Loft PIOC R The Utilities component increased by 1.3%. R The Labor Costs component increased by 4.3%, due to 09 The Core PIOC increases in wages for both union and non-union labor. R The Fuel component was the only component to PIOC Projections decrease, falling by 7.1%. 10 for 2025 R Overall costs in natural-gas heated buildings increased 3.8%, while overall costs in fuel-oil heated buildings Co

In [4]:
# Iterate through keywords and find percentages
for key in keywords.keys():
    # Regex pattern to find the keyword followed by a percentage in the same sentence
    pattern = rf"({re.escape(key)}.*?)(\d{{1,3}}(?:\.\d+)?%)"  # Captures the percentage

    match = re.search(pattern, clean_text, re.IGNORECASE)  # Case-insensitive search

    if match:
        # Store the percentage for the keyword
        keywords[key] = match.group(2)

    # Debugging: Print what’s being matched
    # print(f"Checking: {key} → Match: {match.groups() if match else 'None'}")

# Print final output
print("Final Output:\n\n", keywords)

Final Output:

 {'Rent Stabilized Apartments increased': '3.9%', 'Real estate taxes': '3.2%', 'Insurance costs': '21.7%', 'Administrative costs': '4.6%', 'Maintenance': '3.5%', 'Utilities': '1.3%', 'Labor costs': '4.3%', 'Fuel': '7.1%', 'Natural gas': '4.9%', 'Costs in Pre-1974 buildings': '3.6%', 'Buildings that contain rent stabilized apartments': '3.9%'}


## Phase 2: Extracting and Consolidating PIOC Data Across Multiple Reports

This phase focuses on automating the extraction of PIOC percentage data from multiple annual reports and compiling the results into a structured dataset for analysis. The key steps include:

- Batch Processing PDFs: Iterating through multiple PIOC reports to extract percentage values for each cost category.
- Standardizing Data Structure: Associating extracted values with the correct year and ensuring consistency across reports.
- Transforming Data for Analysis: Converting extracted data into a structured DataFrame, enabling easy manipulation and visualization.
- Compiling Historical Trends: Merging data from all reports into a single dataset to track year-over-year cost changes.

By the end of this phase, a comprehensive dataset is created, allowing for deeper analysis of operating cost trends in NYC rent-stabilized buildings.









In [5]:

def extract_pioc_from_pdf(pdf_path, keywords, year):
    '''
    Extracts PIOC percentages associated with keyword variables.

    Args:
    - pdf_path (str): The path of the PDF file.
    - keywords (dict): A dictionary with keywords as keys and None as initial values.
    - year (int): The year associated with the PDF file.

    Returns:
    - dict: A dictionary with keywords and their associated percentages.
    '''
    try:
        pdf = pdfplumber.open(pdf_path)
        
        # Iterate through each page in the PDF
        for page_num in range(2, len(pdf.pages)):  # Start at page 3 (index 2)
            page = pdf.pages[page_num]
            text = page.extract_text()
            
            if text:  # Ensure the page has text
                # Preprocess text: remove extra line breaks
                text = text.replace("\n", " ")

                for key in keywords.keys():
                    if keywords[key] is None:  # Only search for keywords that haven't been found yet
                        # Regex pattern to find the keyword followed by a percentage in the same sentence
                        pattern = rf"({re.escape(key)}.*?)(\d{{1,3}}(?:\.\d+)?%)"  # Captures the percentage
                        
                        match = re.search(pattern, text, re.IGNORECASE)  # Case-insensitive search
                        
                        if match:
                            # Store the percentage for the keyword 
                            keywords[key] = match.group(2)

                # Stop iterating if all keywords have been found
                if all(value is not None for value in keywords.values()):
                    break
        
        # After processing, add the year column
        keywords['Year'] = year

        return keywords

    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")
        return keywords



In [6]:
# Define a list of PDF file paths and years
pdf_paths = [
    './pdfs/2024-PIOC.pdf',
    './pdfs/2023-PIOC.pdf',
    './pdfs/2022-PIOC.pdf',
    './pdfs/2021-PIOC.pdf',
    './pdfs/2020-PIOC.pdf'
]

# Define keywords (with expected variations in phrasing)
keywords = {
    "Rent Stabilized Apartments increased": None,
    "Real estate taxes": None,
    "Insurance costs": None,
    "Administrative costs": None,
    "Maintenance": None,
    "Utilities": None,
    "Labor costs": None,
    "Fuel": None,
    "Natural gas": None,
    "Costs in Pre-1974 buildings": None,
    "Buildings that contain rent stabilized apartments": None
}

# Create an empty DataFrame to hold the final result
final_df = pd.DataFrame()

# Iterate through each PDF file and extract the percentages
for i, pdf_path in enumerate(pdf_paths):
    year = 2024 - i  # Assuming 2024 is the starting year, subtract i to get the year
    print(f"Processing: {pdf_path} for year {year}")
    
    # Copy the keyword dictionary to avoid overwriting data
    keywords_data = extract_pioc_from_pdf(pdf_path, keywords.copy(), year)
    
    # Convert the extracted data to a DataFrame and append to the final DataFrame
    df = pd.DataFrame([keywords_data])
    final_df = pd.concat([final_df, df], ignore_index=True)

Processing: ./pdfs/2024-PIOC.pdf for year 2024
Processing: ./pdfs/2023-PIOC.pdf for year 2023
Processing: ./pdfs/2022-PIOC.pdf for year 2022
Processing: ./pdfs/2021-PIOC.pdf for year 2021
Processing: ./pdfs/2020-PIOC.pdf for year 2020


In [7]:
# Reorder columns to place 'Year' first
final_df = final_df[['Year'] + [col for col in final_df.columns if col != 'Year']]

# Print the final DataFrame
final_df

Unnamed: 0,Year,Rent Stabilized Apartments increased,Real estate taxes,Insurance costs,Administrative costs,Maintenance,Utilities,Labor costs,Fuel,Natural gas,Costs in Pre-1974 buildings,Buildings that contain rent stabilized apartments
0,2024,3.9%,3.2%,21.7%,4.6%,3.5%,1.3%,4.3%,7.1%,4.9%,3.6%,3.9%
1,2023,8.1%,7.7%,19.9%,2.9%,9.4%,8.8%,2.9%,19.9%,7.2%,,8.1%
2,2022,4.2%,3.7%,19.6%,3.7%,9.2%,5.8%,4.1%,4.3%,3.0%,,4.2%
3,2021,3.0%,3.9%,18.8%,0.7%,3.1%,2.1%,2.8%,1.6%,3.5%,,3.0%
4,2020,3.7%,5.9%,16.5%,3.5%,4.8%,1.6%,3.2%,3.7%,5.1%,,3.7%


In [8]:
# Convert percentages to numeric values
for col in final_df.columns:
    if col != "Year":
        final_df[col] = final_df[col].str.rstrip('%').astype(float)

# Reshape DataFrame to long format for easier plotting
final_df_melted = final_df.melt(id_vars=["Year"], var_name="Cost Category", value_name="Percentage Change")

In [13]:
# Display the transformed DataFrame
final_df_melted.head()

Unnamed: 0,Year,Cost Category,Percentage Change
0,2024,Rent Stabilized Apartments increased,3.9
1,2023,Rent Stabilized Apartments increased,8.1
2,2022,Rent Stabilized Apartments increased,4.2
3,2021,Rent Stabilized Apartments increased,3.0
4,2020,Rent Stabilized Apartments increased,3.7


In [11]:
# Export to CSV
# final_df_melted.to_csv("./data/pioc_2024-2020.csv", index=False)