# Price Index of Operating Costs
**Developed by**: Itzamna Huerta, for the Association for Neighborhood and Housing Development (ANHD)  
**Created**: Feb 2025  
**Last Updated**: N/A  


#### Overview:
The Price Index of Operating Costs is one piece for supporting Preseveration Campaign Materials. 
Variables needed from the PDF reports:
- Rent Stabilized Apartments increased
- Real estate taxes
- Insurance Costs
- Adminstrative Costs
- Maintenance 
- Utilities
- Labor Costs
- Fuel 
- Natural Gas & Fuel Oil Heat
- Costs in Pre-1974 Buildings
- Buildings that contain ret stabilized apartments




In [28]:
# Import libraries
import pandas as pd
import pdfplumber 
import re
import os

<hr>


## Phase 1: Analyzing Text Structure and Developing Methodology to Extract PIOC Percentages

This phase is about figuring out how to extract percentage values for different PIOC categories from a PDF. The text is cleaned up, and we search for specific keywords followed by percentages. The goal is to grab those percentages and link them to the right categories for further use.

In [29]:
# First attempt one pdf file before creating a function to iteration through all pdf files in folder 
pdf = pdfplumber.open('./pdfs/2024-PIOC.pdf')

In [46]:
# Define keywords (with expected variations in phrasing)
keywords = {
    "Rent Stabilized Apartments increased": None,
    "Real estate taxes": None,
    "Insurance costs": None,  # Fixed capitalization
    "Administrative costs": None,
    "Maintenance": None,
    "Utilities": None,
    "Labor costs": None,
    "Fuel": None,
    "Natural gas": None,  # Adjusted to lowercase for better matching
    "Costs in Pre-1974 buildings": None,
    "Buildings that contain rent stabilized apartments": None
}

first_page = pdf.pages[2] # Accessing the first page
text = first_page.extract_text() # Extracting text from page
# print(text) # Uncomment to see the original text from the page

# Preprocess text: Remove extra line breaks to keep sentences intact
clean_text = re.sub(r'\n+', ' ', text)
print("Preproccessed text:\n\n",clean_text)


Preproccessed text:

 New York City Rent Guidelines Board 2024 Price Index of Operating Costs 04 Introduction What’s New 04 Overview R The Price Index of Operating Costs (PIOC) for buildings that contain rent stabilized apartments increased 3.9% this year. PIOC Components - 05 Apartments R Real estate taxes rose by 3.2% primarily due to a rise in the tax rate for Class 2 properties. PIOC by Building 08 R Insurance costs rose by the greatest proportion in this Type year’s PIOC, 21.7%. R The Administrative costs component rose 4.6%. 08 Hotel PIOC R The Maintenance component increased by 3.5%. 09 Loft PIOC R The Utilities component increased by 1.3%. R The Labor Costs component increased by 4.3%, due to 09 The Core PIOC increases in wages for both union and non-union labor. R The Fuel component was the only component to PIOC Projections decrease, falling by 7.1%. 10 for 2025 R Overall costs in natural-gas heated buildings increased 3.8%, while overall costs in fuel-oil heated buildings Co

In [49]:
# Iterate through keywords and find percentages
for key in keywords.keys():
    # Regex pattern to find the keyword followed by a percentage in the same sentence
    pattern = rf"({re.escape(key)}.*?)(\d{{1,3}}(?:\.\d+)?%)"  # Captures the percentage

    match = re.search(pattern, clean_text, re.IGNORECASE)  # Case-insensitive search

    if match:
        # Store the percentage for the keyword
        keywords[key] = match.group(2)

    # Debugging: Print what’s being matched
    # print(f"Checking: {key} → Match: {match.groups() if match else 'None'}")

# Print final output
print("Final Output:\n\n", keywords)

Final Output:

 {'Rent Stabilized Apartments increased': '3.9%', 'Real estate taxes': '3.2%', 'Insurance costs': '21.7%', 'Administrative costs': '4.6%', 'Maintenance': '3.5%', 'Utilities': '1.3%', 'Labor costs': '4.3%', 'Fuel': '7.1%', 'Natural gas': '4.9%', 'Costs in Pre-1974 buildings': '3.6%', 'Buildings that contain rent stabilized apartments': '3.9%'}


## Phase 2: Iterate through multiple PDFS and concat PDF reports. 


concatenate the percentage data from each year's report, associating each year's data with the corresponding PDF. So, the process would involve the following steps:


- Iterate through each PDF: For each PDF file, extract the relevant percentages for the keywords.
- Add a Year Column: After extracting the data for a given PDF, add a column indicating the year (e.g., "2024").
- Convert the Dictionary to DataFrame: After processing each PDF, convert the dictionary of keyword-percentage pairs into a DataFrame.
- Concatenate DataFrames: After processing all the PDFs, combine the individual DataFrames into one final DataFrame.

In [60]:
import pdfplumber
import pandas as pd
import re

def extract_pioc_from_pdf(pdf_path, keywords, year):
    '''
    Extracts PIOC percentages associated with keyword variables.

    Args:
    - pdf_path (str): The path of the PDF file.
    - keywords (dict): A dictionary with keywords as keys and None as initial values.
    - year (int): The year associated with the PDF file.

    Returns:
    - dict: A dictionary with keywords and their associated percentages.
    '''
    try:
        pdf = pdfplumber.open(pdf_path)
        
        # Iterate through each page in the PDF
        for page_num in range(2, len(pdf.pages)):  # Start at page 3 (index 2)
            page = pdf.pages[page_num]
            text = page.extract_text()
            
            if text:  # Ensure the page has text
                # Preprocess text: remove extra line breaks
                text = text.replace("\n", " ")

                for key in keywords.keys():
                    if keywords[key] is None:  # Only search for keywords that haven't been found yet
                        # Regex pattern to find the keyword followed by a percentage in the same sentence
                        pattern = rf"({re.escape(key)}.*?)(\d{{1,3}}(?:\.\d+)?%)"  # Captures the percentage
                        
                        match = re.search(pattern, text, re.IGNORECASE)  # Case-insensitive search
                        
                        if match:
                            # Store the percentage for the keyword 
                            keywords[key] = match.group(2)

                # Stop iterating if all keywords have been found
                if all(value is not None for value in keywords.values()):
                    break
        
        # After processing, add the year column
        keywords['Year'] = year

        return keywords

    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")
        return keywords



In [61]:
# Define a list of PDF file paths and years
pdf_paths = [
    './pdfs/2024-PIOC.pdf',
    './pdfs/2023-PIOC.pdf',
    './pdfs/2022-PIOC.pdf',
    './pdfs/2021-PIOC.pdf'
]

# Define keywords (with expected variations in phrasing)
keywords = {
    "Rent Stabilized Apartments increased": None,
    "Real estate taxes": None,
    "Insurance costs": None,
    "Administrative costs": None,
    "Maintenance": None,
    "Utilities": None,
    "Labor costs": None,
    "Fuel": None,
    "Natural gas": None,
    "Costs in Pre-1974 buildings": None,
    "Buildings that contain rent stabilized apartments": None
}

# Create an empty DataFrame to hold the final result
final_df = pd.DataFrame()

# Iterate through each PDF file and extract the percentages
for i, pdf_path in enumerate(pdf_paths):
    year = 2024 - i  # Assuming 2024 is the starting year, subtract i to get the year
    print(f"Processing: {pdf_path} for year {year}")
    
    # Copy the keyword dictionary to avoid overwriting data
    keywords_data = extract_pioc_from_pdf(pdf_path, keywords.copy(), year)
    
    # Convert the extracted data to a DataFrame and append to the final DataFrame
    df = pd.DataFrame([keywords_data])
    final_df = pd.concat([final_df, df], ignore_index=True)

# Print the final concatenated DataFrame
print(final_df)


Processing: ./pdfs/2024-PIOC.pdf for year 2024
Processing: ./pdfs/2023-PIOC.pdf for year 2023
Processing: ./pdfs/2022-PIOC.pdf for year 2022
Processing: ./pdfs/2021-PIOC.pdf for year 2021
  Rent Stabilized Apartments increased Real estate taxes Insurance costs  \
0                                 3.9%              3.2%           21.7%   
1                                 8.1%              7.7%           19.9%   
2                                 4.2%              3.7%           19.6%   
3                                 3.0%              3.9%           18.8%   

  Administrative costs Maintenance Utilities Labor costs   Fuel Natural gas  \
0                 4.6%        3.5%      1.3%        4.3%   7.1%        4.9%   
1                 2.9%        9.4%      8.8%        2.9%  19.9%        7.2%   
2                 3.7%        9.2%      5.8%        4.1%   4.3%        3.0%   
3                 0.7%        3.1%      2.1%        2.8%   1.6%        3.5%   

  Costs in Pre-1974 buildings  \
0 

In [62]:
final_df

Unnamed: 0,Rent Stabilized Apartments increased,Real estate taxes,Insurance costs,Administrative costs,Maintenance,Utilities,Labor costs,Fuel,Natural gas,Costs in Pre-1974 buildings,Buildings that contain rent stabilized apartments,Year
0,3.9%,3.2%,21.7%,4.6%,3.5%,1.3%,4.3%,7.1%,4.9%,3.6%,3.9%,2024
1,8.1%,7.7%,19.9%,2.9%,9.4%,8.8%,2.9%,19.9%,7.2%,,8.1%,2023
2,4.2%,3.7%,19.6%,3.7%,9.2%,5.8%,4.1%,4.3%,3.0%,,4.2%,2022
3,3.0%,3.9%,18.8%,0.7%,3.1%,2.1%,2.8%,1.6%,3.5%,,3.0%,2021


<hr>

## Phase 2: Iterating Through Pages to Find PIOC Percentages

Overview:
In this phase, we'll expand the search to include all pages starting from page 3. The goal is to find percentages linked to our keywords, stopping once a match is found on any page. The search will stop once a percentage is located, preventing unnecessary page scans after that.

In [45]:
for page_num in range(2, len(pdf.pages)):
    # Access the page and extract text
    page = pdf.pages[page_num]
    text = page.extract_text()
    
    # Preprocess text: Remove extra line breaks to keep sentences intact
    clean_text = re.sub(r'\n+', ' ', text)
    
    # Iterate through keywords and find percentages 
    for key in keywords.keys():
        if keywords[key] is None: # Only search for keywords that haven't been found yet
            # Regex pattern to find the keyword followed by a percentage in the same sentence
            pattern = rf"({re.escape(key)}.*?)(\d{{1,3}}(?:\.\d+)?%)"  # Captures the percentage
            
            match = re.search(pattern, clean_text, re.IGNORECASE) # Case-insensitive search
            
            if match:
                # Store the percentagte for the keyword
                keywords[key] = match.group(2)
                
        # Stop iterating if all keywords have been found
        if all(value is not None for value in keywords.values()):
            break
            
print("Final Output:\n\n", keywords)
    
    

Final Output:

 {'Rent Stabilized Apartments increased': '3.9%', 'Real estate taxes': '3.2%', 'Insurance costs': '21.7%', 'Administrative costs': '4.6%', 'Maintenance': '3.5%', 'Utilities': '1.3%', 'Labor costs': '4.3%', 'Fuel': '7.1%', 'Natural gas & fuel oil heat': None, 'Costs in Pre-1974 buildings': '3.6%', 'Buildings that contain rent stabilized apartments': '3.9%'}
