## Policy Definition for Louisville

For this analysis, we're going to be applying a **full Land value tax shift** on the **Jefferson County Public Schools tax**. This is the heftiest property tax in Jefferson County, and thus has the highest impact potential from a shift. It also applies to all non-exempt properties in Jefferson County, and has a cap on its millage rate in state law, but not in the constitution. We will ensure that the existing **homestead exemption** continues to apply (but will play with how it does). And, to help navigate politics and ensure we craft a tax that fits with the constitution, we will play with how agricultural land is treated.

We are also going to present **six different scenarios**:
1. The new land value tax applies to all parcels, and exemptions apply to improvements first
2. The new land value tax applies to all parcels besides those with current_use of Agricultural with Dwelling and Vacant Agricultural Land (with the latter two groups of parcels maintaining the old uniform property tax), and exemptions apply to improvements first
3. The new land value tax applies to all parcels besides Agricultural with Dwelling, and exemptions apply to improvements first
4. The new land value tax applies to all parcels, and exemptions apply fully to land
5. The new land value tax applies to all parcels besides those with current_use of Agricultural with Dwelling and Vacant Agricultural Land (with the latter two groups of parcels maintaining the old uniform property tax), and exemptions apply fully to land (outside the uniform parcels)
6. The new land value tax applies to all parcels besides Agricultural with Dwelling, and exemptions apply fully to land (outside the uniform parcels)

These scenarios' resulting columns will be respectively prefixed with sce1_, sce2_, sce3, etc.

To filter these parcels, we will create six new dataframes, each representing one of the scenarios. Holding the taxes of the agricultural parcels steady will be handled by a new uniform_parcel column that will added in the dataframes that require it. This column will tell the function to add the current tax revenue of that property to the total for calculating the millage rate, but just set its new_tax to its current_tax. 

To determine how a parcel's exemption will apply, the apply_exemption_to_land flag will be set on the function, determining its behavior with exemption application.

Once the analysis has been performed on all six dataframes, they will be recombined into a master dataframe.

calculate_current_tax() will return a copy of its input dataframe with one new column:
1. current_tax

model_lvt_shift() will return a copy of its input dataframe with three new columns:
1. taxable_value
2. new_tax
3. new_tax_change
4. new_tax_pct

calculate_current_tax() will be run first to attach current_tax to the source dataframe, and then the six derivative dataframes for each scenario will be split off. Each will have lvt_shift() run on them, and then they will be recombined and uploaded to the database.


In [1]:
import sys
import pandas as pd
import geopandas as gpd
import numpy as np
import os
sys.path.append('..')  # Add parent directory to path
from dotenv import load_dotenv
from cloud_utils import get_feature_data, get_feature_data_with_geometry
from lvt_utils import model_split_rate_tax, model_full_building_abatement, calculate_current_tax, model_full_building_abatement_on_subsection, model_lvt_shift
from census_utils import get_census_data, get_census_blockgroups_shapefile, get_census_data_with_boundaries, match_to_census_blockgroups
from sqlalchemy import create_engine
from geoalchemy2 import Geometry

TABLE_NAME = "parcels"

load_dotenv()

True

## Step 1: Getting the Data

The first step in modeling an LVT shift is obtaining property tax data. Most counties make this information publicly accessible through open data portals or GIS systems.

For St. Joseph County (which includes South Bend), we can access parcel data through their ArcGIS services. The base URL below provides access to various property datasets including:

- **Parcel_Civic**: Main parcel dataset with tax information, property types, and assessed values
- **parcel_boundaries**: Geographic boundaries for spatial analysis

### Key Data Elements We Need:
- **Full Market Value (FMV)**: Total assessed property value
- **Land Value**: Assessed value of land only  
- **Improvement Value**: Assessed value of buildings/structures
- **Exemption amounts**: Various tax exemptions applied
- **Property characteristics**: Type, location, tax district

Let's fetch the main parcel dataset:


In [2]:
# Fetch the main parcel dataset with tax info
gdf = gpd.read_parquet("data/louisville/universe.parquet")

## Step 2: Recreating Current Property Tax Revenue

Before we can model an LVT shift, we need to accurately recreate the current property tax system. This validation step ensures our dataset correctly reflects the real-world tax landscape.

### Key Components:
- **Millage Rate**: \\$7.35 per \\$1,000 in assessed value (from 2024 JCPS budget)
- **Exemptions**: Various exemptions that reduce taxable value
- **Exempt Properties**: Fully exempt properties (marked in `is_fully_exempt`)

### The Process:
1. Calculate total exemptions from all exemption amount fields
2. Identify fully exempt properties  
3. Calculate taxable value (land + improvements - exemptions)
4. Apply millage rate to get current tax liability
5. Verify total revenue matches published budget expectations (~$700 million)

Steps 1-3 are currently handled in another script.

This step is crucial - if we can't accurately recreate current taxes, our LVT projections won't be reliable.


In [3]:

millage_rate = 7.35
gdf['millage_rate'] = millage_rate

# # 1. Calculate current tax
current_revenue, second_revenue, gdf = calculate_current_tax(
    df=gdf, 
    tax_value_col='assessor_total_value',
    millage_rate_col='millage_rate',
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt'
)


print(f"Total number of properties: {len(gdf):,}")
print(f"Current annual revenue with ${millage_rate}/1000 millage rate: ${current_revenue:,.2f}")
print(f"Total land value: ${gdf['assessor_land_value'].sum():,.2f}")
print(f"Total improvement value: ${gdf['assessor_improved_value'].sum():,.2f}")



Total current tax revenue: $692,629,764.45
Total number of properties: 290,783
Current annual revenue with $7.35/1000 millage rate: $692,629,764.45
Total land value: $22,785,781,523.00
Total improvement value: $78,522,651,665.00


## Step 3: Setup Scenario Dataframes

In [4]:
# Since the dataframes 4-6 are copies of 1-3, lets create 1-3 first and then make our copies
scenario_0_gdf = gdf.copy()
scenario_1_gdf = gdf.copy()
scenario_2_gdf = gdf.copy()
scenario_3_gdf = gdf.copy()

In [5]:
# Create uniform property column on the dataframes that need it, with the scenario's specified uniform rows set to 1
scenario_2_gdf["uniform_parcel_col"] = np.where((scenario_2_gdf["current_use"] == 'Agricultural With Dwelling') | (scenario_2_gdf["current_use"] == 'Agricultural Vacant Land'), 1, 0)
scenario_3_gdf["uniform_parcel_col"] = np.where((scenario_3_gdf["current_use"] == 'Agricultural With Dwelling'), 1, 0)

In [7]:
scenario_4_gdf = scenario_1_gdf.copy()
scenario_5_gdf = scenario_2_gdf.copy()
scenario_6_gdf = scenario_3_gdf.copy()

## Step 4: Modeling the Land Value Tax

Now for the exciting part - modeling the LVT shift! We'll trial six revenue-neutral policies that shift the full JCPS property tax to land, but test out different outcomes from specific exemptions.

### The Ideal LVT Formula

Under our ideal proposed system: 
- **Land** takes on the full weight of the property tax
- **Total revenue** remains the same as current system

The formula to solve for the building millage rate is:
```
Current Revenue = Land Millage Ã— Total Taxable Land
```

### Agricultural Carve Outs

We have two different potential agricultural carve out scenarios, but they are handled by the same formula:

```
Current Revenue = (New Land Millage X Total Taxable Non-Uniform Land) + (Current Uniform Millage X Total Taxable Non-Uniform Property)
```

### Handling Exemptions

The ideal way to handle the constitutionally-manded homestead exemption is to:
1. Apply the exemption to building value first
2. If the exemption exceeds building value, apply remainder to land value
3. Calculate the tax using one of the above formulas depending on the scenario

This method ensures properties don't over-benefit from exemptions and maintains the intent of existing tax policy.

But, if politics determines that the burden of an LVT would still be too much on those individuals that the homestead exemption is supposed to protect, we could also apply the exemption directly on the land value of all non-uniform properties. This really dampens the impact of an LVT, and will likely require a modification in the dollar amount of the exemption to ensure its not a blanket handout to those 65+.

#### Scenario 0

In [8]:
# Scenario 0, base building abatement for comparison
millage_rate, new_revenue, scenario_0_gdf = model_full_building_abatement(
    df=scenario_0_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    abatement_percentage=1.0,  # Full shift
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt'
)
# # Calculate tax changes manually since they're not being added by the function
# scenario_0_gdf['NEW_TAX'] = (scenario_0_gdf['assessor_land_value'] * millage_rate/1000)
# scenario_0_gdf['TAX_CHANGE'] = scenario_0_gdf['new_tax'] - scenario_0_gdf['current_tax']
# scenario_0_gdf['TAX_CHANGE_PCT'] = (scenario_0_gdf['TAX_CHANGE'] / scenario_0_gdf['current_tax']) * 100

scenario_0_gdf.to_parquet("data/louisville/scenario_0.parquet")

Building abatement model (100.0% abatement)
Millage rate: 36.5382
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $-0.00 (-0.0000%)

Building Abatement (100.0%) Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling            $-102,281,429          230437           $-444             $-446              -31.7%                -32.5%      $440,225,631  $337,944,202                     13.6%                     72.3%               -23.2%
               Res Vacant Land              $15,881,014           14256          $1,114              $245              392.7%                397.1%        $4,083,666   $19,964,680                     99.1%                      0.4%               388.9%
       Ex

#### Scenario 1

In [9]:
# Scenario 1: Full LVT shift, no uniform parcels for ag, exemptions apply to improvements first
millage_rate, new_revenue, scenario_1_gdf = model_lvt_shift(
    df=scenario_1_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    apply_exemption_to_land=False,
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt'
)

# # Calculate tax changes manually since they're not being added by the function
# df['NEW_TAX'] = (df['assessor_land_value'] * millage_rate/1000)
# df['TAX_CHANGE'] = df['new_tax'] - df['current_tax']
# df['TAX_CHANGE_PCT'] = (df['TAX_CHANGE'] / df['current_tax']) * 100

scenario_1_gdf.to_parquet("data/louisville/scenario_1.parquet")

Total number of parcels: 290783
Millage rate: 33.0469
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $-0.00 (-0.0000%)

Full LVT Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling             $-69,084,587          230437           $-300             $-338              -18.1%                -24.4%      $440,225,631  $371,141,044                     16.7%                     67.0%               -15.7%
               Res Vacant Land              $13,964,245           14256            $980              $216              345.7%                349.6%        $4,083,666   $18,047,911                     99.1%                      0.4%               342.0%
       Exempt Metro Government          

#### Scenario 2

In [10]:
# Scenario 2: Full LVT shift, uniform parcels for vacant and improved ag, exemptions apply to improvements first
millage_rate, new_revenue, scenario_2_gdf = model_lvt_shift(
    df=scenario_2_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    apply_exemption_to_land=False,
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt',
    uniform_parcel_col='uniform_parcel_col',
)

scenario_2_gdf.to_parquet("data/louisville/scenario_2.parquet")

Total number of parcels: 290783
Millage rate: 33.5799
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $0.00 (0.0000%)

Full LVT Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling             $-63,098,544          230437           $-274             $-320              -16.7%                -23.2%      $440,225,631  $377,127,087                     17.7%                     65.4%               -14.3%
               Res Vacant Land              $14,255,335           14256          $1,000              $220              352.9%                356.9%        $4,083,666   $18,339,001                     99.1%                      0.4%               349.1%
       Exempt Metro Government            

#### Scenario 3

In [11]:
# Scenario 3: Full LVT shift, uniform parcels for improved ag, exemptions apply to improvements first
millage_rate, new_revenue, scenario_3_gdf = model_lvt_shift(
    df=scenario_3_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    apply_exemption_to_land=False,
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt',
    uniform_parcel_col='uniform_parcel_col'
)

scenario_3_gdf.to_parquet("data/louisville/scenario_3.parquet")

Total number of parcels: 290783
Millage rate: 33.3171
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $0.00 (0.0000%)

Full LVT Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling             $-66,049,579          230437           $-287             $-329              -17.4%                -23.8%      $440,225,631  $374,176,052                     17.2%                     66.1%               -15.0%
               Res Vacant Land              $14,111,831           14256            $990              $218              349.3%                353.3%        $4,083,666   $18,195,498                     99.1%                      0.4%               345.6%
       Exempt Metro Government            

#### Scenario 4

In [12]:
# Scenario 4: Full LVT shift, no uniform parcels for ag, exemptions apply only to land
millage_rate, new_revenue, scenario_4_gdf = model_lvt_shift(
    df=scenario_4_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    apply_exemption_to_land=True,
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt'
)

scenario_4_gdf.to_parquet("data/louisville/scenario_5.parquet")

Total number of parcels: 290783
Millage rate: 36.5382
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $-0.00 (-0.0000%)

Full LVT Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling            $-102,281,429          230437           $-444             $-446              -31.7%                -32.5%      $440,225,631  $337,944,202                     13.6%                     72.3%               -23.2%
               Res Vacant Land              $15,881,014           14256          $1,114              $245              392.7%                397.1%        $4,083,666   $19,964,680                     99.1%                      0.4%               388.9%
       Exempt Metro Government          

#### Scenario 5

In [13]:
# Scenario 5: Full LVT shift, uniform parcels for vacant and improved ag, exemptions apply only to land
millage_rate, new_revenue, scenario_5_gdf = model_lvt_shift(
    df=scenario_5_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    apply_exemption_to_land=True,
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt',
    uniform_parcel_col='uniform_parcel_col'
)

scenario_5_gdf.to_parquet("data/louisville/scenario_5.parquet")

Total number of parcels: 290783
Millage rate: 37.2007
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $0.00 (0.0000%)

Full LVT Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling             $-96,153,727          230437           $-417             $-426              -30.5%                -31.3%      $440,225,631  $344,071,904                     14.6%                     70.9%               -21.8%
               Res Vacant Land              $16,243,019           14256          $1,139              $251              401.6%                406.1%        $4,083,666   $20,326,685                     99.1%                      0.4%               397.8%
       Exempt Metro Government            

#### Scenario 6

In [14]:
# Scenario 6: Full LVT shift, uniform parcels for improved ag, exemptions apply only to land
millage_rate, new_revenue, scenario_6_gdf = model_lvt_shift(
    df=scenario_6_gdf,
    land_value_col='assessor_land_value',
    improvement_value_col='assessor_improved_value',
    current_revenue=current_revenue,
    apply_exemption_to_land=True,
    exemption_col='value_exempt',
    exemption_flag_col='is_fully_exempt',
    uniform_parcel_col='uniform_parcel_col'
)

scenario_6_gdf.to_parquet("data/louisville/scenario_6.parquet")

Total number of parcels: 290783
Millage rate: 36.8698
Total tax revenue: $692,629,764.45
Target revenue: $692,629,764.45
Revenue difference: $0.00 (0.0000%)

Full LVT Tax Change by Property Category
                   current_use total_tax_change_dollars  property_count mean_tax_change median_tax_change mean_tax_change_pct median_tax_change_pct total_current_tax total_new_tax pct_increase_gt_threshold pct_decrease_gt_threshold total_tax_change_pct
         Res 1 Family Dwelling             $-99,214,460          230437           $-431             $-436              -31.1%                -31.9%      $440,225,631  $341,011,171                     14.1%                     71.6%               -22.5%
               Res Vacant Land              $16,062,201           14256          $1,127              $248              397.1%                401.6%        $4,083,666   $20,145,867                     99.1%                      0.4%               393.3%
       Exempt Metro Government            

## Step 5. Upload to Remote Database

In [None]:
# Combine into a single dataframe with the relevant five attributes being appended, each with the prefix indicating their scenario
master_gdf = 

In [None]:
print("Uploading to PostGIS (this may take a while)...")

db_conn = os.getenv("DB_CONN")

engine = create_engine("postgresql://roebling:RGRp%z27DHOz29ODG1]q2TtdBLB-w(@localhost:5433/geo")

print("Connected to server")

new_df.to_postgis(
        name=TABLE_NAME,
        con=engine,
        if_exists="replace", # Use 'append' if adding more data later
        index=False,
        chunksize=10000, 
        dtype={
            # We explicitly tell PostGIS: "This column holds 4326 Geometries"
            # We use 'GEOMETRY' to support mixed types (Polygons + MultiPolygons)
            new_df.geometry.name: Geometry(geometry_type='GEOMETRY', srid=4326)
        }
    )

print("Success! Data uploaded.")

## Step 5: Understanding Property Types and Impacts

With our split-rate tax calculated, we can now analyze which property types are most affected. Understanding the distribution of tax impacts across different property categories is crucial for policy makers and stakeholders.

### Property Type Analysis

We'll examine how the tax burden shifts across:
- **Residential properties** (single-family, multi-family, condos)
- **Commercial properties** (retail, office, industrial)  
- **Vacant land** (often sees largest increases under LVT)
- **Exempt properties** (government, religious, charitable)

### Key Metrics to Track:
- **Count**: Number of properties in each category
- **Median tax change**: Typical impact (less affected by outliers)
- **Average percentage change**: Overall magnitude of impact
- **Percentage with increases**: How many properties see tax increases

This analysis helps identify which sectors benefit from the LVT shift (typically developed properties) and which see increased burden (typically land-intensive properties with low improvement ratios).


In [None]:
# For each column, show top 10 most common values and their counts
columns_to_analyze = ['neighborhood', 'market_area', 'property_type', 'zoning', 'current_use', 
                     'building_year_built']

for col in columns_to_analyze:
    print(f"\nTop 10 values for {col}:")
    value_counts = df[col].value_counts().head(30)
    print(value_counts)
    print(f"Total unique values: {df[col].nunique()}")
    print("-" * 50)

# Let's also look at some basic statistics about these groups
print("\nMedian tax changes by various groupings:")

for col in ['neighborhood', 'market_area', 'property_type', 'zoning', 'current_use']:
    print(f"\nMedian tax change by {col}:")
    median_changes = df.groupby(col)['TAX_CHANGE'].agg([
        'count',
        'median',
        lambda x: (x > 0).mean() * 100  # Percentage with increase
    ]).round(2)
    median_changes.columns = ['Count', 'Median Change ($)', '% With Increase']
    print(median_changes.sort_values('Count', ascending=False).head(30))

In [None]:
# Improved property type categorization for South Bend data

def categorize_property_type(prop_type):
    # Defensive: handle missing/nulls
    if not isinstance(prop_type, str):
        return "Other"

    # Normalize for matching
    p = prop_type.strip().upper()

    # Single Family
    if p.startswith("1 FAMILY DWELL - PLATTED LOT"):
        return "Single Family"
    elif p.startswith("1 FAMILY DWELL - UNPLATTED (0 TO 9.99 ACRES)"):
        return "Single Family - Unplatted Small Acreage"
    elif p.startswith("1 FAMILY DWELL - UNPLATTED (10 TO 19.99 ACRES)"):
        return "Single Family - Unplatted Medium Acreage"
    elif p.startswith("1 FAMILY DWELL - UNPLATTED"):
        return "Single Family - Unplatted Large Acreage"

    # Multi-Family
    if "2 FAMILY" in p or "3 FAMILY" in p or "4 TO 19 FAMILY" in p:
        return "Small Multi-Family (2-19 units)"
    if "20 TO 39 FAMILY" in p or "40 OR MORE FAMILY" in p:
        return "Large Multi-Family (20+ units)"

    # Condos
    if "CONDOMINIUM" in p:
        return "Condominiums"

    # Mobile/Manufactured
    if "MOBILE" in p or "MANUFACTURED" in p:
        return "Mobile/Manufactured Homes"

    # Retail
    if any(x in p for x in [
        "RETAIL", "SHOP", "STORE", "MARKET", "DEPARTMENT", "SHOPPING CENTER",
        "SUPERMARKET", "DISCOUNT AND JUNIOR DEPARTMENT STORE", "COMMUNITY SHOPPING CENTER",
        "NEIGHBORHOOD SHOPPING CENTER", "REGIONAL SHOPPING CENTER", "FULL LINE DEPARTMENT STORE"
    ]):
        return "Retail Commercial"

    # Office
    if any(x in p for x in [
        "OFFICE", "MEDICAL CLINIC", "BANK", "SAVING & LOANS", "DRIVE-UP/WALK-UP ONLY BANK",
        "OFFICE BLDG", "INDUSTRIAL OFFICE"
    ]):
        return "Office Commercial"

    # Food/Hospitality
    if any(x in p for x in [
        "RESTAURANT", "BAR", "HOTEL", "MOTEL", "FOOD", "CAFETERIA", "CONVENIENCE MARKET"
    ]):
        return "Food/Hospitality"

    # Industrial
    if any(x in p for x in [
        "INDUSTRIAL", "MANUFACTURING", "WAREHOUSE", "ASSEMBLY", "FACTORY", "FOUNDRIES",
        "TRUCK TERMINAL", "COMMERCIAL MINI-WAREHOUSE", "COMMERCIAL TRUCK TERMINAL",
        "MEDIUM MANUFACTURING", "LIGHT MANUFACTURING", "RESEARCH & DEVELOPMENT FACILITY"
    ]):
        return "Industrial"

    # Vacant Land
    if p.startswith("VACANT") or "VACANT LAND" in p:
        return "Vacant Land"

    # Parking
    if "PARKING" in p:
        return "Parking"

    # Government
    if any(x in p for x in [
        "EXEMPT, MUNICIPALITY", "EXEMPT, COUNTY", "EXEMPT, STATE OF INDIANA",
        "EXEMPT, UNITED STATES OF AMERICA", "EXEMPT, BOARD OF EDUCATION",
        "EXEMPT, TOWNSHIP", "EXEMPT PROPERTY OWNED BY A MUNICIPAL HOUSING AUTHORITY",
        "EXEMPT PROPERTY OWNED BY A PUBLIC LIBRARY", "EXEMPT, PARK DISTRICT"
    ]):
        return "Government"

    # Religious
    if any(x in p for x in [
        "EXEMPT, RELIGIOUS ORGANIZATION", "EXEMPT, CHURCH", "EXEMPT, CHAPEL",
        "EXEMPT, MOSQUE", "EXEMPT, SYNAGOGUE", "EXEMPT, TEMPLE",
        "EXEMPT, CHURCH, CHAPEL, MOSQUE, SYNAGOGUE, TABERNACLE, OR TEMPLE",
        "EXEMPT, CEMETERY ORGANIZATION"
    ]):
        return "Religious"

    # Charitable
    if "EXEMPT, CHARITABLE ORGANIZATION" in p:
        return "Charitable"

    # Agricultural
    if any(x in p for x in [
        "FARM", "AGRICULTURAL", "GRAIN", "LIVESTOCK", "DAIRY", "NURSERY", "POULTRY", "GREENHOUSES"
    ]):
        return "Agricultural"

    # Special: Nursing, Medical, Funeral, etc.
    if any(x in p for x in [
        "NURSING HOME", "PRIVATE HOSPITAL", "FUNERAL HOME"
    ]):
        return "Health/Institutional"

    # Utility/Infrastructure
    if any(x in p for x in [
        "UTILITY", "RAILROAD", "TELEPHONE", "TELEGRAPH", "CABLE COMPANY", "POWER", "LIGHT, HEAT"
    ]):
        return "Utility/Infrastructure"

    # If still not matched, try broad fallback
    if "EXEMPT" in p:
        return "Other Exempt"
    if "COMMERCIAL" in p:
        return "Other Commercial"
    if "RESIDENTIAL" in p:
        return "Other Residential"

    return "Other"

# Apply the function to the DataFrame
df['PROPERTY_CATEGORY'] = df['PROPTYPE'].apply(categorize_property_type)

# Print unique property categories
print("Unique PROPERTY_CATEGORY values:")
print(df['PROPERTY_CATEGORY'].unique())

# Print unique PROPTYPE values
print("\nUnique PROPTYPE values:")
print(df['PROPTYPE'].unique())


### Creating Detailed Property Categories

To better understand impacts, we'll create a detailed property categorization system that groups similar property types together. This makes the analysis more meaningful and interpretable.

The function below categorizes properties into groups like:
- **Single Family** (with subcategories by lot size)
- **Multi-Family** (small vs. large)
- **Commercial** (by type: retail, office, industrial)
- **Exempt** (by type: government, religious, charitable)

This categorization helps us understand not just that "residential" properties are affected, but specifically which types of residential properties see the biggest changes.


In [None]:
# Create a summary DataFrame grouped by PROPTYPE
proptype_analysis = df.groupby('PROPERTY_CATEGORY').agg({
    'TAX_CHANGE_PCT': 'mean',  # Average percentage change
    'TAX_CHANGE': 'median',    # Median dollar change
    'PARCELID': 'count'        # Count of properties
}).round(2)

# Add percentage that increase
proptype_increases = df.groupby('PROPERTY_CATEGORY').agg({
    'TAX_CHANGE': lambda x: (x > 0).mean() * 100  # Percentage with increase
}).round(2)

proptype_analysis['Percent_Increased'] = proptype_increases['TAX_CHANGE']

# Rename columns for clarity
proptype_analysis.columns = [
    'Avg_Pct_Change',
    'Median_Dollar_Change',
    'Property_Count',
    'Pct_Properties_Increased'
]

# Sort by count of properties (descending)
proptype_analysis = proptype_analysis.sort_values('Property_Count', ascending=False)

# Print results
print("Analysis by Property Type:\n")
print("Note: All monetary values in dollars, percentages shown as %\n")
print(proptype_analysis.to_string())

# Print some summary statistics
print("\nOverall Summary:")
print(f"Total properties analyzed: {proptype_analysis['Property_Count'].sum():,}")
print(f"Overall median dollar change: ${df['TAX_CHANGE'].median():,.2f}")
print(f"Overall average percent change: {df['TAX_CHANGE_PCT'].mean():.2f}%")
print(f"Overall percent of properties with increase: {(df['TAX_CHANGE'] > 0).mean()*100:.2f}%")

### Summary of Tax Impacts by Property Category

Now we can see the clear patterns of how different property types are affected by the LVT shift. This table will show us:

- **Which property types benefit** (negative changes = tax decreases)
- **Which property types pay more** (positive changes = tax increases)  
- **How concentrated the impacts are** (median vs. average differences)
- **What percentage of each type sees increases**

Generally, we expect:
- **Developed properties** (houses, commercial buildings) to see tax **decreases**
- **Vacant land** to see the **largest increases** 
- **Properties with high improvement-to-land ratios** to benefit most


In [None]:
boundary_gdf = get_feature_data_with_geometry('parcel_boundaries', base_url=base_url)


## Step 6: Adding Geographic Context

To make our analysis spatially-aware, we need to add geographic boundaries to our parcel data. This enables us to:

- **Create maps** showing tax changes across the city
- **Analyze patterns by neighborhood** or district  
- **Combine with demographic data** for equity analysis
- **Present results visually** to stakeholders

We'll fetch the parcel boundary data from the same ArcGIS service that contains the geometric information for each property.


In [None]:

print(len(df))

# Merge with our tax analysis data
merged_gdf = df.merge(
    boundary_gdf,
    on='PARCELID',
    how='inner'
)

print(f"\nMerged data has {len(merged_gdf)} parcels")
merged_gdf_2 = merged_gdf.copy()

### Merging Tax Analysis with Geographic Data

Here we combine our tax analysis results with the geographic boundaries. This creates a spatially-enabled dataset that allows us to:

1. **Map tax changes** across South Bend
2. **Identify spatial patterns** in tax impacts
3. **Prepare for demographic analysis** by having geographic context

The merge should give us the same number of records as our original analysis, now with geographic coordinates for each parcel.


In [None]:
# Get census data for St. Joseph County (FIPS code: 18141)
census_data, census_boundaries = get_census_data_with_boundaries(
    fips_code='18141',  # Indiana (18) + St. Joseph County (141)
    year=2022,
    api_key='e22bb2247fdb11c140fc8aa531e952a1e11539b7'  # Replace with your actual Census API key
)

# Set CRS for census boundaries before merging
census_boundaries = census_boundaries.set_crs(epsg=4326)  # Assuming WGS84 coordinate system
boundary_gdf = boundary_gdf.set_crs(epsg=4326)  # Set same CRS for boundary data

# Merge census data with our parcel boundaries
merged_gdf = match_to_census_blockgroups(
    gdf=boundary_gdf,
    census_gdf=census_boundaries,
    join_type="left"
)

# Merge the census data back onto the original dataframe
df = df.merge(
    merged_gdf,
    left_on='PARCELID',
    right_on='PARCELID',
    how='left'
)

print(f"Number of census blocks: {len(census_boundaries)}")
print(f"Number of census data: {len(census_data)}")
print(f"Number of parcels with census data: {len(df)}")

## Step 7: Demographic and Equity Analysis

One of the most important aspects of LVT analysis is understanding the **equity implications** - how does the tax shift affect different income levels and demographic groups?

### Adding Census Data

We'll match each property to its Census Block Group and pull demographic data including:
- **Median household income** 
- **Racial/ethnic composition**
- **Population characteristics**

### Why This Matters

Policy makers need to understand:
- Does the LVT shift disproportionately burden low-income neighborhoods?
- Are there racial equity implications?  
- Does the policy align with broader equity goals?

**Note**: You'll need a Census API key for this section. Get one free at: https://api.census.gov/data/key_signup.html


In [None]:
print("DataFrame columns:")
print(df.columns.tolist())


### Exploring the Enhanced Dataset

With census data merged in, our dataset now contains both property tax information and demographic context. Let's explore what variables we now have available for analysis.

This enhanced dataset allows us to examine relationships between:
- Property characteristics and demographics
- Tax impacts and neighborhood income levels
- Geographic patterns in tax burden shifts


In [None]:
# Display all columns with maximum width
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
display(df.head())


### Viewing the Complete Dataset

Let's examine our enhanced dataset with all the variables we've created and merged. This gives us a comprehensive view of each property with:

- **Property characteristics** (type, value, location)
- **Current tax calculations** 
- **New LVT calculations**
- **Tax change impacts**
- **Demographic context** (income, race/ethnicity)

This rich dataset forms the foundation for sophisticated equity and impact analysis.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def filter_data(df):
    """Filter data to remove negative median incomes and create non-vacant subset"""
    df_filtered = df[df['median_income'] > 0].copy()
    non_vacant_df = df[df['PROPERTY_CATEGORY'] != 'Vacant Land'].copy()
    return df_filtered, non_vacant_df

def calculate_block_group_summary(df):
    """Calculate summary statistics for census block groups"""
    summary = df.groupby('std_geoid').agg(
        median_income=('median_income', 'first'),
        minority_pct=('minority_pct', 'first'),
        black_pct=('black_pct', 'first'),
        total_current_tax=('current_tax', 'sum'),
        total_new_tax=('new_tax', 'sum'),
        mean_tax_change=('TAX_CHANGE', 'mean'),
        median_tax_change=('TAX_CHANGE', 'median'),
        median_tax_change_pct=('TAX_CHANGE_PCT', 'median'),
        parcel_count=('TAX_CHANGE', 'count'),
        has_vacant_land=('PROPERTY_CATEGORY', lambda x: 'Vacant Land' in x.values)
    ).reset_index()
    
    summary['mean_tax_change_pct'] = ((summary['total_new_tax'] - summary['total_current_tax']) / 
                                    summary['total_current_tax']) * 100
    return summary

def create_scatter_plot(data, x_col, y_col, ax, title, xlabel, ylabel):
    """Create a scatter plot with trend line"""
    sns.scatterplot(
        data=data,
        x=x_col,
        y=y_col,
        size='parcel_count',
        sizes=(20, 200),
        alpha=0.7,
        ax=ax
    )
    
    ax.axhline(y=0, color='r', linestyle='--')
    
    x = data[x_col].dropna()
    y = data[y_col].dropna()
    mask = ~np.isnan(x) & ~np.isnan(y)
    
    if len(x[mask]) > 1:
        z = np.polyfit(x[mask], y[mask], 1)
        p = np.poly1d(z)
        ax.plot(x[mask], p(x[mask]), "r--")
    
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)

def plot_comparison(data1, data2, x_col, y_col, title_prefix, xlabel):
    """Create side-by-side comparison plots"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))
    
    create_scatter_plot(data1, x_col, y_col, ax1, 
                       f'{title_prefix} - All Properties', xlabel, 'Mean Tax Change (%)')
    create_scatter_plot(data2, x_col, y_col, ax2,
                       f'{title_prefix} - Excluding Vacant Land', xlabel, 'Mean Tax Change (%)')
    
    plt.tight_layout()
    plt.show()

def calculate_correlations(data1, data2):
    """Calculate correlations between variables"""
    correlations = {}
    for df, suffix in [(data1, 'all'), (data2, 'non_vacant')]:
        correlations[f'income_mean_{suffix}'] = df[['median_income', 'mean_tax_change_pct']].corr().iloc[0, 1]
        correlations[f'income_median_{suffix}'] = df[['median_income', 'median_tax_change_pct']].corr().iloc[0, 1]
        correlations[f'minority_mean_{suffix}'] = df[['minority_pct', 'mean_tax_change_pct']].corr().iloc[0, 1]
        correlations[f'black_mean_{suffix}'] = df[['black_pct', 'mean_tax_change_pct']].corr().iloc[0, 1]
    return correlations

def create_quintile_summary(df, group_col, value_col):
    """Create summary statistics by quintiles"""
    df[f'{group_col}_quintile'] = pd.qcut(df[group_col], 5, 
                                         labels=["Q1 (Lowest)", "Q2", "Q3", "Q4", "Q5 (Highest)"])
    
    summary = df.groupby(f'{group_col}_quintile').agg(
        count=('TAX_CHANGE', 'count'),
        mean_tax_change=('TAX_CHANGE', 'mean'),
        median_tax_change=('TAX_CHANGE', 'median'),
        mean_value=(value_col, 'mean')
    ).reset_index()
    
    return summary

# Main execution
gdf_filtered, non_vacant_gdf = filter_data(df)
print(f"Number of rows in gdf_filtered: {len(gdf_filtered)}")
print(f"Number of rows in non_vacant_gdf: {len(non_vacant_gdf)}")

# Calculate block group summaries
census_block_groups = calculate_block_group_summary(gdf_filtered)
non_vacant_block_summary = calculate_block_group_summary(non_vacant_gdf)

# Create comparison plots
plot_comparison(census_block_groups, non_vacant_block_summary, 
               'median_income', 'mean_tax_change_pct', 
               'Mean Tax Change vs. Median Income', 
               'Median Income by Census Block Group ($)')

plot_comparison(census_block_groups, non_vacant_block_summary,
               'minority_pct', 'mean_tax_change_pct',
               'Mean Tax Change vs. Minority Percentage',
               'Minority Population Percentage by Census Block Group')

plot_comparison(census_block_groups, non_vacant_block_summary,
               'black_pct', 'mean_tax_change_pct',
               'Mean Tax Change vs. Black Percentage',
               'Black Population Percentage by Census Block Group')

# Calculate and print correlations
correlations = calculate_correlations(census_block_groups, non_vacant_block_summary)
for key, value in correlations.items():
    print(f"Correlation {key}: {value:.4f}")

# Create and display quintile summaries
income_quintile_summary = create_quintile_summary(gdf_filtered, 'median_income', 'median_income')
non_vacant_income_quintile_summary = create_quintile_summary(non_vacant_gdf, 'median_income', 'median_income')
minority_quintile_summary = create_quintile_summary(gdf_filtered, 'minority_pct', 'minority_pct')
non_vacant_minority_quintile_summary = create_quintile_summary(non_vacant_gdf, 'minority_pct', 'minority_pct')

print("\nTax impact by income quintile (all properties):")
display(income_quintile_summary)
print("\nTax impact by income quintile (excluding vacant land):")
display(non_vacant_income_quintile_summary)
print("\nTax impact by minority percentage quintile (all properties):")
display(minority_quintile_summary)
print("\nTax impact by minority percentage quintile (excluding vacant land):")
display(non_vacant_minority_quintile_summary)


In [None]:
print("Columns in df:", df.columns.tolist())


In [None]:
# Print unique property categories (try common column names)
category_cols = ['property_category', 'PROPERTY_CATEGORY', 'category', 'CATEGORY', 'land_use', 'LAND_USE', 'class', 'CLASS', 'propclass', 'PROPCLASS', 'PARCELSTAT', 'PARCELSTAT_x', 'PARCELSTAT_y']
cat_col = None
for col in category_cols:
    if col in df.columns:
        cat_col = col
        break
if cat_col is not None:
    unique_categories = df[cat_col].dropna().unique()
    print(f"Unique property categories in '{cat_col}':")
    for cat in unique_categories:
        print(f"  {cat}")
else:
    print("No property category column found in DataFrame columns:", df.columns.tolist())


In [None]:
import numpy as np
import geopandas as gpd
import pandas as pd

# --- Inspect columns for debugging ---
print("Columns in df:", df.columns.tolist())

# Helper to find column with fallback to _x, _y, or other variants
def get_colname(df, base):
    # Try base, then _x, then _y, then case-insensitive match, then substring match
    if base in df.columns:
        return base
    for suffix in ["_x", "_y"]:
        if f"{base}{suffix}" in df.columns:
            return f"{base}{suffix}"
    # Try case-insensitive match
    for col in df.columns:
        if col.lower() == base.lower():
            return col
    # Try substring match (e.g. "land_value" in "LAND_VALUE")
    for col in df.columns:
        if base.lower() in col.lower():
            return col
    return None

# Defensive: check for 'exemption_flag' or variant
exemption_col = get_colname(df, 'exemption_flag')
if exemption_col is None:
    raise KeyError("No 'exemption_flag' column found in DataFrame columns: " + str(df.columns.tolist()))

# Exclude exempt properties (handle missing or all-NA values gracefully)
exemption_mask = df[exemption_col].fillna(False)
if exemption_mask.dtype != bool:
    exemption_mask = exemption_mask.astype(bool)
df_non_exempt = df[~exemption_mask].copy()

# Calculate area in square feet (geometry assumed in meters)
if "geometry" not in df_non_exempt.columns:
    raise AttributeError("GeoDataFrame is missing a 'geometry' column.")
if not isinstance(df_non_exempt, gpd.GeoDataFrame):
    # Try to convert to GeoDataFrame if possible
    try:
        df_non_exempt = gpd.GeoDataFrame(df_non_exempt, geometry="geometry", crs=getattr(df, "crs", None))
    except Exception as e:
        raise TypeError("df_non_exempt is not a GeoDataFrame and could not be converted: " + str(e))

# Ensure geometry is in a projected CRS (meters) for area calculation
if df_non_exempt.crs is None or not df_non_exempt.crs.is_projected:
    # Use Indiana West (EPSG:6493) if available, else fallback to UTM zone 16N (EPSG:26916)
    projected_crs = "EPSG:26916"
    df_non_exempt = df_non_exempt.to_crs(projected_crs)

df_non_exempt['area_sqft'] = df_non_exempt.geometry.area * 10.7639
df_non_exempt['area_sqft'] = df_non_exempt['area_sqft'].replace(0, np.nan)

# --- DEBUG: Print which columns are found for value fields ---
for base in ['REALIMPROV', 'REALLANDVA', 'TLLDIMPROV']:
    col = get_colname(df_non_exempt, base)
    print(f"Column for {base}: {col}")

# Compute per square foot values, only for columns that exist (try _x/_y/other fallback)
# Use 'REALIMPROV' instead of 'improvement_value', 'REALLANDVA' instead of 'land_value', and 'TLLDIMPROV' as full market value
per_sqft_bases = ['new_tax', 'tax_change', 'current_tax', 'REALIMPROV', 'REALLANDVA', 'TLLDIMPROV']
for base in per_sqft_bases:
    col = get_colname(df_non_exempt, base)
    if col is not None:
        per_sqft_col = f"{base}_per_sqft"
        df_non_exempt[per_sqft_col] = df_non_exempt[col] / df_non_exempt['area_sqft']
    else:
        print(f"WARNING: No column found for {base} (tried base/_x/_y/substring)")

# Refined property category assignment
def property_category_refined(row):
    # Helper to get value with _x/_y/substring/case-insensitive fallback
    def get_val(row, base, default=None):
        if base in row:
            return row[base]
        for suffix in ["_x", "_y"]:
            if f"{base}{suffix}" in row:
                return row[f"{base}{suffix}"]
        # Try case-insensitive match
        for col in row.index:
            if col.lower() == base.lower():
                return row[col]
        # Try substring match
        for col in row.index:
            if base.lower() in col.lower():
                return row[col]
        return default

    # Check for PROPERTY_CATEGORY field (with fallback)
    prop_cat = get_val(row, 'PROPERTY_CATEGORY', None)
    if isinstance(prop_cat, str):
        prop_cat_lower = prop_cat.strip().lower()
        if prop_cat_lower == 'vacant land':
            return 'Vacant'
        elif prop_cat_lower == 'parking':
            return 'Parking Lot'

    # Otherwise, use logic as before
    if get_val(row, 'vacant', False):
        return 'Vacant'
    elif get_val(row, 'parking_lot', False):
        return 'Parking Lot'
    else:
        improvement_value = get_val(row, 'REALIMPROV', np.nan)
        land_value = get_val(row, 'REALLANDVA', np.nan)
        if (
            pd.notnull(improvement_value) and 
            pd.notnull(land_value) and 
            land_value > 0 and 
            improvement_value < 0.25 * land_value
        ):
            return 'Underdeveloped'
        else:
            return ''

df_non_exempt['property_category_refined'] = df_non_exempt.apply(property_category_refined, axis=1)

# Exclude condos based on PROPTYPE containing 'condo' (case-insensitive, with fallback)
proptype_col = get_colname(df_non_exempt, 'PROPTYPE')
num_condos = 0
if proptype_col is not None:
    condo_mask = df_non_exempt[proptype_col].astype(str).str.lower().str.contains('condo', na=False)
    num_condos = condo_mask.sum()
    if num_condos > 0:
        print(f"Excluding {num_condos} entries where PROPTYPE includes 'condo'")
    df_non_exempt = df_non_exempt[~condo_mask].copy()
else:
    print("WARNING: No PROPTYPE column found for condo exclusion.")

# Select columns to save, using _x/_y/substring fallback if needed
# Use 'REALIMPROV', 'REALLANDVA', and 'TLLDIMPROV' (as full market value) and their per_sqft columns
# Also save PROPTYPE if available
cols_to_save_bases = [
    'geometry', 'exemption_flag', 'property_category_refined',
    'current_tax', 'current_tax_per_sqft',
    'REALIMPROV', 'REALIMPROV_per_sqft',
    'REALLANDVA', 'REALLANDVA_per_sqft',
    'TLLDIMPROV', 'TLLDIMPROV_per_sqft',
    'PROPERTY_CATEGORY'
]
cols_to_save = []
for base in cols_to_save_bases:
    col = get_colname(df_non_exempt, base)
    if col is not None:
        cols_to_save.append(col)
    else:
        # Only warn for value fields, not geometry/category
        if base not in ['geometry', 'property_category_refined']:
            print(f"WARNING: Output column {base} not found in DataFrame.")

# Ensure output is in WGS84 (EPSG:4326) before saving
if df_non_exempt.crs is None or df_non_exempt.crs.to_epsg() != 4326:
    df_non_exempt = df_non_exempt.to_crs("EPSG:4326")
    print("Converted to EPSG:4326")

# Save as GeoParquet
import os
downloads_path = os.path.expanduser("~/Downloads/southbend_tax_per_sqft.parquet")
output_gdf = df_non_exempt[cols_to_save]
output_gdf.to_parquet(downloads_path, index=False)
print("Saved columns:", output_gdf.columns.tolist())
print("Property category counts:")
print(output_gdf['property_category_refined'].value_counts(dropna=False))
# Print PROPTYPE counts if present
proptype_col_out = get_colname(output_gdf, 'PROPTYPE')
if proptype_col_out is not None:
    print("PROPTYPE counts:")
    print(output_gdf[proptype_col_out].value_counts(dropna=False))
else:
    print("No PROPTYPE column found in output for value counts.")
# Print unique property categories
print("Unique property categories in output:", output_gdf['PROPERTY_CATEGORY'].unique())
