# Data Filtering and Cleaning Notebook

This notebook applies filtering criteria to create a clean analytical dataset suitable for housing market analysis.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Define file paths
INTERIM_DIR = '../data/interim/'
PROCESSED_DIR = '../data/processed/'
ENRICHED_FILE = os.path.join(INTERIM_DIR, 'sales_data_enriched.csv')
CLEAN_FILE = os.path.join(PROCESSED_DIR, 'final_cleaned_data.csv')

## Setup and Configuration

In [2]:
# Load enriched data with proper data types
dtype_spec = {
    'sale_price': 'object',
    'class': 'str',
    'pin': 'str',
    'pin10': 'str'
}
df = pd.read_csv(ENRICHED_FILE, dtype=dtype_spec)
print(f"Initial dataset shape: {df.shape}")
df.shape

  df = pd.read_csv(ENRICHED_FILE, dtype=dtype_spec)


Initial dataset shape: (1785902, 26)


(1785902, 26)

## Load Enriched Data

In [3]:
# Convert sale_price to numeric, removing '$' and ','
df['sale_price'] = df['sale_price'].astype(str).str.replace('$', '', regex=False).str.replace(',', '', regex=False)
df['sale_price'] = pd.to_numeric(df['sale_price'], errors='coerce')

## Clean Price Data

In [4]:
# Remove rows with missing price, coordinates, or distance
df.dropna(subset=['sale_price', 'lon', 'lat', 'min_distance_meters'], inplace=True)
print(f"After removing nulls: {df.shape}")

After removing nulls: (1785902, 26)


## Remove Missing Critical Values

In [5]:
# Keep only single-family residential homes
RESIDENTIAL_CLASSES = ['202', '203', '204', '205', '206', '207', '208', '209']
df['class'] = df['class'].astype(str).str[:3]
df_filtered = df[df['class'].isin(RESIDENTIAL_CLASSES)].copy()
print(f"After class filter: {df_filtered.shape}")
df_filtered.shape

After class filter: (542787, 26)


(542787, 26)

## Filter by Property Class

Keep only single-family residential properties (classes 202-209).

In [6]:
# Keep only arm's-length market transactions
df_filtered = df_filtered[
    (df_filtered['is_multisale'] == False) &
    (df_filtered['sale_filter_less_than_10k'] == False) &
    (df_filtered['sale_filter_deed_type'] == False)
]
print(f"After transaction filter: {df_filtered.shape}")
df_filtered.shape

After transaction filter: (452038, 26)


(452038, 26)

## Filter Transaction Types

Exclude non-market transactions (multi-parcel, low values, non-standard deeds).

In [7]:
# Keep only recent sales (2018+)
df_filtered['sale_date'] = pd.to_datetime(df_filtered['sale_date'], errors='coerce')
RECENT_YEAR = 2018
df_filtered = df_filtered[df_filtered['sale_date'].dt.year >= RECENT_YEAR]
print(f"After year filter: {df_filtered.shape}")
df_filtered.shape

After year filter: (126829, 26)



(126829, 26)

## Filter by Time Period

Keep sales from 2018 onwards for contemporary market analysis.

In [8]:
# Remove extreme price outliers (bottom 1% and top 1%)
lower_bound = df_filtered['sale_price'].quantile(0.01)
upper_bound = df_filtered['sale_price'].quantile(0.99)
df_final = df_filtered[
    (df_filtered['sale_price'] >= lower_bound) & 
    (df_filtered['sale_price'] <= upper_bound)
].copy()
print(f"After outlier removal: {df_final.shape}")
print(f"Price range: ${lower_bound:,.0f} - ${upper_bound:,.0f}")
df_final.describe()

After outlier removal: (124438, 26)
Price range: $25,000 - $1,550,000


Unnamed: 0,year,township_code,nbhd,sale_date,sale_price,num_parcels_sale,row_id,lon,lat,min_distance_meters
count,124438.0,124438.0,124438.0,124438,124438.0,124438.0,124438.0,124438.0,124438.0,124438.0
mean,2021.493957,47.396237,47529.896993,2021-12-27 18:32:09.172760832,288716.6,1.0,34278620.0,-87.728514,41.783491,7466.317174
min,2018.0,10.0,10011.0,2018-01-01 00:00:00,25000.0,1.0,7087464.0,-88.027405,41.469928,29.502009
25%,2020.0,27.0,27020.0,2020-04-28 00:00:00,155000.0,1.0,7339800.0,-87.794284,41.68647,2077.085943
50%,2021.0,39.0,39080.0,2021-12-07 00:00:00,247000.0,1.0,7601428.0,-87.728186,41.778261,4642.148526
75%,2023.0,72.0,72030.0,2023-09-22 00:00:00,351000.0,1.0,96516260.0,-87.664095,41.90776,10304.70215
max,2025.0,77.0,77170.0,2025-10-22 00:00:00,1550000.0,1.0,98484840.0,-87.524891,42.065343,34227.066903
std,2.107735,23.875413,23925.807227,,211635.1,0.0,41133920.0,0.092649,0.144482,7363.257819


## Remove Price Outliers

Trim top and bottom 1% of prices to remove data errors and unusual transactions.

In [9]:
# Save cleaned dataset to CSV
df_final.to_csv(CLEAN_FILE, index=False)
print(f"Cleaned data saved to {CLEAN_FILE}")
print(f"Final dataset: {len(df_final):,} records with {len(df_final.columns)} columns")

Cleaned data saved to ../data/processed/final_cleaned_data.csv
Final dataset: 124,438 records with 26 columns


## Cleaning Summary

**Filtering Applied:**
- Property class: Single-family residential only (202-209)
- Transaction type: Arm's-length market sales only
- Time period: 2018-2024
- Price: 1st-99th percentile (removing outliers)
- Coordinates: Valid lon/lat and distance data required

**Result:** Analytical dataset ready for statistical analysis