# 01_query – Collecting New Jersey home sale data from Redfin

This notebook collects residential home sale data from Redfin using the
[homeharvest](https://github.com/ZacharyHampton/HomeHarvest) library.

The focus is on properties in New Jersey over a recent five-year window
(2020-01-01 to 2025-11-24), to provide a sufficiently large and up-to-date
dataset for modeling home sale prices.

The notebook:

1. Defines the geographic and temporal query parameters for New Jersey.
2. Splits the full date range into smaller time windows to make scraping more
   robust.
3. Uses `homeharvest` to download Redfin property records for each time window.
4. Concatenates and lightly de-duplicates the results.
5. Performs basic sanity checks on key numeric fields.
6. Saves the raw dataset as a single CSV file under `data/raw/` for downstream
   exploration, cleaning, and modeling.


> For future users: please install these required modules for this project. 

In [1]:
#!pip install tensorflow
#!pip install -U homeharvest

## Imports and configuration

In this section, I load the required Python packages, add the project root to
`sys.path` so the notebook can import code from `src/`, and define the query
configuration:

- Location: New Jersey  
- Listing type: sold properties  
- Date range: 2015-01-01 to 2025-11-24  
- Property types: single-family and multi-family homes  
- Exclude foreclosure listings  
- Restrict to MLS listings only  

The configuration can be edited here if I later decide to broaden or narrow
the cohort (e.g., include condos or adjust the date range).


In [17]:
import os
import sys

import pandas as pd

# Add the project root path to sys.path so that I could import the source scripts from ../src
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)
from src.query import generate_date_chunks, scrape_sales, deduplicate_properties

> Future users: please feel free to change the configs according to your interests

In [11]:
# ----------------------
# Configuration
# ----------------------
LOCATION = "New Jersey"
LISTING_TYPE = "sold"

DATE_FROM = "2015-01-01"
DATE_TO   = "2025-11-24"
CHUNK_DAYS = 90

PROPERTY_TYPES = [
    "single_family",
    "multi_family",
    # add 'condo', 'townhouse', etc. if desired
]

FORECLOSURE = False
MLS_ONLY = True

RAW_DATA_DIR = "../data/raw"
os.makedirs(RAW_DATA_DIR, exist_ok=True)

RAW_OUTPUT_PATH = os.path.join(
    RAW_DATA_DIR,
    f"redfin_nj_sold_{DATE_FROM}_to_{DATE_TO}.csv",
)

## Generate date chunks

Rationale of splitting the date range into chunks

Scraping several years of data in a single request is brittle and more likely
to miss records. To make the process more robust, I split the full date range
into 90-day chunks using a helper function from `src.query`.

Each chunk is then scraped separately, and the results are combined at the end.
This also makes it easier to see which time windows, if any, return no data.


In [12]:
date_chunks = generate_date_chunks(DATE_FROM, DATE_TO, CHUNK_DAYS)
print(f"Number of date chunks: {len(date_chunks)}")
print("Preview of the last five chunks: ")
date_chunks[:5]

Number of date chunks: 45
Preview of the last five chunks: 


[('2015-01-01', '2015-03-31'),
 ('2015-04-01', '2015-06-29'),
 ('2015-06-30', '2015-09-27'),
 ('2015-09-28', '2015-12-26'),
 ('2015-12-27', '2016-03-25')]

## Downloading property records from Redfin

Using the generated date chunks, I call `scrape_sales` (a wrapper around
`homeharvest.scrape_property`) to retrieve property records for New Jersey.

For each chunk, I log the date window and the number of properties returned.
At the end, all chunk-level DataFrames are concatenated into a single
`properties_raw` DataFrame.


## Handling transient API errors

The underlying `homeharvest` call occasionally fails with JSON parsing errors
or other transient network/API issues (for example, if the server returns an
HTML error page instead of JSON). To avoid aborting the entire scraping
process, the `scrape_sales` function includes a simple retry mechanism:

- Each date chunk is attempted up to `max_retries` times.
- If all attempts fail, the date range and the last error message are recorded
  in a `failed_chunks` list and the notebook proceeds to the next chunk.

At the end of the scrape, the notebook prints any failed date ranges so they
are easy to inspect or re-run separately if needed.


In [13]:
properties_raw, failed_chunks = scrape_sales(
    location=LOCATION,
    listing_type=LISTING_TYPE,
    date_chunks=date_chunks,
    property_types=PROPERTY_TYPES,
    foreclosure=FORECLOSURE,
    mls_only=MLS_ONLY,
    max_retries=3,
    retry_sleep_seconds=2.0,
)

print(f"\nTotal number of properties (before de-duplication): {len(properties_raw)}")

if failed_chunks:
    print("\nThe following date ranges failed and were skipped:")
    for start_str, end_str, err in failed_chunks:
        print(f"  {start_str} to {end_str}: {err}")

properties_raw.head()


[1/45] Scraping New Jersey from 2015-01-01 to 2015-03-31...
  Retrieved 0 properties
[2/45] Scraping New Jersey from 2015-04-01 to 2015-06-29...
  Retrieved 0 properties
[3/45] Scraping New Jersey from 2015-06-30 to 2015-09-27...
  Retrieved 0 properties
[4/45] Scraping New Jersey from 2015-09-28 to 2015-12-26...
  Retrieved 1720 properties
[5/45] Scraping New Jersey from 2015-12-27 to 2016-03-25...
  Retrieved 2975 properties
[6/45] Scraping New Jersey from 2016-03-26 to 2016-06-23...
  Retrieved 3804 properties
[7/45] Scraping New Jersey from 2016-06-24 to 2016-09-21...
  Retrieved 2953 properties
[8/45] Scraping New Jersey from 2016-09-22 to 2016-12-20...
  Retrieved 3122 properties
[9/45] Scraping New Jersey from 2016-12-21 to 2017-03-20...
  Retrieved 3108 properties
[10/45] Scraping New Jersey from 2017-03-21 to 2017-06-18...
  Retrieved 3809 properties
[11/45] Scraping New Jersey from 2017-06-19 to 2017-09-16...
  Retrieved 3280 properties
[12/45] Scraping New Jersey from 2017-0

Unnamed: 0,property_url,property_id,listing_id,permalink,mls,mls_id,status,mls_status,text,style,...,builder_id,builder_name,office_id,office_mls_set,office_name,office_email,office_phones,nearby_schools,primary_photo,alt_photos
0,https://www.realtor.com/realestateandhomes-det...,6768574319,605279890,20-Pine-Glen-Dr_Jobstown_NJ_08041_M67685-74319,PHPA,6623830,SOLD,S,,SINGLE_FAMILY,...,,,1252001,O-PHPA-EVRYHMRT,Everyhome Realtors,paul.heck@everyhome.com,"[{'number': '(215) 699-5555', 'type': 'Office'...",,,
1,https://www.realtor.com/realestateandhomes-det...,5693347416,603967502,28-Jake-Dr_Upper-Freehold_NJ_08501_M56933-47416,MONJ,21525341,SOLD,Sold,This elegant brick front colonial features 4 B...,SINGLE_FAMILY,...,,,2989657,O-MONJ-1693,Keller Williams Realty West Monmouth,klrw407@kw.com,"[{'number': '7325369010', 'type': 'Office', 'p...",Upper Freehold Regional School District,,
2,https://www.realtor.com/realestateandhomes-det...,5263273403,603969258,28-Jake-Dr_Cream-Ridge_NJ_08514_M52632-73403,PHPA,6600967,SOLD,S,,SINGLE_FAMILY,...,,,1350986,O-PHPA-KWRLTYWM,KELLER WILLIAMS REALTY - West Monmouth,jimdaykw@gmail.com,"[{'number': '7325369010', 'type': 'Office', 'p...",Upper Freehold Regional School District,https://p.rdcpix.com/v01/l57244e45-m0od-w480_h...,https://p.rdcpix.com/v01/l57244e45-m0od-w480_h...
3,https://www.realtor.com/realestateandhomes-det...,5286455447,610063262,1879-Old-Cuthbert-Rd-Ste-38_Cherry-Hill_NJ_080...,PHPA,6614636,SOLD,S,,SINGLE_FAMILY,...,,,124924,O-PHPA-GALL20,All-Ways Agency,77panj@gmail.com,"[{'number': '6097142000', 'type': 'Mobile', 'p...",Cherry Hill Township School District,,
4,https://www.realtor.com/realestateandhomes-det...,5153179323,640820846,334-Perkintown-Rd_Pedricktown_NJ_08067_M51531-...,PHPA,1001798217,SOLD,,"Simple, Inviting, Serene...This Picturesque Ho...",SINGLE_FAMILY,...,,,3006871,O-PHPA-61044,BHHS Fox & Roach Mullica Hill,ClientServices@foxroach.com,"[{'number': '8563436000', 'type': 'Office', 'p...",Oldmans Township School District,https://p.rdcpix.com/v01/ld75c4645-m0od-w480_h...,https://p.rdcpix.com/v01/ld75c4645-m0od-w480_h...


## De-duplication and basic sanity checks

Redfin listings can occasionally appear more than once (for example, if they
were relisted). To reduce redundancy, I use `deduplicate_properties` from
`src.query`, which prefers a unique identifier column (e.g., `property_id`,
`listing_id`, or `mls_id`) if present, and otherwise falls back to a
combination of address-related fields.

In [14]:
properties_raw = deduplicate_properties(properties_raw)
print(f"Total number of properties after de-duplication: {len(properties_raw)}")

Using ['property_id', 'listing_id', 'mls_id'] for de-duplication.
Total number of properties after de-duplication: 188822


After de-duplication, I run a few quick checks on key numeric columns
(e.g., price, list_price, beds, baths, square footage) to detect obviously
problematic values or heavy missingness. Detailed cleaning and filtering will
be handled in the exploration notebook.

In [15]:
print("\nBasic sanity checks:")
for col in ["price", "list_price", "beds", "baths", "sqft"]:
    if col in properties_raw.columns:
        print(
            f"{col}: min={properties_raw[col].min()}, "
            f"max={properties_raw[col].max()}, "
            f"missing={properties_raw[col].isna().sum()}"
        )


Basic sanity checks:
list_price: min=1, max=25000000, missing=142
beds: min=0, max=76, missing=1350
sqft: min=0, max=653400, missing=70855


## Saving the raw dataset

Finally, I save the de-duplicated raw dataset as a CSV file under `data/raw/`:

- File: `data/raw/redfin_nj_sold_2015-01-01_to_2025-11-24.csv`

This file serves as the starting point for the next steps of the project,
where I will perform more detailed exploratory data analysis, define a modeling
cohort, and construct features for predicting home sale prices.

In [16]:
properties_raw.to_csv(RAW_OUTPUT_PATH, index=False)
print(f"Saved raw data to: {RAW_OUTPUT_PATH}")

Saved raw data to: ../data/raw/redfin_nj_sold_2015-01-01_to_2025-11-24.csv


## Summary of coverage, dataset size, and initial quality checks

This notebook queried Redfin (via `homeharvest`) for sold residential
properties in New Jersey over the period **2015-01-01 to 2025-11-24** and
saved the de-duplicated raw data to `data/raw/`.

During the scraping step, the date range was split into 90-day chunks.
The first three chunks in early 2015 returned **no properties**:

- 2015-01-01 to 2015-03-31: 0 properties  
- 2015-04-01 to 2015-06-29: 0 properties  
- 2015-06-30 to 2015-09-27: 0 properties  

The **first non-empty chunk** was:

- 2015-09-28 to 2015-12-26: 1,720 properties  

Subsequent chunks from late 2015 onward consistently returned non-zero
numbers of properties. In practice, this means that although the requested
date window starts on **2015-01-01**, the **effective data coverage** begins
around **late 2015**, reflecting the historical coverage of the underlying
data source.

The dataset size at this stage is:

- Total records before de-duplication: **190,884**
- Total records after de-duplication (using `property_id`, `listing_id`,
  and `mls_id`): **188,822**

Basic sanity checks on key numeric variables yielded:

- **List price (`list_price`)**
  - Minimum observed value: **1**
  - Maximum observed value: **25,000,000**
  - Missing values: **142**
  - Interpretation: a list price of 1 is clearly invalid and should be treated
    as an error or excluded. The upper tail may include legitimate luxury
    properties but will be inspected more carefully using quantiles.

- **Number of bedrooms (`beds`)**
  - Minimum observed value: **0**
  - Maximum observed value: **76**
  - Missing values: **1,350**
  - Interpretation: 0 bedrooms is consistent with studio units, but 76 bedrooms
    is almost certainly not a standard residential property and likely reflects
    multi-unit buildings or data errors. Extreme bedroom counts will be capped
    or removed during cleaning.

- **Square footage (`sqft`)**
  - Minimum observed value: **0**
  - Maximum observed value: **653,400**
  - Missing values: **70,855**
  - Interpretation: 0 square feet is not a valid living area, and very large
    values such as 653,400 square feet are likely to be non-residential,
    aggregates, or erroneous entries. Because square footage is expected to be
    a strong predictor of sale price, extreme and missing values will require
    explicit handling (removal, capping, and/or imputation) in the exploration
    and cleaning steps.

Overall, the query step has:

- Assembled a large New Jersey home-sales dataset with effective coverage from
  **late 2015** through **2025-11-24**, and  
- Revealed expected data-quality issues (outliers, implausible values, and
  missingness) in several key features.

The next notebook will focus on more detailed exploratory data analysis,
definition of a clean modeling cohort, and systematic treatment of outliers
and missing values.
