## Project Setup

In this tutorial, we'll scrape Zillow using Python with two community packages:

- [httpx](https://pypi.org/project/httpx/) - HTTP client library to get Zillow data in either HTML or JSON.
- [parsel](https://pypi.org/project/parsel/) - HTML parsing library to parse our web scraped HTML files.

Optionally, we'll also use [loguru](https://pypi.org/project/loguru/), a logging library that will allow us to track our Zillow data scraper.  
These packages can be installed using the following pip command:

In [None]:
!pip install httpx parsel loguru h2

## How to Scrape Zillow Property Pages?

To start, let's explore scraping Zillow data from property pages. First, let's locate the data on the HTML from a given Zillow page, like [this one](https://www.zillow.com/b/1625-e-13th-st-brooklyn-ny-5YGKWY/).

To scrape this page data, we can parse every detail using XPath or CSS selectors. However, there is a better approach: hidden web data. To find this data, follow the below steps:

- Open the [browser developer tools](https://scrapfly.io/blog/browser-developer-tools-in-web-scraping/) by pressing the `F12` key.
- Search for the selector `//script[@id='__NEXT_DATA__']`.

After following the above steps, you will find the property dataset hidden in the JavaScript variable with the above XPath selector:

![capture of page source of Zillow's property page](https://scrapfly.io/blog/content/images/how-to-scrape-zillow_page-source-prop.svg)

We can see property data is available as JSON object in a script tag

The above real estate data is the same on the page but before getting rendered into the HTML, commonly known as hidden web data.

Let's power our Zillow data scraper with requesting and parsing logic for property pages:

In [2]:
import asyncio
from typing import List
import httpx
import h2
import json
from parsel import Selector

client = httpx.AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent being blocked
    headers={
        "accept-language": "en-US,en;q=0.9",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "accept-language": "en-US;en;q=0.9",
        "accept-encoding": "gzip, deflate, br",
    },
)

async def scrape_properties(urls: List[str]):
    """scrape zillow property pages for property data"""
    to_scrape = [client.get(url) for url in urls]
    results = []
    for response in asyncio.as_completed(to_scrape):
        response = await response
        assert response.status_code == 200, "request has been blocked"
        selector = Selector(response.text)
        data = selector.css("script#__NEXT_DATA__::text").get()
        if data:
            # Option 1: some properties are located in NEXT DATA cache
            data = json.loads(data)
            property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
            property_data = property_data[list(property_data)[0]]['property']
        else:
            # Option 2: other times it's in Apollo cache
            data = selector.css("script#hdpApolloPreloadedData::text").get()
            data = json.loads(json.loads(data)["apiCache"])
            property_data = next(
                v["property"] for k, v in data.items() if "ForSale" in k
            )
        results.append(property_data)
    return results

In [None]:
async def run():
    data = await scrape_properties(
            ["https://www.zillow.com/homedetails/1625-E-13th-St-APT-3K-Brooklyn-NY-11229/245001606_zpid/"]
        )
    print(json.dumps(data, indent=2))

# Execute the run function in an async context
await run()


## How to Find Zillow Properties

Our previous code for scraping Zillow can extract data from a property page. In this section, we'll explore finding real estate listings using Zillow's search bar. Here is how the search system works under the hood:

0:00

/0:35

1×

Inspecting Zillow's search functionality with Chrome Dev tools (accessed via F12 key)

Above, we can see that upon submitting a search query, a background request is sent to Zillow API for search. The search query includes the map coordinates, as well as other comprehensive details. However, few query parameters are actually required:

```json
{
  "searchQueryState":{
    "pagination":{},
    "usersSearchTerm":"New Haven, CT",
    "mapBounds":
      {
        "west":-73.03037621240235,
        "east":-72.82781578759766,
        "south":41.23043771298298,
        "north":41.36611033618769
      },
    },
  "wants": {
    "cat1":["mapResults"]
  },
  "requestId": 2
}
```

The Zillow search API is really powerful and allows us to find listings in _any_ map area defined by two location points comprised of 4 direction values: north, west, south, and east:

![illustration of drawing areas on maps using only two points](https://scrapfly.io/blog/content/images/how-to-scrape-zillow_two-points.svg)

with these 4 values we can draw a square or a circle area at any point of the map!

Let's replicate the login for finding properties by location to our Zillow scraping code using the latitude and longitude values:

In [None]:
import json
import httpx
import time

# we should use browser-like request headers to prevent being instantly blocked
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "Content-Type": "application/json",  # Added Content-Type header
}

url = "https://www.zillow.com/async-create-search-page-state"
body = {
    "searchQueryState": {
        "pagination": {},
        "usersSearchTerm": "New Haven, CT",
        # map coordinates that indicate New Haven city's area
        "mapBounds": {
            "west": -73.03037621240235,
            "east": -72.82781578759766,
            "south": 41.23043771298298,
            "north": 41.36611033618769,
        },
    },
    "wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
    "requestId": 2,
}

max_retries = 3
for attempt in range(max_retries):
    response = httpx.put(url, headers=BASE_HEADERS, data=json.dumps(body))
    if response.status_code == 200:
        break
    elif attempt < max_retries - 1:
        time.sleep(2 ** attempt)  # Exponential backoff
    else:
        raise RuntimeError(f"Request has been blocked. Status code: {response.status_code}, Response: {response.text}")

data = response.json()
results = data["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")

## How to Scrape Zillow Search Pages?

To scrape Zillow search, we need the geographical location details, which can be challenging to get. Therefore, we'll extract the location's geographical details from an easier user interface: search pages. To illustrate this, go to any search URL on Zillow, like [zillow.com/homes/New-Haven,-CT\_rb/](https://www.zillow.com/homes/New-Haven,-CT_rb/). You fill find the geographical details hidden in the HTML:

![capture of page source of Zillow's search pager](https://scrapfly.io/blog/content/images/how-to-scrape-zillow_page-source-search.svg)

We can see query and geo data of this search hidden in a page source comment

The geographical details exist in the script tag. Let's use it to scrape Zillow data from search pages:

In [89]:
import random
import json
import httpx
from loguru import logger as log
from parsel import Selector

BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "Content-Type": "application/json",
}

def _search(query: str, session: httpx.Client, filters: dict = None, categories=("cat1", "cat2")):
    """base search function which is used by sale and rent search functions"""
    html_response = session.get(f"https://www.zillow.com/homes/{query}_rb/")
    assert html_response.status_code != 403, "request is blocked"
    selector = Selector(html_response.text)
    # find query data in script tags
    script_data = json.loads(selector.xpath("//script[@id='__NEXT_DATA__']/text()").get())
    query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
    if filters:
        query_data["filterState"] = filters

    # scrape search API
    url = "https://www.zillow.com/async-create-search-page-state"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": random.randint(2, 10),
        }
        api_response = session.put(url, headers={"content-type": "application/json"}, json=full_query)
        data = api_response.json()
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found

def search_sale(query: str, session: httpx.Client):
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return _search(query=query, session=session)

def search_rent(query: str, session: httpx.Client):
    """search properties that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": True},
        "isAllHomes": {"value": False},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": False},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": True},
        "price": {"max": 1000000},
        "lot": {"min": 5000},
        "beds": {"min": 1},
        "baths": {"min": 1}
    }
    return _search(query=query, session=session, filters=filters, categories=["cat1"])

def run():
    limits = httpx.Limits(max_connections=5)
    with httpx.Client(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = search_rent("Eureka,CA", session)
        with open('properties.geojson', 'w') as geojson_file:
            geojson_data = {
                "type": "FeatureCollection",
                "features": [
                    {
                        "type": "Feature",
                        "geometry": {
                            "type": "Point",
                            "coordinates": [result["latLong"]["longitude"], result["latLong"]["latitude"]],
                        },
                        "properties": result,
                    }
                    for result in data
                ],
            }
            json.dump(geojson_data, geojson_file, indent=2)



# Convert to geojson

In [None]:
import json
import httpx
import random
import geopandas as gpd
from geojson import Feature, Point, FeatureCollection
from loguru import logger as log

def convert_to_geojson(data):
    features = []
    for item in data:
        if "latLong" in item and "latitude" in item["latLong"] and "longitude" in item["latLong"]:
            coordinates = (item["latLong"]["longitude"], item["latLong"]["latitude"])
            feature = Feature(
                geometry=Point(coordinates),
                properties={
                    "zpid": item.get("zpid"),
                    "statusType": item.get("statusType"),
                    "statusText": item.get("statusText"),
                    "price": item.get("price"),
                    "beds": item.get("beds"),
                    "baths": item.get("baths"),
                    "area": item.get("area"),
                    "address": item.get("address"),
                    "city": item.get("hdpData", {}).get("homeInfo", {}).get("city"),
                    "state": item.get("hdpData", {}).get("homeInfo", {}).get("state"),
                    "zipcode": item.get("hdpData", {}).get("homeInfo", {}).get("zipcode"),
                    "homeType": item.get("hdpData", {}).get("homeInfo", {}).get("homeType"),
                }
            )
            features.append(feature)

    return FeatureCollection(features)

def search_by_bbox(bbox, session, filters=None):
    url = "https://www.zillow.com/async-create-search-page-state"
    query_data = {
        "pagination": {},
        "mapBounds": {
            "west": bbox[0],
            "east": bbox[2],
            "south": bbox[1],
            "north": bbox[3],
        },
    }

    if filters:
        query_data["filterState"] = filters

    full_query = {
        "searchQueryState": query_data,
        "wants": {"cat1": ["mapResults"]},
        "requestId": random.randint(2, 10),
    }

    response = session.put(url, headers=BASE_HEADERS, json=full_query)
    data = response.json()
    return data["cat1"]["searchResults"]["mapResults"]

def run_geojson_search(input_geojson_file, output_geojson_file, filters=None):
    geodf = gpd.read_file(input_geojson_file)
    bbox = geodf.total_bounds

    limits = httpx.Limits(max_connections=5)
    with httpx.Client(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        results = search_by_bbox(bbox, session, filters)

        geojson_data = convert_to_geojson(results)

        with open(output_geojson_file, 'w') as geojson_file:
            json.dump(geojson_data, geojson_file, indent=2)

        log.info(f"GeoJSON file '{output_geojson_file}' created with {len(geojson_data['features'])} features.")

# Example usage
if __name__ == "__main__":
    input_geojson_file = "/Users/maples/GitHub/Zillow-Scrape/grt_buffer_bbox_wgs84.geojson"
    output_geojson_file = "output_properties03.geojson"
    run_geojson_search(input_geojson_file, output_geojson_file, filters)