Skip to content

islajr/property-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Property Scraper

Python Version Build Status Last Commit

Introduction

Property Scraper is an autonomous lightweight python-based data pipeline that extracts real property listing data from public Nigerian property portals, cleans and geocodes the data, and persists them to a PostgreSQL instance on Supabase on a weekly schedule. A Telegram message is sent at the end of every run with a summary of what happened.


What it does

Property Scraper has two run modes: Discovery Runs and Health Checks.

Discovery runs are all about finding new property listings and storing them in the database

Health Checks confirm price changes and listing activity, and log the result.

Both run modes are important as they define the lifecycle for the property listings. Property Listings enter the pipeline via the discovery mode and exit through the health check mode.

Discovery Runs

Each run proceeds through seven stages:

Stage Description
1. Snapshot Loads all ACTIVE listings from the DB into memory
2. Scrape Fetches listings from all four Nigerian portals
3. Normalise Converts raw strings ("₦45M", "3 Beds") into typed Python values
4. Geocode Attaches lat/lng coordinates via neighbourhood name lookup
5. Upsert Inserts new listings, update existing ones, emit history events
6. Log Writes one row per portal to scrape_runs
7. Notify Sends a Telegram summary with counts and status per portal

The three portals scraped are: PropertyPro.ng, PrivateProperty.ng, and NigeriaPropertyCentre.ng


Health Check Runs

Each health check goes through the following stages:

Stage Description
1. Snapshot Loads all ACTIVE listings from the DB into memory
2. Check Checks each loaded listing with its original URL to confirm its continued existence and for any recent price changes
3. Log If changes are made or listings are found to be removed, they are stored as PRICE_CHANGE or as REMOVED events respectively with the appropriate information
4. Notify Sends a Telegram summary with the run stats and results for notification

Project structure

property-scraper/
β”œβ”€β”€ config.py                   # All configuration β€” env vars, constants
β”œβ”€β”€ conftest.py                 # Root pytest path fix
β”œβ”€β”€ pytest.ini                  # pytest settings
β”œβ”€β”€ run.sh                      # Local weekly run script
β”œβ”€β”€ .env                        # Local credentials (never commit)
β”‚
β”œβ”€β”€ scraper/
β”‚   β”œβ”€β”€ models.py               # RawListing and NormalisedListing dataclasses
β”‚   β”œβ”€β”€ orchestrator.py         # Main entry point β€” wires all stages together
β”‚   β”œβ”€β”€ normaliser.py           # String β†’ typed value conversion
β”‚   β”œβ”€β”€ geocoder.py             # Neighbourhood β†’ lat/lng (Nominatim + cache)
β”‚   β”œβ”€β”€ db_writer.py            # All database reads and writes
β”‚   β”œβ”€β”€ notifier.py             # Telegram notification
β”‚   β”œβ”€β”€ health_checker.py       # Health Check logic
β”‚   └── parsers/
β”‚       β”œβ”€β”€ base_parser.py      # Shared HTTP + pagination infrastructure
β”‚       β”œβ”€β”€ propertypro.py
β”‚       β”œβ”€β”€ privateproperty.py
β”‚       └── nigeriapropertycentre.py
β”‚
β”œβ”€β”€ schema/
β”‚   β”œβ”€β”€  001_raw_data_schema.sql # DB tables and indexes β€” run once
β”‚   β”œβ”€β”€  002_add_health_check_at.sql    # Migration to track health checks
β”‚
└── tests/
    β”œβ”€β”€ conftest.py
    β”œβ”€β”€ fixtures/               # Saved HTML from each portal
    β”œβ”€β”€ test_parsers.py
    β”œβ”€β”€ test_normaliser.py
    β”œβ”€β”€ test_geocoder.py
    β”œβ”€β”€ test_db_writer.py
    └── test_pipeline.py

Requirements

  • Python 3.10–3.12 (3.13 is not supported β€” psycopg2-binary has no wheel for it yet)
  • A Database with the schema applied
  • Optional: a Telegram bot token and chat ID for run notifications

Setup

1. Clone and create a virtual environment

git clone <repo-url>
cd property-scraper
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Create a .env file in the project root

DATABASE_URL=postgresql://postgres.[project-id]:[password]@...
TELEGRAM_BOT_TOKEN=1234567890:ABCDEF...   # optional
TELEGRAM_CHAT_ID=123456789                # optional

3. Apply the database schema

psql $DATABASE_URL -f schema/001_raw_data_schema.sql

4. Update the Nominatim user-agent

Open scraper/geocoder.py and set NOMINATIM_AGENT to a real application name and contact email. Nominatim blocks generic or empty user-agents.


Running

Discovery mode

./run.sh

run.sh handles the full lifecycle: checks .env, resolves the Python version, creates or rebuilds .venv if needed, installs dependencies, applies the schema (idempotent), and runs the pipeline. On failure it prints the last 30 lines of scraper.log.

You can also run the pipeline directly:

python3 -m scraper.orchestrator

Health Checks

./run.sh --health-check

with the --health-check flag, run.sh pulls all active listings from the database and checks them for activity. If it determines that prices have changed or that they are no longer present, a PRICE_CHANGE or a REMOVED event for each listing is appended to the listing_history table respectively.


Configuration

All configuration is in config.py. The key constants:

Constant Default Purpose
REQUEST_DELAY_MIN / MAX 2.0 / 5.0s Random delay between listing fetches
MAX_RETRIES 3 Retries per request (exponential backoff)
RETRY_BACKOFF_BASE 2.0 Backoff base in seconds (2s, 4s, 8s)
PAGINATION_STOP_AFTER_KNOWN 5 Stop paginating after 5 consecutive known listings
MISSED_RUN_REMOVAL_THRESHOLD 3 Runs absent before a listing is marked REMOVED
SUSPECTED_SOLD_MIN_DAYS 30 Minimum days active to flag a removal as a suspected sale
UPSERT_BATCH_SIZE 200 DB write batch size

CANONICAL_NEIGHBOURHOODS is a hardcoded list of major neighbourhood names across Lagos, Abuja, and Port Harcourt. The normaliser uses it for fuzzy address matching.


Data model

There are exactly two data objects. Everything flows through them.

RawListing β€” produced by a parser. Every field is a string or None. No interpretation, no typing. A direct mirror of what was in the HTML.

NormalisedListing β€” produced by the normaliser from a RawListing. Every field is typed and ready for the database.

Key normalisation rules:

  • All prices are stored as kobo (integer). ₦45,000,000 = 4_500_000_000. Never floats, never naira.
  • All floor areas are stored in square metres. Sqft inputs are converted automatically.
  • price_parse_failed = True when a price string exists but cannot be interpreted.

Database schema

All tables live in the raw_data schema in the database.

Table Purpose
raw_data.scraped_listings One row per listing β€” current state. UNIQUE(source, external_id).
raw_data.listing_history One row per event: LISTED, PRICE_CHANGE, REMOVED.
raw_data.geocode_cache Persistent (neighbourhood, city) β†’ (lat, lng) cache.
raw_data.scrape_runs Operational log β€” one row per portal per run.

Listings have a listing_status of ACTIVE or REMOVED. A listing is marked suspected_sold when it disappears after 30+ days active with at least one downward price change in its history. This is a proxy signal for transaction history β€” not a confirmed sale.


Testing

All tests are fully offline. No live network calls, no real database connections.

pytest                                                     # all tests
pytest tests/test_normaliser.py                            # fastest β€” pure logic, no fixtures
pytest tests/test_parsers.py                               # parser selectors against fixture HTML
pytest tests/test_geocoder.py                              # cache and mocked Nominatim
pytest tests/test_db_writer.py                             # upsert logic and suspected_sold
pytest tests/test_pipeline.py                              # full parser β†’ normalise β†’ geocode chain

# Single class
pytest tests/test_parsers.py::TestPropertyProParser -v

# Single test
pytest tests/test_parsers.py::TestPropertyProParser::test_price_raw -v

Run pytest from the project root, not from inside tests/.


Maintenance

When a portal changes its HTML

This happens regularly. The symptom is 0 listings or ❌ in the Telegram notification.

  1. Run ./run.sh to confirm which portal is failing
  2. Open a live listing in your browser, open DevTools β†’ Inspector
  3. Find the element wrapping the broken field
  4. Update the relevant selector constant at the top of scraper/parsers/<portal>.py
  5. Save the page source to tests/fixtures/<portal>_listing.html
  6. Run pytest tests/test_parsers.py::Test<Portal>Parser -v to confirm

Adding a new portal

  1. Create scraper/parsers/newportal.py β€” subclass BaseParser, implement source, base_url, search_url, get_listing_urls(), parse_listing(), next_page_url()
  2. Add NewPortalParser(active_listings) to the parsers list in orchestrator.py
  3. Save a fixture HTML page to tests/fixtures/newportal_listing.html
  4. Add a TestNewPortalParser class to test_parsers.py

Updating the neighbourhood list

Edit CANONICAL_NEIGHBOURHOODS in config.py, adding the canonical areas as is necessary.


Common errors

Portal returns 0 listings Most likely cause is selector drift (portal changed its HTML). Less common causes: 403 from bot detection (run from a residential IP), Playwright timeout on Jiji (increase SELECTOR_TIMEOUT).

ModuleNotFoundError: No module named 'scraper' Run pytest from the project root. Verify the root conftest.py exists.

psycopg2-binary build failure You are likely on Python 3.13. Use Python 3.12 (pyenv install 3.12.9 && pyenv local 3.12.9), delete .venv, and re-run. Alternatively upgrade to psycopg2-binary==2.9.10 which ships 3.13 wheels.

Geocoder always returns geocoded=False Check that NOMINATIM_AGENT in geocoder.py contains a real application name and contact email. Also note: the first run after a fresh install makes one API call per unique neighbourhood (~90 seconds for 80 neighbourhoods). This is normal β€” subsequent runs hit the cache and make zero API calls.

About

Intel-gathering Infrastructure for major Nigerian Property Listing Portals

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors