Property Scraper is an autonomous lightweight python-based data pipeline that extracts real property listing data from public Nigerian property portals, cleans and geocodes the data, and persists them to a PostgreSQL instance on Supabase on a weekly schedule. A Telegram message is sent at the end of every run with a summary of what happened.
Property Scraper has two run modes: Discovery Runs and Health Checks.
Discovery runs are all about finding new property listings and storing them in the database
Health Checks confirm price changes and listing activity, and log the result.
Both run modes are important as they define the lifecycle for the property listings. Property Listings enter the pipeline via the discovery mode and exit through the health check mode.
Each run proceeds through seven stages:
| Stage | Description |
|---|---|
| 1. Snapshot | Loads all ACTIVE listings from the DB into memory |
| 2. Scrape | Fetches listings from all four Nigerian portals |
| 3. Normalise | Converts raw strings ("β¦45M", "3 Beds") into typed Python values |
| 4. Geocode | Attaches lat/lng coordinates via neighbourhood name lookup |
| 5. Upsert | Inserts new listings, update existing ones, emit history events |
| 6. Log | Writes one row per portal to scrape_runs |
| 7. Notify | Sends a Telegram summary with counts and status per portal |
The three portals scraped are: PropertyPro.ng, PrivateProperty.ng, and NigeriaPropertyCentre.ng
Each health check goes through the following stages:
| Stage | Description |
|---|---|
| 1. Snapshot | Loads all ACTIVE listings from the DB into memory |
| 2. Check | Checks each loaded listing with its original URL to confirm its continued existence and for any recent price changes |
| 3. Log | If changes are made or listings are found to be removed, they are stored as PRICE_CHANGE or as REMOVED events respectively with the appropriate information |
| 4. Notify | Sends a Telegram summary with the run stats and results for notification |
property-scraper/
βββ config.py # All configuration β env vars, constants
βββ conftest.py # Root pytest path fix
βββ pytest.ini # pytest settings
βββ run.sh # Local weekly run script
βββ .env # Local credentials (never commit)
β
βββ scraper/
β βββ models.py # RawListing and NormalisedListing dataclasses
β βββ orchestrator.py # Main entry point β wires all stages together
β βββ normaliser.py # String β typed value conversion
β βββ geocoder.py # Neighbourhood β lat/lng (Nominatim + cache)
β βββ db_writer.py # All database reads and writes
β βββ notifier.py # Telegram notification
β βββ health_checker.py # Health Check logic
β βββ parsers/
β βββ base_parser.py # Shared HTTP + pagination infrastructure
β βββ propertypro.py
β βββ privateproperty.py
β βββ nigeriapropertycentre.py
β
βββ schema/
β βββ 001_raw_data_schema.sql # DB tables and indexes β run once
β βββ 002_add_health_check_at.sql # Migration to track health checks
β
βββ tests/
βββ conftest.py
βββ fixtures/ # Saved HTML from each portal
βββ test_parsers.py
βββ test_normaliser.py
βββ test_geocoder.py
βββ test_db_writer.py
βββ test_pipeline.py
- Python 3.10β3.12 (3.13 is not supported β
psycopg2-binaryhas no wheel for it yet) - A Database with the schema applied
- Optional: a Telegram bot token and chat ID for run notifications
1. Clone and create a virtual environment
git clone <repo-url>
cd property-scraper
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt2. Create a .env file in the project root
DATABASE_URL=postgresql://postgres.[project-id]:[password]@...
TELEGRAM_BOT_TOKEN=1234567890:ABCDEF... # optional
TELEGRAM_CHAT_ID=123456789 # optional
3. Apply the database schema
psql $DATABASE_URL -f schema/001_raw_data_schema.sql4. Update the Nominatim user-agent
Open scraper/geocoder.py and set NOMINATIM_AGENT to a real application name and contact email. Nominatim blocks generic or empty user-agents.
./run.shrun.sh handles the full lifecycle: checks .env, resolves the Python version, creates or rebuilds .venv if needed, installs dependencies, applies the schema (idempotent), and runs the pipeline. On failure it prints the last 30 lines of scraper.log.
You can also run the pipeline directly:
python3 -m scraper.orchestrator./run.sh --health-checkwith the --health-check flag, run.sh pulls all active listings from the database and checks them for activity. If it determines that prices have changed or that they are no longer present, a PRICE_CHANGE or a REMOVED event for each listing is appended to the listing_history table respectively.
All configuration is in config.py. The key constants:
| Constant | Default | Purpose |
|---|---|---|
REQUEST_DELAY_MIN / MAX |
2.0 / 5.0s | Random delay between listing fetches |
MAX_RETRIES |
3 | Retries per request (exponential backoff) |
RETRY_BACKOFF_BASE |
2.0 | Backoff base in seconds (2s, 4s, 8s) |
PAGINATION_STOP_AFTER_KNOWN |
5 | Stop paginating after 5 consecutive known listings |
MISSED_RUN_REMOVAL_THRESHOLD |
3 | Runs absent before a listing is marked REMOVED |
SUSPECTED_SOLD_MIN_DAYS |
30 | Minimum days active to flag a removal as a suspected sale |
UPSERT_BATCH_SIZE |
200 | DB write batch size |
CANONICAL_NEIGHBOURHOODS is a hardcoded list of major neighbourhood names across Lagos, Abuja, and Port Harcourt. The normaliser uses it for fuzzy address matching.
There are exactly two data objects. Everything flows through them.
RawListing β produced by a parser. Every field is a string or None. No interpretation, no typing. A direct mirror of what was in the HTML.
NormalisedListing β produced by the normaliser from a RawListing. Every field is typed and ready for the database.
Key normalisation rules:
- All prices are stored as kobo (integer). β¦45,000,000 =
4_500_000_000. Never floats, never naira. - All floor areas are stored in square metres. Sqft inputs are converted automatically.
price_parse_failed = Truewhen a price string exists but cannot be interpreted.
All tables live in the raw_data schema in the database.
| Table | Purpose |
|---|---|
raw_data.scraped_listings |
One row per listing β current state. UNIQUE(source, external_id). |
raw_data.listing_history |
One row per event: LISTED, PRICE_CHANGE, REMOVED. |
raw_data.geocode_cache |
Persistent (neighbourhood, city) β (lat, lng) cache. |
raw_data.scrape_runs |
Operational log β one row per portal per run. |
Listings have a listing_status of ACTIVE or REMOVED. A listing is marked suspected_sold when it disappears after 30+ days active with at least one downward price change in its history. This is a proxy signal for transaction history β not a confirmed sale.
All tests are fully offline. No live network calls, no real database connections.
pytest # all tests
pytest tests/test_normaliser.py # fastest β pure logic, no fixtures
pytest tests/test_parsers.py # parser selectors against fixture HTML
pytest tests/test_geocoder.py # cache and mocked Nominatim
pytest tests/test_db_writer.py # upsert logic and suspected_sold
pytest tests/test_pipeline.py # full parser β normalise β geocode chain
# Single class
pytest tests/test_parsers.py::TestPropertyProParser -v
# Single test
pytest tests/test_parsers.py::TestPropertyProParser::test_price_raw -vRun pytest from the project root, not from inside tests/.
This happens regularly. The symptom is 0 listings or β in the Telegram notification.
- Run
./run.shto confirm which portal is failing - Open a live listing in your browser, open DevTools β Inspector
- Find the element wrapping the broken field
- Update the relevant selector constant at the top of
scraper/parsers/<portal>.py - Save the page source to
tests/fixtures/<portal>_listing.html - Run
pytest tests/test_parsers.py::Test<Portal>Parser -vto confirm
- Create
scraper/parsers/newportal.pyβ subclassBaseParser, implementsource,base_url,search_url,get_listing_urls(),parse_listing(),next_page_url() - Add
NewPortalParser(active_listings)to the parsers list inorchestrator.py - Save a fixture HTML page to
tests/fixtures/newportal_listing.html - Add a
TestNewPortalParserclass totest_parsers.py
Edit CANONICAL_NEIGHBOURHOODS in config.py, adding the canonical areas as is necessary.
Portal returns 0 listings
Most likely cause is selector drift (portal changed its HTML). Less common causes: 403 from bot detection (run from a residential IP), Playwright timeout on Jiji (increase SELECTOR_TIMEOUT).
ModuleNotFoundError: No module named 'scraper'
Run pytest from the project root. Verify the root conftest.py exists.
psycopg2-binary build failure
You are likely on Python 3.13. Use Python 3.12 (pyenv install 3.12.9 && pyenv local 3.12.9), delete .venv, and re-run. Alternatively upgrade to psycopg2-binary==2.9.10 which ships 3.13 wheels.
Geocoder always returns geocoded=False
Check that NOMINATIM_AGENT in geocoder.py contains a real application name and contact email. Also note: the first run after a fresh install makes one API call per unique neighbourhood (~90 seconds for 80 neighbourhoods). This is normal β subsequent runs hit the cache and make zero API calls.