## Downloading the Data (Refactored)

This notebook demonstrates the use of the `DownloadUtils` class for downloading and processing data from various sources.

The `DownloadUtils` class provides a clean, object-oriented interface for:
- Creating data directory structures
- Downloading files from URLs
- Scraping time series data from web tables
- Saving processed data to CSV files


In [15]:
%load_ext autoreload
%autoreload 2

import sys
from pathlib import Path

# Add project root to Python path
# Get the current notebook's directory and go up to project root
current_dir = Path().resolve()
if current_dir.name == 'notebooks':
    project_root = current_dir.parent
elif current_dir.name == 'project2':
    project_root = current_dir
else:
    # If we're in the parent directory, look for project2
    project_root = current_dir / 'project2'

sys.path.insert(0, str(project_root))
print(f"Project root: {project_root}")

from utils.download import DownloadUtils
from utils.preprocess import PreprocessUtils

# Initialize the utilities
downloader = DownloadUtils(base_data_dir="../data/")


Project root: /Users/jackshee/University/MAST30034 Applied Data Science/project2


### Option 2: Download Data Sources Individually

You can also download specific data sources individually for more control over the process.


#### Download Rent Time Series Data from DFFH


In [4]:
# Download the latest moving annual rent file (March 2025) with verbose output
downloader.download_latest_rent_data(verbose=True)


Downloading latest moving annual rent file: moving_annual_median_weekly_rent_by_suburb
URL: https://www.dffh.vic.gov.au/moving-annual-rent-suburb-march-quarter-2025-excel
⚠️  File already exists: moving_annual_median_weekly_rent_by_suburb.xlsx (1.08 MB)
✅ Successfully downloaded latest moving annual rent file!


#### Download Public Transport Data


In [5]:
# Download public transport stops and lines data with verbose output
downloader.download_public_transport_data(verbose=True)


Downloading from: https://opendata.transport.vic.gov.au/dataset/6d36dfd9-8693-4552-8a03-05eb29a391fd/resource/afa7b823-0c8b-47a1-bc40-ada565f684c7/download/public_transport_stops.geojson
✅ Downloaded public_transport_stops.geojson (7.33 MB) in 1.91s
Downloading from: https://opendata.transport.vic.gov.au/dataset/6d36dfd9-8693-4552-8a03-05eb29a391fd/resource/52e5173e-b5d5-4b65-9b98-89f225fc529c/download/public_transport_lines.geojson
✅ Downloaded public_transport_lines.geojson (362.41 MB) in 122.91s


#### Download School Locations Data


In [6]:
# Download school locations for 2023, 2024, and 2025 with verbose output
downloader.download_school_locations(verbose=True)


Downloading from: https://www.education.vic.gov.au/Documents/about/research/datavic/dv346-schoollocations2023.csv
✅ Downloaded school_locations_2023.csv (0.40 MB) in 0.57s
Downloading from: https://www.education.vic.gov.au/Documents/about/research/datavic/dv378_DataVic-SchoolLocations-2024.csv
✅ Downloaded school_locations_2024.csv (0.48 MB) in 0.51s
Downloading from: https://www.education.vic.gov.au/Documents/about/research/datavic/dv402-SchoolLocations2025.csv
✅ Downloaded school_locations_2025.csv (0.51 MB) in 0.53s


#### Scrape Macroeconomic Time Series Data


In [7]:
# Scrape unemployment rate data with verbose output
downloader.scrape_unemployment_data(verbose=True)


=== SCRAPING UNEMPLOYMENT RATE DATA ===
Scraping unemployment_rate data from: https://djsir-data.github.io/djprecodash/tables/djsir_labour_market
✅ Scraped unemployment_rate data (191 records) in 2.77s
Successfully scraped and saved unemployment_rate data to ../data/landing/unemployment_rate/quarterly_unemployment_rate.csv
Data contains 191 quarterly records
Date range: 1978-03-01 00:00:00 to 2025-09-01 00:00:00

First 5 records:
        date  year  quarter  Unemployment rate (%)
0 1978-03-01  1978        1               5.842250
1 1978-06-01  1978        2               5.552600
2 1978-09-01  1978        3               5.550233
3 1978-12-01  1978        4               5.575700
4 1979-03-01  1979        1               5.600167

Last 5 records:
          date  year  quarter  Unemployment rate (%)
186 2024-09-01  2024        3               4.473833
187 2024-12-01  2024        4               4.378333
188 2025-03-01  2025        1               4.544433
189 2025-06-01  2025        2  

'../data/landing/unemployment_rate/quarterly_unemployment_rate.csv'

In [8]:
# Scrape interest rates data with verbose output
downloader.scrape_interest_rates_data(verbose=True)


=== SCRAPING INTEREST RATES DATA ===
Scraping interest_rates data from: https://djsir-data.github.io/djprecodash/tables/djsir_interest_rates
✅ Scraped interest_rates data (181 records) in 1.11s
Successfully scraped and saved interest_rates data to ../data/landing/interest_rates/quarterly_interest_rates.csv
Data contains 181 quarterly records
Date range: 1990-03-01 00:00:00 to 2035-09-01 00:00:00

First 5 records:
        date  year  quarter  Mortgage rates (%)  Savings rates (%)  \
0 1990-03-01  1990        1                 NaN                NaN   
1 1990-06-01  1990        2                 NaN                NaN   
2 1990-09-01  1990        3                 NaN                NaN   
3 1990-12-01  1990        4                 NaN                NaN   
4 1991-03-01  1991        1                 NaN                NaN   

   Cash rate (%)  
0      16.666667  
1      15.000000  
2      14.333333  
3      12.666667  
4      12.000000  

Last 5 records:
          date  year  quarter  

'../data/landing/interest_rates/quarterly_interest_rates.csv'

In [9]:
# Scrape price data with verbose output
downloader.scrape_price_data(verbose=True)


=== SCRAPING PRICE DATA ===
Scraping price_data data from: https://djsir-data.github.io/djprecodash/tables/djsir_prices
✅ Scraped price_data data (304 records) in 0.69s
Successfully scraped and saved price_data data to ../data/landing/price_data/quarterly_price_data.csv
Data contains 304 quarterly records
Date range: 1949-09-01 00:00:00 to 2025-06-01 00:00:00

First 5 records:
        date  year  quarter  CPI (%/y)  WPI (%/y)  PPI, Final Demand (%/y)
0 1949-09-01  1949        3        7.9        NaN                      NaN
1 1949-12-01  1949        4       10.5        NaN                      NaN
2 1950-03-01  1950        1       10.3        NaN                      NaN
3 1950-06-01  1950        2       10.0        NaN                      NaN
4 1950-09-01  1950        3        9.8        NaN                      NaN

Last 5 records:
          date  year  quarter  CPI (%/y)  WPI (%/y)  PPI, Final Demand (%/y)
299 2024-06-01  2024        2        3.7        3.3                      4.8

'../data/landing/price_data/quarterly_price_data.csv'

In [10]:
# Scrape economic activity data with verbose output
downloader.scrape_economic_activity_data(verbose=True)


=== SCRAPING ECONOMIC ACTIVITY DATA ===
Scraping economic_activity data from: https://djsir-data.github.io/djprecodash/tables/djsir_economic_activity
✅ Scraped economic_activity data (156 records) in 0.65s
Successfully scraped and saved economic_activity data to ../data/landing/economic_activity/quarterly_economic_activity.csv
Data contains 156 quarterly records
Date range: 1986-09-01 00:00:00 to 2025-06-01 00:00:00

First 5 records:
        date  year  quarter  SFD (%/y)  GSP quarterly components (%/y)
0 1986-09-01  1986        3     2.8931                             NaN
1 1986-12-01  1986        4     2.5560                             NaN
2 1987-03-01  1987        1     1.1749                             NaN
3 1987-06-01  1987        2     1.7426                             NaN
4 1987-09-01  1987        3     2.3537                             NaN

Last 5 records:
          date  year  quarter  SFD (%/y)  GSP quarterly components (%/y)
151 2024-06-01  2024        2     1.4268      

'../data/landing/economic_activity/quarterly_economic_activity.csv'

In [11]:
# Scrape population dynamics data with verbose output
downloader.scrape_population_data(verbose=True)


=== SCRAPING POPULATION DATA ===
Scraping population data from: https://djsir-data.github.io/djprecodash/tables/djsir_contribution_to_population_growth
✅ Scraped population data (172 records) in 0.52s
Successfully scraped and saved population_dynamics data to ../data/landing/population/quarterly_population_dynamics.csv
Data contains 172 quarterly records
Date range: 1982-06-01 00:00:00 to 2025-03-01 00:00:00

First 5 records:
        date  year  quarter  Population (%/y)  Natural increase (pp/y)  \
0 1982-06-01  1982        2            1.1643                   0.7689   
1 1982-09-01  1982        3            1.1472                   0.7278   
2 1982-12-01  1982        4            1.1160                   0.7354   
3 1983-03-01  1983        1            1.1160                   0.7228   
4 1983-06-01  1983        2            1.0727                   0.7358   

   Net overseas migration (pp/y)  Net interstate migration (pp/y)  
0                         0.7891                         

'../data/landing/population/quarterly_population_dynamics.csv'

In [12]:
# Scrape investment data with verbose output
downloader.scrape_investment_data(verbose=True)


=== SCRAPING INVESTMENT DATA ===
Scraping investment data from: https://djsir-data.github.io/djprecodash/tables/djsir_contribution_to_growth
✅ Scraped investment data (156 records) in 0.60s
Successfully scraped and saved investment data to ../data/landing/investment/quarterly_investment.csv
Data contains 156 quarterly records
Date range: 1986-09-01 00:00:00 to 2025-06-01 00:00:00

First 5 records:
        date  year  quarter  State final demand (%/y)  \
0 1986-09-01  1986        3                    2.8931   
1 1986-12-01  1986        4                    2.5560   
2 1987-03-01  1987        1                    1.1749   
3 1987-06-01  1987        2                    1.7426   
4 1987-09-01  1987        3                    2.3537   

   Household consumption (pp/y)  Dwelling investment (pp/y)  \
0                        2.3367                     -1.6844   
1                        1.3148                     -0.9639   
2                        2.2443                     -0.6434   
3   

'../data/landing/investment/quarterly_investment.csv'

#### Download Population Census Data from ABS


In [16]:
# This downloads Excel files for SAL codes 20001-22944 from ABS
# Note: This will take a very long time and generate a lot of output
no_data_list = downloader.download_population_census_data(verbose=True)


=== DOWNLOADING POPULATION CENSUS DATA ===
❌ Error for SAL 20001: HTTP Error 404: Not Found
❌ Error for SAL 20006: HTTP Error 504: Gateway Time-out
❌ Error for SAL 20012: HTTP Error 504: Gateway Time-out
❌ Error for SAL 20016: HTTP Error 504: Gateway Time-out
❌ Error for SAL 20055: HTTP Error 404: Not Found
❌ Error for SAL 20059: HTTP Error 404: Not Found
❌ Error for SAL 20060: HTTP Error 404: Not Found
✅ Downloaded 50 files...
❌ Error for SAL 20083: HTTP Error 404: Not Found
❌ Error for SAL 20084: HTTP Error 404: Not Found
❌ Error for SAL 20088: HTTP Error 404: Not Found
❌ Error for SAL 20098: HTTP Error 404: Not Found
✅ Downloaded 100 files...
❌ Error for SAL 20155: HTTP Error 404: Not Found
❌ Error for SAL 20158: HTTP Error 404: Not Found
❌ Error for SAL 20167: HTTP Error 404: Not Found
✅ Downloaded 150 files...
❌ Error for SAL 20194: HTTP Error 404: Not Found
❌ Error for SAL 20199: HTTP Error 404: Not Found
✅ Downloaded 200 files...
❌ Error for SAL 20234: HTTP Error 404: Not Found


#### Domain Rental Listings Scraping

The project includes a comprehensive Scrapy-based scraper for Domain.com.au rental listings. This scraper can collect both current live listings and historical data from Wayback Machine archives.


##### Live Listings Scraping

To scrape current rental listings from Domain.com.au:

```bash
# Navigate to the scraper directory
cd scraping/domain_scraper

# Test the scraper first (recommended)
python test_spider.py

# Run full live listings scraping
python run_spider.py
```

**Output**: `data/landing/domain/live/rental_listings_2025_09.csv`

**Features**:
- Scrapes all Victorian suburbs (3,186+ suburbs)
- Extracts detailed property information including features, amenities, and market insights
- Uses conservative settings to avoid being blocked
- Includes property images using Selenium


##### Wayback Historical Scraping

To scrape historical rental listings from 2022-2025:

```bash
# Navigate to the scraper directory
cd scraping/domain_scraper

# Run wayback historical scraping
python run_wayback_spider.py
```

**Output**: `data/landing/domain/wayback/rental_listings_YYYY_MM.csv` (multiple files)

**Features**:
- Scrapes quarterly snapshots from 2022 Q3 to 2025 Q2
- Uses Wayback Machine archives to access historical data
- Processes suburbs with available timestamps for each quarter
- Consolidates results into quarterly CSV files

**Note**: This process can take several hours as it processes thousands of suburbs across multiple quarters.


##### Scraper Configuration

The scraper is configured with:
- **Concurrent requests**: 4 (conservative to avoid blocking)
- **Delay between requests**: 1 second (with randomization)
- **Auto-throttling**: Enabled to adapt to server response
- **Expected duration**: Several hours for full scraping
- **Memory management**: Processes suburbs sequentially

For detailed configuration and troubleshooting, see `scraping/domain_scraper/SCRAPING_INSTRUCTIONS.md`.
