This notebook performs basic preprocessing for the wayback scraped listings. It also sets up intermediate files required calling GeoCode API via Open Route Service. The results from running this notebook are saved to `data/processed/domain/wayback_listings.csv`

# Data Preprocessing and Imputation

In [28]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
import pandas as pd
import glob
import os
from pathlib import Path
from utils.preprocess import PreprocessUtils
from utils.geo import GeoUtils

# Initialize the preprocessor
preprocessor = PreprocessUtils()

# Initialize the geo utils
geo_utils = GeoUtils()


OpenRouteService client initialized successfully.


In [30]:
# Define the path to the domain folder
domain_path = "../data/raw/domain"

# Get all CSV files except rental_listings_2025_09.csv
csv_files = glob.glob(os.path.join(domain_path, "rental_listings_*.csv"))
csv_files = [f for f in csv_files if "rental_listings_2025_09.csv" not in f]

print(f"Found {len(csv_files)} files to process:")
for f in sorted(csv_files):
    print(f"  - {os.path.basename(f)}")


Found 14 files to process:
  - rental_listings_2022_03.csv
  - rental_listings_2022_06.csv
  - rental_listings_2022_09.csv
  - rental_listings_2022_12.csv
  - rental_listings_2023_03.csv
  - rental_listings_2023_06.csv
  - rental_listings_2023_09.csv
  - rental_listings_2023_12.csv
  - rental_listings_2024_03.csv
  - rental_listings_2024_06.csv
  - rental_listings_2024_09.csv
  - rental_listings_2024_12.csv
  - rental_listings_2025_03.csv
  - rental_listings_2025_06.csv


Here we make the assumption that the listings on wayback have an `updated_dated` that is equal to the `scraped_date`. Because we scraped a random day with an existing snapshot for each suburb in each quarter going back to 2022 Q1, we essentially have more rental price data for listings last updated in the "past".

In [31]:
# Read all CSV files and add year and quarter columns
dataframes = []

for csv_file in sorted(csv_files):
    # Extract filename without extension
    filename = os.path.basename(csv_file)
    # Parse filename: rental_listings_YYYY_MM.csv
    parts = filename.replace('.csv', '').split('_')
    year = parts[2]
    month = parts[3]
    
    # Map month to quarter
    month_to_quarter = {
        '03': 1,
        '06': 2,
        '09': 3,
        '12': 4
    }
    quarter = month_to_quarter.get(month, 'Unknown')
    
    # Read the CSV file
    df = pd.read_csv(csv_file)
    
    # Add year and quarter columns
    df['year'] = int(year)
    df['quarter'] = quarter
    
    dataframes.append(df)
    print(f"Loaded {filename}: {len(df)} rows, Year={year}, Quarter={quarter}")

print(f"\nTotal dataframes loaded: {len(dataframes)}")


Loaded rental_listings_2022_03.csv: 20 rows, Year=2022, Quarter=1
Loaded rental_listings_2022_06.csv: 3047 rows, Year=2022, Quarter=2
Loaded rental_listings_2022_09.csv: 710 rows, Year=2022, Quarter=3
Loaded rental_listings_2022_12.csv: 23 rows, Year=2022, Quarter=4
Loaded rental_listings_2023_03.csv: 123 rows, Year=2023, Quarter=1
Loaded rental_listings_2023_06.csv: 98 rows, Year=2023, Quarter=2
Loaded rental_listings_2023_09.csv: 20 rows, Year=2023, Quarter=3
Loaded rental_listings_2023_12.csv: 179 rows, Year=2023, Quarter=4
Loaded rental_listings_2024_03.csv: 1372 rows, Year=2024, Quarter=1
Loaded rental_listings_2024_06.csv: 1786 rows, Year=2024, Quarter=2
Loaded rental_listings_2024_09.csv: 1460 rows, Year=2024, Quarter=3
Loaded rental_listings_2024_12.csv: 1562 rows, Year=2024, Quarter=4
Loaded rental_listings_2025_03.csv: 3340 rows, Year=2025, Quarter=1
Loaded rental_listings_2025_06.csv: 3382 rows, Year=2025, Quarter=2

Total dataframes loaded: 14


In [32]:
# Stack all dataframes together
df = pd.concat(dataframes, ignore_index=True)

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17122 entries, 0 to 17121
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   property_id        17122 non-null  int64  
 1   url                17122 non-null  object 
 2   rental_price       17122 non-null  object 
 3   bedrooms           16994 non-null  float64
 4   bathrooms          17067 non-null  float64
 5   car_spaces         15120 non-null  float64
 6   property_type      17122 non-null  object 
 7   land_area          17122 non-null  float64
 8   property_features  17122 non-null  object 
 9   suburb             17122 non-null  object 
 10  postcode           17122 non-null  int64  
 11  scraped_date       17122 non-null  object 
 12  wayback_url        17122 non-null  object 
 13  wayback_time       17122 non-null  int64  
 14  year               17122 non-null  int64  
 15  quarter            17122 non-null  int64  
dtypes: float64(4), int64(5

In [33]:
# Map the suburbs to standardized names (used by DFFH dataset)
df['suburb'] = preprocessor.map_suburb(df['suburb'])

# Remove the suburbs that have count less than 10
suburb_counts = df['suburb'].value_counts()
valid_suburbs = suburb_counts[suburb_counts > 10].index
df = df[df['suburb'].isin(valid_suburbs)]

df['suburb'].nunique()

338

In [34]:
# drop land_area column
df = df.drop(columns=['land_area'])

# convert bedrooms, bathrooms, car_spaces to Int64
df['bedrooms'] = df['bedrooms'].astype('Int64')
df['bathrooms'] = df['bathrooms'].astype('Int64')
df['car_spaces'] = df['car_spaces'].astype('Int64')


In [35]:
# impute bedrooms, bathrooms, car_spaces with preprocessor
# handle missing values by imputing by mode grouped by property_type
print("Before imputation:")
print(df[['bedrooms', 'bathrooms', 'car_spaces']].isnull().sum())

# Impute bedrooms using property_type mode
df['bedrooms'] = preprocessor.impute_by_property_type_mode(df, 'bedrooms')

# Impute bathrooms using property_type mode
df['bathrooms'] = preprocessor.impute_by_property_type_mode(df, 'bathrooms')

# Impute car_spaces using property_type mode
df['car_spaces'] = preprocessor.impute_by_property_type_mode(df, 'car_spaces')

print("\nAfter imputation:")
print(df[['bedrooms', 'bathrooms', 'car_spaces']].isnull().sum())

Before imputation:
bedrooms       108
bathrooms       42
car_spaces    1767
dtype: int64
Property type: Apartment / Unit / Flat, bedrooms imputed with 2
Property type: Townhouse, bedrooms imputed with 3
Property type: House, bedrooms imputed with 3
Property type: Studio, bedrooms imputed with 1
Property type: Duplex, bedrooms imputed with 2
Property type: Vacant land, bedrooms imputed with 4
Property type: Farm, bedrooms imputed with <NA>
Property type: Apartment / Unit / Flat, bathrooms imputed with 1
Property type: Townhouse, bathrooms imputed with 2
Property type: House, bathrooms imputed with 2
Property type: Studio, bathrooms imputed with 1
Property type: Vacant land, bathrooms imputed with 2
Property type: Farm, bathrooms imputed with <NA>
Property type: Apartment / Unit / Flat, car_spaces imputed with 1
Property type: Townhouse, car_spaces imputed with 2
Property type: House, car_spaces imputed with 2
Property type: Studio, car_spaces imputed with 1
Property type: New Apartments

In [36]:
# Extract weekly rent from rental_price column
df['rental_price'] = preprocessor.extract_rental_price(df['rental_price'])

# Check how many rows have unknown frequencies (NaN in weekly_rent)
unknown_count = df['rental_price'].isna().sum()
print(f"Rows with unknown price frequencies: {unknown_count}")

# Drop rows with unknown frequencies
df = df[df['rental_price'].notna()]

Rows with unknown price frequencies: 351


In [37]:
# extract address from url since we don't have the address in the summary listing data
df['address'] = df['url'].apply(geo_utils.extract_address_from_url)

# drop url column
df = df.drop(columns=['url', 'property_features', 'postcode', 'scraped_date', 'wayback_url', 'wayback_time'])

df

Unnamed: 0,property_id,rental_price,bedrooms,bathrooms,car_spaces,property_type,suburb,year,quarter,address
0,15295193,240.0,1,1,1,Apartment / Unit / Flat,footscray,2022,1,"51 Gordon Street, Footscray, VIC 3011"
1,15624467,385.0,2,1,1,Townhouse,footscray,2022,1,"Cirque Drive, Footscray, VIC 3011"
2,15726089,470.0,2,1,1,House,footscray,2022,1,"10 Stanlake Street, Footscray, VIC 3011"
3,15754757,410.0,2,1,1,Apartment / Unit / Flat,footscray,2022,1,"42 Whitehall Street, Footscray, VIC 3011"
4,15802091,650.0,3,2,2,House,footscray,2022,1,"2 Saltriver Place, Footscray, VIC 3011"
...,...,...,...,...,...,...,...,...,...,...
17113,13001612,70.0,1,1,3,House,trafalgar,2025,2,"1c Dodemaides, Trafalgar, VIC 3824"
17115,17530473,340.0,1,1,2,Studio,trafalgar,2025,2,"26a Waterloo Road, Trafalgar, VIC 3824"
17116,17597533,500.0,3,1,2,House,trafalgar,2025,2,"Princes Way, Trafalgar, VIC 3824"
17120,17618453,550.0,3,2,2,House,trafalgar,2025,2,"Cross Street, Trafalgar, VIC 3824"


In [38]:
# check for property_id duplicates
df['property_id'].duplicated().sum()

# sort by year, quarter descending
df = df.sort_values(by=['year', 'quarter'], ascending=False)

# drop duplicates but keep first occurrence
df = df.drop_duplicates(subset=['property_id'], keep='first')

df.shape




(14044, 10)

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14044 entries, 13740 to 19
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   property_id    14044 non-null  int64  
 1   rental_price   14044 non-null  float64
 2   bedrooms       14043 non-null  Int64  
 3   bathrooms      14043 non-null  Int64  
 4   car_spaces     14043 non-null  Int64  
 5   property_type  14044 non-null  object 
 6   suburb         14044 non-null  object 
 7   year           14044 non-null  int64  
 8   quarter        14044 non-null  int64  
 9   address        14044 non-null  object 
dtypes: Int64(3), float64(1), int64(3), object(3)
memory usage: 1.2+ MB


In [40]:
# drop remaining nulls
df = df.dropna()

df.info()

df.shape


<class 'pandas.core.frame.DataFrame'>
Index: 14043 entries, 13740 to 19
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   property_id    14043 non-null  int64  
 1   rental_price   14043 non-null  float64
 2   bedrooms       14043 non-null  Int64  
 3   bathrooms      14043 non-null  Int64  
 4   car_spaces     14043 non-null  Int64  
 5   property_type  14043 non-null  object 
 6   suburb         14043 non-null  object 
 7   year           14043 non-null  int64  
 8   quarter        14043 non-null  int64  
 9   address        14043 non-null  object 
dtypes: Int64(3), float64(1), int64(3), object(3)
memory usage: 1.2+ MB


(14043, 10)

In [41]:
# save to data/processed/domain/wayback_listings.csv
df.to_csv("../data/processed/domain/wayback_listings.csv", index=False)

# Fetching Coordinates/ Geocoding the Address of Wayback Listings

We need to geocode the address to get the latitude and longitude. Let us first save the listings to `data/raw/missing_coordinates`. This way we know which listings still need to be geocoded. We will be using the Open Route Service Geocode API which has an API request limit of 1000 per key. 

In [102]:
# Split the dataframe into batches and save to output directory
output_dir = "../data/raw/missing_coordinates"
batch_size = 1000

batch_files = preprocessor.split_into_batches(df[['property_id', 'address']], batch_size, output_dir)
print(f"\nCreated {len(batch_files)} batch files")


Saved batch_0001.csv: 1000 rows
Saved batch_0002.csv: 1000 rows
Saved batch_0003.csv: 1000 rows
Saved batch_0004.csv: 1000 rows
Saved batch_0005.csv: 1000 rows
Saved batch_0006.csv: 1000 rows
Saved batch_0007.csv: 1000 rows
Saved batch_0008.csv: 1000 rows
Saved batch_0009.csv: 1000 rows
Saved batch_0010.csv: 1000 rows
Saved batch_0011.csv: 1000 rows
Saved batch_0012.csv: 1000 rows
Saved batch_0013.csv: 1000 rows
Saved batch_0014.csv: 1000 rows
Saved batch_0015.csv: 44 rows

Total batches created: 15
Output directory: ../data/raw/missing_coordinates

Created 15 batch files


After setting up the input files, we can now run the script from the project `/` directory `./run_geocode.sh`. This should then create 15 files in `data/processed/coordinates`.

# Combining Live Listings with Wayback Listings

In [42]:
# read in live listings from data/processed/domain/live_listings.csv
live_listings = pd.read_csv("../data/processed/domain/live_listings.csv")

# read in wayback listings from data/processed/domain/wayback_listings.csv
wayback_listings = pd.read_csv("../data/processed/domain/wayback_listings.csv")

# stack live_listings[['property_id', 'suburb', 'property_type', 'bedrooms']] and wayback_listings[['property_id', 'suburb', 'property_type', 'bedrooms']]
live_listings = live_listings[['property_id', 'suburb', 'property_type', 'bedrooms', 'year', 'quarter']]
wayback_listings = wayback_listings[['property_id', 'suburb', 'property_type', 'bedrooms', 'year', 'quarter']]

# concat live_listings and wayback_listings 
df = pd.concat([live_listings, wayback_listings])

# convert property_id to int
df['property_id'] = df['property_id'].astype(int)



In [43]:
# sort by year, quarter descending
df = df.sort_values(by=['year', 'quarter'], ascending=False)

# drop duplicates but keep first occurrence
df = df.drop_duplicates(subset=['property_id'], keep='first')

# save to data/processed/domain/cleaned_listings.csv
df.to_csv("../data/processed/domain/cleaned_listings.csv", index=False)



# Stratified Sampling

Randomly shuffle and stratify sample the dataframe down to 50% by property_type, suburb, and bedrooms. This helps reduce computational cost while maintaining representative distribution of property characteristics.


In [46]:
# Set random seed for reproducibility
import numpy as np
np.random.seed(42)

# First, shuffle the dataframe randomly
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Create stratification groups based on property_type, suburb, and bedrooms
df_shuffled['strata'] = (
    df_shuffled['property_type'].astype(str) + '_' + 
    df_shuffled['suburb'].astype(str) + '_' + 
    df_shuffled['bedrooms'].astype(str)
)

# Perform stratified sampling to get 50% of the data
df_sampled = df_shuffled.groupby('strata', group_keys=False).apply(
    lambda x: x.sample(frac=0.5, random_state=42) if len(x) > 1 else x
)

# Drop the temporary strata column
df_sampled = df_sampled.drop(columns=['strata'])

# Reset index
df_sampled = df_sampled.reset_index(drop=True)

print(f"Original size: {len(df_shuffled):,} rows")
print(f"Sampled size: {len(df_sampled):,} rows")
print(f"Sampling ratio: {len(df_sampled) / len(df_shuffled):.1%}")

# Verify distribution is maintained
print("\n--- Property Type Distribution ---")
print("Original:")
print(df_shuffled['property_type'].value_counts(normalize=True).head())
print("\nSampled:")
print(df_sampled['property_type'].value_counts(normalize=True).head())

print("\n--- Bedrooms Distribution ---")
print("Original:")
print(df_shuffled['bedrooms'].value_counts(normalize=True).sort_index())
print("\nSampled:")
print(df_sampled['bedrooms'].value_counts(normalize=True).sort_index())


Original size: 26,460 rows
Sampled size: 14,065 rows
Sampling ratio: 53.2%

--- Property Type Distribution ---
Original:
property_type
House                      0.275813
house                      0.237642
apartment / unit / flat    0.181519
Apartment / Unit / Flat    0.178193
Townhouse                  0.061224
Name: proportion, dtype: float64

Sampled:
property_type
House                      0.271809
house                      0.233914
apartment / unit / flat    0.177177
Apartment / Unit / Flat    0.175258
Townhouse                  0.065482
Name: proportion, dtype: float64

--- Bedrooms Distribution ---
Original:
bedrooms
1     0.133787
2     0.287415
3     0.338473
4     0.217347
5     0.019803
6     0.002041
7     0.000529
8     0.000340
9     0.000227
50    0.000038
Name: proportion, dtype: float64

Sampled:
bedrooms
1     0.135869
2     0.287096
3     0.329968
4     0.215997
5     0.025382
6     0.003626
7     0.000995
8     0.000569
9     0.000427
50    0.000071
Name: proport

  df_sampled = df_shuffled.groupby('strata', group_keys=False).apply(


In [47]:
# Save the sampled dataframe
df_sampled.to_csv("../data/processed/domain/cleaned_listings_sampled.csv", index=False)
print("\n✓ Saved stratified sampled data to: data/processed/domain/cleaned_listings_sampled.csv")



✓ Saved stratified sampled data to: data/processed/domain/cleaned_listings_sampled.csv


In [48]:
# Display summary statistics
print("Final sampled dataset:")
print(f"Shape: {df_sampled.shape}")
print(f"\nUnique suburbs: {df_sampled['suburb'].nunique()}")
print(f"Unique property types: {df_sampled['property_type'].nunique()}")
print(f"\nTop 10 suburbs by count:")
print(df_sampled['suburb'].value_counts().head(10))


Final sampled dataset:
Shape: (14065, 6)

Unique suburbs: 371
Unique property types: 24

Top 10 suburbs by count:
suburb
melbourne                         631
tarneit                           199
werribee-hoppers crossing         195
southbank                         191
south yarra                       165
shepparton                        163
north melbourne-west melbourne    163
truganina                         155
st kilda                          136
clayton                           132
Name: count, dtype: int64
