This notebook performs preprocessing on the scraped listings from the live version of the Domain website with extra features due to the presence of individual property pages.

In [127]:
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [128]:
import pandas as pd
import re
from shapely.geometry import Point
from utils.preprocess import PreprocessUtils

# Initialize the preprocessor
preprocessor = PreprocessUtils()

pd.set_option("display.max_rows", None)  # Show all rows, default is 10
pd.set_option("display.max_columns", None)  # Show all columns, default is 20

Permitted property types are defined as "renter friendly types". 

The following will be mapped to **House**
- House
- Townhouse
- Villa
- New House & Land
- Semi-Detached
- Terrace
- Duplex

The following will be mapped to **Flat**
- Apartment / Unit / Flat
- Studio
- New Apartments / Off the Plan
- Penthouse

The following will be dropped
- Vacant Land
- Carspace
- Block of Units
- Acreage / Semi-rural
- Rural

In [129]:
df = pd.read_csv("../data/raw/domain/rental_listings_2025_09.csv")

# convert column names to lowercase with snake case
df.columns = df.columns.str.lower().str.replace(' ', '_')

print("Before filtering: data shape is ", df.shape)
# remove rows where property_id is null
df = df[df['property_id'].notna()]

# remove rows where property_features is null
df = df[df['property_features'].notna()]

# drop the bathrooms, bedrooms, car_spaces, land_area columns 
df = df.drop(columns=['bathrooms', 'bedrooms', 'car_spaces', 'land_area'])

# drop rows with not permitted property_type
permitted_types = ["house", "new house & land", "townhouse", "villa", "semi-detached", "terrace", "duplex",
                   "apartment / unit / flat", "studio", "new apartments / off the plan", "penthouse",
                   ]
# remove rows where description is null
df = df[df['description'].notna()]

df['property_type'] = df['property_type'].str.lower()
df = df[df['property_type'].isin(permitted_types)]

print("After filtering: data shape is ", df.shape)

Before filtering: data shape is  (14146, 47)
After filtering: data shape is  (14077, 43)


In [130]:
df["house_flat_other"] = preprocessor.map_property_type(df['property_type'])

df["house_flat_other"].value_counts()

house_flat_other
house      8725
unknown    5115
flat        237
Name: count, dtype: int64

In [131]:
# Map the suburbs to standardized names (used by DFFH dataset)
df['suburb'] = preprocessor.map_suburb(df['suburb'])

# Remove the suburbs that have count less than 10
suburb_counts = df['suburb'].value_counts()
valid_suburbs = suburb_counts[suburb_counts > 10].index
df = df[df['suburb'].isin(valid_suburbs)]

df['suburb'].nunique()

245

Because the live Domain website scraped the property page for `property_features` the format is slightly different to the format for the listings scraped from Wayback archived summary listings. Here we convert `property_features` back into `bedrooms`, `bathrooms`, `car_spaces` features. 

In [132]:
def parse_property_features(feature_string):
    """
    Parse property_features column to extract bedrooms, bathrooms, car_spaces, and land_area.
    
    Format: 'bedrooms, ,bathrooms, ,car_spaces,' or 'bedrooms, ,bathrooms, ,car_spaces, ,XXXm²,'
            or 'bedrooms, ,bathrooms, ,car_spaces, ,X.XXha,'
    Missing values are represented by '−'
    Land area can be in m² or ha (hectares are converted to m²: 1 ha = 10,000 m²)
    
    Returns: pd.Series with four integer values (bedrooms, bathrooms, car_spaces, land_area)
    """    
    # Split by ', ,'
    parts = feature_string.split(', ,')
    
    # Initialize values
    bedrooms = None
    bathrooms = None
    car_spaces = None
    land_area = None
    
    # Extract bedrooms (index 0)
    if len(parts) > 0:
        val = parts[0].strip().rstrip(',')
        # Check if this is just a land area value (like '12.51ha,')
        if 'ha' in val or 'm²' in val:
            # This entire string is just land area, extract it
            if 'ha' in val:
                land_area_str = val.replace('ha', '').replace(',', '').strip()
                if land_area_str and land_area_str != '−':
                    land_area = int(float(land_area_str) * 10000)  # Convert ha to m²
            elif 'm²' in val:
                land_area_str = val.replace('m²', '').replace(',', '').strip()
                if land_area_str and land_area_str != '−':
                    land_area = int(land_area_str)
        else:
            bedrooms = None if val == '−' or val == '' else int(val)
    
    # Extract bathrooms (index 1)
    if len(parts) > 1:
        val = parts[1].strip().rstrip(',')
        bathrooms = None if val == '−' or val == '' else int(val)
    
    # Extract car_spaces (index 2)
    if len(parts) > 2:
        val = parts[2].strip().rstrip(',')
        car_spaces = None if val == '−' or val == '' else int(val)
    
    # Extract land_area (index 3, if present and not already extracted)
    if land_area is None and len(parts) > 3:
        val = parts[3].strip().rstrip(',')
        if 'ha' in val:
            # Remove 'ha' and extract number (handle commas and decimals like '12.51ha')
            land_area_str = val.replace('ha', '').replace(',', '').strip()
            if land_area_str and land_area_str != '−':
                land_area = int(float(land_area_str) * 10000)  # Convert ha to m², then to int
        elif 'm²' in val:
            # Remove 'm²' and extract number (handle commas in numbers like '5,030m²')
            land_area_str = val.replace('m²', '').replace(',', '').strip()
            if land_area_str and land_area_str != '−':
                land_area = int(land_area_str)
    
    return pd.Series([bedrooms, bathrooms, car_spaces, land_area])

# Apply the function to create four new columns
df[['bedrooms', 'bathrooms', 'car_spaces', 'land_area']] = df['property_features'].apply(parse_property_features)

# Convert to nullable integer type (Int64) to preserve NaNs while using integer type
df['bedrooms'] = df['bedrooms'].astype('Int64')
df['bathrooms'] = df['bathrooms'].astype('Int64')
df['car_spaces'] = df['car_spaces'].astype('Int64')
df['land_area'] = df['land_area'].astype('Int64')

# drop land_area since mostly missing and difficult to fill
df = df.drop(columns=['land_area'])

In [133]:
# handle missing values by imputing by mode grouped by property_type
print("Before imputation:")
print(df[['bedrooms', 'bathrooms', 'car_spaces']].isnull().sum())

# Impute bedrooms using property_type mode
df['bedrooms'] = preprocessor.impute_by_property_type_mode(df, 'bedrooms')

# Impute bathrooms using property_type mode
df['bathrooms'] = preprocessor.impute_by_property_type_mode(df, 'bathrooms')

# Impute car_spaces using property_type mode
df['car_spaces'] = preprocessor.impute_by_property_type_mode(df, 'car_spaces')

print("\nAfter imputation:")
print(df[['bedrooms', 'bathrooms', 'car_spaces']].isnull().sum())

Before imputation:
bedrooms       144
bathrooms       58
car_spaces    2037
dtype: int64
Property type: apartment / unit / flat, bedrooms imputed with 2
Property type: house, bedrooms imputed with 4
Property type: studio, bedrooms imputed with 1
Property type: apartment / unit / flat, bathrooms imputed with 1
Property type: house, bathrooms imputed with 2
Property type: townhouse, car_spaces imputed with 2
Property type: apartment / unit / flat, car_spaces imputed with 1
Property type: house, car_spaces imputed with 2
Property type: studio, car_spaces imputed with 1
Property type: villa, car_spaces imputed with 1
Property type: terrace, car_spaces imputed with 2
Property type: new apartments / off the plan, car_spaces imputed with 1
Property type: semi-detached, car_spaces imputed with 1

After imputation:
bedrooms      0
bathrooms     0
car_spaces    0
dtype: int64


In [134]:
# impute appointment_only
df['appointment_only'] = df['appointment_only'].fillna(df['appointment_only'].mode()[0])


  df['appointment_only'] = df['appointment_only'].fillna(df['appointment_only'].mode()[0])


In [135]:
# Convert date columns to datetime, using mixed format to handle inconsistent datetime formats
df["updated_date"] = pd.to_datetime(df["updated_date"], format="mixed")
df["first_listed_date"] = pd.to_datetime(df["first_listed_date"], format="mixed")
df["last_sold_date"] = pd.to_datetime(df["last_sold_date"], format="mixed")

In [136]:
# convert updated_date to year and quarter
df['year'] = df['updated_date'].dt.year
df['quarter'] = df['updated_date'].dt.quarter


In [137]:
# Extract weekly rent from rental_price column
df['rental_price'] = preprocessor.extract_rental_price(df['rental_price'])

# Check how many rows have unknown frequencies (NaN in weekly_rent)
unknown_count = df['rental_price'].isna().sum()
print(f"Rows with unknown price frequencies: {unknown_count}")

# Drop rows with unknown frequencies
df = df[df['rental_price'].notna()]

Rows with unknown price frequencies: 86


In [138]:
# only keep relevant columns
relevant_columns = [
    'property_id','rental_price', 
    'suburb', 'postcode', 'property_type', 'year', 'quarter',
    'bedrooms', 'bathrooms', 'car_spaces',
    'age_0_to_19', 'age_20_to_39', 'age_40_to_59', 'age_60_plus',
    'agency_name', 'appointment_only', 'avg_days_on_market',
    'description', 'family_percentage',
    'first_listed_date',
    'latitude', 'longitude', 'listing_status', 'long_term_resident', 
    'median_rent_price', 'median_sold_price', 'number_sold',
    'renter_percentage', 'single_percentage'
]

df = df[relevant_columns]

print(df.shape)

(12650, 29)


In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12650 entries, 0 to 14143
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   property_id         12650 non-null  float64       
 1   rental_price        12650 non-null  float64       
 2   suburb              12650 non-null  object        
 3   postcode            12650 non-null  int64         
 4   property_type       12650 non-null  object        
 5   year                12650 non-null  int32         
 6   quarter             12650 non-null  int32         
 7   bedrooms            12650 non-null  Int64         
 8   bathrooms           12650 non-null  Int64         
 9   car_spaces          12650 non-null  Int64         
 10  age_0_to_19         12650 non-null  float64       
 11  age_20_to_39        12650 non-null  float64       
 12  age_40_to_59        12650 non-null  float64       
 13  age_60_plus         12650 non-null  float64       


In [140]:
# remove remaining rows where any of the columns is null
df = df.dropna()

df.info()

df.shape


<class 'pandas.core.frame.DataFrame'>
Index: 12649 entries, 0 to 14143
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   property_id         12649 non-null  float64       
 1   rental_price        12649 non-null  float64       
 2   suburb              12649 non-null  object        
 3   postcode            12649 non-null  int64         
 4   property_type       12649 non-null  object        
 5   year                12649 non-null  int32         
 6   quarter             12649 non-null  int32         
 7   bedrooms            12649 non-null  Int64         
 8   bathrooms           12649 non-null  Int64         
 9   car_spaces          12649 non-null  Int64         
 10  age_0_to_19         12649 non-null  float64       
 11  age_20_to_39        12649 non-null  float64       
 12  age_40_to_59        12649 non-null  float64       
 13  age_60_plus         12649 non-null  float64       


(12649, 29)

In [141]:
# save df to processed/domain/live_listings.csv
df.to_csv("../data/processed/domain/live_listings.csv", index=False)

In [142]:
df = df[['property_id', 'longitude', 'latitude', 'suburb']]

# convert longitude and latitude to Point
df['coordinates'] = df.apply(lambda row: Point(row['latitude'], row['longitude']), axis=1)



In [None]:
# Split the dataframe into batches and save to output directory

batch_size = 500
for output_dir in ["../data/raw/missing_routes", "../data/raw/missing_poi", "../data/raw/missing_isochrones/driving", "../data/raw/missing_isochrones/walking"]:
    batch_files = preprocessor.split_into_batches(df[['property_id', 'coordinates']], batch_size, output_dir)
    print(f"\nCreated {len(batch_files)} batch files")

Now we have set up the input files we can call the `./run_` scripts associated with making the API calls.

In [143]:
# read in data/processed/isochrones/driving/isochrones_driving_combined.csv 
isochrones_driving = pd.read_csv("../data/processed/isochrones/driving/isochrones_driving_combined.csv")

# read in data/processed/isochrones/walking/isochrones_walking_combined.csv
isochrones_walking = pd.read_csv("../data/processed/isochrones/walking/isochrones_walking_combined.csv")

# drop coordinates column
isochrones_driving = isochrones_driving.drop(columns=['coordinates'])
isochrones_walking = isochrones_walking.drop(columns=['coordinates'])


In [144]:
# merge isochrones_driving with df on property_id
merged_df = pd.merge(df, isochrones_driving, on='property_id', how='left')

# rename the columns 5min, 10min 15min to driving_5min, driving_10min, driving_15min
merged_df = merged_df.rename(columns={'5min': 'driving_5min', '10min': 'driving_10min', '15min': 'driving_15min'})

# merge isochrones_walking with df on property_id
merged_df = pd.merge(merged_df, isochrones_walking, on='property_id', how='left')

# rename the columns 5min, 10min 15min to walking_5min, walking_10min, walking_15min
merged_df = merged_df.rename(columns={'5min': 'walking_5min', '10min': 'walking_10min', '15min': 'walking_15min'})

merged_df.head()

Unnamed: 0,property_id,longitude,latitude,suburb,coordinates,driving_5min,driving_10min,driving_15min,walking_5min,walking_10min,walking_15min
0,17732837.0,144.996157,-37.796893,collingwood-abbotsford,POINT (-37.796893 144.9961565),"POLYGON ((144.979355 -37.798607, 144.981794 -3...","POLYGON ((144.959626 -37.793, 144.962138 -37.8...","POLYGON ((144.936435 -37.785708, 144.936258 -3...","POLYGON ((144.992561 -37.795622, 144.992555 -3...","POLYGON ((144.987942 -37.794449, 144.988173 -3...","POLYGON ((144.983394 -37.794141, 144.983946 -3..."
1,17744154.0,145.007683,-37.811065,collingwood-abbotsford,POINT (-37.8110653 145.0076834),"POLYGON ((144.980748 -37.80875, 144.981582 -37...","POLYGON ((144.958496 -37.803735, 144.95771 -37...","POLYGON ((144.925411 -37.825948, 144.926093 -3...","POLYGON ((145.002942 -37.810891, 145.003013 -3...","POLYGON ((144.998207 -37.810222, 144.998236 -3...","POLYGON ((144.993536 -37.809922, 144.993569 -3..."
2,17750349.0,145.001906,-37.80211,collingwood-abbotsford,POINT (-37.80210950000001 145.0019064),"POLYGON ((144.981316 -37.798493, 144.982206 -3...","POLYGON ((144.962479 -37.799088, 144.966284 -3...","POLYGON ((144.938168 -37.789279, 144.939602 -3...","POLYGON ((144.997121 -37.801603, 144.997121 -3...","POLYGON ((144.992658 -37.801436, 144.993016 -3...","POLYGON ((144.989197 -37.800257, 144.989158 -3..."
3,17739910.0,144.999856,-37.809205,collingwood-abbotsford,POINT (-37.8092053 144.999856),"POLYGON ((144.979348 -37.808679, 144.97928 -37...","POLYGON ((144.956625 -37.806511, 144.957063 -3...","POLYGON ((144.939835 -37.78925, 144.940454 -37...","POLYGON ((144.996383 -37.809856, 144.996346 -3...","POLYGON ((144.991645 -37.809719, 144.991735 -3...","POLYGON ((144.987048 -37.808997, 144.987011 -3..."
4,17751219.0,144.99394,-37.808042,collingwood-abbotsford,POINT (-37.8080424 144.9939399),"POLYGON ((144.97256 -37.807922, 144.973631 -37...","POLYGON ((144.951863 -37.805174, 144.95199 -37...","POLYGON ((144.936452 -37.785721, 144.936286 -3...","POLYGON ((144.990064 -37.807245, 144.991239 -3...","POLYGON ((144.985368 -37.806814, 144.986504 -3...","POLYGON ((144.980519 -37.806187, 144.9818 -37...."


In [146]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15005 entries, 0 to 15004
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   property_id           15005 non-null  float64
 1   longitude             15005 non-null  float64
 2   latitude              15005 non-null  float64
 3   suburb                15005 non-null  object 
 4   coordinates           15005 non-null  object 
 5   driving_5min          13863 non-null  object 
 6   driving_10min         13863 non-null  object 
 7   driving_15min         13863 non-null  object 
 8   walking_5min          9695 non-null   object 
 9   walking_10min         9695 non-null   object 
 10  walking_15min         9695 non-null   object 
 11  driving_5min_imputed  15005 non-null  object 
dtypes: float64(3), object(9)
memory usage: 1.4+ MB


In [None]:
# get the rows where the driving_5min is null
null_driving_5min = merged_df[merged_df['driving_5min'].isnull()]

# get the rows where the walking_5min is null
null_walking_5min = merged_df[merged_df['walking_5min'].isnull()]

# use preprocessor to split into batches
preprocessor.split_into_batches(null_driving_5min[['property_id', 'coordinates']], 500, "../data/raw/missing_isochrones/driving")
preprocessor.split_into_batches(null_walking_5min[['property_id', 'coordinates']], 500, "../data/raw/missing_isochrones/walking")

In [None]:
merged_df

In [None]:
# drop the driving_5min, driving_10min, driving_15min, walking_5min, walking_10min, walking_15min columns
merged_df = merged_df.drop(columns=['driving_5min', 'driving_10min', 'driving_15min', 'walking_5min', 'walking_10min', 'walking_15min'])

# rename the columns driving_5min_imputed, driving_10min_imputed, driving_15min_imputed, walking_5min_imputed, walking_10min_imputed, walking_15min_imputed to driving_5min, driving_10min, driving_15min, walking_5min, walking_10min, walking_15min
merged_df = merged_df.rename(columns={'driving_5min_imputed': 'driving_5min', 'driving_10min_imputed': 'driving_10min', 'driving_15min_imputed': 'driving_15min', 'walking_5min_imputed': 'walking_5min', 'walking_10min_imputed': 'walking_10min', 'walking_15min_imputed': 'walking_15min'})

# save the merged dataframe with property_id and driving_5min, driving_10min, driving_15min to data/curated/rent_features/cleaned_isochrones_driving.csv
merged_df[['property_id', 'driving_5min', 'driving_10min', 'driving_15min']].to_csv("../data/curated/rent_features/cleaned_isochrones_driving.csv", index=False)


In [None]:
# Display the updated dataframe with imputed columns
merged_df.head()
