This notebook performs preprocessing on the scraped listings from the live version of the Domain website with extra features due to the presence of individual property pages.

In [159]:
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [160]:
import pandas as pd
import re
from utils.preprocess import PreprocessUtils

# Initialize the preprocessor
preprocessor = PreprocessUtils()

pd.set_option("display.max_rows", None)  # Show all rows, default is 10
pd.set_option("display.max_columns", None)  # Show all columns, default is 20

Permitted property types are defined as "renter friendly types". 

The following will be mapped to **House**
- House
- Townhouse
- Villa
- New House & Land
- Semi-Detached
- Terrace
- Duplex

The following will be mapped to **Flat**
- Apartment / Unit / Flat
- Studio
- New Apartments / Off the Plan
- Penthouse

The following will be dropped
- Vacant Land
- Carspace
- Block of Units
- Acreage / Semi-rural
- Rural

In [161]:
df = pd.read_csv("../data/raw/domain/rental_listings_2025_09.csv")

print("Before filtering: data shape is ", df.shape)
# remove rows where property_id is null
df = df[df['property_id'].notna()]

# remove rows where property_features is null
df = df[df['property_features'].notna()]

# drop the bathrooms, bedrooms, car_spaces, land_area columns 
df = df.drop(columns=['bathrooms', 'bedrooms', 'car_spaces', 'land_area'])

# drop rows with not permitted property_type
permitted_types = ["house", "new house & land", "townhouse", "villa", "semi-detached", "terrace", "duplex",
                   "apartment / unit / flat", "studio", "new apartments / off the plan", "penthouse",
                   ]
# remove rows where description is null
df = df[df['description'].notna()]

df['property_type'] = df['property_type'].str.lower()
df = df[df['property_type'].isin(permitted_types)]

print("After filtering: data shape is ", df.shape)

Before filtering: data shape is  (14146, 47)
After filtering: data shape is  (14077, 43)


In [162]:
df["house_flat_other"] = preprocessor.map_property_type(df['property_type'])

df["house_flat_other"].value_counts()

house_flat_other
house      8725
unknown    5115
flat        237
Name: count, dtype: int64

In [163]:
# Map the suburbs to standardized names (used by DFFH dataset)
df['suburb'] = preprocessor.map_suburb(df['suburb'])

# Remove the suburbs that have count less than 10
suburb_counts = df['suburb'].value_counts()
valid_suburbs = suburb_counts[suburb_counts > 10].index
df = df[df['suburb'].isin(valid_suburbs)]

df['suburb'].nunique()

245

Because the live Domain website scraped the property page for `property_features` the format is slightly different to the format for the listings scraped from Wayback archived summary listings. Here we convert `property_features` back into `bedrooms`, `bathrooms`, `car_spaces` features. 

In [164]:
def parse_property_features(feature_string):
    """
    Parse property_features column to extract bedrooms, bathrooms, car_spaces, and land_area.
    
    Format: 'bedrooms, ,bathrooms, ,car_spaces,' or 'bedrooms, ,bathrooms, ,car_spaces, ,XXXm²,'
            or 'bedrooms, ,bathrooms, ,car_spaces, ,X.XXha,'
    Missing values are represented by '−'
    Land area can be in m² or ha (hectares are converted to m²: 1 ha = 10,000 m²)
    
    Returns: pd.Series with four integer values (bedrooms, bathrooms, car_spaces, land_area)
    """    
    # Split by ', ,'
    parts = feature_string.split(', ,')
    
    # Initialize values
    bedrooms = None
    bathrooms = None
    car_spaces = None
    land_area = None
    
    # Extract bedrooms (index 0)
    if len(parts) > 0:
        val = parts[0].strip().rstrip(',')
        # Check if this is just a land area value (like '12.51ha,')
        if 'ha' in val or 'm²' in val:
            # This entire string is just land area, extract it
            if 'ha' in val:
                land_area_str = val.replace('ha', '').replace(',', '').strip()
                if land_area_str and land_area_str != '−':
                    land_area = int(float(land_area_str) * 10000)  # Convert ha to m²
            elif 'm²' in val:
                land_area_str = val.replace('m²', '').replace(',', '').strip()
                if land_area_str and land_area_str != '−':
                    land_area = int(land_area_str)
        else:
            bedrooms = None if val == '−' or val == '' else int(val)
    
    # Extract bathrooms (index 1)
    if len(parts) > 1:
        val = parts[1].strip().rstrip(',')
        bathrooms = None if val == '−' or val == '' else int(val)
    
    # Extract car_spaces (index 2)
    if len(parts) > 2:
        val = parts[2].strip().rstrip(',')
        car_spaces = None if val == '−' or val == '' else int(val)
    
    # Extract land_area (index 3, if present and not already extracted)
    if land_area is None and len(parts) > 3:
        val = parts[3].strip().rstrip(',')
        if 'ha' in val:
            # Remove 'ha' and extract number (handle commas and decimals like '12.51ha')
            land_area_str = val.replace('ha', '').replace(',', '').strip()
            if land_area_str and land_area_str != '−':
                land_area = int(float(land_area_str) * 10000)  # Convert ha to m², then to int
        elif 'm²' in val:
            # Remove 'm²' and extract number (handle commas in numbers like '5,030m²')
            land_area_str = val.replace('m²', '').replace(',', '').strip()
            if land_area_str and land_area_str != '−':
                land_area = int(land_area_str)
    
    return pd.Series([bedrooms, bathrooms, car_spaces, land_area])

# Apply the function to create four new columns
df[['bedrooms', 'bathrooms', 'car_spaces', 'land_area']] = df['property_features'].apply(parse_property_features)

# Convert to nullable integer type (Int64) to preserve NaNs while using integer type
df['bedrooms'] = df['bedrooms'].astype('Int64')
df['bathrooms'] = df['bathrooms'].astype('Int64')
df['car_spaces'] = df['car_spaces'].astype('Int64')
df['land_area'] = df['land_area'].astype('Int64')

# drop land_area since mostly missing and difficult to fill
df = df.drop(columns=['land_area'])

In [165]:
# handle missing values by imputing by mode grouped by property_type
print("Before imputation:")
print(df[['bedrooms', 'bathrooms', 'car_spaces']].isnull().sum())

# Impute bedrooms using property_type mode
df['bedrooms'] = preprocessor.impute_by_property_type_mode(df, 'bedrooms')

# Impute bathrooms using property_type mode
df['bathrooms'] = preprocessor.impute_by_property_type_mode(df, 'bathrooms')

# Impute car_spaces using property_type mode
df['car_spaces'] = preprocessor.impute_by_property_type_mode(df, 'car_spaces')

print("\nAfter imputation:")
print(df[['bedrooms', 'bathrooms', 'car_spaces']].isnull().sum())

Before imputation:
bedrooms       144
bathrooms       58
car_spaces    2037
dtype: int64
Property type: apartment / unit / flat, bedrooms imputed with 2
Property type: house, bedrooms imputed with 4
Property type: studio, bedrooms imputed with 1
Property type: apartment / unit / flat, bathrooms imputed with 1
Property type: house, bathrooms imputed with 2
Property type: townhouse, car_spaces imputed with 2
Property type: apartment / unit / flat, car_spaces imputed with 1
Property type: house, car_spaces imputed with 2
Property type: studio, car_spaces imputed with 1
Property type: villa, car_spaces imputed with 1
Property type: terrace, car_spaces imputed with 2
Property type: new apartments / off the plan, car_spaces imputed with 1
Property type: semi-detached, car_spaces imputed with 1

After imputation:
bedrooms      0
bathrooms     0
car_spaces    0
dtype: int64


In [166]:
# impute appointment_only
df['appointment_only'] = df['appointment_only'].fillna(df['appointment_only'].mode()[0])


  df['appointment_only'] = df['appointment_only'].fillna(df['appointment_only'].mode()[0])


In [167]:
# Convert date columns to datetime, using mixed format to handle inconsistent datetime formats
df["updated_date"] = pd.to_datetime(df["updated_date"], format="mixed")
df["first_listed_date"] = pd.to_datetime(df["first_listed_date"], format="mixed")
df["last_sold_date"] = pd.to_datetime(df["last_sold_date"], format="mixed")

In [168]:
# convert updated_date to year and quarter
df['year'] = df['updated_date'].dt.year
df['quarter'] = df['updated_date'].dt.quarter


In [169]:
# Extract weekly rent from rental_price column
df['rental_price'] = preprocessor.extract_rental_price(df['rental_price'])

# Check how many rows have unknown frequencies (NaN in weekly_rent)
unknown_count = df['rental_price'].isna().sum()
print(f"Rows with unknown price frequencies: {unknown_count}")

# Drop rows with unknown frequencies
df = df[df['rental_price'].notna()]

Rows with unknown price frequencies: 86


In [170]:
# only keep relevant columns
relevant_columns = [
    'property_id','rental_price', 
    'suburb', 'postcode', 'property_type', 'year', 'quarter',
    'bedrooms', 'bathrooms', 'car_spaces',
    'age_0_to_19', 'age_20_to_39', 'age_40_to_59', 'age_60_plus',
    'agency_name', 'appointment_only', 'avg_days_on_market',
    'description', 'family_percentage',
    'first_listed_date',
    'latitude', 'longitude', 'listing_status', 'long_term_resident', 
    'median_rent_price', 'median_sold_price', 'number_sold',
    'renter_percentage', 'single_percentage'
]

df = df[relevant_columns]

print(df.shape)

(12650, 29)


In [171]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12650 entries, 0 to 14143
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   property_id         12650 non-null  float64       
 1   rental_price        12650 non-null  float64       
 2   suburb              12650 non-null  object        
 3   postcode            12650 non-null  int64         
 4   property_type       12650 non-null  object        
 5   year                12650 non-null  int32         
 6   quarter             12650 non-null  int32         
 7   bedrooms            12650 non-null  Int64         
 8   bathrooms           12650 non-null  Int64         
 9   car_spaces          12650 non-null  Int64         
 10  age_0_to_19         12650 non-null  float64       
 11  age_20_to_39        12650 non-null  float64       
 12  age_40_to_59        12650 non-null  float64       
 13  age_60_plus         12650 non-null  float64       


In [172]:
# remove remaining rows where any of the columns is null
df = df.dropna()

df.info()

df.shape


<class 'pandas.core.frame.DataFrame'>
Index: 12649 entries, 0 to 14143
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   property_id         12649 non-null  float64       
 1   rental_price        12649 non-null  float64       
 2   suburb              12649 non-null  object        
 3   postcode            12649 non-null  int64         
 4   property_type       12649 non-null  object        
 5   year                12649 non-null  int32         
 6   quarter             12649 non-null  int32         
 7   bedrooms            12649 non-null  Int64         
 8   bathrooms           12649 non-null  Int64         
 9   car_spaces          12649 non-null  Int64         
 10  age_0_to_19         12649 non-null  float64       
 11  age_20_to_39        12649 non-null  float64       
 12  age_40_to_59        12649 non-null  float64       
 13  age_60_plus         12649 non-null  float64       


(12649, 29)

In [173]:
# save df to processed/domain/live_listings.csv
df.to_csv("../data/processed/domain/live_listings.csv", index=False)