# Summary Notebook


## Dataset Overview and Data Pipeline


### 1. Data Sources and Collection (`0_data_download.ipynb`)

The project collects data from multiple authoritative sources:

**Primary Rental Data:**
- **Domain.com.au**: Live rental listings (14,146 properties) and historical wayback data (14,043 properties) spanning 2022-2025
- **DFFH Moving Annual Rent**: Official Victorian government quarterly median rent data by suburb and property type (84,729 records)

**Economic Indicators:**
- **Unemployment Rate**: Quarterly data from 1978-2025 (191 records)
- **Interest Rates**: Mortgage, savings, and cash rates from 1990-2035 (181 records)  
- **Price Data**: CPI, WPI, and PPI quarterly data from 1949-2025 (304 records)
- **Economic Activity**: State Final Demand and GSP components from 1986-2025 (156 records)
- **Population Dynamics**: Natural increase, migration patterns from 1982-2025 (172 records)
- **Investment Data**: Household consumption, dwelling investment, business investment, government spending from 1986-2025 (156 records)

**Geographic and Infrastructure Data:**
- **Public Transport**: PTV stops and lines from Open Data Victoria (7.33 MB + 362.41 MB)
- **School Locations**: 2023-2025 school data from Education Victoria (2,363 unique schools)
- **Census Data**: Population demographics from ABS (attempted download of SAL codes 20001-22944)

### 2. Data Preprocessing (`1_preprocess_domain.ipynb`)

**Live Listings Processing:**
- **Input**: 14,146 raw listings with 47 columns
- **Processing**: Property feature parsing, bedroom/bathroom/car space imputation, data cleaning
- **Output**: 12,649 processed listings across 245 suburbs
- **Property Types**: House (7,577), Unknown (4,860), Flat (213)

**Wayback Listings Processing:**
- **Input**: 14 wayback files (2022Q1-2025Q2) with varying record counts
- **Processing**: Address extraction from URLs, property feature imputation, geocoding via OpenRouteService API
- **Output**: 13,301 geocoded listings across 338 suburbs (93% geocoding success rate)

**Data Combination and Cleaning:**
- **Combined Dataset**: 25,950 records → 24,517 after deduplication
- **Outlier Removal**: Removed top/bottom 1% of rental prices ($240-$1,650 range)
- **Final Shape**: 24,027 properties across 368 suburbs and 24 property types
- **Stratified Sampling**: 12,823 records (53.4%) for API feature engineering

### 3. Spatial Feature Engineering (`1_feature_engineer.ipynb`)

**API-Based Feature Generation:**
The project leverages multiple APIs to create location-based features:

**OpenRouteService APIs:**
- **Isochrones**: Driving and walking accessibility polygons (5, 10, 15-minute travel times)
- **Routes**: Minimum distance and duration to closest PTV stations
- **Geocoding**: Address-to-coordinate conversion for wayback listings

**OpenStreetMap POI Features:**
- **Amenity Counts**: 41 POI categories including cafes, pharmacies, childcare, fuel stations, etc.
- **Distance Features**: Minimum distances to nearest amenities
- **Imputation**: Missing values filled using nearest neighbor spatial interpolation

**School Quality Features:**
- **School Rankings**: Integration of school performance data
- **Accessibility Scores**: Best school within each isochrone polygon
- **Distance Metrics**: Travel time and distance to highest-rated schools

**Final Feature-Rich Dataset:**
- **Shape**: 24,027 properties × 133 features
- **Spatial Features**: 30 school-related features, 41 POI features, 6 isochrone features, 3 PTV route features
- **Storage**: Parquet format with 10 partitions for efficient processing

### 4. Panel Data Construction (`2_preprocess_forecast.ipynb`)

**Time Series Panel Creation:**
- **Temporal Range**: 2013 Q2 to 2025 Q1 (48 quarters)
- **Spatial Units**: 849 Victorian suburbs
- **Property Types**: 6 categories (1-3 bedroom flats, 2-4 bedroom houses)
- **Panel Shape**: 41,918 observations (849 suburbs × 6 property types × 48 quarters)

**Economic Feature Integration:**
- **LASSO Selection**: 12 out of 18 economic indicators selected for modeling
- **Temporal Alignment**: Quarterly economic data matched to rental price periods
- **Missing Data Handling**: GSP data gaps reduced observations from 87,330 to 41,918

### 5. Data Quality and Challenges

**Data Quality Issues:**
- **Geocoding Success**: 93% success rate for wayback listings
- **Missing Values**: Systematic imputation using spatial nearest neighbor methods
- **COVID-19 Impact**: Rental market volatility during 2020-2021 period
- **API Limitations**: Rate limits requiring batch processing and multiple API keys

**Data Validation:**
- **Outlier Detection**: Statistical outlier removal based on rental price distributions
- **Deduplication**: Property ID-based deduplication preserving most recent listings
- **Spatial Validation**: Coordinate validation and geometry consistency checks

## Big Question 2: What are the top 10 Suburbs with the highest predicted growth rate?

# Selected Model
A Spatial Autoregressive (SAR) model was implemented for this task because it is able to include spatial relationships between suburbs and utilise lags to create a time series forecast. Additionally, SAR models are easily interpretable. 

# Overview
Panel data spanning 2013-2025 across 849 Victorian suburbs and 6 property types was used.  Both temporal and spatial dependencies were constructed to predict median rental prices.
OLS and WLS models were used for training and validation before determining that WLS with downweighting for COVID period performed better. Therefore, the WLS was used for forecasting 20 quarters of future median rental prices. 


# Methodology and Approach

#### 1. Data Structure and Preparation
- **Panel Data**: 41,346 observations across 849 suburb and property type combinations and 43 time periods
- **Target Variable**: Median rental price by suburb, property type, and time period
- **Temporal Range**: 2013 Q2 to 2025 Q1 (training: 2013-2023, testing: 2024-2025)

#### 2. Economic Feature Selection Process
**LASSO Feature Selection:**
- **Initial Feature Pool**: 18 economic indicators including CPI, WPI, PPI, unemployment rate, GSP, household consumption, dwelling investment, government spending, population demographics, migration patterns, and interest rates
- **Selection Method**: LASSO regression with time series cross-validation (3-fold)
- **Standardisation**: Features standardised using StandardScaler for fair comparison
- **Selected Features**: 12 out of 18 features retained based on non-zero coefficients

#### 3. Feature Engineering
**Temporal Features:**
- **Rent Lags**: 1-4 quarter lagged values of median rent (primary predictors)
- **Time Weights**: Downweighted COVID-19 period (2020-2021) observations to 0.1 weight

**Spatial Features:**
- **Spatial Lag**: Weighted average of neighboring suburbs' rents at same time period
- **Connectivity Matrix**: Row-stochastic spatial weights matrix (96% sparse due to limited connectivity)

**Other Features:**
- **Property Type Dummies**: One-hot encoded property types

#### 4. Model Implementation Approaches

**Approach 1: Linear Scale Models**
1. **OLS (Ordinary Least Squares)**: Baseline model assuming constant variance
2. **WLS (Weighted Least Squares)**: Accounts for heteroscedasticity by weighting unreliable periods

**Approach 2: Log-Transformed Models**
- **Motivation**: Q-Q plot analysis revealed log-normal distribution of rent prices
- **Transformation**: Applied log transformation to median rent and all lag variables
- **Benefits**: Satisfies normality assumptions, handles multiplicative effects, stabilizes variance
- **Models**: Log-transformed OLS and WLS with back-transformation to original scale

**Validation Strategy:**
- **Temporal Split**: Train on 2013-2023, test on 2024-2025 data (89.5%/10.5% split)
- **Out-of-Sample Testing**: Proper evaluation of generalization performance
- **Performance Metrics**: R², RMSE, MAE for both in-sample and out-of-sample evaluation

### 5. Forecasting 

**WLS for Forecasting**
- Results of training indicated WLS had superior performance across the training and validation set

**Forecasting Economic Features**
- A vector autoregressive (VAR) model was implemented to forecast 20 quarters of the LASSO selected economic features.
- VAR was used since it allows for multiple time series simultaneously which was necessary to forecast all economic features. 
- The ecnomic forecast also downweighted the COVID period to minimise the significance of COVID trends in future forecasts.

### 6. Growth Rate 

**Calculation of Growth Rate**
- 2030Q2 - 2025Q2 / 2025Q2 *100 to produce a percentage growth rate

### 7. Visualisation of results
**Growth Rates**
- Histograms of top 10 suburb property type growth rates were produced alongside top 10 growth rate suburbs average across property types







### Issues and Challenges Faced

#### 1. Spatial Connectivity Limitations
**Problem**: The spatial connectivity matrix was 96% sparse, severely limiting spatial autoregressive effects.
- **Impact**: Spatial lag features had minimal predictive power (coefficient ~0.08)

#### 2. Data Quality During COVID-19
**Problem**: Rental market volatility and data collection disruptions during 2020-2021.
- **Solution**: Implemented observation weighting (0.1 weight for 2020-2021 data)
- **Effectiveness**: WLS model showed improved generalisation over OLS

#### 3. Temporal Data Gaps
**Problem**: Missing observations for some suburb-property combinations in certain time periods.
- **Impact**: Reduced effective sample size from 41,346 to ~37,000 clean observations
- **Handling**: Removed observations with missing lag features

#### 4. Feature Selection and Dimensionality
**Problem**: High-dimensional feature space
- **Challenge**: 18 economic features with varying degrees of correlation and predictive power
- **Solution**: LASSO regression with time series cross-validation to select most relevant features

#### 5. Distribution Assumptions and Model Fit
**Problem**: Rent prices follow log-normal distribution
- **Challenge**: Heteroscedasticity and non-normal residuals in linear models
- **Solution**: Log transformation of target variable and lag features

#### 6. Log Transformation Challenges
**Problem**: Handling zero values in log-transforms
- **Challenge**: Spatial lag values could be zero, causing log(0) issues
- **Solution**: Incorporate a small buffer if the spatial lag value is 0


### Limitations and Assumptions

#### 1. Model Assumptions
**Linear Relationships**: Assumes linear relationships between features and rental prices (or log-rental prices)


#### 2. LASSO Feature Selection Limitations
**Linear Feature Selection**: LASSO assumes linear relationships for feature selection
- May miss non-linear interactions between economic variables


#### 3. Spatial Modeling Limitations
**Sparse Connectivity Matrix**: 96% sparsity severely limits spatial effects
- Matrix constructed off suburb boundaries
- This underestimates spatial dependencies based off more than just proximity


#### 4. Data Limitations
**Feature to Observations**: Large number of features relative to the number of observations
- Model will likely overfit to the training data and for some instances have insufficient data to forecast
- Reduced number of forecasts ie model will be unable to fit for all property type suburb combinations


**Economic Feature Availability**: Some economic indicators had missing values
- GSP data had significant missing values, reducing sample size
- Reduced from 87,330 to 41,918 observations after removing missing GSP values

#### 5. Forecasting Assumptions
**Stationarity**: Assumes stable statistical properties of rent time series
- Market regimes may change (e.g., post-COVID housing dynamics)
- Forecast accuracy may degrade during structural market changes

**Exogenous Variables**: Economic forecasts assumed accurate and available
- Forecasting the economic variables assumes knowledge of future economic conditions
- If trends deviate forecast errors propagate from economic variable uncertainty

**Log-Normality Persistence**: Assumes log-normal distribution continues into future
- Distribution may change due to market structural shifts


### Recommendations and Future Improvements

#### 1. Spatial Modeling Enhancements
**Rich Connectivity Matrix**: Develop more sophisticated spatial relationships
- **Distance-based weights**: Use geographic distance or travel time
- **Economic similarity**: Weight by demographic/economic characteristics  
- **Transport connectivity**: Incorporate public transport accessibility
- **Expected Impact**: Significant improvement in spatial autoregressive effects


#### 2. Model Architecture Improvements
**Non-linear Models**: Explore machine learning approaches
- **Random Forest/XGBoost**: Capture non-linear feature interactions
- **Neural Networks**: Model complex temporal-spatial patterns
- **Ensemble Methods**: Combine multiple modeling approaches

#### 3. Dataset Expansion
- **More Observations**: Expand dataset to increase number of observations for each combination





## Big Question 3: What are the most liveable and affordable suburbs according to your chosen metrics


### Affordability
#### Approach
For this project we calculated affordability as the ratio of median rent to median household income for each suburbs. This was based off the 2021 census data packet.

#### Limitations
The 2021 census data packet is now relatively old and as such the ratios calculated may not be applicable in the current real estate market.



### Liveability
#### Approach
Analyze demographic, housing, transport, and amenities datasets to build a composite liveability index: clean and standardize inputs, engineer region-level indicators (affordability ratios, transit access scores, green-space coverage), normalize and weight metrics against user-defined priorities, and visualize results through comparative maps, ranked tables, and trend plots to highlight strengths, gaps, and actionable improvement targets.