# Moving Annual Rent Analysis

This notebook processes the moving annual rent Excel files to extract median rent data by suburb, property type, quarter, and year.

## Data Structure
- Each Excel file contains multiple sheets for different property types
- Property types: 1 bedroom flat, 2 bedroom flat, 3 bedroom flat, 2 bedroom house, 3 bedroom house, 4 bedroom house
- Each sheet has quarterly data (Mar, Jun, Sep, Dec) for multiple years
- Data includes both Count and Median columns for each quarter/year combination
- We will extract only the Median columns and restructure the data


In [32]:
%load_ext autoreload
%autoreload 2

import sys
from pathlib import Path

# Add project root to Python path
# Get the current notebook's directory and go up to project root
current_dir = Path().resolve()
if current_dir.name == 'notebooks':
    project_root = current_dir.parent
elif current_dir.name == 'project2':
    project_root = current_dir
else:
    # If we're in the parent directory, look for project2
    project_root = current_dir / 'project2'

sys.path.insert(0, str(project_root))
print(f"Project root: {project_root}")


import pandas as pd
import numpy as np
from utils.preprocess import PreprocessUtils
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

# Initialize PreprocessUtils
preprocessor = PreprocessUtils()


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Project root: /Users/jackshee/University/MAST30034 Applied Data Science/project2


## 1. Preprocess the DFFH Moving Annual Rent Time Series Data


In [49]:
# Set the data directory
data_dir = '../data/landing/moving_annual_rent'

# Process all Excel files
print("Starting data processing...")
df = preprocessor.process_moving_annual_rent_files(data_dir)

print(f"\nTotal records processed: {len(df)}")
print(f"Data shape: {df.shape}")
print(f"Columns: {list(df.columns)}")


Starting data processing...
Found 1 Excel files to process:
  - moving_annual_median_weekly_rent_by_suburb.xlsx
Processing 1 bedroom flat from moving_annual_median_weekly_rent_by_suburb.xlsx
Processing 2 bedroom flat from moving_annual_median_weekly_rent_by_suburb.xlsx
Processing 3 bedroom flat from moving_annual_median_weekly_rent_by_suburb.xlsx
Processing 2 bedroom house from moving_annual_median_weekly_rent_by_suburb.xlsx
Processing 3 bedroom house from moving_annual_median_weekly_rent_by_suburb.xlsx
Processing 4 bedroom house from moving_annual_median_weekly_rent_by_suburb.xlsx
Successfully processed moving_annual_median_weekly_rent_by_suburb.xlsx: 84729 records

Total records processed: 84729
Data shape: (84729, 5)
Columns: ['suburb', 'property_type', 'quarter', 'year', 'median_rent']


### Examine the Processed Data


In [50]:
# Display basic information about the dataset
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head(10))

print(f"\nData types:")
print(df.dtypes)

print(f"\nUnique property types: {df['property_type'].unique()}")
print(f"Unique quarters: {sorted(df['quarter'].unique())}")
print(f"Year range: {df['year'].min()} - {df['year'].max()}")
print(f"Number of unique suburbs: {df['suburb'].nunique()}")


Dataset Overview:
Shape: (84729, 5)

First few rows:
                                  suburb   property_type  quarter  year  \
0  albert park-middle park-west st kilda  1 bedroom flat        1  2000   
1                               armadale  1 bedroom flat        1  2000   
2                          carlton north  1 bedroom flat        1  2000   
3                      carlton-parkville  1 bedroom flat        1  2000   
4                        cbd-st kilda rd  1 bedroom flat        1  2000   
5                 collingwood-abbotsford  1 bedroom flat        1  2000   
6                         east melbourne  1 bedroom flat        1  2000   
7                          east st kilda  1 bedroom flat        1  2000   
8                                 elwood  1 bedroom flat        1  2000   
9                                fitzroy  1 bedroom flat        1  2000   

   median_rent  
0        165.0  
1        150.0  
2        150.0  
3        165.0  
4        250.0  
5        135.0  
6 

In [51]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Check data quality
print(f"\nData quality checks:")
print(f"Records with negative median rent: {(df['median_rent'] < 0).sum()}")
print(f"Records with zero median rent: {(df['median_rent'] == 0).sum()}")
print(f"Records with very high median rent (>$5000): {(df['median_rent'] > 5000).sum()}")


Missing values:
suburb           0
property_type    0
quarter          0
year             0
median_rent      0
dtype: int64

Data quality checks:
Records with negative median rent: 0
Records with zero median rent: 0
Records with very high median rent (>$5000): 0


### Create Pivot Table for Analysis


In [56]:
# Create a pivot table with quarters and years as columns
# Create a combined quarter_year column
# convert quarter and year to string
df['quarter'] = df['quarter'].astype(str)
df['year'] = df['year'].astype(str)
df['quarter_year'] = df['quarter'] + '_' + df['year']

# Create pivot table
pivot_df = df.pivot_table(
    index=['suburb', 'property_type'],
    columns='quarter_year',
    values='median_rent',
    aggfunc='first'  # In case there are duplicates
).reset_index()

print(f"Pivot table shape: {pivot_df.shape}")
print(f"\nFirst few rows:")
print(pivot_df.head())

print(f"\nColumn names (first 20):")
print(list(pivot_df.columns)[:20])


Pivot table shape: (866, 103)

First few rows:
quarter_year                                 suburb    property_type  1_2000  \
0             albert park-middle park-west st kilda   1 bedroom flat   165.0   
1             albert park-middle park-west st kilda   2 bedroom flat   250.0   
2             albert park-middle park-west st kilda  2 bedroom house   300.0   
3             albert park-middle park-west st kilda   3 bedroom flat   350.0   
4             albert park-middle park-west st kilda  3 bedroom house   390.0   

quarter_year  1_2001  1_2002  1_2003  1_2004  1_2005  1_2006  1_2007  ...  \
0              180.0   195.0   220.0   200.0   220.0   220.0   220.0  ...   
1              260.0   290.0   300.0   288.0   310.0   320.0   320.0  ...   
2              335.0   350.0   340.0   350.0   350.0   375.0   390.0  ...   
3              425.0   440.0   435.0   428.0   420.0   400.0   410.0  ...   
4              384.0   450.0   450.0   425.0   455.0   460.0   528.0  ...   

quarter_y

### Save Processed Data


In [57]:
# Save the long format data
output_file_long = '../data/processed/moving_rent/moving_annual_rent_long.csv'
os.makedirs(os.path.dirname(output_file_long), exist_ok=True)
df.to_csv(output_file_long, index=False)
print(f"Long format data saved to: {output_file_long}")

# Save the pivot table format
output_file_pivot = '../data/processed/moving_rent/moving_annual_rent_pivot.csv'
pivot_df.to_csv(output_file_pivot, index=False)
print(f"Pivot table data saved to: {output_file_pivot}")

print(f"\nData processing complete!")
print(f"Total records: {len(df)}")
print(f"Unique suburbs: {df['suburb'].nunique()}")
print(f"Property types: {len(df['property_type'].unique())}")
print(f"Time period: {df['year'].min()}-{df['year'].max()}")


Long format data saved to: ../data/processed/moving_rent/moving_annual_rent_long.csv
Pivot table data saved to: ../data/processed/moving_rent/moving_annual_rent_pivot.csv

Data processing complete!
Total records: 84729
Unique suburbs: 146
Property types: 6
Time period: 2000-2025


## 2. Preprocess Economic Time Series Data from RBA

In [40]:
# Set up paths
data_path = Path("../data/landing")

# Read all datasets
print("Reading datasets...")

# Price data
price_data = pd.read_csv(data_path / "price_data" / "quarterly_price_data.csv")
print(f"Price data shape: {price_data.shape}")

# Unemployment rate
unemployment_data = pd.read_csv(data_path / "unemployment_rate" / "quarterly_unemployment_rate.csv")
print(f"Unemployment data shape: {unemployment_data.shape}")

# Economic activity
economic_data = pd.read_csv(data_path / "economic_activity" / "quarterly_economic_activity.csv")
print(f"Economic activity data shape: {economic_data.shape}")

# Investment
investment_data = pd.read_csv(data_path / "investment" / "quarterly_investment.csv")
print(f"Investment data shape: {investment_data.shape}")

# Population
population_data = pd.read_csv(data_path / "population" / "quarterly_population_dynamics.csv")
print(f"Population data shape: {population_data.shape}")

# Interest rates
interest_data = pd.read_csv(data_path / "interest_rates" / "quarterly_interest_rates.csv")
print(f"Interest rates data shape: {interest_data.shape}")

print("\nDataset columns:")
print("Price data:", price_data.columns.tolist())
print("Unemployment data:", unemployment_data.columns.tolist())
print("Economic activity data:", economic_data.columns.tolist())
print("Investment data:", investment_data.columns.tolist())
print("Population data:", population_data.columns.tolist())
print("Interest rates data:", interest_data.columns.tolist())

Reading datasets...
Price data shape: (304, 6)
Unemployment data shape: (191, 4)
Economic activity data shape: (156, 5)
Investment data shape: (156, 8)
Population data shape: (172, 7)
Interest rates data shape: (181, 6)

Dataset columns:
Price data: ['date', 'year', 'quarter', 'CPI (%/y)', 'WPI (%/y)', 'PPI, Final Demand (%/y)']
Unemployment data: ['date', 'year', 'quarter', 'Unemployment rate (%)']
Economic activity data: ['date', 'year', 'quarter', 'SFD (%/y)', 'GSP quarterly components (%/y)']
Investment data: ['date', 'year', 'quarter', 'State final demand (%/y)', 'Household consumption (pp/y)', 'Dwelling investment (pp/y)', 'Business investment (pp/y)', 'Government spending (pp/y)']
Population data: ['date', 'year', 'quarter', 'Population (%/y)', 'Natural increase (pp/y)', 'Net overseas migration (pp/y)', 'Net interstate migration (pp/y)']
Interest rates data: ['date', 'year', 'quarter', 'Mortgage rates (%)', 'Savings rates (%)', 'Cash rate (%)']


In [41]:
# Merge all datasets on year and quarter
print("Merging datasets...")

# Start with price data as the base
merged_data = price_data.copy()

# Merge unemployment data
merged_data = merged_data.merge(
    unemployment_data[['year', 'quarter', 'Unemployment rate (%)']], 
    on=['year', 'quarter'], 
    how='outer'
)

# Merge economic activity data
merged_data = merged_data.merge(
    economic_data[['year', 'quarter', 'SFD (%/y)', 'GSP quarterly components (%/y)']], 
    on=['year', 'quarter'], 
    how='outer'
)

# Merge investment data
investment_cols = ['year', 'quarter', 'State final demand (%/y)', 'Household consumption (pp/y)', 
                  'Dwelling investment (pp/y)', 'Business investment (pp/y)', 'Government spending (pp/y)']
merged_data = merged_data.merge(
    investment_data[investment_cols], 
    on=['year', 'quarter'], 
    how='outer'
)

# Merge population data
population_cols = ['year', 'quarter', 'Population (%/y)', 'Natural increase (pp/y)', 
                  'Net overseas migration (pp/y)', 'Net interstate migration (pp/y)']
merged_data = merged_data.merge(
    population_data[population_cols], 
    on=['year', 'quarter'], 
    how='outer'
)

# Merge interest rates data
interest_cols = ['year', 'quarter', 'Mortgage rates (%)', 'Savings rates (%)', 'Cash rate (%)']
merged_data = merged_data.merge(
    interest_data[interest_cols], 
    on=['year', 'quarter'], 
    how='outer'
)

print(f"Merged dataset shape: {merged_data.shape}")
print(f"Merged dataset columns: {merged_data.columns.tolist()}")

# Sort by year and quarter
merged_data = merged_data.sort_values(['year', 'quarter']).reset_index(drop=True)

# Create a quarter identifier for plotting
merged_data['quarter_id'] = merged_data['year'].astype(str) + 'Q' + merged_data['quarter'].astype(str)

print(f"\nDate range: {merged_data['year'].min()}-{merged_data['quarter'].min()} to {merged_data['year'].max()}-{merged_data['quarter'].max()}")
print(f"Total quarters: {len(merged_data)}")

# Display first few rows
print("\nFirst 5 rows of merged data:")
print(merged_data.head())


Merging datasets...
Merged dataset shape: (343, 21)
Merged dataset columns: ['date', 'year', 'quarter', 'CPI (%/y)', 'WPI (%/y)', 'PPI, Final Demand (%/y)', 'Unemployment rate (%)', 'SFD (%/y)', 'GSP quarterly components (%/y)', 'State final demand (%/y)', 'Household consumption (pp/y)', 'Dwelling investment (pp/y)', 'Business investment (pp/y)', 'Government spending (pp/y)', 'Population (%/y)', 'Natural increase (pp/y)', 'Net overseas migration (pp/y)', 'Net interstate migration (pp/y)', 'Mortgage rates (%)', 'Savings rates (%)', 'Cash rate (%)']

Date range: 1949-1 to 2035-4
Total quarters: 343

First 5 rows of merged data:
         date  year  quarter  CPI (%/y)  WPI (%/y)  PPI, Final Demand (%/y)  \
0  1949-09-01  1949        3        7.9        NaN                      NaN   
1  1949-12-01  1949        4       10.5        NaN                      NaN   
2  1950-03-01  1950        1       10.3        NaN                      NaN   
3  1950-06-01  1950        2       10.0        NaN

In [42]:
# Save the merged dataset
# create economic folder if it doesn't exist
if not os.path.exists("../data/processed/economic"):
    os.makedirs("../data/processed/economic")

output_path = "../data/processed/economic/economic_time_series.csv"
merged_data.to_csv(output_path, index=False)
print(f"\nMerged dataset saved to: {output_path}")

# Display data quality information
print("\nData Quality Information:")
print("="*40)
print(f"Total rows: {len(merged_data)}")
print(f"Total columns: {len(merged_data.columns)}")

# Check for missing values
missing_data = merged_data.isnull().sum()
print(f"\nMissing values by column:")
for col in missing_data[missing_data > 0].index:
    print(f"  {col}: {missing_data[col]} ({missing_data[col]/len(merged_data)*100:.1f}%)")

# Check date coverage
print(f"\nDate coverage:")
print(f"  Start: {merged_data['year'].min()}-Q{merged_data['quarter'].min()}")
print(f"  End: {merged_data['year'].max()}-Q{merged_data['quarter'].max()}")
print(f"  Total quarters: {len(merged_data)}")

# Show sample of the merged data
print(f"\nSample of merged data (first 3 rows):")
print(merged_data[['year', 'quarter', 'quarter_id', 'CPI (%/y)', 'Unemployment rate (%)', 'Cash rate (%)']].head(3))



Merged dataset saved to: ../data/processed/economic/economic_time_series.csv

Data Quality Information:
Total rows: 343
Total columns: 22

Missing values by column:
  date: 39 (11.4%)
  CPI (%/y): 39 (11.4%)
  WPI (%/y): 235 (68.5%)
  PPI, Final Demand (%/y): 239 (69.7%)
  Unemployment rate (%): 152 (44.3%)
  SFD (%/y): 187 (54.5%)
  GSP quarterly components (%/y): 294 (85.7%)
  State final demand (%/y): 187 (54.5%)
  Household consumption (pp/y): 187 (54.5%)
  Dwelling investment (pp/y): 187 (54.5%)
  Business investment (pp/y): 187 (54.5%)
  Government spending (pp/y): 187 (54.5%)
  Population (%/y): 171 (49.9%)
  Natural increase (pp/y): 171 (49.9%)
  Net overseas migration (pp/y): 171 (49.9%)
  Net interstate migration (pp/y): 171 (49.9%)
  Mortgage rates (%): 257 (74.9%)
  Savings rates (%): 256 (74.6%)
  Cash rate (%): 199 (58.0%)

Date coverage:
  Start: 1949-Q1
  End: 2035-Q4
  Total quarters: 343

Sample of merged data (first 3 rows):
   year  quarter quarter_id  CPI (%/y)  U

## 3. Preprocess Census Data

In [21]:
# Warning: the code below takes a long time to run, the results are saved in the data/processed/census folder
preprocessor.process_census_data_workflow(
    base_data_dir="../data/"
)

Starting complete census data processing workflow...
=== PROCESSING CENSUS DATA TO CSV ===
Found 2711 Excel files to process
Processed 100 files...
Processed 200 files...
Processed 300 files...
Processed 400 files...
Processed 500 files...
Processed 600 files...
Processed 700 files...
Processed 800 files...
Processed 900 files...
Processed 1000 files...
Processed 1100 files...
Processed 1200 files...
Processed 1300 files...
Processed 1400 files...
Processed 1500 files...
Processed 1600 files...
Processed 1700 files...
Processed 1800 files...
Processed 1900 files...
Processed 2000 files...
Processed 2100 files...
Processed 2200 files...
Processed 2300 files...
Processed 2400 files...
Processed 2500 files...
Processed 2600 files...
Processed 2700 files...
Successfully processed 2711 census files
=== MERGING CENSUS CSV FILES ===
⚠️  No files found for pattern: *_median_stats.csv
⚠️  No files found for pattern: *_population_breakdown.csv
⚠️  No files found for pattern: *_personal_income.cs

## 4. Merge all datasets together to create panel data

In [47]:
economic_data = pd.read_csv("../data/processed/economic/economic_time_series.csv")
economic_data.head()

Unnamed: 0,date,year,quarter,CPI (%/y),WPI (%/y),"PPI, Final Demand (%/y)",Unemployment rate (%),SFD (%/y),GSP quarterly components (%/y),State final demand (%/y),...,Business investment (pp/y),Government spending (pp/y),Population (%/y),Natural increase (pp/y),Net overseas migration (pp/y),Net interstate migration (pp/y),Mortgage rates (%),Savings rates (%),Cash rate (%),quarter_id
0,1949-09-01,1949,3,7.9,,,,,,,...,,,,,,,,,,1949Q3
1,1949-12-01,1949,4,10.5,,,,,,,...,,,,,,,,,,1949Q4
2,1950-03-01,1950,1,10.3,,,,,,,...,,,,,,,,,,1950Q1
3,1950-06-01,1950,2,10.0,,,,,,,...,,,,,,,,,,1950Q2
4,1950-09-01,1950,3,9.8,,,,,,,...,,,,,,,,,,1950Q3


In [59]:
rental_data = pd.read_csv("../data/processed/moving_rent/moving_annual_rent_long.csv")
rental_data.head()

Unnamed: 0,suburb,property_type,quarter,year,median_rent,quarter_year
0,albert park-middle park-west st kilda,1 bedroom flat,1,2000,165.0,1_2000
1,armadale,1 bedroom flat,1,2000,150.0,1_2000
2,carlton north,1 bedroom flat,1,2000,150.0,1_2000
3,carlton-parkville,1 bedroom flat,1,2000,165.0,1_2000
4,cbd-st kilda rd,1 bedroom flat,1,2000,250.0,1_2000


In [61]:
# merge rental_data and economic_data on year and quarter
merged_data = pd.merge(rental_data, economic_data, on=['year', 'quarter'], how='outer')
merged_data.shape


(84971, 26)

In [66]:
# drop rows where suburb is null
merged_data = merged_data[merged_data['suburb'].notna()]
merged_data.shape


(84729, 26)

In [67]:
# create helper function to merge on suburb by translating census suburb names to the DFFH definition of suburb area

def merge_on_suburb(column, data, destination):
    """ 
    A function to merge on suburb when they are of different forms
    Takes the average of combo suburbs - so only works for numeric data
    """
    # incoming dataset needs to have brackets/ extra info removed
    # also ensure lower case
    data["suburb"] = data["suburb"].str.replace(r"\s*\([^)]*\)", "", regex=True)
    data["suburb"] = data['suburb'].str.lower()

    # extra adjustments needed to match specific formatting
    data["suburb"] = data["suburb"].replace("brunswick west", "west brunswick")
    data["suburb"] = data["suburb"].replace("brunswick east", "east brunswick")
    data["suburb"] = data["suburb"].replace("st kilda east", "east st kilda")
    data["suburb"] = data["suburb"].replace("st kilda west", "west st kilda")
    data["suburb"] = data["suburb"].replace("hawthorn east", "east hawthorn")
    data["suburb"] = data["suburb"].replace("east bendigo", "bendigo east")

    data["suburb"] = data["suburb"].replace("mount martha", "mt martha")
    data["suburb"] = data["suburb"].replace("mount eliza", "mt eliza")

    data["suburb"] = data["suburb"].replace("wangaratta", "wanagaratta")
    data["suburb"] = data["suburb"].replace("newcomb", "newcombe")
    
    
    # First create conjoined suburbs
    # iterate through the suburbs 
    for suburb in destination["suburb"].unique():
        # check if we have a hyphen suburb needs to be averaged
        if "-" in suburb:
            to_avg = suburb.split("-")
            # take average of metric from each suburb
            if set(to_avg).issubset(set(data['suburb'])) == True:
                # we can  take an average
                # Filter to only those suburbs
                subset = data[data['suburb'].isin(to_avg)]
                
                # Compute the average population
                average_population = subset[column].mean()
                
                new_row = pd.DataFrame({
                    'suburb': [suburb], 
                    column: [average_population]
                })
                
                # Append to the other DataFrame
                data = pd.concat([data, new_row], ignore_index=True)
                                
    # now do a merge on suburb
    merged = pd.merge( destination, data, on='suburb', how='inner') 

    # save the csv
    # merged.to_csv(f"../data/curated/check_{column}.csv", index = False)
    return merged
        

In [69]:
# Load in census data
data_path = "../data/processed/census/population_breakdown.csv"
census_pop = pd.read_csv(data_path)

# Drop individual Age years
mask = census_pop["Age group"].str.contains(r"-|years|Total", case=False, na=False)
pop_filtered = census_pop[mask].reset_index(drop=True)
pop_filtered["Suburb"] = pop_filtered['Suburb'].str.lower()
# rename the suburb column to match destination df
pop_filtered = pop_filtered.rename(columns={"Suburb": "suburb"})

# Only care about suburb totals for now
subset = pop_filtered[pop_filtered["Age group"] == "Total"]
subset = subset.rename(columns={"Persons": "population_size"}).drop(columns=["Age group"])
panel_data_1 = merge_on_suburb("population_size", subset, merged_data)
#for age_cat in pop_filtered["Age group"].unique():
 #   subset = pop_filtered[pop_filtered["Age group"] == age_cat]
  #  subset = subset.rename(columns={"Persons": age_cat}).drop(columns=["Age group"])
   # merge_on_suburb(age_cat, subset, moving_annual)


# Use median stats to add median personal income
data_path = "../data/processed/census/median_stats.csv"
census_medians = pd.read_csv(data_path)
# extract median personal income
subset = census_medians[census_medians["Statistic"] == "Median total personal income ($/weekly)"]
subset = subset.rename(columns={"Value": 
                                "median_personal_income"}).drop(columns=["Statistic"])
subset = subset.rename(columns={"Suburb": "suburb"})
panel_data_2 = merge_on_suburb("median_personal_income", subset, panel_data_1)


# extract median age
subset_age = census_medians[census_medians["Statistic"] == "Median age of persons"]
subset_age = subset_age.rename(columns={"Value": 
                                "median_age"}).drop(columns=["Statistic"])
subset_age = subset_age.rename(columns={"Suburb": "suburb"})
panel_data_3 = merge_on_suburb("median_age", subset_age, panel_data_2)

# save the updated panel data
panel_data_3.to_csv("../data/curated/rent_growth/panel_data.csv", index = False)


In [73]:
panel_data_3['suburb'].nunique()

143

In [74]:
panel_data_3.shape

(87330, 29)