## Downloading the Data (Refactored)

This notebook demonstrates the use of the `DownloadUtils` class for downloading and processing data from various sources.

The `DownloadUtils` class provides a clean, object-oriented interface for:
- Creating data directory structures
- Downloading files from URLs
- Scraping time series data from web tables
- Saving processed data to CSV files


In [None]:
from ..utils.download import DownloadUtils
from ..utils.preprocess import PreprocessUtils

# Initialize the utilities
downloader = DownloadUtils(base_data_dir="../data/")
preprocessor = PreprocessUtils()


### Option 1: Download All Data at Once

The simplest approach is to download all available data sources using the `download_all_data()` method.


In [None]:
# Download all data sources with verbose output
downloader.download_all_data(verbose=True)


### Option 2: Download Data Sources Individually

You can also download specific data sources individually for more control over the process.


#### Download Rent Data


In [None]:
# Download moving annual rent data with verbose output
downloader.download_rent_data(verbose=True)

# Download the latest moving annual rent file (March 2025) with verbose output
downloader.download_latest_rent_data(verbose=True)


#### Download Public Transport Data


In [None]:
# Download public transport stops and lines data with verbose output
downloader.download_public_transport_data(verbose=True)


#### Download School Locations Data


In [None]:
# Download school locations for 2023, 2024, and 2025 with verbose output
downloader.download_school_locations(verbose=True)


#### Download Open Space Data


In [None]:
# Download open space data with verbose output
downloader.download_open_space_data(verbose=True)


#### Scrape Time Series Data


In [None]:
# Scrape unemployment rate data with verbose output
downloader.scrape_unemployment_data(verbose=True)


In [None]:
# Scrape interest rates data with verbose output
downloader.scrape_interest_rates_data(verbose=True)


In [None]:
# Scrape price data with verbose output
downloader.scrape_price_data(verbose=True)


In [None]:
# Scrape economic activity data with verbose output
downloader.scrape_economic_activity_data(verbose=True)


In [None]:
# Scrape population dynamics data with verbose output
downloader.scrape_population_data(verbose=True)


In [None]:
# Scrape investment data with verbose output
downloader.scrape_investment_data(verbose=True)


### Option 3: Custom Downloads

You can also use the general methods for custom downloads or scraping.


In [None]:
# Example: Custom file download
# downloader.download_file(
#     url="https://example.com/data.csv",
#     output_path="../data/landing/custom/custom_data",
#     file_type="csv"
# )

# Example: Custom time series scraping
# custom_data = downloader.scrape_time_series_data(
#     url="https://example.com/time-series-table",
#     data_name="custom_series",
#     value_columns=[1, 2],  # Specify which columns to extract
#     aggregate_method='mean'
# )
# downloader.save_time_series_data(
#     custom_data,
#     "custom_series",
#     "../data/landing/custom"
# )


#### Download Population Census Data


In [None]:
# Download and process complete census data (SAL codes 20001-22944)
# This demonstrates proper separation of concerns:
# 1. DownloadUtils handles downloading Excel files
# 2. PreprocessUtils handles processing and merging data
# Note: This will take a very long time and generate a lot of output

# Step 1: Download census Excel files (DownloadUtils responsibility)
# no_data_list = downloader.download_population_census_data(verbose=True)

# Step 2: Process and merge census data (PreprocessUtils responsibility)
# preprocessor.process_census_data_workflow(no_data_list, base_data_dir="../data/")


#### Download Population Census Data (Step by Step)

You can also run the census data download process step by step for more control:


In [None]:
# Step 1: Download census Excel files (DownloadUtils responsibility)
# This downloads Excel files for SAL codes 20001-22944 from ABS
# Note: This will take a very long time and generate a lot of output
# no_data_list = downloader.download_population_census_data(verbose=True)


In [None]:
# Step 2: Process Excel files to CSV format (PreprocessUtils responsibility)
# This converts the downloaded Excel files into structured CSV files
# preprocessor.process_all_census_data(no_data_list, base_data_dir="../data/")


In [None]:
# Step 3: Merge individual CSV files into consolidated datasets (PreprocessUtils responsibility)
# This creates 7 large CSV files with all suburb data combined
# preprocessor.merge_census_csv_files(base_data_dir="../data/")


#### Download Open Space Data


In [None]:
# Download open space data with verbose output
downloader.download_open_space_data(verbose=True)


### Option 4: Download All Data Including Census

The `download_all_data()` method now includes all data sources including census data:


In [None]:
# Download all data sources including census data with verbose output
# Note: This will take a very long time due to the large number of census files
# downloader.download_all_data(verbose=True)

# For census data, use proper separation of concerns:
# no_data_list = downloader.download_population_census_data(verbose=True)
# preprocessor.process_census_data_workflow(no_data_list, base_data_dir="../data/")


### Census Data Output

The census data download process creates the following consolidated CSV files in the `../data/landing/` directory:

1. **median_stats.csv** - Median age, income, and other demographic statistics
2. **population_breakdown.csv** - Age group population breakdowns
3. **personal_income.csv** - Personal income distribution by age groups
4. **household_income.csv** - Household income distribution
5. **dwelling_structure.csv** - Types of dwellings and housing structures
6. **job_type.csv** - Employment types by age groups
7. **education_level.csv** - Education levels by age groups

Each file contains data for all Victorian suburbs (SAL codes 20001-22944) with suburb names included.


### Separation of Concerns

The refactored system now properly separates responsibilities:

**DownloadUtils Class:**
- ✅ Downloads files from URLs
- ✅ Scrapes time series data from web tables
- ✅ Downloads census Excel files from ABS
- ❌ Does NOT process or transform data

**PreprocessUtils Class:**
- ✅ Processes Excel files to CSV format
- ✅ Merges individual CSV files into consolidated datasets
- ✅ Handles data transformation and cleaning
- ✅ Manages suburb name extraction and mapping
- ❌ Does NOT download files from external sources


In [None]:
# Example: Proper separation of concerns for census data
# This demonstrates how to use both classes together

# 1. Download census Excel files (DownloadUtils responsibility)
# no_data_list = downloader.download_population_census_data(verbose=True)

# 2. Process the downloaded files (PreprocessUtils responsibility)
# preprocessor.process_census_data_workflow(no_data_list, base_data_dir="../data/")


### Verbose Output Features

The `DownloadUtils` class now includes comprehensive verbose output with timing information:

**Download Methods:**
- Shows download URLs, file sizes, and download times
- Displays progress indicators for multiple files
- Provides detailed error messages with timing

**Scraping Methods:**
- Shows scraping URLs and processing times
- Displays number of records processed
- Provides detailed error messages with timing

**Overall Timing:**
- Tracks total time for complete workflows
- Shows breakdown by category (static files vs. scraping)
- Provides performance metrics for optimization


In [None]:
# Example: Download a single file with verbose output
# This shows detailed timing and file information
downloader.download_file(
    url="https://opendata.arcgis.com/datasets/da1c06e3ab6948fcb56de4bb3c722449_0.csv",
    output_path="../data/landing/test/test_file",
    file_type="csv",
    verbose=True
)


### Benefits of the Enhanced DownloadUtils Class

The refactored `DownloadUtils` class now provides:

1. **Comprehensive Data Coverage**: All original data sources plus census data
2. **Flexible Processing**: Choose between complete workflow or step-by-step processing
3. **Error Handling**: Robust error handling for failed downloads
4. **Progress Tracking**: Progress indicators for long-running operations
5. **Data Validation**: Automatic detection of missing data sources
6. **Consolidated Output**: Merged datasets ready for analysis
7. **Extensible Design**: Easy to add new data sources or processing steps
