## Running the full ETL pipeline

Below are the concise steps a data analyst should follow to perform data-quality checks and then transform the cleaned data.  


1. **Prerequisite** – ensure a Python virtual environment is active and the project dependencies are installed: 
   ```bash
   python -m venv .venv
   .venv\\Scripts\\activate      # Windows
   pip install -r requirements.txt
   ```


2. **Data availability** – raw CSV files should reside in `../Raw` relative to the `src` folder; these files come from the HDB resale dataset.  The schema defined in `resale_flat_schema.raw_resale_flat_schema` treats most columns as strings.





3. **Profile raw data** – execute the first code cell below to load all raw files and generate an HTML profiling report:
   *Assumptions:* the dataset contains columns listed in `config.json` and months span from 1990 onward. The profiler helps discover actual values for categorical columns.


In [1]:
from data_quality_check import data_profiling_run
data_profiling_run(reprofile=False)

  from .autonotebook import tqdm as notebook_tqdm
2026-02-16 10:11:12,916 [INFO] (data_profiling_run) Loaded raw CSV files from f:\Projects\HousingETL\Raw


month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
str,str,str,str,str,str,str,str,str,str,str
"""1990-01""","""ANG MO KIO""","""1 ROOM""","""309""","""ANG MO KIO AVE 1""","""10 TO 12""","""31""","""IMPROVED""","""1977""","""9000""",
"""1990-01""","""ANG MO KIO""","""1 ROOM""","""309""","""ANG MO KIO AVE 1""","""04 TO 06""","""31""","""IMPROVED""","""1977""","""6000""",
"""1990-01""","""ANG MO KIO""","""1 ROOM""","""309""","""ANG MO KIO AVE 1""","""10 TO 12""","""31""","""IMPROVED""","""1977""","""8000""",
"""1990-01""","""ANG MO KIO""","""1 ROOM""","""309""","""ANG MO KIO AVE 1""","""07 TO 09""","""31""","""IMPROVED""","""1977""","""6000""",
"""1990-01""","""ANG MO KIO""","""3 ROOM""","""216""","""ANG MO KIO AVE 1""","""04 TO 06""","""73""","""NEW GENERATION""","""1976""","""47200""",
…,…,…,…,…,…,…,…,…,…,…
"""2026-02""","""YISHUN""","""EXECUTIVE""","""352""","""YISHUN RING RD""","""01 TO 03""","""146.00""","""Maisonette""","""1988""","""61 years 06 months""","""875000"""
"""2026-02""","""YISHUN""","""EXECUTIVE""","""792""","""YISHUN RING RD""","""01 TO 03""","""142.00""","""Apartment""","""1987""","""60 years 07 months""","""833000"""
"""2026-01""","""YISHUN""","""EXECUTIVE""","""643""","""YISHUN ST 61""","""10 TO 12""","""142.00""","""Apartment""","""1987""","""60 years 09 months""","""825000"""
"""2026-01""","""YISHUN""","""EXECUTIVE""","""643""","""YISHUN ST 61""","""04 TO 06""","""146.00""","""Maisonette""","""1987""","""60 years 08 months""","""788000"""


First run data_quality_run to identify the actual categorical values used for `Town, Flat Type, Flat
Model`


From the profiling: 
1. Categorical values used will be the most common values that appear
2. Categorical values get split between Upper and lower case. Will add additional step to convert all to UPPER CASE




4. **Update rules** – inspect `data_quality_rules.json` and adjust `expected_values` lists for `flat_type`, `flat_model`, etc.  Rules are applied after upper‑casing to normalize case variations.

5. **Run validation** – call `data_validation` from `data_quality_check` to clean the data:

In [2]:
from data_quality_check import data_validation, combine_datasets
from pathlib import Path
raw = combine_datasets(Path('../Raw'))
qualified, unqualified = data_validation(raw)

print(qualified.head(n=50))
print(unqualified.head(n=50))

2026-02-16 10:11:13,380 [INFO] (filter_null_values) Successfully converted and filtered out null values: 817822 rows with null values found
2026-02-16 10:11:13,383 [INFO] (filter_month_range) Filtered rows by month range: kept 153701 of 153701
2026-02-16 10:11:13,390 [INFO] (data_validation) Successfully converted all categorical values to upper case
2026-02-16 10:11:13,410 [INFO] (data_validation) Successfully filtered out categorical values based on rules: 132653 qualified, 21048 not qualified
2026-02-16 10:11:13,490 [INFO] (data_validation) Validation complete: 130013 qualified, 844068 not qualified


shape: (50, 13)
┌────────────┬────────────┬───────────┬───────┬───┬────────────┬───────────┬───────────┬───────────┐
│ month      ┆ town       ┆ flat_type ┆ block ┆ … ┆ remaining_ ┆ resale_pr ┆ created_d ┆ composite │
│ ---        ┆ ---        ┆ ---       ┆ ---   ┆   ┆ lease      ┆ ice       ┆ atetime   ┆ _key      │
│ date       ┆ str        ┆ str       ┆ str   ┆   ┆ ---        ┆ ---       ┆ ---       ┆ ---       │
│            ┆            ┆           ┆       ┆   ┆ str        ┆ i64       ┆ datetime[ ┆ str       │
│            ┆            ┆           ┆       ┆   ┆            ┆           ┆ μs]       ┆           │
╞════════════╪════════════╪═══════════╪═══════╪═══╪════════════╪═══════════╪═══════════╪═══════════╡
│ 2016-07-01 ┆ TOA PAYOH  ┆ 3 ROOM    ┆ 121   ┆ … ┆ 68 years 5 ┆ 390000    ┆ 2026-02-1 ┆ 2016-07-0 │
│            ┆            ┆           ┆       ┆   ┆ months     ┆           ┆ 6 10:11:1 ┆ 1TOA      │
│            ┆            ┆           ┆       ┆   ┆            ┆           


   *Behavior:*
   - Rows with missing key fields are removed.
   - Only months between Jan 2016 and Jan 2019 are kept (per `filter_month_range`).
   - Categorical values are upper‑cased and filtered using the rules.
   - Numeric casts and lease calculations are performed.
   - Duplicate records based on composite key are split into qualified and failed sets.
   - Cleaned rows are written to `../Data/Cleaned.csv`, failed ones to `../Data/Failed.csv`.

6. **Transform cleaned data** – once you have `Cleaned.csv`, run the transformation logic:  
   This function
   - reads `Cleaned.csv` using the cleaned schema,
   - generates `block_num` (3‑digit, zero‑padded numeric part of `block`),
   - computes average resale price by month & flat type,
   - joins the average back to every row,
   - builds a `resale_identifier` with format
     `S{block_num}{last2(avg_price)}{month}{town[1:]}`.
   - duplicates are detected and failed records exported; cleaned results go to `../Data/Transformed.csv`

In [3]:
from data_transformation import transform_cleaned_data
transformed = transform_cleaned_data()

print(transformed.head(n=50))

2026-02-16 10:11:13,672 [INFO] (transform_cleaned_data) Loading configuration file
2026-02-16 10:11:13,673 [INFO] (transform_cleaned_data) Cleaned data path resolved: f:\Projects\HousingETL\src\..\Data\Cleaned.csv
2026-02-16 10:11:13,674 [INFO] (transform_cleaned_data) Reading cleaned CSV into DataFrame
2026-02-16 10:11:13,745 [INFO] (transform_cleaned_data) Loaded 130013 rows
2026-02-16 10:11:13,746 [INFO] (get_resale_identifier) Creating block_num column
2026-02-16 10:11:13,765 [INFO] (get_resale_identifier) Aggregating average resale_price by month and flat_type
2026-02-16 10:11:13,818 [INFO] (get_resale_identifier) Aggregation produced 460 groups
2026-02-16 10:11:13,818 [INFO] (get_resale_identifier) Joining averages back to original DataFrame
2026-02-16 10:11:13,852 [INFO] (get_resale_identifier) Joined DataFrame has 130013 rows
2026-02-16 10:11:13,853 [INFO] (get_resale_identifier) Generating resale_identifier column
2026-02-16 10:11:13,868 [INFO] (transform_cleaned_data) Sorting

SchemaError: provided schema does not match number of columns in file (11 != 13 in file)

7. **Review outputs** – inspect the CSV files under `Data` and use the profiling report or additional analysis as needed.

### Notes & assumptions

* The `month` field is assumed to be parseable as a date; rows outside the specified range are removed early.
* The transformation uses Polars for performance and adds logging at each key step; logs appear on the console and in `housing_etl.log` if `LogsFolderName` is configured.
* Adjust the `data_quality_rules.json` and `config.json` as the dataset evolves.
