## Running the full ETL pipeline

Below are the concise steps a data analyst should follow to perform data-quality checks and then transform the cleaned data.  


1. **Prerequisite** – ensure a Python virtual environment is active and the project dependencies are installed: 
   ```bash
   python -m venv .venv
   .venv\\Scripts\\activate      # Windows
   pip install -r requirements.txt
   ```


2. **Data availability** – raw CSV files should reside in `../Raw` relative to the `src` folder; these files come from the HDB resale dataset.  The schema defined in `resale_flat_schema.raw_resale_flat_schema` treats most columns as strings.





3. **Profile raw data** – execute the first code cell below to load all raw files and generate an HTML profiling report:
   *Assumptions:* the dataset contains columns listed in `config.json` and months span from 1990 onward. The profiler helps discover actual values for categorical columns.


In [None]:
from data_quality_check import data_profiling_run
data_profiling_run(reprofile=False)

First run data_quality_run to identify the actual categorical values used for `Town, Flat Type, Flat
Model`


From the profiling: 
1. Categorical values used will be the most common values that appear
2. Categorical values get split between Upper and lower case. Will add additional step to convert all to UPPER CASE




4. **Update rules** – inspect `data_quality_rules.json` and adjust `expected_values` lists for `flat_type`, `flat_model`, etc.  Rules are applied after upper‑casing to normalize case variations.

5. **Run validation** – call `data_validation` from `data_quality_check` to clean the data:

In [None]:
from data_quality_check import data_validation, combine_datasets
from pathlib import Path
raw = combine_datasets(Path('../Raw'))
qualified, unqualified = data_validation(raw)

print(qualified.head(n=50))
print(unqualified.head(n=50))


   *Behavior:*
   - Rows with missing key fields are removed.
   - Only months between Jan 2016 and Jan 2019 are kept (per `filter_month_range`).
   - Categorical values are upper‑cased and filtered using the rules.
   - Numeric casts and lease calculations are performed.
   - Duplicate records based on composite key are split into qualified and failed sets.
   - Cleaned rows are written to `../Data/Cleaned.csv`, failed ones to `../Data/Failed.csv`.

6. **Transform cleaned data** – once you have `Cleaned.csv`, run the transformation logic:  
   This function
   - reads `Cleaned.csv` using the cleaned schema,
   - generates `block_num` (3‑digit, zero‑padded numeric part of `block`),
   - computes average resale price by month & flat type,
   - joins the average back to every row,
   - builds a `resale_identifier` with format
     `S{block_num}{last2(avg_price)}{month}{town[1:]}`.
   - duplicates are detected and failed records exported; cleaned results go to `../Data/Transformed.csv`

In [None]:
from data_transformation import transform_cleaned_data
transformed = transform_cleaned_data()

print(transformed.head(n=50))

7. **Review outputs** – inspect the CSV files under `Data` and use the profiling report or additional analysis as needed.

### Notes & assumptions

* The `month` field is assumed to be parseable as a date; rows outside the specified range are removed early.
* The transformation uses Polars for performance and adds logging at each key step; logs appear on the console and in `housing_etl.log` if `LogsFolderName` is configured.
* Adjust the `data_quality_rules.json` and `config.json` as the dataset evolves.


## System Design
![alt text](System_Design.png)

### Network considerations:

1. AWS athena will be given access to VPC S3 instance
2. Tableau will be given IAM with valid credentials for Athena

## Markdown code to generate system diagram: 

```mermaid
flowchart LR

  %% Public Internet (only for source upload)
  subgraph INTERNET [Public Internet]
    SOURCE@{ img: "https://api.iconify.design/mdi/earth.svg", label: "Data Gov Site", pos: "b", w: 60, h: 60, constraint: "on"}
    TABLEAU_A@{ img: "https://api.iconify.design/logos/tableau-icon.svg", label: "Tableau", pos: "b", w: 60, h: 60, constraint: "on"}

  end

  

  %% Enterprise VPC (Singapore Region)
  subgraph VPC [Enterprise VPC in Singapore]
           ATHENA__A@{ img: "https://api.iconify.design/logos/aws-athena.svg", label: "Athena", pos: "b", w: 60, h: 60, constraint: "on"}
        LAMBDA_A@{ img: "https://api.iconify.design/logos/aws-lambda.svg", label: "AWS Lambda", pos: "b", w: 60, h: 60, constraint: "on"}
    %% Availability Zone A
    subgraph AZ_A [Availability Zone A]
      subgraph PRIVATE_SUBNET_A [Private Subnet A]
        S3_GW_A@{ img: "https://api.iconify.design/logos/aws-s3.svg", label: "S3 Primary", pos: "b", w: 60, h: 60, constraint: "on"}
        end
    end

    %% Availability Zone B
    subgraph AZ_B [Availability Zone B]
      subgraph PRIVATE_SUBNET_B [Private Subnet B]
        S3_GW_B@{ img: "https://api.iconify.design/logos/aws-s3.svg", label: "S3 Secondary", pos: "b", w: 60, h: 60, constraint: "on"}
      end
    end


  end

  %% Data Flow

    S3_GW_A -- "5. Replication" --> S3_GW_B


  LAMBDA_A -- "1. Trigger file upload" --> SOURCE
  SOURCE -- "2. Multipart file Download into" --> S3_GW_A
  LAMBDA_A -- "3. Runs ETL Pipeline" --> S3_GW_A
  ATHENA__A-- "4. Query via Proxy " -->  S3_GW_A
  TABLEAU_A -- "Will access via PrivateLink" --> ATHENA__A






  %% Styling
  classDef vpc fill:none,color:#0a0,stroke:#0a0,stroke-dasharray: 5 5, stroke-width: 2px
  class VPC vpc
  class INTERNET,PRIVATE_SUBNET_A,PRIVATE_SUBNET_B,AZ_A,AZ_B,LB_LAYER group
```