In [None]:
#docker exec -it namenode bash

Raw Layer

Purpose: Store the original data exactly as scraped from sources.
If something goes wrong downstream (bad transformations, column mismatch), you can reprocess from raw without re-scraping.

In [None]:

#hdfs dfs -mkdir -p /datalake/bronze


Bronze Layer (RAW DATA) after saving raw data we will:

: Standardize and clean data.

Fix inconsistent column names.

Convert types (string → integer/float/date).

Remove obvious corrupt rows.

Separates cleaning concerns from business transformations.

Downstream layers can trust the bronze data.

In [None]:
#hdfs dfs -mkdir -p /datalake/silver

Silver Layer (Transforming)

Purpose: Combine multiple sources and apply business logic.

Tasks:

Merge Dubizzle, Bayut, FazWaz data into a unified schema.

Calculate derived columns (price per m², property age, etc.).

Filter for relevant data.

Why:

Silver data is ready for analytics or machine learning.

Avoids mixing cleaning and business transformations in the same step.

In [None]:
# hdfs dfs -mkdir -p /datalake/gold

In [None]:
# hdfs dfs -chmod 777 /datalake/bronze
# hdfs dfs -chmod 777 /datalake/silver
# hdfs dfs -chmod 777 /datalake/gold  

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSVtoParquetCustomNames").getOrCreate()

# Mapping of local CSV files to HDFS parquet filenames
files_mapping = {
    "/data/data_csv_files/bayut/bayutData.csv": "bayutData",
    "/data/data_csv_files/dubbizle/dubbizle_alexandria.csv": "dubbizle_alexandria",
    "/data/data_csv_files/dubbizle/dubizzle_all_listings_cairo.csv": "dubizzle_all_listings_cairo",
    "/data/data_csv_files/fazwaz/fazwaz_apartments_allcombined.csv": "fazwaz_apartments_allcombined",
    "/data/data_csv_files/propertyfinder/propertyfinder.csv": "propertyfinder"
}

hdfs_bronze_path = "hdfs://namenode:9000/datalake/bronze/"

for local_csv, parquet_name in files_mapping.items():
    print(f"Processing {local_csv}...")
    
    # Read CSV
    df = spark.read.option("header", True).csv(local_csv)
    
    # Write to HDFS as parquet with the mapped name
    parquet_path = f"{hdfs_bronze_path}{parquet_name}"
    df.write.mode("overwrite").parquet(parquet_path)
    
    print(f"Saved {parquet_path} successfully.")


Processing /data/data_csv_files/bayut/bayutData.csv...
Saved hdfs://namenode:9000/datalake/bronze/bayutData successfully.
Processing /data/data_csv_files/dubbizle/dubbizle_alexandria.csv...
Saved hdfs://namenode:9000/datalake/bronze/dubbizle_alexandria successfully.
Processing /data/data_csv_files/dubbizle/dubizzle_all_listings_cairo.csv...
Saved hdfs://namenode:9000/datalake/bronze/dubizzle_all_listings_cairo successfully.
Processing /data/data_csv_files/fazwaz.com/fazwaz_apartments_allcombined.csv...
Saved hdfs://namenode:9000/datalake/bronze/fazwaz_apartments_allcombined successfully.
Processing /data/data_csv_files/propertyfinder/propertyfinder.csv...
Saved hdfs://namenode:9000/datalake/bronze/propertyfinder successfully.


change permission and give access to all user

<!-- hdfs dfs -chmod 777 /datalake/bronze
hdfs dfs -chmod 777 /datalake/gold
hdfs dfs -chmod 777 /datalake/raw
hdfs dfs -chmod 777 /datalake/silver -->