# Raw Data Ingestion and Cleaning (One-Time Preprocessing)

This notebook performs a one-time ingestion and structural cleaning of the
Dubai Pulse Land Transactions dataset using Apache Spark.

The raw CSV file contains malformed records, mixed-language attributes,
and missing values that prevent reliable processing using traditional
data analysis tools. This notebook outputs a cleaned, Spark-native
Parquet dataset that is treated as immutable for all downstream feature
engineering and modeling tasks.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Raw_Ingestion_and_Cleaning") \
    .getOrCreate()

In [2]:
RAW_DATA_PATH = "Transactions.csv"
OUTPUT_PATH = "land_transactions_cleaned.parquet"

In [3]:
df_raw = spark.read \
    .option("header", True) \
    .option("multiLine", True) \
    .option("escape", "\"") \
    .option("mode", "DROPMALFORMED") \
    .csv(RAW_DATA_PATH)

In [4]:
df_raw.printSchema()
print("Raw row count:", df_raw.count())

root
 |-- transaction_id: string (nullable = true)
 |-- procedure_id: string (nullable = true)
 |-- trans_group_id: string (nullable = true)
 |-- trans_group_ar: string (nullable = true)
 |-- trans_group_en: string (nullable = true)
 |-- procedure_name_ar: string (nullable = true)
 |-- procedure_name_en: string (nullable = true)
 |-- instance_date: string (nullable = true)
 |-- property_type_id: string (nullable = true)
 |-- property_type_ar: string (nullable = true)
 |-- property_type_en: string (nullable = true)
 |-- property_sub_type_id: string (nullable = true)
 |-- property_sub_type_ar: string (nullable = true)
 |-- property_sub_type_en: string (nullable = true)
 |-- property_usage_ar: string (nullable = true)
 |-- property_usage_en: string (nullable = true)
 |-- reg_type_id: string (nullable = true)
 |-- reg_type_ar: string (nullable = true)
 |-- reg_type_en: string (nullable = true)
 |-- area_id: string (nullable = true)
 |-- area_name_ar: string (nullable = true)
 |-- area_name

## Schema Handling Strategy

Due to the large number of attributes and the presence of mixed-language
categorical fields, schema inference is used during ingestion. Explicit
type validation and casting are applied to critical numerical and
temporal fields to ensure correctness for downstream processing.

In [5]:
from pyspark.sql.functions import col

df_clean = df_raw

# Drop records missing essential identifiers or target-related values
df_clean = df_clean.dropna(subset=["transaction_id", "actual_worth"])

# Cast critical numeric fields
df_clean = df_clean.withColumn(
    "actual_worth", col("actual_worth").cast("double")
).withColumn(
    "meter_sale_price", col("meter_sale_price").cast("double")
)

In [6]:
df_clean.printSchema()
print("Cleaned row count:", df_clean.count())

root
 |-- transaction_id: string (nullable = true)
 |-- procedure_id: string (nullable = true)
 |-- trans_group_id: string (nullable = true)
 |-- trans_group_ar: string (nullable = true)
 |-- trans_group_en: string (nullable = true)
 |-- procedure_name_ar: string (nullable = true)
 |-- procedure_name_en: string (nullable = true)
 |-- instance_date: string (nullable = true)
 |-- property_type_id: string (nullable = true)
 |-- property_type_ar: string (nullable = true)
 |-- property_type_en: string (nullable = true)
 |-- property_sub_type_id: string (nullable = true)
 |-- property_sub_type_ar: string (nullable = true)
 |-- property_sub_type_en: string (nullable = true)
 |-- property_usage_ar: string (nullable = true)
 |-- property_usage_en: string (nullable = true)
 |-- reg_type_id: string (nullable = true)
 |-- reg_type_ar: string (nullable = true)
 |-- reg_type_en: string (nullable = true)
 |-- area_id: string (nullable = true)
 |-- area_name_ar: string (nullable = true)
 |-- area_name

In [7]:
df_clean.write \
    .mode("overwrite") \
    .parquet(OUTPUT_PATH)

# Dataset Freeze and Handoff

The cleaned dataset generated in this notebook is written in Parquet
format and treated as immutable for all downstream tasks.

Final dataset characteristics:
- Storage format: Parquet (Snappy compression)
- Output path: `data/processed/land_transactions_cleaned.parquet`

All subsequent feature engineering, modeling, manual cross-validation,
and ensemble learning steps operate exclusively on this dataset.