# Cleaned Dataset Validation

This notebook performs validation and sanity checks on the cleaned
Dubai Land Transactions dataset stored in Parquet format.

The purpose of this notebook is to verify schema correctness, data
completeness, and structural consistency before feature engineering
and modeling. No data cleaning or transformation is performed here.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DLD_Data_Validation") \
    .getOrCreate()

In [3]:
df = spark.read.parquet(
    "/content/cleaned_parquet"
)

## Dataset Size Verification

The total number of rows and columns is verified to ensure that the
cleaned dataset was loaded correctly and that no unintended data loss
occurred during ingestion.

In [4]:
print("Number of rows:", df.count())
print("Number of columns:", len(df.columns))

Number of rows: 30173
Number of columns: 50


## Schema Inspection

The dataset schema is inspected to verify data types and confirm that
numerical, categorical, and temporal attributes are represented
correctly after cleaning.

In [5]:
df.printSchema()

root
 |-- transaction_id: string (nullable = true)
 |-- procedure_id: integer (nullable = true)
 |-- trans_group_id: integer (nullable = true)
 |-- trans_group_ar: string (nullable = true)
 |-- trans_group_en: string (nullable = true)
 |-- procedure_name_ar: string (nullable = true)
 |-- procedure_name_en: string (nullable = true)
 |-- instance_date: string (nullable = true)
 |-- property_type_id: integer (nullable = true)
 |-- property_type_ar: string (nullable = true)
 |-- property_type_en: string (nullable = true)
 |-- property_sub_type_id: integer (nullable = true)
 |-- property_sub_type_ar: string (nullable = true)
 |-- property_sub_type_en: string (nullable = true)
 |-- property_usage_ar: string (nullable = true)
 |-- property_usage_en: string (nullable = true)
 |-- reg_type_id: integer (nullable = true)
 |-- reg_type_ar: string (nullable = true)
 |-- reg_type_en: string (nullable = true)
 |-- area_id: integer (nullable = true)
 |-- area_name_ar: string (nullable = true)
 |-- are

## Sample Record Inspection

A small sample of records is displayed to visually confirm that values
are correctly parsed and aligned with their respective columns.

In [6]:
df.show(5, truncate=False)

+-----------------+------------+--------------+--------------+--------------+--------------------------+-------------------------+-------------+----------------+----------------+----------------+--------------------+--------------------+--------------------+-----------------+-----------------+-----------+----------------+-------------------+-------+--------------+------------------+--------------------------+----------------------+--------------+--------------------------+---------------------------+--------------------------+---------------------------------+---------------------------------+----------------------------+------------------------+--------------------------+---------------+------------------+--------+--------+-----------+--------------+------------+----------------+----------+----------------+--------------------+--------------------+--------------------+-----------+----+-----+----+
|transaction_id   |procedure_id|trans_group_id|trans_group_ar|trans_group_en|procedure_na

## Missing Value Analysis

The presence of missing values is examined to understand data
completeness and to inform subsequent feature-level handling.

In [7]:
from pyspark.sql.functions import col, sum

missing_summary = df.select([
    sum(col(c).isNull().cast("int")).alias(c)
    for c in df.columns
])

missing_summary.show(truncate=False)

+--------------+------------+--------------+--------------+--------------+-----------------+-----------------+-------------+----------------+----------------+----------------+--------------------+--------------------+--------------------+-----------------+-----------------+-----------+-----------+-----------+-------+------------+------------+----------------+----------------+--------------+---------------+---------------+-----------------+-----------------+-------------------+-------------------+----------------+----------------+---------------+---------------+--------+--------+-----------+--------------+------------+----------------+----------+----------------+--------------------+--------------------+--------------------+-----------+-----+-----+-----+
|transaction_id|procedure_id|trans_group_id|trans_group_ar|trans_group_en|procedure_name_ar|procedure_name_en|instance_date|property_type_id|property_type_ar|property_type_en|property_sub_type_id|property_sub_type_ar|property_sub_type_e

## Duplicate Transaction Check

Transaction identifiers are checked for duplicates to ensure that each
record represents a unique transaction.

In [8]:
total_rows = df.count()
distinct_transactions = df.select("transaction_id").distinct().count()

print("Total rows:", total_rows)
print("Distinct transaction IDs:", distinct_transactions)

Total rows: 30173
Distinct transaction IDs: 30173


## Temporal Coverage Verification

The temporal range of the dataset is examined to confirm the span of
transaction records available for analysis.

In [9]:
df.selectExpr(
    "min(instance_date) as earliest_date",
    "max(instance_date) as latest_date"
).show()

+-------------+-----------+
|earliest_date|latest_date|
+-------------+-----------+
|   01-02-2010| 31-12-2025|
+-------------+-----------+



## Partitioning Inspection

The number of partitions is inspected to understand how the dataset is
distributed across Spark executors.

In [10]:
print("Number of partitions:", df.rdd.getNumPartitions())

Number of partitions: 7


## Validation Summary

The cleaned dataset has been successfully validated in terms of schema,
size, completeness, and structural consistency. No further cleaning is
performed at this stage.

Based on these checks, the dataset is deemed suitable for feature
engineering and large-scale machine learning tasks performed in
subsequent notebooks.