# Task A

The steps performed will be: 
- A.2 data formatting, which included Idealista data processing, Income data processing, and Lookup data processing, followed by export to parquet in the formatted zone; 
- A.3 Exploitation: We will perform feature engineering, split the data into train and test sets, and save the resulting datasets in the exploitation zone.

- A.4 validation: We will perform validation on the data stored both in the formatted and exploitation zones.

## A.2 Data Formatting Process

Setting system variables location

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, when, array_contains
from pyspark.sql import functions as F
from IPython.display import Markdown, display

In [3]:
# set a Spark session
spark = SparkSession.builder \
    .appName("DataFormatting") \
    .master("local[*]") \
    .getOrCreate()

In [4]:
# set the paths for the input and output data
landing_zone = ("landing_zone")
formatted_zone = ("formatted_zone")

In [5]:
# load all files from each folder of the landing zone
df_idealista = spark.read.option("multiline", True).json(f"{landing_zone}/idealista")
df_income = spark.read.option("header", True).csv(f"{landing_zone}/Income")
df_lookup = spark.read.option("header", True).csv(f"{landing_zone}/lookup_tables")

### Idealista data

In [6]:
df_idealista.show()

+--------------------+---------+-------+--------------------+--------+--------------+--------+-----------------+-----+------+---------+-------+-------+----------+--------+----------+---------+--------------------+--------------------+--------------+----------------------+---------+---------+------------------+---------+-----------+------------+------------+---------+-----+-----------+-----+------+--------------------+--------------------+-----------------+--------------------+
|             address|bathrooms|country|        detailedType|distance|      district|exterior|externalReference|floor|has360|has3DTour|hasLift|hasPlan|hasStaging|hasVideo|  latitude|longitude|        municipality|        neighborhood|newDevelopment|newDevelopmentFinished|numPhotos|operation|      parkingSpace|    price|priceByArea|propertyCode|propertyType| province|rooms|showAddress| size|status|      suggestedTexts|           thumbnail|topNewDevelopment|                 url|
+--------------------+---------+----

We will only predict price for the Municipality of Barcelona, filtering the data

In [7]:
df_idealista = df_idealista.filter(F.col("municipality") == "Barcelona")

We will check next how many rows we have in our data frame

In [8]:
n_rows = df_idealista.count()
print(n_rows)

7856


For our specific task, we want to include only unique properties, so we will also check for duplicates next

In [None]:
df_idealista.groupBy("propertyCode") \
    .agg(count("*").alias("occurrences")) \
    .filter("occurrences > 1") \
    .show(truncate=False)

+------------+-----------+
|propertyCode|occurrences|
+------------+-----------+
|90204735    |5          |
|92524329    |6          |
|91622667    |4          |
|91358097    |3          |
|87258588    |4          |
|91903585    |12         |
|91901786    |9          |
|92771991    |5          |
|91560330    |4          |
|92108611    |10         |
|92411589    |2          |
|92557089    |2          |
|87265378    |3          |
|92962742    |6          |
|40084649    |2          |
|92474319    |3          |
|91566532    |2          |
|88667699    |4          |
|88863152    |4          |
|91671895    |3          |
+------------+-----------+
only showing top 20 rows


In [None]:
n_duplicates = (
    df_idealista
    .groupBy("propertyCode") 
    .agg(count("*").alias("occurrences"))
    .filter("occurrences > 1")
    .count()
)

display(Markdown(f"We identified **{n_duplicates}** duplicated properties in our data."))

We identified **1389** duplicated properties in our data.

The Idealista JSON files were originally split by year, month, and day, but the concatenation process done during the loading part causes duplicates in the column `propertyCode` - which correspond to the same house recorded in more periods. 

Since distinguishing between dates is not relevant for our analysis, we will remove these duplicate entries. Additionally, we cannot leverage temporal information, as the Idealista dataset spans 2020–2021, whereas the income dataset covers 2007–2017.

Lastly, this approach resolves the issue of repeated files - for example, cases like 2020_12_28_idealista(1).json and 2020_12_28_idealista.json, which are likely duplicates. By retaining only unique propertyCode values, we ensure that such repetitions do not introduce bias into our analysis.

In [11]:
# remove duplicates
df_idealista = df_idealista.dropDuplicates(["propertyCode"])

In [12]:
n_rows = df_idealista.count()
display(Markdown(f"We will perform our analysis using information for **{n_rows}** properties."))

We will perform our analysis using information for **4062** properties.

Checking if properties are unique

In [None]:
# Count total rows
total_rows = df_idealista.count()

# Count unique propertyCodes
unique_property_codes = df_idealista.select("propertyCode").distinct().count()

print(f"Total rows: {total_rows}")
print(f"Unique property codes: {unique_property_codes}")

# Check if propertyCode is unique
if total_rows == unique_property_codes:
    print("propertyCode is unique for all rows.")
else:
    print("There are duplicate propertyCode values.")

Total rows: 4062
Unique property codes: 4062
propertyCode is unique for all rows.


### Income data

In [14]:
df_income.show()

+----+--------------+--------------+----------+--------------------+--------+-------------------------+
| Any|Codi_Districte| Nom_Districte|Codi_Barri|           Nom_Barri|Població|Índex RFD Barcelona = 100|
+----+--------------+--------------+----------+--------------------+--------+-------------------------+
|2007|             1|  Ciutat Vella|         1|            el Raval|   46595|                     64.7|
|2007|             1|  Ciutat Vella|         2|      el Barri Gòtic|   27946|                     86.5|
|2007|             1|  Ciutat Vella|         3|      la Barceloneta|   15921|                     66.7|
|2007|             1|  Ciutat Vella|         4|Sant Pere, Santa ...|   22572|                     80.2|
|2007|             2|      Eixample|         5|       el Fort Pienc|   31521|                    107.9|
|2007|             2|      Eixample|         6|  la Sagrada Família|   52185|                    101.8|
|2007|             2|      Eixample|         7|la Dreta de l'Eix

In the 'income' dataframe we won't have exact duplicates since the population and the RDF index change over the years, so we used a different mechanism to remove the time dimension:
- we grouped by the neighborhood,
- we computed average population and average RDF index

In [15]:
# we tried to normally groupby 'Nom_Barri' and then average the 'Població' and 'Índex RFD Barcelona = 100' columns, 
# but we found that some values were not numeric, so here we handle this. 
df_income_cleaned = df_income.withColumn(
    "Poblacio_num", when(col("Població").rlike("^\d+$"), col("Població").cast("double"))).withColumn(
    "Index_RFD_num", when(col("Índex RFD Barcelona = 100").rlike("^\d+(\.\d+)?$"), col("Índex RFD Barcelona = 100").cast("double")))

# normal groupby and average
df_income_barri = df_income_cleaned.groupBy("Nom_Barri").agg(
    avg("Poblacio_num").alias("Poblacio_average"),
    avg("Index_RFD_num").alias("Index_RFD_average"))


In [16]:
df_income_barri.show()

+--------------------+------------------+------------------+
|           Nom_Barri|  Poblacio_average| Index_RFD_average|
+--------------------+------------------+------------------+
|         el Poblenou|32450.454545454544| 92.63636363636364|
|   la Vila de Gràcia| 51166.63636363636|104.81818181818181|
|el Besòs i el Mar...|23435.454545454544| 56.07272727272727|
|        la Guineueta|15231.727272727272|63.818181818181806|
|        la Teixonera|11400.727272727272| 71.89999999999999|
|la Dreta de l'Eix...| 43410.36363636364|155.25454545454548|
|      el Barri Gòtic| 18795.81818181818| 98.14545454545454|
|         el Guinardó|35770.818181818184| 85.58181818181818|
|            Vallbona|1338.1818181818182| 48.96363636363637|
|           Canyelles| 7169.181818181818| 64.53636363636365|
|Provençals del Po...|19809.090909090908| 87.87272727272727|
| la Verneda i la Pau|29134.545454545456| 62.77272727272727|
|Vilapicina i la T...| 25575.81818181818| 72.43636363636364|
|l'Antiga Esquerra...| 4

### Lookup Data

 To merge the datasets, we will use the `neighborhood` column as the key. For the Idealista properties, we have neighborhood information available for all properties located within the municipality of Barcelona.

In [17]:
# Checking missing values for neighborhood
missing_neighborhood = df_idealista.filter(col("neighborhood").isNull()).count()
display(Markdown(f"We identified **{missing_neighborhood}** missing values for neighborhood in the idealista df."))

We identified **0** missing values for neighborhood in the idealista df.

First, we will create a comprehensive mapping of all possible neighborhood names by aggregating the different naming conventions found in the columns `neighborhood``, neighborhood_n_reconciled`, and `neighborhood_n` of `df_lookup`, grouped by `neighborhood_id`. 

We will then merge `df_idealista` with this enhanced lookup table, ensuring that each property is matched to its correct neighborhood regardless of naming variations. Next, we bring in the neighborhood-level income index by merging with `df_income_barri`, matching any of the possible neighborhood names to the column `Nom_Barri`. This process ensures that each property receives the most granular income index available.

In [18]:
# Step 1: Stack all three columns into one column with the corresponding id
grouped = (
    df_lookup
    .select("neighborhood_id", "neighborhood")
    .unionByName(df_lookup.select("neighborhood_id", "neighborhood_n_reconciled").withColumnRenamed("neighborhood_n_reconciled", "neighborhood"))
    .unionByName(df_lookup.select("neighborhood_id", "neighborhood_n").withColumnRenamed("neighborhood_n", "neighborhood"))
    .filter(F.col("neighborhood").isNotNull())
    .dropDuplicates()
)

# Step 2: Group by id, collect all unique names in a list
lookup_collapsed = (
    grouped
    .groupBy("neighborhood_id")
    .agg(F.collect_set("neighborhood").alias("all_names"))
)

In [19]:
lookup_collapsed.show()

+---------------+--------------------+
|neighborhood_id|           all_names|
+---------------+--------------------+
|       Q3320806|[vilapicina i la ...|
|       Q3321805|[el putxet i el f...|
|       Q3291762|[L'Antiga Esquerr...|
|       Q3813818|[el Congrés i els...|
|       Q3320705|[La Teixonera, la...|
|       Q3045547|[el guinardo, El ...|
|       Q1932090|[el coll, el Coll...|
|       Q3773169|[sant marti de pr...|
|       Q3294602|[el camp de l arp...|
|       Q1425291|[Baró de Viver, b...|
|       Q3296693|[Les Corts, les C...|
|       Q1026658|[la nova esquerra...|
|        Q542473|[La Verneda i la ...|
|       Q3751072|[vallbona, Vallbona]|
|       Q2562684|[pedralbes, Pedra...|
|        Q524311|[El Turó de la Pe...|
|       Q3773462|[la sagrera, la S...|
|       Q1627690|[les roquetes, Le...|
|       Q3320699|[La Salut, la Sal...|
|       Q3750558|      [porta, Porta]|
+---------------+--------------------+
only showing top 20 rows


In [20]:
# Merge df_idealista with lookup_collapsed using array_contains on 'all_names'
df_idealista_lookup = df_idealista.join(
    lookup_collapsed,
    array_contains(lookup_collapsed['all_names'], df_idealista['neighborhood']),
    how='left'
)

In [21]:
# Sanity check
display(Markdown(f"The row count should match between **{df_idealista_lookup.count()}** and **{df_idealista.count()}**."))

The row count should match between **4062** and **4062**.

In [None]:
# Merge df_idealista_lookup with df_income_barri using array_contains on 'all_names'
df_idealista_barri = df_idealista_lookup.join(
    df_income_barri.select('Nom_Barri', 'Index_RFD_average', 'Poblacio_average'),
    array_contains(df_idealista_lookup['all_names'], df_income_barri['Nom_Barri']),
    how='left'
).drop('Nom_Barri')

In [23]:
# Sanity check
display(Markdown(f"The row count should match between **{df_idealista_barri.count()}** and **{df_idealista_lookup.count()}**."))

The row count should match between **4062** and **4062**.

We check how many properties are missing an Index\_RFD\_average value. Any property without a match did not have its neighborhood name present in any of the entries in the lookup table.

In [None]:
# Count rows where there is a valid neighborhood_id but missing Index_RFD_average
missing_income = df_idealista_barri.filter(
    (F.col('neighborhood_id').isNotNull()) &
    (F.col('Index_RFD_average').isNull())
).count()

display(Markdown(
    f"Rows with non-null <b>neighborhood_id</b> but missing <b>Index_RFD_average</b>: <b>{missing_income}</b>.<br>"
    + ("<span style='color:green'>PASS</span>" if missing_income == 0 else "<span style='color:red'>CHECK DATA</span>")
))

Rows with non-null <b>neighborhood_id</b> but missing <b>Index_RFD_average</b>: <b>0</b>.<br><span style='color:green'>PASS</span>

In [25]:
# Counting missing values
df_idealista_barri.filter(F.col("Index_RFD_average").isNull()).count()

0

We remove columns that won't be useful for our analysis

In [26]:
final_cleaned = df_idealista_barri \
    .drop("address", "country", "detailedType", "externalReference", "municipality", "operation", "province", "suggestedTexts", "thumbnail",
          "url", "all_names") 

In [27]:
final_cleaned.show()

+---------+--------+-------------------+--------+-----+------+---------+-------+-------+----------+--------+----------+---------+--------------------+--------------+----------------------+---------+------------------+---------+-----------+------------+------------+-----+-----------+-----+------+-----------------+---------------+------------------+------------------+
|bathrooms|distance|           district|exterior|floor|has360|has3DTour|hasLift|hasPlan|hasStaging|hasVideo|  latitude|longitude|        neighborhood|newDevelopment|newDevelopmentFinished|numPhotos|      parkingSpace|    price|priceByArea|propertyCode|propertyType|rooms|showAddress| size|status|topNewDevelopment|neighborhood_id| Index_RFD_average|  Poblacio_average|
+---------+--------+-------------------+--------+-----+------+---------+-------+-------+----------+--------+----------+---------+--------------------+--------------+----------------------+---------+------------------+---------+-----------+------------+----------

### Exporting final Parquet to Formatted Zone

For property price prediction, which is our goal, we want to partition by column that will:

-Filter commonly in queries 

-Have moderate cardinality 

-Not explode into millions of tiny files

Because of the data size filtering by neighborhood (>70 unique values) won't be very efficient.

We will assume that our queries will filter by district a lot. Then:

In [28]:
final_cleaned.write.mode("overwrite") \
    .partitionBy("district") \
    .parquet(f"{formatted_zone}/formatted_data")

## A.3 Exploitation

### Feature Engineering

In [29]:
from pyspark.sql.functions import col, when, isnan, isnull, mean, stddev, abs as spark_abs
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

In [30]:
# set a Spark session
spark = SparkSession.builder \
    .appName("DataExploitation") \
    .master("local[*]") \
    .getOrCreate()

In [31]:
# set the paths for the input and output data
formatted_zone = ("formatted_zone")
exploitation_zone = ("exploitation_zone")

In [32]:
# load data previously formatted
data = spark.read.option("multiline", True).parquet(f"{formatted_zone}/formatted_data")

In [33]:
data.show()

+---------+--------+--------+-----+------+---------+-------+-------+----------+--------+----------+---------+--------------------+--------------+----------------------+---------+------------+--------+-----------+------------+------------+-----+-----------+-----+------+-----------------+---------------+-----------------+------------------+--------------+
|bathrooms|distance|exterior|floor|has360|has3DTour|hasLift|hasPlan|hasStaging|hasVideo|  latitude|longitude|        neighborhood|newDevelopment|newDevelopmentFinished|numPhotos|parkingSpace|   price|priceByArea|propertyCode|propertyType|rooms|showAddress| size|status|topNewDevelopment|neighborhood_id|Index_RFD_average|  Poblacio_average|      district|
+---------+--------+--------+-----+------+---------+-------+-------+----------+--------+----------+---------+--------------------+--------------+----------------------+---------+------------+--------+-----------+------------+------------+-----+-----------+-----+------+-----------------+-

In [34]:
# clean numerical features: if price, size, or priceByArea are null or less than or equal to zero (=probably errors),
# set them to None. When instead they are valid, cast them to DoubleType (as anyways VectorAssembler would convert 
# to this type)
data_features = data.withColumn(
    "price_clean", 
    when(col("price").isNull() | (col("price") <= 0), None).otherwise(col("price").cast(DoubleType()))
).withColumn(
    "size_clean",
    when(col("size").isNull() | (col("size") <= 0), None).otherwise(col("size").cast(DoubleType()))
).withColumn(
    "priceByArea_clean",
    when(col("priceByArea").isNull() | (col("priceByArea") <= 0), None).otherwise(col("priceByArea").cast(DoubleType())))

In [35]:
# create categorical features for size, income, and rooms
data_features = data_features.withColumn(
    "size_category",
    when(col("size_clean") <= 50, "small")
    .when(col("size_clean") <= 100, "medium") 
    .when(col("size_clean") <= 150, "large")
    .otherwise("extra_large")
).withColumn(
    "income_category", 
    when(col("Index_RFD_average") <= 70, "low_income")
    .when(col("Index_RFD_average") <= 100, "medium_income")
    .when(col("Index_RFD_average") <= 130, "high_income") 
    .otherwise("very_high_income")
).withColumn(
    "rooms_category",
    when(col("rooms") <= 2, "small")
    .when(col("rooms") <= 4, "medium")
    .otherwise("large"))

In [36]:
# try to convert floor to numeric when possible
data_features = data_features.withColumn(
    "floor_numeric",
    when(col("floor").rlike("^\\d+$"), col("floor").cast("int")).otherwise(0))

In [37]:
# create binary features for parking, exterior, and lift availability
data_features = data_features.withColumn(
    "has_parking", 
    when(col("parkingSpace").isNotNull(), 1).otherwise(0)
).withColumn(
    "is_exterior",
    when(col("exterior") == True, 1).otherwise(0)
).withColumn(
    "has_lift_binary",
    when(col("hasLift") == True, 1).otherwise(0))

In [38]:
# detect and remove outliers 
price_stats = data_features.select(
    mean("price_clean").alias("price_mean"),
    stddev("price_clean").alias("price_std")
).collect()[0]

size_stats = data_features.select(
    mean("size_clean").alias("size_mean"), 
    stddev("size_clean").alias("size_std")
).collect()[0]

# remove outliers (beyond 3 standard deviations) 
price_lower = price_stats["price_mean"] - 3 * price_stats["price_std"] 
price_upper = price_stats["price_mean"] + 3 * price_stats["price_std"]
size_lower = size_stats["size_mean"] - 3 * size_stats["size_std"]
size_upper = size_stats["size_mean"] + 3 * size_stats["size_std"]

data_cleaned = data_features.filter(
    (col("price_clean").between(price_lower, price_upper)) &
    (col("size_clean").between(size_lower, size_upper)) &
    (col("price_clean").isNotNull()) &
    (col("size_clean").isNotNull()) &
    (col("Index_RFD_average").isNotNull()) &
    (col("neighborhood").isNotNull()))

print(f"removed {data.count() - data_cleaned.count()} outliers")

removed 112 outliers


In [39]:
# convert all numerical columns to DoubleType - same reason as above 
data_cleaned = data_cleaned.withColumn("distance", col("distance").cast(DoubleType())) \
    .withColumn("numPhotos", col("numPhotos").cast(DoubleType())) \
    .withColumn("rooms", col("rooms").cast(DoubleType())) \
    .withColumn("bathrooms", col("bathrooms").cast(DoubleType())) \
    .withColumn("latitude", col("latitude").cast(DoubleType())) \
    .withColumn("longitude", col("longitude").cast(DoubleType()))

In [40]:
# define the features we will use for prediction
numerical_features = [
    "size_clean", "rooms", "bathrooms", "Index_RFD_average", "Poblacio_average",
    "latitude", "longitude", "distance", "numPhotos", "floor_numeric",
    "has_parking", "is_exterior", "has_lift_binary"]

# handle boolean columns - convert to integer, and then VectorAssembler will convert to DoubleType
for bool_col in ["has360", "has3DTour", "hasVideo", "hasPlan", "newDevelopment"]:
    data_cleaned = data_cleaned.withColumn(bool_col + "_int", when(col(bool_col) == True, 1).otherwise(0))
    numerical_features.append(bool_col + "_int")

# handle categorical features by doing string indexing and one-hot encoding
categorical_features = [
    "propertyType", "size_category", "income_category", "rooms_category",
    "neighborhood", "status"]

# create string indexers for categorical variables
indexers = []
for cat_col in categorical_features:
    indexer = StringIndexer(inputCol=cat_col, outputCol=f"{cat_col}_indexed", handleInvalid="keep")
    indexers.append(indexer)

# create one-hot encoders  
encoders = []
encoded_cols = []
for cat_col in categorical_features:
    encoder = OneHotEncoder(inputCol=f"{cat_col}_indexed", outputCol=f"{cat_col}_encoded")
    encoders.append(encoder)
    encoded_cols.append(f"{cat_col}_encoded")

In [41]:
# target variable
target_column = "price_clean"

# select only the columns we need
all_features = [target_column] + numerical_features + categorical_features
ml_data = data_cleaned.select(all_features)

# fill missing values
ml_data = ml_data.fillna({
    "floor_numeric": 0,
    "numPhotos": 0,
    "distance": 0.0,
    "status": "unknown"})

print(f"Selected {len(numerical_features)} numerical features and {len(categorical_features)} categorical features")
print(f"ML dataset shape: {ml_data.count()} rows, {len(all_features)} columns")

Selected 18 numerical features and 6 categorical features
ML dataset shape: 3950 rows, 25 columns


### Checking missing values in features

In [42]:
# Sanity check (no missing values)
for col_name in all_features:
    n_null = ml_data.filter(col(col_name).isNull()).count()
    print(f"Missing values in '{col_name}': {n_null}")

Missing values in 'price_clean': 0
Missing values in 'size_clean': 0
Missing values in 'rooms': 0
Missing values in 'bathrooms': 0
Missing values in 'Index_RFD_average': 0
Missing values in 'Poblacio_average': 0
Missing values in 'latitude': 0
Missing values in 'longitude': 0
Missing values in 'distance': 0
Missing values in 'numPhotos': 0
Missing values in 'floor_numeric': 0
Missing values in 'has_parking': 0
Missing values in 'is_exterior': 0
Missing values in 'has_lift_binary': 0
Missing values in 'has360_int': 0
Missing values in 'has3DTour_int': 0
Missing values in 'hasVideo_int': 0
Missing values in 'hasPlan_int': 0
Missing values in 'newDevelopment_int': 0
Missing values in 'propertyType': 0
Missing values in 'size_category': 0
Missing values in 'income_category': 0
Missing values in 'rooms_category': 0
Missing values in 'neighborhood': 0
Missing values in 'status': 0


In [43]:
# combine all feature columns for VectorAssembler
all_feature_cols = numerical_features + encoded_cols

# assemble all features into a single vector - which is the input for ML algorithms
assembler = VectorAssembler(inputCols=all_feature_cols, outputCol="features")

# create preprocessing pipeline
pipeline_stages = indexers + encoders + [assembler]
preprocessing_pipeline = Pipeline(stages=pipeline_stages)

### Train test split

In [44]:
# fit and transform the preprocessing pipeline
pipeline_model = preprocessing_pipeline.fit(ml_data)
ml_data_processed = pipeline_model.transform(ml_data)

# select final columns for ML (target, features, and neighborhood for analysis)
final_ml_data = ml_data_processed.select("price_clean", "features", "neighborhood")

# split data 80/20 for train/test
train_data, test_data = final_ml_data.randomSplit([0.8, 0.2], seed=42)

print(f"Train set: {train_data.count():,} records")
print(f"Test set: {test_data.count():,} records")

Train set: 3,194 records
Test set: 756 records


### Checking missing values in target

In [45]:
# check for missing values in target
null_targets_train = train_data.filter(col("price_clean").isNull()).count()
null_targets_test = test_data.filter(col("price_clean").isNull()).count()

print(f" missing values in target column ->  train: {null_targets_train}, test: {null_targets_test}")

 missing values in target column ->  train: 0, test: 0


### Saving data in exploitation Zone

In [46]:
# save train set 
train_data.write.mode("overwrite").parquet(f"{exploitation_zone}/train_data")
# save test set 
test_data.write.mode("overwrite").parquet(f"{exploitation_zone}/test_data")
# save full processed dataset
final_ml_data.write.mode("overwrite").parquet(f"{exploitation_zone}/ml_ready_data")
# preprocessing pipeline
pipeline_model.write().overwrite().save(f"{exploitation_zone}/preprocessing_pipeline")

spark.stop()

## A.4 Validation

In [47]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum as spark_sum, avg, min as spark_min, max as spark_max, stddev, isnan, isnull, when
from pyspark.sql.types import NumericType

In [48]:
# set spark session 
spark = SparkSession.builder \
    .appName("DataValidation") \
    .master("local[*]") \
    .getOrCreate()

In [49]:
# paths for the zones
formatted_zone = "formatted_zone"
exploitation_zone = "exploitation_zone"

### Validation formatted zone

In [50]:
# validate data in the formatted zone

# load data
formatted_data = spark.read.parquet(f"{formatted_zone}/formatted_data")

# visualize schema
print("Columns in formatted data:")
for i, (col_name, col_type) in enumerate(formatted_data.dtypes, 1):
    print(f"  {i:2d}. {col_name:<25} : {col_type}")

print(f"\ntotal columns: {len(formatted_data.columns)}")

Columns in formatted data:
   1. bathrooms                 : bigint
   2. distance                  : string
   3. exterior                  : boolean
   4. floor                     : string
   5. has360                    : boolean
   6. has3DTour                 : boolean
   7. hasLift                   : boolean
   8. hasPlan                   : boolean
   9. hasStaging                : boolean
  10. hasVideo                  : boolean
  11. latitude                  : double
  12. longitude                 : double
  13. neighborhood              : string
  14. newDevelopment            : boolean
  15. newDevelopmentFinished    : boolean
  16. numPhotos                 : bigint
  17. parkingSpace              : struct<hasParkingSpace:boolean,isParkingSpaceIncludedInPrice:boolean,parkingSpacePrice:double>
  18. price                     : double
  19. priceByArea               : double
  20. propertyCode              : string
  21. propertyType              : string
  22. rooms    

In [51]:
# verify where we have null values
print(f"{'Column Name':<25} {'Non-Null Count':<15} {'Null Count':<12} {'Null %':<8} {'Data Type':<15}")
print("-" * 80)
total_records = formatted_data.count()
for col_name, col_type in formatted_data.dtypes:
    non_null_count = formatted_data.filter(col(col_name).isNotNull()).count()
    null_count = total_records - non_null_count
    null_percentage = (null_count / total_records) * 100
    
    print(f"{col_name:<25} {non_null_count:<15,} {null_count:<12,} {null_percentage:<7.1f}% {col_type:<15}")

Column Name               Non-Null Count  Null Count   Null %   Data Type      
--------------------------------------------------------------------------------
bathrooms                 4,062           0            0.0    % bigint         
distance                  4,062           0            0.0    % string         
exterior                  4,062           0            0.0    % boolean        
floor                     3,486           576          14.2   % string         
has360                    4,062           0            0.0    % boolean        
has3DTour                 4,062           0            0.0    % boolean        
hasLift                   3,736           326          8.0    % boolean        
hasPlan                   4,062           0            0.0    % boolean        
hasStaging                4,062           0            0.0    % boolean        
hasVideo                  4,062           0            0.0    % boolean        
latitude                  4,062        

In [52]:
# check statistics for the price column
price_stats = formatted_data.select(
    count("price").alias("count"),
    avg("price").alias("avg_price"),
    spark_min("price").alias("min_price"),
    spark_max("price").alias("max_price"),
    stddev("price").alias("std_price")
).collect()[0]

print(f"Price statistics:")
print(f"  Count: {price_stats['count']:,}")
print(f"  Average: €{price_stats['avg_price']:,.0f}")
print(f"  Min: €{price_stats['min_price']:,.0f}")
print(f"  Max: €{price_stats['max_price']:,.0f}")
print(f"  Std Dev: €{price_stats['std_price']:,.0f}")

Price statistics:
  Count: 4,062
  Average: €579,315
  Min: €34,000
  Max: €12,000,000
  Std Dev: €683,459


In [53]:
# check statistics for the income index
income_stats = formatted_data.select(
    count("Index_RFD_average").alias("count"),
    avg("Index_RFD_average").alias("avg_income"),
    spark_min("Index_RFD_average").alias("min_income"),
    spark_max("Index_RFD_average").alias("max_income")
).collect()[0]

print(f"\nIncome Index statistics:")
print(f"  Count: {income_stats['count']:,}")
print(f"  Average Index: {income_stats['avg_income']:.1f}")
print(f"  Min Index: {income_stats['min_income']:.1f}")
print(f"  Max Index: {income_stats['max_income']:.1f}")


Income Index statistics:
  Count: 4,062
  Average Index: 108.0
  Min Index: 43.6
  Max Index: 229.0


In [54]:
# neighborhood distribution
print(f"\n>>> Neighborhood distribution (top 10):")
neighborhood_dist = formatted_data.groupBy("neighborhood") \
    .count() \
    .orderBy(col("count").desc())

neighborhood_dist.show(10, truncate=False)


>>> Neighborhood distribution (top 10):
+-------------------------------+-----+
|neighborhood                   |count|
+-------------------------------+-----+
|La Dreta de l'Eixample         |352  |
|Sants                          |346  |
|El Poble Sec - Parc de Montjuïc|300  |
|La Nova Esquerra de l'Eixample |298  |
|La Marina del Port             |231  |
|La Maternitat i Sant Ramon     |229  |
|Sants - Badal                  |226  |
|El Gòtic                       |218  |
|Les Corts                      |214  |
|La Bordeta                     |201  |
+-------------------------------+-----+
only showing top 10 rows


In [55]:
# property type distribution
print(f">>> Property type distribution:")
property_dist = formatted_data.groupBy("propertyType") \
    .count() \
    .orderBy(col("count").desc())

property_dist.show(truncate=False)

>>> Property type distribution:
+------------+-----+
|propertyType|count|
+------------+-----+
|flat        |3421 |
|penthouse   |231  |
|chalet      |216  |
|duplex      |127  |
|studio      |66   |
|countryHouse|1    |
+------------+-----+



### Validation exploitation zone

In [56]:
# validate data in the exploitation zone

# laod data 
train_data = spark.read.parquet(f"{exploitation_zone}/train_data")
test_data = spark.read.parquet(f"{exploitation_zone}/test_data")
ml_ready_data = spark.read.parquet(f"{exploitation_zone}/ml_ready_data")

In [57]:
# check the splits
total_ml_records = ml_ready_data.count()
train_records = train_data.count()
test_records = test_data.count()

print(f"  Total ML records: {total_ml_records:,}")
print(f"  Train records: {train_records:,} ({train_records/total_ml_records*100:.1f}%)")
print(f"  Test records: {test_records:,} ({test_records/total_ml_records*100:.1f}%)")

  Total ML records: 3,950
  Train records: 3,194 (80.9%)
  Test records: 756 (19.1%)


In [58]:
# check what columns we have
print("Train data columns:")
train_data.printSchema()

# check feature vector
print("\nFirst feature vector:")
train_data.select("features").show(1, truncate=False)

# get vector size
vector_size = len(train_data.select("features").first()["features"])
print(f"Vector has {vector_size} dimensions")

Train data columns:
root
 |-- price_clean: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- neighborhood: string (nullable = true)


First feature vector:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                          |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(91,[0,1,2,3,4,5,6,7,8,9,12,18,23,27,31,36,88],[80.0,4.0,2.0,73.00000000000001,40601.181818181816,41.374478,2.1553834,2764.0,7.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Please note that this step (above) ensures there are not missing values in features

In [59]:
# check statistics for the target variable (=price_clean)
target_stats = train_data.select(
    count("price_clean").alias("count"),
    avg("price_clean").alias("avg_price"),
    spark_min("price_clean").alias("min_price"),
    spark_max("price_clean").alias("max_price"),
    stddev("price_clean").alias("std_price")
).collect()[0]

print(f"\nTarget variable (=price_clean) statistics:")
print(f"  Count: {target_stats['count']:,}")
print(f"  Average: €{target_stats['avg_price']:,.0f}")
print(f"  Min: €{target_stats['min_price']:,.0f}")
print(f"  Max: €{target_stats['max_price']:,.0f}")
print(f"  Std Dev: €{target_stats['std_price']:,.0f}")


Target variable (=price_clean) statistics:
  Count: 3,194
  Average: €500,212
  Min: €39,000
  Max: €2,600,000
  Std Dev: €409,415


In [60]:
# check for missing values in target
null_targets = train_data.filter(col("price_clean").isNull()).count()
print(f"  Null values: {null_targets}")

  Null values: 0


In [61]:
# compare record counts between zones
formatted_count = formatted_data.count()
ml_ready_count = ml_ready_data.count()
records_removed = formatted_count - ml_ready_count
removal_percentage = (records_removed / formatted_count) * 100

print(f"  Formatted Zone: {formatted_count:,} records")
print(f"  Exploitation Zone: {ml_ready_count:,} records")
print(f"  Records removed: {records_removed:,} ({removal_percentage:.1f}%)")

# check price ranges
formatted_price_range = formatted_data.select(
    spark_min("price").alias("min"), 
    spark_max("price").alias("max")
).collect()[0]

ml_price_range = ml_ready_data.select(
    spark_min("price_clean").alias("min"), 
    spark_max("price_clean").alias("max")
).collect()[0]

print(f"  Formatted Zone price range: €{formatted_price_range['min']:,.0f} - €{formatted_price_range['max']:,.0f}")
print(f"  ML Zone price range: €{ml_price_range['min']:,.0f} - €{ml_price_range['max']:,.0f}")

  Formatted Zone: 4,062 records
  Exploitation Zone: 3,950 records
  Records removed: 112 (2.8%)
  Formatted Zone price range: €34,000 - €12,000,000
  ML Zone price range: €34,000 - €2,600,000


In [62]:
spark.stop()