# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**: Regalado Floriano Luis A.

**Professor**: Pablo Camarillo Ramirez

# Introduction

OrangeStone wishes to enter the House Market, and has access to a database that is updated monthly with information pertaining to global house purchases. They want to use the most of these information to make decisions on the real state they buy. As such, they decided to transform their raw data into a star model for ease of analysis.

# Dataset

The dataset they have access to is https://www.kaggle.com/datasets/mohankrishnathalla/global-house-purchase-decision-dataset


It is a single table, and they want to convert it to the following Star Model

```mermaid
---
title: House Loan Star Model
---
erDiagram

    Customer ||..|{ FactHouse : INTERESTED_IN

    FactHouse ||--|| Location: AT 
    FactHouse ||--|{ LoanDetails: HAS_LOAN  
    FactHouse ||--|| HouseDetails: ABOUT_HOUSE  
    Location ||--|| CityDetails: IN_CITY
    Location ||--|| CountryDetails: IN_COUNTRY

    HouseDetails ||--|| FurnishingDetails: IS_FURNISHED
    HouseDetails ||--|| PropertyTypeDetails: OF_PROPERTY_TYPE

```

```mermaid
---
title: House Loan Star Model
---
erDiagram

    Customer ||..|{ FactHouse : INTERESTED_IN

    FactHouse ||--|| Location: AT 
    FactHouse ||--|{ LoanDetails: HAS_LOAN  
    FactHouse ||--|| HouseDetails: ABOUT_HOUSE  
    Location ||--|| CityDetails: IN_CITY
    Location ||--|| CountryDetails: IN_COUNTRY

    HouseDetails ||--|| FurnishingDetails: IS_FURNISHED
    HouseDetails ||--|| PropertyTypeDetails: OF_PROPERTY_TYPE

```

# Transformations and Actions

## Transformations

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML: Logistic Regression") \
    .master("spark://spark-master:7077") \
    .config("spark.jars", "/opt/spark/work-dir/jars/postgresql-42.7.8.jar") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

25/10/27 02:47:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
from regalado_floriano.spark_utils import SparkUtils 

In [3]:
houses_schema = SparkUtils.generate_schema(
 
    (("property_id","int"),
("country","string"),
("city","string"),
("property_type","string"),
("furnishing_status","string"),
("property_size_sqft","int"),
("price","int"),
("constructed_year","int"),
("previous_owners","int"),
("rooms","int"),
("bathrooms","int"),
("garage","bool"),
("garden","bool"),
("crime_cases_reported","int"),
("legal_cases_on_property","bool"),
("customer_salary","int"),
("loan_amount","int"),
("loan_tenure_years","int"),
("monthly_expenses","int"),
("down_payment","int"),
("emi_to_income_ratio","float"),
("satisfaction_score","int"),
("neighbourhood_rating","int"),
("connectivity_score","int"),
("decision","bool")
    )
)

In [4]:
house_df = spark.read \
                .option("header", "true") \
                .schema(houses_schema) \
                .csv("/opt/spark/work-dir/data/house_purchases")
 

In [5]:
from pyspark.sql.functions import col
from pyspark.sql.functions import monotonically_increasing_id
house_df= house_df.na.fill(False)


## Transformation 1: Extracting Strings 

In [6]:
#in order to prepare for ease of analysis, all categorical data will be put in it's own frame. That is to say, each country and city will get 
# their own id
categories = "country city property_type furnishing_status".split()
_localMap = SparkUtils.generate_keyed_distinct_column(house_df)

categoricalTables =   {
     key:     _localMap(key) for key in categories
 }

id_house = house_df
for cat in categories:
    cur_df = categoricalTables[cat]
    id_house = SparkUtils.replace_column_for_key(id_house)(cur_df)(cat)


## Transformation 2: Creating unique location id

In [7]:

locations = id_house.select( "country_id", "city_id").distinct().withColumn("id",monotonically_increasing_id())
id_house = id_house.join(
    locations ,
    on=[id_house["country_id"] == locations["country_id"],
        id_house["city_id"] == locations["city_id"]],
    how="left"
).drop("country_id", "city_id").withColumnRenamed("id", "location_id")


25/10/27 02:47:17 WARN Column: Constructing trivially true equals predicate, 'm.country_id == m.country_id'. Perhaps you need to use aliases.
25/10/27 02:47:17 WARN Column: Constructing trivially true equals predicate, 'm.city_id == m.city_id'. Perhaps you need to use aliases.


## Transformation 3: Creating House Details

In [22]:
house_details = id_house.select("property_id", 
                                "previous_owners",
                                "rooms","bathrooms", "garage","garden","crime_cases_reported","legal_cases_on_property",
                               "neighbourhood_rating", "satisfaction_score", "property_size_sqft", "price"
                               ,"constructed_year", "furnishing_status_id", "property_type_id"
                               )



## Transformation 4: Creating Loan Details

In [23]:
loan_details = house_df.select( 
            "property_id",
            "loan_amount","loan_tenure_years","down_payment"
                              ) 

### Transformation 5: Creating Buyer Details

In [24]:
buyer_details = id_house.select(
    "property_id", "customer_salary","emi_to_income_ratio", "monthly_expenses", "connectivity_score"
                               )



In [11]:
all_details = [house_details , loan_details, buyer_details]

In [12]:
factHouses =  id_house.select("property_id", "decision", "location_id" )

# Persistence Data

Since we are going to write a Star Model, a relational database is the obvious choice. Since we would be dealing with big data, a distributed environment would be much preferred, which is why we chose to use Postgres as our Database Engine

In [13]:
jdbc_url = "jdbc:postgresql://postgres-iteso:5432/postgres"

In [15]:
SparkUtils.writeToPostGres(factHouses)(jdbc_url)("FactHouses")

AnalysisException: Table or view 'FactHouses' already exists. SaveMode: ErrorIfExists.

In [16]:
SparkUtils.writeToPostGres(locations)(jdbc_url)("LocationDetails")

                                                                                

In [17]:
for tableName in categoricalTables:
    table = categoricalTables[tableName] 
    
    SparkUtils.writeToPostGres(table)(jdbc_url)(f"{tableName}Details")

                                                                                

In [25]:
all_details = { "HouseDetails": house_details , "LoanDetails": loan_details, "BuyerDetails": buyer_details}
for tableName in all_details:
    table = all_details[tableName] 
    
    SparkUtils.writeToPostGres(table)(jdbc_url)(f"{tableName}")

25/10/27 03:13:47 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

# <center> <img src="../../img/batch_regalado_floriano_post.png" alt="ITESO" width="480" height="130"> </center>


# DAG

# <center> <img src="../../img/batch_regalado_floriano_dag.png" alt="ITESO" width="480" height="130"> </center>
