# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**: Luis Fernando Ramirez Ramos

**Professor**: Pablo Camarillo Ramirez

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, trim, col, count, isnull, when, lit, \
     concat, round, asc, desc, input_file_name, current_timestamp
from datetime import datetime
from fernandoramos.spark_utils import SparkUtils

spark = SparkSession.builder \
    .appName("Car catalogue normalization and match.") \
    .master("spark://3fc414c80e1d:7077") \
    .config("spark.jars", "/opt/spark/work-dir/jars/postgresql-42.7.8.jar") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

25/10/28 05:31:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


# Introduction

We will build a compact batch data pipeline in PySpark that ingests five CSV catalogues from real insurers (Zurich, HDI, Quálitas, Chubb, El Potosí), normalizes core attributes (brand, model, year, transmission, version), and writes a master catalog to PostgreSQL (both run locally in Docker).

### Two matching layers will be used:

#### Commercial hash candidates 
We form a deterministic key from (brand, normalized model, year, transmission) to quickly find likely matches across insurers.

#### Version similarity 
We normalize the free-text version_original into a single cleaned field version_clean (we keep all specs together). We then compare versions using Jaccard similarity (set overlap) and a coverage index (bidirectional overlap) to confirm real matches.

# Dataset

The input data is extracted from real insurers’ vehicle catalogues. Each CSV row is one vehicle entry as published by the insurer.

Raw fields: origen, id_original, marca, modelo, anio, version_original, transmision, activo

#### origen 
Insurer code/name the row comes from (e.g., ZURICH, HDI).

#### id_original 
The insurer’s primary/row identifier (keep as string to avoid overflow or formatting issues).

#### marca 
Vehicle brand as provided by the insurer (will be uppercased/canonicalized later).

#### modelo 
Vehicle model name as provided (to be normalized).

#### anio 
Model year (integer).

#### version_original 
Free-text trim/spec line (contains trim, HP, doors, body type, etc.).

#### transmision 
Transmission as provided; may be blank/messy; later normalized to AUTO or MANUAL.

#### activo 
Row activity flag from the source; typically 1/0 or boolean; we’ll filter to active rows.

## Version Matching Spec
#### Token sets

### Why both?

Bidirectional Coverage ensures both sides are sufficiently covered (prevents subset false positives).

Jaccard penalizes large unions (lots of extra tokens) and is a good guardrail/tie-breaker.

Keeping both keeps precision high with little extra complexity.

# Transformations and Actions

### Define Input Schema

In [3]:
raw_vehicle_data_schema = SparkUtils.generate_schema([
    ("origen_aseguradora", "string", True),
    ("id_original", "string", True),
    ("marca", "string", True),
    ("modelo", "string", True),
    ("anio", "int", True),
    ("version_original", "string", True),
    ("transmision", "string", True), # {AUTO, MANUAL}
    ("activo", "string", True) 
])

print("Schema generated.")

Schema generated.


### Origin Data Load

In [8]:
vehicle_dataset = spark \
  .read \
  .format("csv") \
  .option("header", "true") \
  .option("quote", '"') \
  .option("escape", '"') \
  .option("mode", "PERMISSIVE") \
  .schema(raw_vehicle_data_schema) \
  .csv("/opt/spark/work-dir/data/insurers/")

# Add metadata columns 
vehicle_dataset = vehicle_dataset \
  .withColumn("source_file", input_file_name()) \
  .withColumn("ingestion_ts", current_timestamp()) \
  .withColumnRenamed("origen_aseguradora", "origen")

# Safe conversion of activo from string to int
# Malformed rows will have NULL in activo_int
vehicle_dataset = vehicle_dataset.withColumn(
  "activo_int",
  when(col("activo").rlike("^[0-9]+$"), col("activo").cast("int"))
  .otherwise(None)
)

rows_read = vehicle_dataset.count()
print(f"Rows loaded from CSV files: {rows_read:,}")

# Show sample to verify data loaded correctly
print("\nSample raw data (first 5 rows):")
vehicle_dataset.select("origen", "id_original", "marca", "modelo", "anio", "activo_int").show(5, truncate=False)


Rows loaded from CSV files: 171,190

Sample raw data (first 5 rows):
+------+-----------+-----+------+----+----------+
|origen|id_original|marca|modelo|anio|activo_int|
+------+-----------+-----+------+----+----------+
|ZURICH|72315      |ACURA|ADX   |2025|1         |
|ZURICH|72316      |ACURA|ADX   |2025|1         |
|ZURICH|32380      |ACURA|ILX   |2013|1         |
|ZURICH|32381      |ACURA|ILX   |2013|1         |
|ZURICH|32382      |ACURA|ILX   |2014|1         |
+------+-----------+-----+------+----+----------+
only showing top 5 rows


### Filter only valid data

### Normalized Output Schema

In [5]:
normalized_schema = SparkUtils.generate_schema([
    ("brand_norm", "string", False),
    ("model_norm", "string", False),
    ("year", "int", False),
    ("transmission_norm", "string", False),
    ("version_clean", "string", False),
    ("token_list", "array<string>", False),
    ("commercial_hash", "string", False),
    ("activo", "int", False),
    ("disponibilidad", "struct<origen:string,id_original:string,version_original:string,source_file:string,ingestion_ts:timestamp>", False)
])

print("Schema generated.")

Schema generated.


# Persistence Data

# DAG