# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**: Juan Pablo Quintero

**Professor**: Pablo Camarillo Ramirez

# Introduction
En este proyecto voy a usar un datset de accidentes en estados unidos, voy a aboradad una en donde hay mas accidentes que abarcaria el estado, ciudad y calle, tambien se tomaria en cuenta la hora del dia y el dia que paso el accidente, generariamos nuevas columnas para este tipo de datos.
# URL del dataset : https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents/data


# Dataset

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Proyecto : BatchProcesing") \
    .master("spark://spark-master:7077") \
    .config("spark.jars", "/opt/spark/work-dir/jars/postgresql-42.7.8.jar") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

25/10/26 22:24:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
#Carga del dataset
from PabloQuintero.spark_utils import SparkUtils
#Debdio a que no se el nombre de las columnas del dataset le digo a pyspark que las le automaticamente
df_accidentes = spark.read.option("header", True).option("inferSchema", True).csv("/opt/spark/work-dir/data/CarAccident")

#use el formato vertical para poder ver las tablas por que de la otra forma se veria muy amontonado y muy dificl de leer asi es ma facil de leerlo
print("Primeras 20 filas")
df_accidentes.show(n=20, truncate=False, vertical=True)

print("Estructura")
df_accidentes.printSchema()



25/10/26 22:25:11 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Primeras 20 filas
-RECORD 0-------------------------------------------------------------------------------------------------------------------
 ID                    | A-1                                                                                                
 Source                | Source2                                                                                            
 Severity              | 3                                                                                                  
 Start_Time            | 2016-02-08 05:46:00                                                                                
 End_Time              | 2016-02-08 11:00:00                                                                                
 Start_Lat             | 39.865147                                                                                          
 Start_Lng             | -84.058723                                                                        

# Transformations and Actions

In [19]:
from pyspark.sql.functions import col, date_format, hour, dayofweek, month, count, when, isnull, round, avg
from pyspark.sql.types import IntegerType

#Limpieza de columnas de los valores null
df_clean = (
    df_accidentes
    .withColumn("Severity", when(col("Severity").isNull(), -1).otherwise(col("Severity"))) # se usara el -1 para ver cuales son los valores faltantes
)


In [15]:
#Creacion de columnas nuevas 
df_features = (
    df_clean
    .withColumn("Accident_Hour", hour(col("Start_Time"))) 
    .withColumn("Accident_DayOfWeek", dayofweek(col("Start_Time")))
    .withColumn("Accident_Month", month(col("Start_Time")))
    .withColumn("Duration_Minutes", round((col("End_Time").cast("long") - col("Start_Time").cast("long")) / 60))
    .withColumn("Is_Hot_Accident", when(col("`Temperature(F)`") > 80, True).otherwise(False))
)

df_accidentes_by_time_loc = (
    df_features
    .groupBy("State", "City", "Accident_DayOfWeek", "Accident_Hour") 
    .agg(
        count("*").alias("Total_Accidents"), 
        count(when(col("Severity") >= 3, 1)).alias("High_Severity_Count"),
        round(avg(col("Severity")).cast("double")).alias("Avg_Severity")
    )
    .select(
        "State",
        "City",
        "Accident_DayOfWeek",
        "Accident_Hour",
        "Total_Accidents",
        "High_Severity_Count",
        "Avg_Severity"
    )
)


# Persistence Data


Quise usar postgre en esta estapa final se basa en su superioridad como base de datos relacional y su capacidad para poder grandes cantidade de datos, y esto es bueno para este dataset de accidentes vehiculares.

In [11]:
jdbc_url = "jdbc:postgresql://postgres-iteso:5432/postgres"
table_name = "Accidentes2_Us"
df_accidentes_by_time_loc.write \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", table_name) \
    .option("user", "postgres") \
    .option("password", "Admin@1234") \
    .option("driver", "org.postgresql.Driver") \
    .save()
print("DataFrame successfully written into a PosgreSQL DB !")
jdbc_url = "jdbc:postgresql://postgres-iteso:5432/postgres"
db_properties = {
      "user": "postgres",      
      "password": "Admin@1234",
      "driver": "org.postgresql.Driver"
  }
df_verification = spark.read \
    .jdbc(url=jdbc_url, table=table_name, properties=db_properties)
df_verification.printSchema()
df_verification.show(5, truncate=False)


                                                                                

DataFrame successfully written into a PosgreSQL DB !
root
 |-- State: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Accident_DayOfWeek: integer (nullable = true)
 |-- Accident_Hour: integer (nullable = true)
 |-- Total_Accidents: long (nullable = true)
 |-- High_Severity_Count: long (nullable = true)
 |-- Avg_Severity: double (nullable = true)

+-----+------------+------------------+-------------+---------------+-------------------+------------+
|State|City        |Accident_DayOfWeek|Accident_Hour|Total_Accidents|High_Severity_Count|Avg_Severity|
+-----+------------+------------------+-------------+---------------+-------------------+------------+
|OH   |Williamsburg|2                 |6            |2              |0                  |2.0         |
|OH   |Columbus    |2                 |18           |206            |94                 |3.0         |
|OH   |Columbus    |3                 |5            |102            |30                 |2.0         |
|OH   |Columbus

In [23]:
#pruebas de promedios de cuantos accidentes pasan 
print("Severidad Promedio por Estado")
df_clean.groupBy("State").avg("Severity").show()
print("\n Conteo de Accidentes por Ciudad")
df_clean.groupBy("City").count().show(5)


Severidad Promedio por Estado


                                                                                

+-----+------------------+
|State|     avg(Severity)|
+-----+------------------+
|   CA|2.1656876836490406|
|   FL|2.1400603504689886|
|   GA|2.5069312313128567|
|   IA|2.4194320903181663|
|   IL|2.3833023591661835|
|   NY| 2.259549948269916|
|   PA| 2.205764951790169|
|   DE| 2.254309427537774|
|   OR| 2.112406768340198|
|   NE| 2.180671977831659|
|   NJ|2.2337495292035903|
|   CT| 2.347031899162031|
|   NH| 2.243317340644277|
|   DC|2.1436392914653783|
|   TX|2.2241244121426744|
|   WA|2.3441383834930374|
|   OH|2.3536976675274097|
|   IN|2.3980721171010355|
|   MA|2.2938415381637527|
|   RI| 2.458252312768841|
+-----+------------------+
only showing top 20 rows

 Conteo de Accidentes por Ciudad




+---------+-----+
|     City|count|
+---------+-----+
|   Dublin| 3781|
|   Sabina|   55|
|  Findlay|  183|
|Mansfield|  927|
|Pataskala|  259|
+---------+-----+
only showing top 5 rows


                                                                                

# DAG