# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Machine Learning** </center>
---

**Date**: November, 2025

**Student Name**: Axel Escoto García

**Professor**: Pablo Camarillo Ramirez

# Init Spark

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Final Project: Machine Learning") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/25 05:08:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Justification
El dataset que seleccione para la actividad es el siguiente:
https://www.kaggle.com/datasets/rupakroy/online-payments-fraud-detection-dataset

Me pareció interesante el tema de la detección de fraude, el cómo podemos detectar movimientos extraños y actuar de forma rápida, ya se apara mitigarlo o avisar sobre la sospecha. 

# Schema

In [2]:
from axel2293.spark_utils import SparkUtils

payments_schema = [
    ("step", "int"),
    ("type", "string"),
    ("amount", "float"),
    ("nameOrig", "string"),
    ("oldbalanceOrg", "float"),
    ("newbalanceOrig", "float"),
    ("nameDest", "string"),
    ("oldbalanceDest", "float"),
    ("newbalanceDest", "float"),
    ("isFraud", "int"),
    ("isFlaggedFraud", "int"),
]

payments_schema = SparkUtils.generate_schema(payments_schema)
df_fraud = spark.read \
    .schema(payments_schema) \
    .option("header", True) \
    .csv("/opt/spark/work-dir/data/online_fraud/")

df_fraud.show(5)

[Stage 0:>                                                          (0 + 1) / 1]

+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|     170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|      21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|        181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|        181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|      41554.0|      29885.86|M1230701703|      

                                                                                

In [3]:
from pyspark.sql.functions import *

print(f"Total transactions: {df_fraud.count()}")

# Distribución de isFraud
df_fraud.groupBy("isFraud").count().show()

# Cantidad de fraudes por tipo de transacción
df_fraud.groupBy("type", "isFraud") \
    .count() \
    .orderBy("type", "isFraud").show()

# Ver tipos de transacciones
df_fraud.groupBy("type").count().show()

                                                                                

Total transactions: 6362620


                                                                                

+-------+-------+
|isFraud|  count|
+-------+-------+
|      0|6354407|
|      1|   8213|
+-------+-------+



                                                                                

+--------+-------+-------+
|    type|isFraud|  count|
+--------+-------+-------+
| CASH_IN|      0|1399284|
|CASH_OUT|      0|2233384|
|CASH_OUT|      1|   4116|
|   DEBIT|      0|  41432|
| PAYMENT|      0|2151495|
|TRANSFER|      0| 528812|
|TRANSFER|      1|   4097|
+--------+-------+-------+





+--------+-------+
|    type|  count|
+--------+-------+
| PAYMENT|2151495|
|TRANSFER| 532909|
| CASH_IN|1399284|
|   DEBIT|  41432|
|CASH_OUT|2237500|
+--------+-------+



                                                                                

El dataset está bastante desbalanceado, tenemos 6354407 (~99.87%) filas que no son fraude y solo 8213 (~0.12%) que sí son. Esto deja bastante claro que nuestra mejor opción es Random Forest, ya que es una buena opción para datasets desbalanceados.

Solo Transfer y Cash out tienen filas con marca de ser fraude.

# Vector assembly

In [4]:
from pyspark.ml.feature import VectorAssembler, StringIndexer

type_indexer = StringIndexer(inputCol="type", outputCol="type_indexed")
df_fraud = type_indexer.fit(df_fraud).transform(df_fraud)

# Columnas/features para el vector
feature_columns = [
    "type_indexed", # El tipo de transacción puede estar muy relacionado con los fraudes. Por ejemplo, Transferencia o Cash_out son los más usados en los marcados por fraudes.
    "amount", # Grandes cantidades pueden hacer sonar las alertas de fraude.
    "oldbalanceOrg", # Balances bajos y transacciones altas pueden indicar fraude
    "newbalanceOrig", # Ayuda a identificar si la cuenta fue drenada totalmente o parcialmente.
    "oldbalanceDest", # Cuentas que tengan poco balance podrian ser sospechosas.
    "newbalanceDest" # Si recibe grandes cantidades podria ser sospechoso.
]

# Construir vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_features = assembler.transform(df_fraud)

# Renombrar isFraud a label
df_final = df_features.withColumnRenamed("isFraud", "label")

df_final.select("label", "features").show(5, truncate=False)

[Stage 19:>                                                         (0 + 1) / 1]

+-----+-------------------------------------------------------+
|label|features                                               |
+-----+-------------------------------------------------------+
|0    |[1.0,9839.6396484375,170136.0,160296.359375,0.0,0.0]   |
|0    |[1.0,1864.280029296875,21249.0,19384.720703125,0.0,0.0]|
|1    |[3.0,181.0,181.0,0.0,0.0,0.0]                          |
|1    |[0.0,181.0,181.0,0.0,21182.0,0.0]                      |
|0    |[1.0,11668.1396484375,41554.0,29885.859375,0.0,0.0]    |
+-----+-------------------------------------------------------+
only showing top 5 rows


                                                                                

# Data Split
Como el Dataset está bastante desbalanceado, investigue formas para poder crear el train y test set, llegando al **"Stratified Split"**
## Undersample
El dataset es muy grande (6 millones de filas) y el training tarda demasiado, así que un poco de undersample es necesario para efectos prácticos.

In [5]:
# Separar filas de fraude y no fraude
fraud_df = df_final.filter(col("label") == 1)
non_fraud_df = df_final.filter(col("label") == 0)

fraud_count = df_final.filter(col("label") == 1).count()
non_fraud_count = df_final.filter(col("label") == 0).count()
total_count = df_final.count()

# Dividir 5:1
desired_ratio = 5  # 5 non-fraud por 1 fraud
non_fraud_sample_count = fraud_count * desired_ratio

fraction = non_fraud_sample_count / non_fraud_count
non_fraud_df = non_fraud_df.sample(
    withReplacement=False,
    fraction=fraction,
    seed=42
)

# Separar 80/20
fraud_train, fraud_test = fraud_df.randomSplit([0.8, 0.2], seed=42)
non_fraud_train, non_fraud_test = non_fraud_df.randomSplit([0.8, 0.2], seed=42)

# Combinar ambos df
train_df = fraud_train.union(non_fraud_train)
test_df = fraud_test.union(non_fraud_test)

print(f"{train_df.count()}")

print(f"{test_df.count()}")

                                                                                

39821




9595


                                                                                

# Train (Random Forest Classifier)

In [6]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

In [7]:
rf = RandomForestClassifier(
    labelCol="label",
    featuresCol="features",
    numTrees=100,
    maxDepth=15,
    maxBins=32,
    seed=42,
    featureSubsetStrategy="sqrt"
)

# Train
rf_model = rf.fit(train_df)

25/11/25 05:10:18 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/11/25 05:12:14 WARN DAGScheduler: Broadcasting large task binary with size 1001.4 KiB
25/11/25 05:12:15 WARN DAGScheduler: Broadcasting large task binary with size 1375.3 KiB
25/11/25 05:12:17 WARN DAGScheduler: Broadcasting large task binary with size 1832.3 KiB
25/11/25 05:12:18 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
25/11/25 05:12:20 WARN DAGScheduler: Broadcasting large task binary with size 3.0 MiB
25/11/25 05:12:21 WARN DAGScheduler: Broadcasting large task binary with size 3.7 MiB
25/11/25 05:12:24 WARN DAGScheduler: Broadcasting large task binary with size 4.6 MiB
25/11/25 05:12:26 WARN DAGScheduler: Broadcasting large task binary with size 5.4 MiB
25/11/25 05:12:28 WARN DAGScheduler: Broadcasting large task binary with size 6.2 MiB
                            

# Save Model

In [8]:
# Save pipeline model
rf_model_path = "/opt/spark/work-dir/data/mlmodels/rf/fraud_rf_pipeline"
rf_model.write().overwrite().save(rf_model_path)
print(f"Modelo guardado en {rf_model_path}")

25/11/25 05:12:33 WARN TaskSetManager: Stage 76 contains a task of very large size (1603 KiB). The maximum recommended task size is 1000 KiB.


Modelo guardado en /opt/spark/work-dir/data/mlmodels/rf/fraud_rf_pipeline


                                                                                

# Evaluate Model

In [11]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

rf_predictions = rf_model.transform(test_df)
rf_predictions.select("type", "amount", "label", "prediction", "probability").show(5, truncate=False)

# Accuracy
evaluator_acc = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)
rf_accuracy = evaluator_acc.evaluate(rf_predictions)

# Precision
evaluator_precision = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="weightedPrecision"
)
rf_precision = evaluator_precision.evaluate(rf_predictions)

# F1 Score
evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)
rf_f1 = evaluator_f1.evaluate(rf_predictions)

print(f"  Accuracy:  {rf_accuracy}")
print(f"  Precision: {rf_precision}")
print(f"  F1 Score:  {rf_f1}")

confusion_matrix = rf_predictions.groupBy("label", "prediction").count().orderBy("label", "prediction")
confusion_matrix.show()

25/11/25 05:15:59 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
                                                                                

+--------+---------+-----+----------+------------------------------------------+
|type    |amount   |label|prediction|probability                               |
+--------+---------+-----+----------+------------------------------------------+
|CASH_OUT|20128.0  |1    |1.0       |[0.26274515837241996,0.73725484162758]    |
|CASH_OUT|235238.66|1    |1.0       |[0.09149302030157122,0.9085069796984289]  |
|CASH_OUT|1277212.8|1    |1.0       |[2.451920693872261E-4,0.9997548079306128] |
|TRANSFER|35063.63 |1    |1.0       |[0.0016148411255892795,0.9983851588744108]|
|CASH_OUT|1096187.2|1    |1.0       |[2.743010102608729E-4,0.9997256989897392] |
+--------+---------+-----+----------+------------------------------------------+
only showing top 5 rows


25/11/25 05:16:12 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
25/11/25 05:16:39 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
25/11/25 05:17:08 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
                                                                                

  Accuracy:  0.9938509640437728
  Precision: 0.99393853226463
  F1 Score:  0.9938758991961807


25/11/25 05:17:36 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    0|       0.0| 8021|
|    0|       1.0|   45|
|    1|       0.0|   14|
|    1|       1.0| 1515|
+-----+----------+-----+



25/11/25 05:18:01 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
                                                                                

# Conclusion
El modelo obtuvo scores bastante altos, también se enfrentaron varios problemas como lo fue el desbalance de los datos y que el dataset era muy grande. Me pareció muy interesante analizar como con este tipo de datos podemos detectar fraudes entre transacciones.