# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Machine Learning** </center>
---

**Date**: October, 2025

**Student Name**: Luis Daniel Arellano Núñez

**Professor**: Pablo Camarillo Ramirez

## Machine Learning algorithm to use

The problem I aim to solve is a binary classification task: predicting whether a patient has diabetes based on several clinical features. This type of problem is well-suited for supervised learning because the dataset includes labeled examples indicating whether each patient is diabetic or not. I selected Logistic Regression as the main algorithm because it is a simple yet powerful model for binary outcomes. It provides interpretable coefficients that help identify which medical factors are most associated with diabetes, and it performs well when relationships between variables and the target follow a linear trend. Additionally, Logistic Regression is computationally efficient and serves as a solid baseline for medical risk prediction tasks.

## Dataset Description

For this project, I am using the Diabetes Dataset available on Kaggle, which contains medical measurements used to diagnose diabetes in patients.

* Source: Kaggle — Diabetes Dataset by Akshay Dattatray Khare

* Size of the dataset: The dataset contains 768 rows and 9 columns, including features such as glucose level, BMI, age, insulin, and the target variable Outcome (0 = no diabetes, 1 = diabetes).

Since I am working on a classification problem, I analyzed the class distribution using PySpark to determine whether the dataset is balanced. The results show that the dataset is slightly imbalanced: approximately 65% of the patients are non-diabetic (class 0) and 35% are diabetic (class 1). While not severely imbalanced, this distribution may still require attention during model evaluation to ensure the classifier performs well on both classes.

link:

* https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

## ML Training process

### Proyect configuration

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML: Final proyect Logistic Regression") \
    .master("spark://4de840d3187e:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/14 15:29:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Collect Data

In [2]:
from Daniel_Arellano.sql_im import SparkUtils

# Define schema for the DataFrame
diab_schema = SparkUtils.generate_schema([
    ("Pregnancies", "int"),
    ("Glucose", "int"),
    ("BloodPressure", "int"),
    ("SkinThickness", "int"),
    ("Insulin", "int"),
    ("BMI", "int"),
    ("DiabetesPedigreeFunction", "int"),
    ("Age", "int"),
    ("Outcome", "int")])

# Source: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression?resource=download

diab_df = spark.read \
                .option("header", "true") \
                .schema(diab_schema) \
                .csv("/opt/spark/work-dir/data/Diabetes/diabetes.csv")

diab_df.printSchema()
print(f"Number of records:{diab_df.count()}")

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: integer (nullable = true)
 |-- DiabetesPedigreeFunction: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Outcome: integer (nullable = true)





Number of records:768


                                                                                

### Check if dataset is balanced

In [3]:
from pyspark.sql.functions import col, count

# Count instances of each class
#diab_df.groupBy("Outcome").agg(count("*").alias("count")).show()

[Stage 3:>                                                          (0 + 1) / 1]

+-------+-----+
|Outcome|count|
+-------+-----+
|      0|  500|
|      1|  268|
+-------+-----+



                                                                                

The dataset is unbalanced, so this will determine the type of variable we can rely on, in the predictions part.

### We assemble the columns into 1

In [5]:
from pyspark.ml.feature import VectorAssembler

diab_df = diab_df.withColumnRenamed("Outcome", "label")

imputed_df = diab_df.fillna(0)

assembler = VectorAssembler(inputCols=["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"], outputCol="features")
data_with_features = assembler.transform(imputed_df).select("label", "features")
#data_with_features.show()

[Stage 6:>                                                          (0 + 1) / 1]

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    1|[6.0,148.0,72.0,3...|
|    0|[1.0,85.0,66.0,29...|
|    1|(8,[0,1,2,7],[8.0...|
|    0|[1.0,89.0,66.0,23...|
|    1|[0.0,137.0,40.0,3...|
|    0|(8,[0,1,2,7],[5.0...|
|    1|[3.0,78.0,50.0,32...|
|    0|(8,[0,1,7],[10.0,...|
|    1|[2.0,197.0,70.0,4...|
|    1|(8,[0,1,2,7],[8.0...|
|    0|(8,[0,1,2,7],[4.0...|
|    1|[10.0,168.0,74.0,...|
|    0|(8,[0,1,2,7],[10....|
|    1|[1.0,189.0,60.0,2...|
|    1|[5.0,166.0,72.0,1...|
|    1|(8,[0,1,5,7],[7.0...|
|    1|[0.0,118.0,84.0,4...|
|    1|(8,[0,1,2,7],[7.0...|
|    0|[1.0,103.0,30.0,3...|
|    1|[1.0,115.0,70.0,3...|
+-----+--------------------+
only showing top 20 rows


                                                                                

### Data splitting (80/20)

In [6]:
train_df, test_df = data_with_features.randomSplit([0.8, 0.2], seed=13)

#Show dataset for debugging
print("Original Dataset")
#imputed_df.show()

# Print train dataset
print("train set")
#train_df.show()

Original Dataset
+-----------+-------+-------------+-------------+-------+---+------------------------+---+-----+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin|BMI|DiabetesPedigreeFunction|Age|label|
+-----------+-------+-------------+-------------+-------+---+------------------------+---+-----+
|          6|    148|           72|           35|      0|  0|                       0| 50|    1|
|          1|     85|           66|           29|      0|  0|                       0| 31|    0|
|          8|    183|           64|            0|      0|  0|                       0| 32|    1|
|          1|     89|           66|           23|     94|  0|                       0| 21|    0|
|          0|    137|           40|           35|    168|  0|                       0| 33|    1|
|          5|    116|           74|            0|      0|  0|                       0| 30|    0|
|          3|     78|           50|           32|     88| 31|                       0| 26|    1|
|         10|

### Create ML Model

In [7]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.2)

### Train ML Model

In [8]:
lr_model = lr.fit(train_df)

# Print coefficients
print("Coefficients: " + str(lr_model.coefficients))

# Display model summary
training_summary = lr_model.summary

25/11/14 15:30:28 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


Coefficients: [0.05255214988622757,0.014379505679704347,-0.0012754371982106045,0.005168707257162747,0.0002016958273922796,0.007144428589115499,0.0,0.011366944469557886]


### Persist the Model

In [10]:
path = "/opt/spark/work-dir/data/mlmodels/proyect/diabetes_ml"
lr_model.write().overwrite().save(path)

                                                                                

In [13]:
!ls ../../data/mlmodels/proyect/diabetes_ml

data  metadata


## ML Evaluation

### Predictions with the loaded model

To generate predictions, the previously trained logistic regression model is first loaded from storage using LogisticRegressionModel.load(). Once the model is available, it is applied to the test dataset through the transform() method. This method takes the feature vector for each patient and computes both the predicted class label and the probability associated with each outcome. The resulting DataFrame contains several useful columns, including the original features, the predicted class (prediction), and the model-estimated probability of having diabetes (probability). Displaying these results allows us to evaluate how well the model generalizes to unseen data and to analyze the confidence of each prediction, which is especially important in medical decision-making contexts.

In [14]:
from pyspark.ml.classification import LogisticRegressionModel

loaded_model = LogisticRegressionModel.load(path)

# Use the trained model to make predictions on the test data
predictions = loaded_model.transform(test_df)

# Show predictions
predictions.select("features", "prediction", "probability").show()

                                                                                

+--------------------+----------+--------------------+
|            features|prediction|         probability|
+--------------------+----------+--------------------+
|(8,[0,1,2,7],[1.0...|       0.0|[0.84737099092317...|
|(8,[0,1,2,7],[1.0...|       0.0|[0.78754841417407...|
|(8,[0,1,2,7],[2.0...|       0.0|[0.81437472307329...|
|(8,[0,1,2,7],[3.0...|       0.0|[0.81035364058587...|
|(8,[0,1,2,7],[3.0...|       0.0|[0.82548045910985...|
|(8,[0,1,2,7],[3.0...|       0.0|[0.76707568449425...|
|(8,[0,1,2,7],[3.0...|       0.0|[0.76649733355903...|
|(8,[0,1,2,7],[3.0...|       0.0|[0.63683811234577...|
|(8,[0,1,2,7],[4.0...|       0.0|[0.70058856689063...|
|(8,[0,1,2,7],[5.0...|       0.0|[0.69547274144500...|
|(8,[0,1,2,7],[7.0...|       0.0|[0.62053467709073...|
|(8,[0,1,2,7],[7.0...|       0.0|[0.57653424728884...|
|(8,[0,1,2,7],[8.0...|       0.0|[0.66558401864923...|
|(8,[0,1,2,7],[8.0...|       0.0|[0.69891412431244...|
|(8,[0,1,2,7],[10....|       0.0|[0.76028731231029...|
|(8,[0,1,3

### Test ML model

In [15]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label",
                            predictionCol="prediction")

accuracy = evaluator.evaluate(predictions, 
                  {evaluator.metricName: "accuracy"})
print(f"Accuracy: {accuracy}")
precision = evaluator.evaluate(predictions,
                  {evaluator.metricName: "weightedPrecision"})
print(f"Precision: {precision}")
recall = evaluator.evaluate(predictions,
                  {evaluator.metricName: "weightedRecall"})
print(f"Recall: {recall}")
f1 = evaluator.evaluate(predictions,
                {evaluator.metricName: "f1"})
print(f"F1 Score: {f1}") 

Accuracy: 0.7536231884057971
Precision: 0.7938303575484984
Recall: 0.7536231884057971
F1 Score: 0.7219772506040263


Since the dataset is not balanced, relying solely on accuracy would give a misleading representation of the model’s performance. In imbalanced datasets, a classifier can achieve high accuracy simply by predicting the majority class most of the time, while still performing poorly on the minority class, which in this case represents patients diagnosed with diabetes. To address this limitation, the F1-score is used as the primary evaluation metric. The F1-score provides a better assessment by combining both precision and recall, ensuring that the model not only identifies diabetic cases correctly but also avoids missing them. This makes F1 a more reliable and informative metric for evaluating classification performance in medical prediction tasks involving class imbalance.