# Safety Category: Models

In this notebook, several models are tested on the preprocessed training data.

**Models:**

- Random Forest
- Logistic Regression
- Support Vector Machine
- Neural Network

## Reading the data ##

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.sql.functions import col
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, LinearSVC, MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Initialize Spark Session
spark = SparkSession.builder.appName("ModelComparison").getOrCreate()

# Load CSV dataset
data_path = "data/safety_dataset_filtered.csv"
df = spark.read.csv(data_path, header=True, inferSchema=True)

# Show data schema
df.printSchema()

root
 |-- bookingID: long (nullable = true)
 |-- Speed_perc70: double (nullable = true)
 |-- acceleration_x_min: double (nullable = true)
 |-- acceleration_z_std: double (nullable = true)
 |-- Bearing_std: double (nullable = true)
 |-- acceleration_x_std: double (nullable = true)
 |-- Speed_std: double (nullable = true)
 |-- acceleration_y_std: double (nullable = true)
 |-- acceleration_z_max: double (nullable = true)
 |-- Speed_max: double (nullable = true)
 |-- time: double (nullable = true)
 |-- label: integer (nullable = true)



## Preprocessing the data ##

In [2]:
# Drop non-feature columns
df = df.drop("bookingID")

# Ensure 'label' is integer type
df = df.withColumn("label", col("label").cast("integer"))

feature_cols = [col_name for col_name in df.columns if col_name != "label"]

feature_cols

['Speed_perc70',
 'acceleration_x_min',
 'acceleration_z_std',
 'Bearing_std',
 'acceleration_x_std',
 'Speed_std',
 'acceleration_y_std',
 'acceleration_z_max',
 'Speed_max',
 'time']

In [3]:
# Convert features into a single feature vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df = assembler.transform(df)

# Normalize features using StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withStd=True, withMean=False)
df = scaler.fit(df).transform(df)

# Select only the 'scaled_features' and 'label' columns
df = df.select("scaled_features", "label")
df = df.withColumnRenamed("scaled_features", "features")

# Show sample processed data
df.show(5, truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|features                                                                                                                                                                                         |label|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2.46632374490225,-1.7137477662481138,1.721000516645763,4.590215978024872,2.0611283187167313,3.8006332049649876,0.9211031805473977,0.456982800746409,4.317393682544466,1.1572045267372614E-4]    |0    |
|[2.0649761468201704,-1.955052600432529,1.2882197597306149,3.1918143431744297,1.652765068822707,3.7264367296228014,0.7685017022083639,0.29192196676784254,4.117208894890124,7.53020440935386E-5]

## Split Data for Training and Testing ##

In [4]:
# Split data into train (80%) and test (20%)
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# Show dataset sizes
print(f"Training Data: {train_df.count()} rows")
print(f"Test Data: {test_df.count()} rows")

Training Data: 16052 rows
Test Data: 3948 rows


## Train the Models ##
Each model is trained with train_df, and then predictions are made on test_df.

**Train Random Forest**

In [5]:
rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=50)
rf_model = rf.fit(train_df)
rf_preds = rf_model.transform(test_df)

**Train Logistic Regression**

In [6]:
lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train_df)
lr_preds = lr_model.transform(test_df)

**Train Support Vector Machine (SVM)**

In [7]:
svm = LinearSVC(featuresCol="features", labelCol="label", maxIter=10)
svm_model = svm.fit(train_df)
svm_preds = svm_model.transform(test_df)

**Train Neural Network**
- The input layer size is the **number of features**.
- The output layer size is the **number of unique labels**.

In [8]:
num_features = len(feature_cols)
num_classes = df.select("label").distinct().count()

nn = MultilayerPerceptronClassifier(
    featuresCol="features",
    labelCol="label",
    layers=[num_features, 16, 8, num_classes],  # Example: 3 hidden layers
    blockSize=128,
    maxIter=100
)

nn_model = nn.fit(train_df)
nn_preds = nn_model.transform(test_df)

## Evaluate Models ##
To check which model performs best, we evaluate accuracy and F1-score.

In [9]:
# Initialize evaluators
evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy")
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", metricName="f1")

# Store models and predictions
models = {
    "Random Forest": rf_preds,
    "Logistic Regression": lr_preds,
    "SVM": svm_preds,
    "Neural Network": nn_preds
}

# Compute accuracy and F1-score for each model
for name, preds in models.items():
    acc = evaluator_acc.evaluate(preds)
    f1 = evaluator_f1.evaluate(preds)
    print(f"{name}: Accuracy = {acc:.4f}, F1-score = {f1:.4f}")

Random Forest: Accuracy = 0.7791, F1-score = 0.7161
Logistic Regression: Accuracy = 0.7629, F1-score = 0.6800
SVM: Accuracy = 0.7611, F1-score = 0.6638
Neural Network: Accuracy = 0.7647, F1-score = 0.6787


## Conclusion ##
Based on the evaluate results, we choose the **Random Forest** model which performs the best.