# MLlib RDD-Based Implementation

## 1. Setup & Data Loading

In [33]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when

spark = SparkSession.builder \
    .appName("LogisticRegression_Spark") \
    .getOrCreate()

df = spark.read.csv("creditcard.csv", header=True, inferSchema=True)

                                                                                

## 2. Data Preprocessing

This dataset contains credit card transactions over two days in 2013, with a severe class imbalance: only 0.172% of the transactions are fraudulent (492 out of 284,807).

### Key Points About the Data:
- Features V1 to V28 are PCA-transformed, anonymized components.

- Time shows seconds since the first transaction — not very useful for fraud prediction.

- Amount is the raw transaction value — not standardized, unlike the PCA features.

- Class is the target label — 1 for fraud, 0 for normal.

### Preprocessing Steps:

- Ensure data quality by removing rows with any missing/null values.

- Standardize Amount since Amount is on a different scale than the PCA components, we standardize it for consistency.

- Combine all numerical features (including the scaled Amount) into a single feature vector, as required by machine learning algorithms in Spark.

- Use the Class column as the label for classification — 1 (fraud), 0 (non-fraud).

- Because fraud cases are rare, metrics like precision, recall, and AUC-PR are more appropriate than accuracy

In [34]:
from pyspark.sql.functions import col, count, when
from pyspark.ml.feature import VectorAssembler, StandardScaler

df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

df = df.dropna()

pca_features = [c for c in df.columns if c.startswith("V")]  # V1 to V28
feature_cols_to_scale = ["Amount"]
final_feature_cols = pca_features + feature_cols_to_scale

assembler_for_scaling = VectorAssembler(inputCols=feature_cols_to_scale, outputCol="amount_vec")
df = assembler_for_scaling.transform(df)

scaler = StandardScaler(inputCol="amount_vec", outputCol="scaled_amount", withStd=True, withMean=True)
scaler_model = scaler.fit(df)
df = scaler_model.transform(df)

df = df.drop("Amount", "amount_vec")
df = df.withColumnRenamed("scaled_amount", "Amount")

final_features = pca_features + ["Amount"]
assembler = VectorAssembler(inputCols=final_features, outputCol="features")
df = assembler.transform(df).select("features", col("Class").alias("label"))



                                                                                

+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|Time| V1| V2| V3| V4| V5| V6| V7| V8| V9|V10|V11|V12|V13|V14|V15|V16|V17|V18|V19|V20|V21|V22|V23|V24|V25|V26|V27|V28|Amount|Class|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|   0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|     0|    0|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+



                                                                                

## 4. MLlib RDD-Based Implementation 

In [35]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.classification import LogisticRegressionWithLBFGS


rdd_data = df.rdd.map(lambda row: LabeledPoint(row["label"], Vectors.dense(row["features"])))

train_rdd, test_rdd = rdd_data.randomSplit([0.8, 0.2], seed=42)

model_rdd = LogisticRegressionWithLBFGS.train(train_rdd, iterations=100, numClasses=2)


predictions_rdd = test_rdd.map(lambda p: (float(model_rdd.predict(p.features)), p.label))


correct_preds = predictions_rdd.filter(lambda x: x[0] == x[1]).count()
total_preds = test_rdd.count()
accuracy_rdd = correct_preds / float(total_preds)


# TP = predicted 1, actual 1
tp = predictions_rdd.filter(lambda x: x == (1.0, 1.0)).count()
# FP = predicted 1, actual 0
fp = predictions_rdd.filter(lambda x: x == (1.0, 0.0)).count()
# FN = predicted 0, actual 1
fn = predictions_rdd.filter(lambda x: x == (0.0, 1.0)).count()

# Precision = TP / (TP + FP)
precision_rdd = tp / (tp + fp) if (tp + fp) != 0 else 0
# Recall = TP / (TP + FN)
recall_rdd = tp / (tp + fn) if (tp + fn) != 0 else 0

# Square of (prediction - actual) averaged over all samples
mse_rdd = predictions_rdd.map(lambda x: (x[0] - x[1]) ** 2).mean()

f1_rdd = 2 * precision_rdd * recall_rdd / (precision_rdd + recall_rdd) if (precision_rdd + recall_rdd) != 0 else 0

print("\n=== MLlib RDD-Based Evaluation Results ===")
print(f"Total Test Samples: {total_preds}")
print(f"Accuracy:  {accuracy_rdd:.4f}")
print(f"Precision: {precision_rdd:.4f}")
print(f"Recall:    {recall_rdd:.4f}")
print(f"F1 Score:  {f1_rdd:.4f}")
print(f"Mean Squared Error (MSE): {mse_rdd:.6f}")


25/04/11 16:40:01 WARN Instrumentation: [c8042e57] Initial coefficients will be ignored! Its dimensions (1, 29) did not match the expected size (1, 29)


=== MLlib RDD-Based Evaluation Results ===
Total Test Samples: 57105
Accuracy:  0.9993
Precision: 0.8230
Recall:    0.8378
F1 Score:  0.8304
Mean Squared Error (MSE): 0.000665


                                                                                