# **Classification with Logistic Regression**

## **3.1.1 Structured API Implementation (High-Level)**

### **1. Data preparation**

In [16]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CreditCardFraud").getOrCreate()

df = spark.read.csv("creditcard.csv", header=True, inferSchema=True)

                                                                                

### **2. Data Preprocessing*

This dataset contains credit card transactions over two days in 2013, with a severe class imbalance: only 0.172% of the transactions are fraudulent (492 out of 284,807).

### Key Points About the Data:
- Features V1 to V28 are PCA-transformed, anonymized components.

- Time shows seconds since the first transaction — not very useful for fraud prediction.

- Amount is the raw transaction value — not standardized, unlike the PCA features.

- Class is the target label — 1 for fraud, 0 for normal.

### Preprocessing Steps:

- Ensure data quality by removing rows with any missing/null values.

- Standardize Amount since Amount is on a different scale than the PCA components, we standardize it for consistency.

- Combine all numerical features (including the scaled Amount) into a single feature vector, as required by machine learning algorithms in Spark.

- Use the Class column as the label for classification — 1 (fraud), 0 (non-fraud).

- Apply undersampling to address class imbalance. Since the number of fraud cases is much lower, we randomly sample from the majority class (non-fraud) to match the number of fraud samples. This results in a balanced dataset that helps prevent the model from being biased toward the majority class.

- Because fraud cases are rare, metrics like precision, recall, and AUC-PR are more appropriate than accuracy


In [None]:
from pyspark.sql.functions import col, count, when
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.sql.functions import rand

df = df.dropna()

pca_features = [c for c in df.columns if c.startswith("V")]  # V1 to V28
feature_cols_to_scale = ["Amount"]
final_feature_cols = pca_features + feature_cols_to_scale

# Assemble Amount for scaling
assembler_for_scaling = VectorAssembler(inputCols=feature_cols_to_scale, outputCol="amount_vec")
df = assembler_for_scaling.transform(df)

# Scale Amount
scaler = StandardScaler(inputCol="amount_vec", outputCol="scaled_amount", withStd=True, withMean=True)
scaler_model = scaler.fit(df)
df = scaler_model.transform(df)

# Drop raw Amount and rename scaled column
df = df.drop("Amount", "amount_vec")
df = df.withColumnRenamed("scaled_amount", "Amount")

# Separate classes
class_1_df = df.filter(col("Class") == 1)
class_0_df = df.filter(col("Class") == 0)

# Match number of class 0 to class 1
count_1 = class_1_df.count()
balanced_0_df = class_0_df.sample(False, fraction=(count_1 / class_0_df.count()), seed=2505)

# Combine and shuffle
balanced_df = balanced_0_df.union(class_1_df)
balanced_df = balanced_df.orderBy(rand(seed = 2505))


input_columns = [col_name for col_name in balanced_df.columns if col_name != "Class"]

data = VectorAssembler(inputCols=input_columns, outputCol="Features") \
           .transform(balanced_df).select("Features", col("Class"))


                                                                                

+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|Time| V1| V2| V3| V4| V5| V6| V7| V8| V9|V10|V11|V12|V13|V14|V15|V16|V17|V18|V19|V20|V21|V22|V23|V24|V25|V26|V27|V28|Amount|Class|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|   0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|     0|    0|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+



                                                                                

### **3. Train the Logistic Regression model using MLlib:**

In [18]:
from pyspark.ml.classification import LogisticRegression

train_data, test_data = data.randomSplit([0.8, 0.2], seed=1234)

lr = LogisticRegression(featuresCol="Features", labelCol="Class")

model = lr.fit(train_data)

predictions = model.transform(test_data)



### **4. Evaluation**

In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

accuracy = MulticlassClassificationEvaluator(labelCol="Class", metricName="accuracy").evaluate(predictions)

auc = BinaryClassificationEvaluator(labelCol="Class", metricName="areaUnderROC").evaluate(predictions)

precision = MulticlassClassificationEvaluator(labelCol="Class", metricName="weightedPrecision").evaluate(predictions)

recall = MulticlassClassificationEvaluator(labelCol="Class", metricName="weightedRecall").evaluate(predictions)

print(f"Accuracy: {accuracy:.4f}")
print(f"AUC: {auc:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")



Accuracy: 0.9350
AUC: 0.9643
Precision: 0.9361
Recall: 0.9350


                                                                                