# **Classification with Logistic Regression**
## **3.1.3 Low-Level Operations**

### **1. Parse dataset and create RDD**

In [2]:
from pyspark import SparkContext
import numpy as np
import csv
from io import StringIO

spark = SparkContext.getOrCreate()

raw_data = spark.textFile("creditcard.csv")
header = raw_data.first()
data = raw_data.filter(lambda row: row != header)

parsed_data = data.map(lambda line: next(csv.reader(StringIO(line))))

rdd_data = parsed_data.map(lambda fields: (
    float(fields[-1]), # class
    [float(x) for x in fields[1:-1]] # features
))

                                                                                

### **2. Using undersampling to balance the class label of the dataset**

In [3]:
# separate positive and negative samples
positive = rdd_data.filter(lambda x: x[0] == 1.0)
negative = rdd_data.filter(lambda x: x[0] == 0.0)

# count
pos_count = positive.count()
neg_count = negative.count()

# downsample negative class
negative_downsampled = negative.sample(False, float(pos_count) / neg_count, seed=2505)

# combine and shuffle
balanced_rdd = positive.union(negative_downsampled).repartition(4).cache()

                                                                                

### **3. Implementation of logistic regression using gradient descent**

In [4]:
feature_length = len(balanced_rdd.take(1)[0][1])
weights = [0.0] * feature_length
learning_rate = 0.0001
iterations = 100

def dot_product(w, x):
    return sum(wi * xi for wi, xi in zip(w, x))

# sigmoid function to convert score to probability
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def compute_gradient(label, features, weights):
    prediction = sigmoid(dot_product(weights, features))
    error = prediction - label
    return [error * f for f in features]

for i in range(iterations):
    gradients = balanced_rdd.map(lambda x: compute_gradient(x[0], x[1], weights))
    
    # average gradients over all records
    total_gradient = gradients.reduce(lambda a, b: [x + y for x, y in zip(a, b)])
    count = balanced_rdd.count()
    avg_gradient = [g / count for g in total_gradient]
    
    # weight update: w = w - learning_rate * gradient
    weights = [w - learning_rate * g for w, g in zip(weights, avg_gradient)]
    
    # if i % 10 == 0:
    #     print(f"Iteration {i}")

# predict labels using the final weights
def predict_label(features, weights):
    prob = sigmoid(dot_product(weights, features))
    return 1 if prob >= 0.5 else 0



                                                                                

### **4. Model evaluation**

In [5]:
# compare low-level predictions to true labels
predictions = balanced_rdd.map(lambda x: (x[0], predict_label(x[1], weights)))

# evaluate prediction performance with accuracy
correct = predictions.filter(lambda x: x[0] == x[1]).count()
total = predictions.count()
accuracy = correct / total

print(f"\nFinal Accuracy: {accuracy:.4f}")


Final Accuracy: 0.8686


### **Explanations:**
- Learning rate at 0.0001: small enough due to the high-dimensional scattered data. Larger values may cause divergence.
- Iteration count 100: enough iteration to see convergence trends and reasonable accuracy without overfitting or too long computation time.
### **Challenges:**
- Implementing gradient descent manually using RDD operations require careful attention to broadcasting weights and avoiding driver-only updates.
- Parsing and managing numerical data from text-based CSV format required custom parsing and float conversion.
- Ensuring the sigmoid function didnâ€™t overflow required care with extreme dot product values.