# 3. Spark

Spark Programming Guide: <https://spark.apache.org/docs/latest/> (use Python API recommended)
Spark API: <https://spark.apache.org/docs/latest/api/python/index.html>


# 3.1 Example Walkthrough
3.1 Follow the Spark Examples below! After completion see Exercise 3.2 and 3.3!


### Initialize PySpark

First, we use the findspark package to initialize PySpark.

In [None]:
!pip install pyspark

In [None]:
import os, sys

In [None]:
# Initialize PySpark
APP_NAME = "PySpark Lecture"
SPARK_MASTER="local[1]"
import pyspark
import pyspark.sql
from pyspark.sql import Row
conf=pyspark.SparkConf()
conf=pyspark.SparkConf().setAppName(APP_NAME).set("spark.local.dir", os.path.join(os.getcwd(), "tmp"))
sc = pyspark.SparkContext(master=SPARK_MASTER, conf=conf)
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()

print("PySpark initiated...")

### Hello, World!

Loading data, mapping it and collecting the records into RAM...

In [1]:
!wget https://raw.githubusercontent.com/scalable-infrastructure/exercise-2025/main/data/example.csv

--2025-02-28 21:28:55--  https://raw.githubusercontent.com/scalable-infrastructure/exercise-2025/main/data/example.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 189 [text/plain]
Saving to: ‘example.csv.1’


2025-02-28 21:28:55 (16.4 MB/s) - ‘example.csv.1’ saved [189/189]



In [None]:
# Load the text file using the SparkContext
csv_lines = sc.textFile("example.csv")

# Map the data to split the lines into a list
data = csv_lines.map(lambda line: line.split(","))

# Collect the dataset into local RAM
data.collect()

### Creating Objects from CSV

Using a function with a map operation to create objects (dicts) as records...

In [None]:
# Turn the CSV lines into objects
def csv_to_record(line):
    parts = line.split(",")
    record = {
      "name": parts[0],
      "company": parts[1],
      "title": parts[2]
    }
    return record

# Apply the function to every record
records = csv_lines.map(csv_to_record)

# Inspect the first item in the dataset
records.first()

### GroupBy

Using the groupBy operator to count the number of jobs per person...

In [None]:
# Group the records by the name of the person
grouped_records = records.groupBy(lambda x: x["name"])

# Show the first group
grouped_records.first()

# Count the groups
job_counts = grouped_records.map(
  lambda x: {
    "name": x[0],
    "job_count": len(x[1])
  }
)

job_counts.first()

job_counts.collect()

### Map vs FlatMap

Understanding the difference between the map and flatmap operators...

In [None]:
# Compute a relation of words by line
words_by_line = csv_lines\
  .map(lambda line: line.split(","))

print(words_by_line.collect())

# Compute a relation of words
flattened_words = csv_lines\
  .map(lambda line: line.split(","))\
  .flatMap(lambda x: x)

flattened_words.collect()

---
## Further Exercises




In [None]:
!wget "https://github.com/scalable-infrastructure/exercise-2025/blob/main/data/nasa/NASA_access_log_Jul95.gz?raw=true" -O NASA_access_log_Jul95.gz
!gzip -d NASA_access_log_Jul95.gz

--2025-02-28 21:29:11--  https://github.com/scalable-infrastructure/exercise-2025/blob/main/data/nasa/NASA_access_log_Jul95.gz?raw=true
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/scalable-infrastructure/exercise-2025/raw/refs/heads/main/data/nasa/NASA_access_log_Jul95.gz [following]
--2025-02-28 21:29:11--  https://github.com/scalable-infrastructure/exercise-2025/raw/refs/heads/main/data/nasa/NASA_access_log_Jul95.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/scalable-infrastructure/exercise-2025/refs/heads/main/data/nasa/NASA_access_log_Jul95.gz [following]
--2025-02-28 21:29:11--  https://raw.githubusercontent.com/scalable-infrastructure/exercise-2025/refs/heads/main/data/nasa/NASA_access_log_Jul95.gz
Resolving raw.githubusercontent.c

3.2 Implement a wordcount using Spark. How many words are in the file `example.csv`?

3.3 Using the NASA Log file, implement a Spark version of the HTTP Response Code Analysis. How many log enteries per HTTP Response Code exist? 