# Spark Exercise

Apache Spark is an excellent tool for data engineering projects due to its robust ability to process large-scale data efficiently through distributed computing. Spark's in-memory processing capabilities significantly enhance the speed of data operations, making it ideal for handling big data workloads. It supports various data sources and formats, offering versatility in data ingestion and transformation. Additionally, Spark's rich API supports multiple programming languages such as Python, Java, and Scala, catering to diverse developer preferences. Its ecosystem, which includes libraries for SQL, machine learning, and graph processing, provides a comprehensive suite for building complex data pipelines and analytics, making it a powerful and flexible choice for data engineering tasks.

Use Python, ```pyspark``` and ```pandas``` to explore Apache Spark RDD and DataFrame:

# Spark RDD

Spark RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark that enables fault-tolerant, distributed processing of large datasets across multiple nodes in a cluster. Spark RDDs provide a higher-level abstraction for performing distributed data processing tasks, including both map (transformations) and reduce (aggregations) operations.

## Import Necessary Libraries

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

In [None]:
import pandas as pd
import json
import csv

In [3]:
# 📍 InfluxDB-Konfiguration
INFLUXDB_URL = "http://localhost:10896"
INFLUXDB_TOKEN = "14iJvsBJKp37nLXjIZvE4RbAoEO2dNs1k0GvCbKuJUnF_ub4pSWWw80O739jabLPMD-XBzA72WSX9f-4FuDBQ=="
INFLUXDB_ORG = "bdinf-org"
INFLUXDB_BUCKET = "bdinf-bucket"

spark_master_url = "spark://localhost:7077"

## Spark Context and Session
Initialize Spark Context and Spark Session

In [4]:
# 🧠 SparkSession mit MinIO S3-Kompatibilität

spark = SparkSession.builder \
    .appName("Sentiment AlphaVantage") \
    .master(spark_master_url) \
    .getOrCreate()

## Load Data into RDD

In [16]:
import os
import json

# Define the directory path
data_dir = "data"

# Initialize an empty list to hold all the combined data
combined_data = []

# Loop through all files in the directory
for filename in os.listdir(data_dir):
    if filename.endswith(".json"):
        file_path = os.path.join(data_dir, filename)
        with open(file_path, "r") as f:
            try:
                data = json.load(f)
                combined_data += data
            except json.JSONDecodeError:
                print(f"Error decoding {filename}, skipping.")

# `combined_data` now contains all data from the JSON files


In [17]:
rdd = spark.sparkContext.parallelize(data)
parsed_rdd = rdd.map(lambda x: json.loads(x))

In [18]:
print (type(parsed_rdd))

<class 'pyspark.core.rdd.PipelinedRDD'>


## Map Operation

Split data into individual parts and create key-value pairs

In [19]:
mapped_rdd = rdd.flatMap(lambda row: [
    (
        (row['time_published'][:8], ts['ticker']),  # key: (date, ticker)
        (
            float(ts['ticker_sentiment_score']),            # sentiment
            float(ts['relevance_score']),                   # relevance
            1,                                              # count
            float(ts['ticker_sentiment_score']) * float(ts['relevance_score'])  # sentiment * relevance
        )
    )
    for ts in row.get('ticker_sentiment', [])
    if ts.get('ticker') in ['AAPL', 'GOOG', 'BA', 'NVDA', 'O', 'TSLA']
])

## Reduce Operation

Reduce your key-value pairs

In [20]:
#Reduce (sum up sentiment, relevance, count)
reduced_rdd = mapped_rdd.reduceByKey(
    lambda a, b: (
        a[0] + b[0],  # sentiment sum
        a[1] + b[1],  # relevance sum
        a[2] + b[2],   # count
        a[3] + b[3]   # sum of (sentiment * relevance)
    )
)

#Compute Averages
final_rdd = reduced_rdd.mapValues(lambda x: (
    x[0] / x[2],  # avg sentiment
    x[1] / x[2],   # avg relevance
    x[3] / x[1] if x[1] != 0 else 0
))



## Collect Results

Because of lazy evaluation, the map-reduce operation is performed only now. Show what you calculated.

In [22]:
# Convert to flat records: (ticker, date, avg_sentiment, avg_relevance)
formatted_rdd = final_rdd.map(lambda x: (
    x[0][1],  # ticker
    x[0][0],  # date
    x[1][0],  # avg sentiment
    x[1][1],  # avg relevance
    x[1][2]   # weighted sentiment
))

results = formatted_rdd.collect()

# Print a few rows
for row in results[:5]:
    print(row)


('GOOG', '20240122', 0.18401800000000001, 0.1609795, 0.16978489321388748)
('TSLA', '20240123', 0.08054454237288135, 0.29334440677966117, 0.10585794423602263)
('BA', '20240124', -0.06395, 0.097566, -0.05961403160937211)
('BA', '20240128', -0.1025755, 0.12291149999999999, -0.07014408278314072)
('NVDA', '20240129', 0.13267408695652175, 0.1503447391304348, 0.12140735311424843)


## Save Results

In [24]:
import csv

# Prepare your data: list of dicts with all 5 fields
csv_data = [
    {
        "ticker": t,
        "date": d,
        "avg_sentiment": s,
        "avg_relevance": r,
        "weighted_sentiment": w
    }
    for t, d, s, r, w in results
]

# Define output path
output_csv_path = "output/ticker_by_day_sentiment_av.csv"

# Write to CSV
with open(output_csv_path, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["ticker", "date", "avg_sentiment", "avg_relevance", "weighted_sentiment"])
    writer.writeheader()
    writer.writerows(csv_data)


Convert RDD to Spark DataFrame

In [25]:

# Define schema for DataFrame
from pyspark.sql.types import StructType, StructField, StringType, FloatType

schema = StructType([
    StructField("ticker", StringType(), True),
    StructField("date", StringType(), True),
    StructField("avg_sentiment", FloatType(), True),
    StructField("avg_relevance", FloatType(), True),
    StructField("weighted_sentiment", FloatType(), True)
])

# Create DataFrame
df = spark.createDataFrame(formatted_rdd, schema)


In [26]:
summary_df = df.toPandas()

In [27]:
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
import pandas as pd

# InfluxDB setup
influx_client = InfluxDBClient(
    url=INFLUXDB_URL,
    token=INFLUXDB_TOKEN,
    org=INFLUXDB_ORG
)
write_api = influx_client.write_api(write_options=SYNCHRONOUS)

# Write each row to Influx
for _, row in summary_df.iterrows():
    point = (
        Point("sentiment_alphavantage")
        .tag("ticker", row["ticker"])
        .tag("aggregation", "daily")
        .field("avg_sentiment", row["avg_sentiment"])
        .field("avg_relevance", row["avg_relevance"])
        .field("weighted_sentiment", row["weighted_sentiment"])
        .time(pd.to_datetime(row["date"]), WritePrecision.S)
    )
    write_api.write(bucket=INFLUXDB_BUCKET, org=INFLUXDB_ORG, record=point)

print("✅ All sentiment points written to InfluxDB.")

✅ All sentiment points written to InfluxDB.


In [None]:
spark.stop()
