# Time Series Analysis with PySpark

This notebook demonstrates how to load, explore, and analyze time series data using PySpark. It is designed to run in the local development environment using the DevContainer setup.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_timestamp, avg, weekofyear, year, month
import pandas as pd

# Initialize Spark session
spark = SparkSession.builder.appName("TimeSeriesAnalysis").getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/04 06:57:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load Parquet Time Series Data

In [None]:
# Uptimestamp path as per your repo structure
data_path = "data/input/project/raw_time_series/parquet"

# Load CSV
df = spark.read.option("header", True).option("inferSchema", True).parquet(data_path)

# Show schema and sample data
df.printSchema()
df.show(5)


root
 |-- contract_id: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- value: double (nullable = true)
 |-- value_source: string (nullable = true)
 |-- annotations: string (nullable = true)

+-------------------+-------------------+-------------------+------------+--------------------+
|        contract_id|          timestamp|              value|value_source|         annotations|
+-------------------+-------------------+-------------------+------------+--------------------+
| 04_02_111 _ CHR12 |2023-01-01 06:00:00|0.02591860654732236| measurement|{"region":"Europe...|
| 04 _02_111 _CHR12 |2023-01-01 17:00:00|0.07385444264936832| measurement|{"region":"Europe...|
| 04_02_111 _ CHR12 |2023-01-01 17:30:22|0.08180149515221906| measurement|{"region":"Europe...|
| 04 _02_111 _CHR12 |2023-01-01 21:30:00|0.08670661371854547| measurement|{"region":"Europe...|
|04 _ 02 _111_CHR12 |2023-01-02 00:30:00|0.03597601881331959| measurement|{"region":"Europe...|
+--------------

## Clean and Prepare Data

In [None]:
# Convert string timestamp column to proper timestampType
df = df.withColumn("timestamp", to_timestamp(col("timestamp"), "yyyy-MM-dd"))

# Drop rows with nulls in critical fields
df = df.dropna(subset=["timestamp", "value"])

# Show cleaned data
df.show(5)


+-------------------+----------+-------------------+------------+--------------------+
|        contract_id| timestamp|              value|value_source|         annotations|
+-------------------+----------+-------------------+------------+--------------------+
| 04_02_111 _ CHR12 |2023-01-01|0.02591860654732236| measurement|{"region":"Europe...|
| 04 _02_111 _CHR12 |2023-01-01|0.07385444264936832| measurement|{"region":"Europe...|
| 04_02_111 _ CHR12 |2023-01-01|0.08180149515221906| measurement|{"region":"Europe...|
| 04 _02_111 _CHR12 |2023-01-01|0.08670661371854547| measurement|{"region":"Europe...|
|04 _ 02 _111_CHR12 |2023-01-02|0.03597601881331959| measurement|{"region":"Europe...|
+-------------------+----------+-------------------+------------+--------------------+
only showing top 5 rows

