# Time Series Analysis with PySpark

This notebook demonstrates how to load, explore, and analyze time series data using PySpark. It is designed to run in the local development environment using the DevContainer setup.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, avg, weekofyear, year, month
import matplotlib.pyplot as plt
import pandas as pd

# Initialize Spark session
spark = SparkSession.builder.appName("TimeSeriesAnalysis").getOrCreate()


## Load CSV Time Series Data

In [None]:
# Update path as per your repo structure
data_path = "data/input/project/raw_time_series/csv/sample_timeseries.csv"

# Load CSV
df = spark.read.option("header", True).option("inferSchema", True).csv(data_path)

# Show schema and sample data
df.printSchema()
df.show(5)


## Clean and Prepare Data

In [None]:
# Convert string date column to proper DateType
df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))

# Drop rows with nulls in critical fields
df = df.dropna(subset=["date", "value"])

# Show cleaned data
df.show(5)


## Weekly and Monthly Aggregations

In [None]:
df_weekly = df.withColumn("week", weekofyear("date")).withColumn("year", year("date")) \
    .groupBy("year", "week").agg(avg("value").alias("weekly_avg")) \
    .orderBy("year", "week")

df_monthly = df.withColumn("month", month("date")).withColumn("year", year("date")) \
    .groupBy("year", "month").agg(avg("value").alias("monthly_avg")) \
    .orderBy("year", "month")

df_weekly.show(5)
df_monthly.show(5)


## Visualize Time Series Trends

In [None]:
# Convert to Pandas for plotting
pdf = df.select("date", "value").orderBy("date").toPandas()

# Plot
plt.figure(figsize=(12, 6))
plt.plot(pdf["date"], pdf["value"], marker='o')
plt.xlabel("Date")
plt.ylabel("Value")
plt.title("Daily Time Series")
plt.grid(True)
plt.tight_layout()
plt.show()


## Conclusion

We successfully explored a time series dataset using PySpark, applied transformations, computed weekly and monthly averages, and visualized trends using Matplotlib.