# Test

This notebook generates a synthetic dataset using memory-efficient chunked processing. The data will include various types of columns with different distributions and characteristics.

## 1. Import Required Libraries

First, let's import all the necessary libraries for data generation, monitoring, and I/O operations.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import warnings
warnings.filterwarnings("ignore")

# Initialize Spark Session
spark = (
    SparkSession.builder
    .appName("Synthetic Dataset Generator")
    .master("spark://spark-master:7077")
    .getOrCreate()
)

## 2. Implement Chunked DataFrame Generator

Now we'll create a function that generates data in chunks to manage memory efficiently.

In [None]:
# Create spark DataFrame with synthetic columns using built-in SQL functions
df = (
    spark
    .range(0, 20_000_000_000, 1, 16_000)
    .toDF("value_uniform")
    .toDF("value_normal")
)

# Use different seeds for different columns to decorrelate
s = 42

df = (
    df
    .withColumn("value_uniform", F.rand(s) * 1000)
    .withColumn("value_normal", F.randn(s + 1) * 15 + 100)
)

In [None]:
res_uniform = df.agg({"value_uniform": "mean"}).show()
res_normal = df.agg({"value_normal": "mean"}).show()

res_normal

## Clean Up

Stop the Spark session to release resources

In [None]:
# Stop the Spark session
spark.stop()
print("Spark session stopped successfully")