# PySpark DataFrame Visualization Demo

Use the helpers in `spark_fuse.utils.visualization` to quickly explore PySpark `DataFrame` objects. Each example below samples a manageable amount of data, converts it to pandas, and renders with `matplotlib`.

> **Note:** Make sure `matplotlib` is installed in your environment (e.g., `pip install matplotlib`).

In [None]:
%matplotlib inline

In [None]:
from spark_fuse.spark import create_session
from spark_fuse.utils.visualization import (
    plot_histogram,
    plot_scatter,
    plot_line,
    plot_bar,
)

In [None]:
spark = create_session(app_name="spark-fuse-visualization-demo")
spark

## Sample dataset
Generate a tiny sales set with date, region, order count, and revenue columns.

In [None]:
from pyspark.sql import functions as F

data = [
    ("2024-01-01", "North", 120, 1200.0),
    ("2024-01-02", "South", 95, 1025.0),
    ("2024-01-03", "West", 80, 875.0),

    ("2024-01-04", "North", 150, 1600.0),
    ("2024-01-05", "South", 110, 1190.0),
    ("2024-01-06", "West", 70, 780.0),

    ("2024-01-07", "North", 140, 1550.0),
    ("2024-01-08", "South", 100, 1080.0),
    ("2024-01-09", "West", 90, 950.0),
]

sales_df = spark.createDataFrame(data, ['date_str', 'region', 'orders', 'revenue'])
sales_df = sales_df.withColumn('date', F.to_date('date_str')).drop('date_str')
sales_df = sales_df.orderBy('date')
sales_df

In [None]:
sales_df.show()

## Histogram: order distribution
Use `plot_histogram` to inspect the spread of order volume.

In [None]:
plot_histogram(sales_df, column='orders', bins=6)

## Scatter plot: orders vs. revenue by region
Color-coding makes it easy to contrast patterns between regions.

In [None]:
plot_scatter(
    sales_df,
    x_col='orders',
    y_col='revenue',
    color_col='region',
    legend=True,
)

## Line plot: revenue trend
Sort by date and render revenue over time.

In [None]:
plot_line(
    sales_df,
    x_col='date',
    y_col='revenue',
    order_by='date',
)

## Bar chart: total revenue per region
Aggregate by region using `plot_bar`.

In [None]:
plot_bar(
    sales_df,
    category_col='region',
    value_col='revenue',
    agg_func='sum',
)

In [None]:
spark.stop()