# Introduction to programming with Pathway

[Pathway](https://pathway.com/developers/user-guide/introduction/welcome) is a Python data processing framework for analytics and AI pipelines over data streams. It is the ideal solution for real-time processing use cases like streaming ETL or RAG pipelines for unstructured data.

This Jupyter notebook is an introduction to Pathway.
You will learn how to develop a simple pipeline over a data stream.

## Installing Pathway with Python 3.10+

First, you need to install Pathway. You can do it with a simple pip command.
Using the cell below, you install Pathway into a Python 3.10+ Linux runtime.

> **If you are running in Google Colab, please run the colab notebook (Ctrl+F9)**, disregarding the 'not authored by Google' warning.
>
> **The installation and loading time is less than 1 minute**.


In [None]:
%%capture --no-display
!pip install pathway

## A moving average with Pathway
In this Jupyter notebook, you will learn how to connect to a data source, design a data pipeline, and run it over both static and streaming data.

This notebook shows you how to compute an moving average over a data stream.
You will use Pathway, [Bokeh](https://bokeh.org/) and [Panel](https://panel.holoviz.org/) to build a real-time data visualisation.
You can run the code for yourself [in Colab]() or [from Github]().

Because Pathway [unites static and stream processing](https://pathway.com/developers/user-guide/connecting-to-data/switch-from-batch-to-streaming/#switching-from-batch-to-streaming) in a single syntax, you can use the exact same code for both batch and streaming. This way you can test your workflow with a static snapshot and then simply switch out the data source for a live stream when you want to visualise the full data flow in real-time.

Let's jump in! 🪂

## Imports and setup

First, import the necessary libraries:

- [`pathway`](/developers/user-guide/introduction/welcome/) for data processing,
- `datetime` for date manipulation,
- `panel` and `bokeh` for dashboarding.

In [None]:
import datetime

import bokeh.plotting
import panel

import pathway as pw

import logging
logging.basicConfig(level=logging.CRITICAL, force=True)

And then fetch the sample data from Github. This data was generated using [polygon.io](https://polygon.io) and replicate a financial data stream.

In [None]:
%%capture --no-display
!wget -nc https://gist.githubusercontent.com/janchorowski/e351af72ecd8d206a34763a428826ab7/raw/ticker.csv

## Data source setup

In Pathway, the data is accessed using [connectors](https://pathway.com/developers/user-guide/connecting-to-data/connectors).
You need to specify the schema of the [schema](https://pathway.com/developers/user-guide/types-in-pathway/schema) of the table.
Since the data is stored in a CSV file, you can [automatically extract the schema](https://pathway.com/developers/user-guide/types-in-pathway/schema-generation):

In [None]:
fname = "ticker.csv"
schema = pw.schema_from_csv(fname)
print(schema)


First, you are going to use a static sample of the data to develop and test our moving average.
Let's extract the ten first lines of the data:

In [None]:
!head -n 10 ticker.csv > ticker_head.csv

You can now load the data using the [CSV connector](https://pathway.com/developers/user-guide/connectors/csv_connectors) with the mode set to "static" as you want to access static data and not a data stream:

In [None]:
static_data = pw.io.csv.read("ticker_head.csv", schema=schema, mode="static")

You can print the data using [`pw.debug.compute_and_print`](https://pathway.com/developers/api-docs/debug#pathway.debug.compute_and_print).
Note that this function only works in the static mode and should only be used in a debugging scenario. To serve your data out of Pathway, you should use an [output connector](https://pathway.com/developers/user-guide/introduction/concepts#send-the-results-to-external-systems-using-output-connectors).

In [None]:
pw.debug.compute_and_print(static_data)

Perfect, our data sample is read!

Now, let's define a moving average of one minute.
You will compute the 1-minute running mean of Volume Weighted Average Price (vwap) using the volume of transactions. You don't need to fully understand the financial terminology. All you need to know is that you are going to compute the mean of the price over a minute. The column `vwap` is the average price computed over the number of transactions contained in the column `volume`.

First, you need to convert the time, contained in the column `t` as a timestamp, to a UTC format.
Then, you can easily do a moving average using a [tumbling window](https://pathway.com/developers/user-guide/temporal-data/windows-manual#temporal-tumbling-windowing).

In [None]:
def pipeline(input_data):
  # Parse the timestamps
  data = input_data.with_columns(t=input_data.t.dt.utc_from_timestamp(unit="ms"))
  # Compute the one minute average
  minute_1_stats = (
    data.windowby(
        pw.this.t,
        window=pw.temporal.tumbling(datetime.timedelta(minutes=1)),
        behavior=pw.temporal.exactly_once_behavior(),
        instance=pw.this.ticker,
    )
    .reduce(
        ticker=pw.this._pw_instance,
        t=pw.this._pw_window_end,
        volume=pw.reducers.sum(pw.this.volume),
        transact_total=pw.reducers.sum(pw.this.volume * pw.this.vwap),
    )
    .with_columns(vwap=pw.this.transact_total / pw.this.volume)
  )
  return minute_1_stats

In [None]:
static_average = pipeline(static_data)
pw.debug.compute_and_print(static_average)

## Switching between static and stream processing

Thanks to the static data sample, you have developed and tested a data pipeline.
Now, let's run it on live data!

For simplicity, you will create a streaming data source that replays data in a CSV file using the [demo module](https://pathway.com/developers/user-guide/connecting-to-data/artificial-streams/). This is an easy way to simulate a live data stream without any infrastructure hassle. You can of course use Pathway with a real, production-grade data stream, for example [from Kafka](https://pathway.com/developers/user-guide/exploring-pathway/from-jupyter-to-deploy/#kafka-integration) or [Redpanda](https://pathway.com/developers/tutorials/connectors/switching-to-redpanda/).

The `input_rate` parameter controls how fast the data is replayed.

Then you can apply the same pipeline to the new data source.
This is the only change you need to make in your code to switch between static and stream processing. Learn more about Pathway's unified engine and single syntax in [our User Guide](https://pathway.com/developers/user-guide/getting-started/switch-from-batch-to-streaming#switching-from-batch-to-streaming).


💡 No data processing actually happens when you run the cell below. We are building a computational graph that will only be executed at the end of the notebook. This allows Pathway to optimise the computations and perform them as fast as possible when the data starts streaming.

In [None]:
live_data = pw.demo.replay_csv(fname, schema=schema, input_rate=1000)

streaming_average = pipeline(live_data)

## Dashboard creation

Now create a `Bokeh` plot and `Panel` table visualization: the plot shows the Bollinger Bands along with the price running mean and indicates the price of buy and sell decisions. The table gathers all the decisions conveniently for further processing, such as reducing it to compute a historical evaluation of the gains of the strategy.

When the cell is executed, placeholder containers are created for the plot and table visualization. They will be populated with live data when the computation is started (i.e. when running the final `pw.run()` at the end of this notebook).

In [None]:
def stats_plotter(src):

    fig = bokeh.plotting.figure(
        height=400,
        width=600,
        title="1-minute moving average",
        x_axis_type="datetime",
        y_range=(188.5, 191),
    )
    fig.line("t", "vwap", source=src)
    return fig


viz = panel.Row(
    streaming_average.plot(stats_plotter, sorting_col="t"),
)
viz

## Running the computation

All the hard work is done! The final step is to start the Pathway data processing engine using the `pw.run()` command.
**Do not scroll down once the cell is executed, scroll up to see the animation.**

In [None]:
pw.run()

Watch how the dashboard is updated in realtime!

While the computation is running, `pathway` prints important statistics such as message processing latency.

You can do more advanced computation such as the [Bollinger Bands](https://pathway.com/developers/showcases/live_data_jupyter).

## What else can you do with Pathway?

Congratulations! You have developed your first streaming pipeline with Pathway. Easy right?

* Perform machine learning in real time. e.g. [ real-time
Classification](/developers/showcases/lsh/lsh_chapter1/) , [real-time fuzzy joins](/developers/showcases/fuzzy_join/fuzzy_join_chapter2/)

* Transform unstructured data to structured data using [live LLM pipelines](https://github.com/pathwaycom/llm-app)

* Making [joins](/developers/tutorials/fleet_eta_interval_join/) simple with timeseries data

And so much more... Read more about what we can do in the [developer docs](/developers/user-guide/introduction/welcome/).

We would love to have you try out [Pathway on GitHub](https://github.com/pathwaycom/pathway).