# Real-time Bitcoin Data Processing with PySpark API
**Difficulty:** 3 (difficult)

This notebook demonstrates how to use the `bitcoin_utils` module to ingest and process real-time Bitcoin price data with PySpark.

## Describe Technology

- **PySpark**: Python API for Apache Spark, enabling distributed data processing using RDDs, DataFrames, and high-level APIs.
- **Resilient Distributed Datasets (RDDs)**: Core data structure providing fault tolerance and parallel operations.
- **DataFrames and Datasets**: Schema-enforced tabular data abstractions with SQL querying capabilities.
- **Spark Streaming**: Real-time data ingestion and processing framework.
- **MLlib**: Sparkâ€™s scalable machine learning library for tasks like regression, classification, and clustering.

## Setup
Load Jupyter extensions for automatic reload and plotting.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

## 1. Initialize Spark Session

In [23]:
from bitcoin_utils import initialize_spark_session

initialize_spark_session()

## 2. Fetch Real-time Data and Define Streaming Setup

In [24]:
import importlib
import bitcoin_utils
importlib.reload(bitcoin_utils)

from bitcoin_utils import configure_streaming_paths_and_schedule, fetch_price_as_ohlc

configure_streaming_paths_and_schedule()
fetch_price_as_ohlc()

{'Datetime': '2025-05-14T21:33:00.628257+00:00',
 'Open': 103541,
 'High': 104529,
 'Low': 102964,
 'Close': 103541,
 'Volume': '27791958894'}

## 3. Start File Producer

In [25]:
from bitcoin_utils import start_file_producer

# start_file_producer()  # Uncomment to start streaming data

## 4. Define and Process Streaming Data

In [26]:
from bitcoin_utils import stream_and_display_batches

# stream_and_display_batches()  # Uncomment to process batches

## Stream Data and Process Batches
Run the streaming producer and consumer concurrently.

In [None]:
# This will run for the configured duration (e.g., 90 seconds)
run_streaming_query_and_writer()

## Preview Historical Data
Load and inspect the combined history of all fetched records.

In [None]:
preview_historical_data()

## Aggregations
Compute hourly, daily, and rolling-window price statistics.

In [None]:
aggregate_hourly_daily_moving_average()

## Train and Evaluate Gradient-Boosted Tree Regressor
Build a GBT model to forecast Bitcoin prices.

In [None]:
train_and_evaluate_gbt_regressor()

## Plot Actual vs. Predicted Prices
Visualize model performance over time.

In [None]:
plot_actual_vs_predicted_prices()