# BigDL & Spark API Notebook  

This notebook is a **quick, end-to-end demo** of the helper functions in `BigDL_API.py`.
It mirrors the structure of the SQLite example so future readers have a consistent experience.

---
## Notebook Objectives
1. **Spin-up Spark 3 + BigDL** inside the Docker image.  
2. **Fetch live BTC-USD prices** (24 h sample) from CoinGecko.  
3. **Clean + feature-engineer** the raw data (`rolling_avg_1h`, `% change`).  
4. **Inspect the Spark DataFrame** you can feed into the BigDL LSTM pipeline.

## Notebook Flow
1. Setup & Imports  
2. Spark Session  
3. Data Download (`fetch_bitcoin_prices`)  
4. Cleaning (`process_bitcoin_data`)  
5. Feature Engineering (`transform_bitcoin_data`)  
6. Preview & sanity-check

## References 📚
* **BigDL docs:** <https://bigdl.readthedocs.io>  
* **Apache Spark 3.3** API refs  
* **CoinGecko REST API:** <https://www.coingecko.com/en/api/documentation>  
* Project README for full Docker instructions.

> 🛠 **Prerequisite** – Build the image with `./docker_build.sh` and run a shell or Jupyter inside the container so BigDL & Spark are on the PYTHONPATH.


In [1]:
from BigDL_API import (
    get_spark_session,
    fetch_bitcoin_prices,
    process_bitcoin_data,
    transform_bitcoin_data,
)

# 1️⃣ Spark Session
spark = get_spark_session()
spark



In [2]:
# 2️⃣ Download a ~24-hour slice just for demo purposes
raw_df = fetch_bitcoin_prices(days=1)
raw_df.show(5)

+------------------+-------------+
|             price|    timestamp|
+------------------+-------------+
|108482.11390421852|1748063463710|
| 108434.0833026226|1748063776529|
|108377.99266408195|1748064085703|
|108349.45800161053|1748064326238|
|108361.03052409452|1748064692643|
+------------------+-------------+
only showing top 5 rows



In [3]:
# 3️⃣ Cleaning + features
clean_df = process_bitcoin_data(raw_df)
trans_df = transform_bitcoin_data(clean_df)

trans_df.printSchema()
trans_df.show(5)

root
 |-- time: timestamp (nullable = true)
 |-- price: double (nullable = true)
 |-- rolling_avg_1h: double (nullable = true)
 |-- pct_change: double (nullable = true)

+-------------------+------------------+------------------+--------------------+
|               time|             price|    rolling_avg_1h|          pct_change|
+-------------------+------------------+------------------+--------------------+
|2025-05-24 05:11:03|108482.11390421852|108482.11390421852|                null|
|2025-05-24 05:16:16| 108434.0833026226|108458.09860342057|-0.04427513427543337|
|2025-05-24 05:21:25|108377.99266408195|108431.39662364102|-0.05172786713575643|
|2025-05-24 05:25:26|108349.45800161053| 108410.9119681334|-0.02632883463698...|
|2025-05-24 05:31:32|108361.03052409452|108400.93567932563|0.010680738692585191|
+-------------------+------------------+------------------+--------------------+
only showing top 5 rows



### ✅ All good!

You now have a tidy Spark DataFrame ready for **BigDL LSTM** training (see `BigDL_example.py`).
Feel free to adjust the `days` parameter or integrate these helpers into your own workflows.
