# OpenRefine API

This notebook demonstrates how to use the OpenRefine API and Prophet for time series analysis and forecasting on real-time Bitcoin price data. The raw OHLCV data is ingested from the KuCoin API, and all cleaning, preprocessing, indicator generation, and forecasting are performed using a custom Python API layer.

The methodology is as follows:

- Ingesting and saving 15-minute interval Bitcoin price data.
- Cleaning and transforming the dataset using OpenRefine to perform required analysis.
- Calculating technical indicators (e.g., moving averages, Bollinger Bands).
- Training and evaluating a Prophet model to forecast 24-hour future trends.
- Visualizing actual vs predicted values using interactive plots for effective comparision and understanding.

All utility functions used in this notebook are defined in `openrefine_utils.py`. These functions provide a modular, beginner-friendly interface over the preprocessing and modeling steps. For design choices and detailed function documentation, refer to `OpenRefine.API.md`.

## Notebook Description

- This notebook demonstrates the full workflow of loading, validating, and analyzing Bitcoin price data cleaned using OpenRefine.
- The API notebook organizes the utility functions under each major step of the project to provide a clear understanding of the overall workflow and how each component contributes to the analysis.
- It includes clear visualizations to show trends and forecast results.

## References & Citations

- KuCoin API: https://www.kucoin.com/docs/rest/spot-trading/market/get-klines
- OpenRefine Documentation: https://docs.openrefine.org/
- Introduction to OpenRefine: https://openrefine.org/

## Prerequisites

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'iframe'  # For Interactive Plots
from prophet import Prophet

In [3]:
from openrefine_utils import (
    fetch_bitcoin_data_kucoin,
    save_to_csv,
    load_cleaned_data,
    validate_cleaned_data,
    resample_data,
    calculate_technical_indicators,
    plot_technical_indicators,
    prepare_forecast_data,
    train_model,
    plot_forecast,
    plot_comparision
)

## Data Ingestion

In [4]:
btc_df = fetch_bitcoin_data_kucoin(days=7, interval='15min')
btc_df.head()

save_to_csv(btc_df, 'bitcoin_15m_kucoin.csv')

INFO:openrefine_utils:Fetching 7 days of BTC data from KuCoin (15min candles)

The behavior of 'to_datetime' with 'unit' when parsing strings is deprecated. In a future version, strings will be parsed as datetime strings, matching the behavior without a 'unit'. To retain the old behavior, explicitly cast ints or floats to numeric type before calling to_datetime.

INFO:openrefine_utils:Saving 672 records to bitcoin_15m_kucoin.csv


## Data Loading & Validation

In [5]:
cleaned_df = load_cleaned_data('bitcoin_price_analysis_using_OpenRefine_w_timestamp.csv')

validation_passed = validate_cleaned_data(cleaned_df)

INFO:openrefine_utils:Loading cleaned data from bitcoin_price_analysis_using_OpenRefine_w_timestamp.csv
INFO:openrefine_utils:Successfully loaded 672 records
INFO:openrefine_utils:No missing values
INFO:openrefine_utils:Valid price relationships
INFO:openrefine_utils:Time sequence valid
INFO:openrefine_utils:All data validation checks passed successfully!


## Preprocessing Utilities

In [6]:
analyzed_df = calculate_technical_indicators(cleaned_df)
display(analyzed_df[['timestamp', 'ma_7', 'ma_24', 'intraday_volatility', 'daily_momentum']].head())

Unnamed: 0,timestamp,ma_7,ma_24,intraday_volatility,daily_momentum
96,2025-05-10 17:14:40+00:00,103330.214286,103486.816667,199.4,0.380282
97,2025-05-10 17:29:36+00:00,103370.5,103486.016667,198.8,0.28615
98,2025-05-10 17:44:32+00:00,103359.9,103476.166667,297.3,-0.027131
99,2025-05-10 17:59:28+00:00,103357.728571,103464.333333,159.0,0.186503
100,2025-05-10 18:14:24+00:00,103338.128571,103443.625,200.0,0.048557


## Forecasting Setup

#### Timestamp Modification

In [7]:
dataset_path = "bitcoin_price_analysis_using_OpenRefine_notimestamp.csv"
notimestamps_df = pd.read_csv(dataset_path)

notimestamps_df['timestamp'] = pd.to_datetime(notimestamps_df['timestamp'], utc=True, errors='coerce')
# Convert to naive datetime
notimestamps_df['timestamp'] = notimestamps_df['timestamp'].dt.tz_localize(None)
notimestamps_df = notimestamps_df.dropna(subset=['timestamp'])

forecast_df = prepare_forecast_data(notimestamps_df)

#### Splitting Train/Test Data

In [8]:
train_data = forecast_df[:-96]  
test_data = forecast_df[-96:] 

#### Training Model

In [9]:
model, forecast = train_model(train_data, periods=96)

INFO:openrefine_utils:Starting model training...
DEBUG:cmdstanpy:input tempfile: /tmp/tmpogr2_nq6/bbn30wxe.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpogr2_nq6/rc7yesrc.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/opt/conda/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=40993', 'data', 'file=/tmp/tmpogr2_nq6/bbn30wxe.json', 'init=/tmp/tmpogr2_nq6/rc7yesrc.json', 'output', 'file=/tmp/tmpogr2_nq6/prophet_modelk0h3g8xl/prophet_model-20250521011100.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:11:00 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:11:00 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing


## Visualizations

### A. Technical Indicator Plots

In [10]:
plot_technical_indicators(analyzed_df)

### B. Forecasting Plots

In [11]:
plot_forecast(train_data, forecast)
plot_comparision(test_data, forecast)

## Insights and Findings

- The **Bitcoin Price Forecast (24 Hours)** graph generated using the Prophet model shows how the model anticipates future price movements using patterns from historical data.
  - Prophet works by identifying **seasonal trends and recent behavior** in the price series to estimate what will likely happen next.
  - The forecast includes a **confidence interval**, represented by the shaded region around the prediction line. This interval reflects the model's uncertainty — the wider it is, the less confident the model is about precise values. Most of our predicted intervals stayed reasonably tight, indicating consistent and stable model behavior.

- As time progresses into the future, these **confidence intervals expand slightly**, which is expected. This means the model is less certain the further ahead it predicts, which is a common characteristic of time series models.

- The **Actual vs Predicted Price** plot provides a strong visual assessment of model performance.
  - The **actual price** (from real data) is shown alongside the **predicted values** from the model.
  - Most real values fall within the confidence band, indicating that the model is well-calibrated and **accurately captures short-term price behavior**.
  - Minor gaps between actual and predicted values typically occurred during more volatile time windows, which suggests the model could be fine-tuned further or enhanced with external signals (like volume or macroeconomic events).


- Before modeling, **OpenRefine** was used to clean and validate raw Bitcoin price data.
  - Issues such as variable values and inconsistent timestamps were resolved.
  - Ensuring clean data with **logical relationships among columns** (like high ≥ low, etc.) is critical to avoid introducing noise into time-based predictions.

- The data used was **15-minute intervals price data** which helped maintain high granularity, which is ideal for crypto markets where prices change rapidly.

- We also computed **technical indicators** such as:
  - **Moving Averages (MA)** to smooth out short-term fluctuations and highlight longer-term trends.
  - **Bollinger Bands**, which measure price volatility by placing upper and lower bands around a moving average. In periods of high volatility, the bands widen — a pattern that was clearly visible in our visualization.

- Finally, by combining **cleaned data**, **technical indicators**, and a **forecasting model**, the project successfully demonstrated a **modular, reproducible pipeline** for near real-time price prediction.


## FLOW CHART

<center>

###  **RAW CSV**  
    ⬇️  
###  **OPENREFINE CLEANING**  
    ⬇️  
###  **CLEANED CSV**  
    ⬇️  
###  **LOAD INTO NOTEBOOK FOR ANALYSIS**  
    ⬇️  
###  **ADD TECHNICAL INDICATORS**  
    ⬇️  
###  **MODEL TRAINING**
    ⬇️  
###  **FORECAST VISUALIZATIONS**  
    ⬇️  
###  **DEDUCE CONCLUSIONS**

</center>
