# OpenRefine Pipeline Example: Bitcoin Price Analysis

This notebook introduces the OpenRefine API wrapper designed to simplify and streamline data cleaning and transformation tasks in Python to prepare data for further complex analysis.

For a detailed explanation of the API utilities used in this notebook, refer to: `openrefine_utils.py`

This notebook assumes that OpenRefine is already running at http://localhost:3333 and that the necessary data cleaning has been completed, with the cleaned dataset exported for further analysis.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

## Imports

In [2]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'iframe'  # For Interactive Plots
from prophet import Prophet

## Functions from UTILS

In [3]:
from openrefine_utils import (
    fetch_bitcoin_data_kucoin,
    save_to_csv,
    load_cleaned_data,
    validate_cleaned_data,
    resample_data,
    calculate_technical_indicators,
    plot_technical_indicators,
    prepare_forecast_data,
    train_model,
    plot_forecast,
    plot_comparision
)

## Fetch Raw Data from API

In [4]:
# 7 days of 15min interval data
btc_df = fetch_bitcoin_data_kucoin(days=7, interval='15min')

btc_df.head()

INFO:openrefine_utils:Fetching 7 days of BTC data from KuCoin (15min candles)

The behavior of 'to_datetime' with 'unit' when parsing strings is deprecated. In a future version, strings will be parsed as datetime strings, matching the behavior without a 'unit'. To retain the old behavior, explicitly cast ints or floats to numeric type before calling to_datetime.



Unnamed: 0,timestamp,open,high,low,close,volume
0,2025-05-14 03:14:08,103648.3,103649.4,103513.6,103531.0,9.804768
1,2025-05-14 03:29:04,103531.0,103610.5,103452.0,103610.4,8.740662
2,2025-05-14 03:44:00,103610.5,103610.5,103546.5,103578.0,2.908408
3,2025-05-14 03:58:56,103577.9,103578.0,103465.8,103549.0,6.683276
4,2025-05-14 04:16:00,103552.5,103692.7,103482.7,103675.4,5.099068


## Load Cleaned Data (Cleaned using OpenRefine)

In [5]:
cleaned_df = load_cleaned_data('bitcoin_price_analysis_using_OpenRefine_w_timestamp.csv')

# Run validation checks
validation_passed = validate_cleaned_data(cleaned_df)

if validation_passed:
    print("\n=== VALIDATION SUCCESS ===")
    print("Data is clean and ready for analysis!")
    display(cleaned_df.head(3))
else:
    print("\n=== VALIDATION FAILED ===")
    print("Fix issues in OpenRefine before proceeding")

INFO:openrefine_utils:Loading cleaned data from bitcoin_price_analysis_using_OpenRefine_w_timestamp.csv
INFO:openrefine_utils:Successfully loaded 672 records
INFO:openrefine_utils:No missing values
INFO:openrefine_utils:Valid price relationships
INFO:openrefine_utils:Time sequence valid
INFO:openrefine_utils:All data validation checks passed successfully!



=== VALIDATION SUCCESS ===
Data is clean and ready for analysis!


Unnamed: 0,timestamp,open,high,price_validation,hourly_volatility,low,close,price_change,volume
0,2025-05-09 17:14:40+00:00,102774.6,103049.1,Valid,275,102774.6,103028.9,254,34.717871
1,2025-05-09 17:29:36+00:00,103028.9,103216.1,Valid,260,102956.1,103127.9,99,24.626889
2,2025-05-09 17:44:32+00:00,103119.1,103234.0,Valid,125,103109.4,103201.3,82,25.83615


## Sampling into Hourly Data

In [6]:
hourly_df = resample_data(cleaned_df, '1H')
hourly_df.head()

Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-05-09 17:00:00+00:00,102774.6,103247.6,102774.6,103108.4,104.843171
2025-05-09 18:00:00+00:00,103108.5,103372.1,102933.4,103143.9,61.753349
2025-05-09 19:00:00+00:00,103143.9,103207.2,102897.6,103189.8,62.987781
2025-05-09 20:00:00+00:00,103194.1,103415.6,103109.1,103176.9,49.422233
2025-05-09 21:00:00+00:00,103177.0,103211.2,102774.8,102981.5,48.68466


## Hourly Data Representation

In [7]:
fig_price = go.Figure()
fig_price.add_trace(go.Scatter(
    x=hourly_df.index,
    y=hourly_df['close'],
    mode='lines',
    name='Hourly Close Price',
    line=dict(color='blue')
))
fig_price.update_layout(
    title='Bitcoin Price Trends',
    xaxis_title='Date',
    yaxis_title='Price (USD)',
    template='plotly_white',
    hovermode='x unified',
    width=1000,
    height=500
)
fig_price.show()

In [8]:
# Volume analysis (Hourly Data)

fig_volume = go.Figure()
fig_volume.add_trace(go.Bar(
    x=hourly_df.index,
    y=hourly_df['volume'],
    name='Trading Volume',
    marker_color='orange'
))
fig_volume.update_layout(
    title='Trading Volume',
    xaxis_title='Date',
    yaxis_title='Volume',
    template='plotly_white',
    hovermode='x unified',
    width=1000,
    height=500
)
fig_volume.show()

## Preparing Data for Analysis

In [9]:
analyzed_df = calculate_technical_indicators(cleaned_df)   # Back to our 15m interval data
print("\nTechnical Indicators Added for Analysis:")
display(analyzed_df[['timestamp', 'ma_7', 'ma_24', 'intraday_volatility', 'daily_momentum']].head())


Technical Indicators Added for Analysis:


Unnamed: 0,timestamp,ma_7,ma_24,intraday_volatility,daily_momentum
96,2025-05-10 17:14:40+00:00,103330.214286,103486.816667,199.4,0.380282
97,2025-05-10 17:29:36+00:00,103370.5,103486.016667,198.8,0.28615
98,2025-05-10 17:44:32+00:00,103359.9,103476.166667,297.3,-0.027131
99,2025-05-10 17:59:28+00:00,103357.728571,103464.333333,159.0,0.186503
100,2025-05-10 18:14:24+00:00,103338.128571,103443.625,200.0,0.048557


## Representaion of Added Indicators

In [10]:
plot_technical_indicators(analyzed_df)

## Preparing Data for Forecasting

In [11]:
dataset_path = "bitcoin_price_analysis_using_OpenRefine_notimestamp.csv"
notimestamps_df = pd.read_csv(dataset_path)

# Correctly parse the 'timestamp' column
notimestamps_df['timestamp'] = pd.to_datetime(notimestamps_df['timestamp'], utc=True, errors='coerce')
# Convert to naive datetime
notimestamps_df['timestamp'] = notimestamps_df['timestamp'].dt.tz_localize(None)

notimestamps_df = notimestamps_df.dropna(subset=['timestamp'])

forecast_df = prepare_forecast_data(notimestamps_df)

# Split into train/test (last 24 hours for testing)
train_data = forecast_df[:-96]  
test_data = forecast_df[-96:]   

## Model Training

In [12]:
model, forecast = train_model(train_data, periods=96)

print("\n=== MODEL TRAINING COMPLETED ===")
print("Forecasting model trained successfully! Generating predictions and visualization.")

INFO:openrefine_utils:Starting model training...
DEBUG:cmdstanpy:input tempfile: /tmp/tmpg6b4um08/lidbhol8.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpg6b4um08/5m6xoclc.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/opt/conda/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=58371', 'data', 'file=/tmp/tmpg6b4um08/lidbhol8.json', 'init=/tmp/tmpg6b4um08/5m6xoclc.json', 'output', 'file=/tmp/tmpg6b4um08/prophet_modela6xo45p4/prophet_model-20250521030325.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
03:03:25 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
03:03:25 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing



=== MODEL TRAINING COMPLETED ===
Forecasting model trained successfully! Generating predictions and visualization.


## Analysis and Results

In [13]:
print("FORECAST SUMMARY:")
display(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

plot_forecast(train_data, forecast)

FORECAST SUMMARY:


Unnamed: 0,ds,yhat,yhat_lower,yhat_upper
91,2025-05-16 13:30:24,100640.036618,95452.369855,105692.035721
92,2025-05-16 13:45:24,100587.137336,95132.837713,105654.460292
93,2025-05-16 14:00:24,100532.774919,95007.821101,105670.75413
94,2025-05-16 14:15:24,100479.066645,94955.158208,105761.657195
95,2025-05-16 14:30:24,100428.018446,94778.151168,105884.139417


## Comparision Plot

In [14]:
# Detailed Actual vs Predicted comparison for specifically the last 24 hours

plot_comparision(test_data, forecast)