# cuDF API

## NVIDIA RAPIDS cuDF for GPU-Accelerated Data Processing

This notebook provides a comprehensive exploration of NVIDIA's RAPIDS cuDF library, focusing on its powerful GPU-accelerated data processing capabilities for financial time series analysis. Using Bitcoin price data as our example dataset, we demonstrate how cuDF can significantly enhance performance for data-intensive operations.

### Core Functionality Covered:

- **GPU-Accelerated DataFrame Operations**: Leveraging NVIDIA GPUs for high-speed data manipulation
- **Time Series Processing**: Implementing financial technical indicators with rolling window operations
- **Interoperability**: Seamless conversion between cuDF and pandas for a complete workflow
- **Performance Optimization**: Techniques for maximizing GPU processing efficiency
- **Real-time Data Analysis**: Strategies for processing streaming financial data

### References and Resources:

#### Official Documentation
- **RAPIDS cuDF Documentation**: [Official cuDF API Reference](https://docs.rapids.ai/api/cudf/stable/)
- **Getting Started Guide**: [RAPIDS API Documentation](https://docs.rapids.ai/install/)
- **RAPIDS GitHub Repository**: [rapidsai/cudf](https://github.com/rapidsai/cudf)

#### Project Documentation
- **Detailed API Guide**: See `notebook/cudf.API.md` for a comprehensive overview
- **Data Source**: [CoinGecko API Documentation](https://www.coingecko.com/en/api/documentation)

#### Academic References
- Raschka, S., Patterson, J., & Nolet, C. (2020). Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. *Information*, 11(4), 193.
- Harris, C.R., Millman, K.J., van der Walt, S.J. et al. (2020). Array programming with NumPy. *Nature*, 585, 357–362.

In [None]:
import os
import sys
import time
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from datetime import datetime, timedelta
from dotenv import load_dotenv
import sklearn
import sklearn.linear_model
import sklearn.preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Add the parent directory to sys.path.
sys.path.append('..')
from utils.cudf_utils import (
    fetch_bitcoin_price, fetch_historical_data, add_to_dataframe,
    compute_moving_averages, compute_volatility, compute_rate_of_change,
    compute_rsi, plot_bitcoin_data, save_to_csv, load_from_csv
)

# Load environment variables.
load_dotenv()

# Import cuDF.
import cudf

## 1. Fetching Historical Bitcoin Data

Let's fetch some historical Bitcoin price data from the CoinGecko API.

In [None]:
# Fetch 90 days of historical data.
days = 90
print(f"Fetching {days} days of historical Bitcoin price data...")

# Start timing.
start_time = time.time()

# Fetch data.
historical_data = fetch_historical_data(days=days)

if historical_data is None or len(historical_data) == 0:
    print("Failed to fetch historical data. Please check your internet connection and try again.")
else:
    fetch_time = time.time() - start_time
    print(f"Successfully fetched {len(historical_data)} historical data points in {fetch_time:.2f} seconds.")
    print(f"Date range: {historical_data['timestamp'].min()} to {historical_data['timestamp'].max()}")
    print(f"Price range: ${historical_data['price'].min():.2f} to ${historical_data['price'].max():.2f}")
    
    # Display the first few rows.
    historical_data.head()

## 2. Computing Technical Indicators

Now, let's compute some technical indicators for our Bitcoin price data.

In [None]:
# Only proceed if we have historical data.
if 'historical_data' in locals() and historical_data is not None and len(historical_data) > 0:
    print("Computing technical indicators...")
    start_time = time.time()
    
    # Apply various technical indicators.
    historical_data = compute_moving_averages(historical_data, windows=[7, 20, 50])
    historical_data = compute_volatility(historical_data, window=20)
    historical_data = compute_rate_of_change(historical_data, periods=[1, 7])
    historical_data = compute_rsi(historical_data, window=14)
    
    compute_time = time.time() - start_time
    print(f"Finished computing indicators in {compute_time:.2f} seconds.")
    
    # Display the data with indicators.
    historical_data.head()

## 3. Visualizing Historical Data with Technical Indicators

Let's create a visualization of our historical Bitcoin price data with the technical indicators we computed.

In [None]:
# Only proceed if we have historical data with indicators.
if 'historical_data' in locals() and historical_data is not None and len(historical_data) > 0:
    print("Generating visualization of historical data...")
    
    # Create the plot.
    fig = plot_bitcoin_data(historical_data, title=f"{days} days Bitcoin Price Analysis with cuDF")
    fig.show()

### Separate Plots for Technical Indicators

Let's create individual plots for each type of technical indicator for clearer analysis.

In [None]:
# Plot Moving Averages.
if 'historical_data' in locals() and historical_data is not None:
    # Convert to pandas for easier plotting.
    df = historical_data.to_pandas()
    
    # Create figure.
    fig_ma = go.Figure()
    
    # Add price.
    fig_ma.add_trace(go.Scatter(
        x=df['timestamp'],
        y=df['price'],
        mode='lines',
        name='Bitcoin Price',
        line=dict(color='#F2A900', width=2)
    ))
    
    # Add moving averages.
    colors = ['#3D9970', '#FF4136', '#7FDBFF']
    for i, window in enumerate([7, 20, 50]):
        col = f'SMA_{window}'
        if col in df.columns:
            fig_ma.add_trace(go.Scatter(
                x=df['timestamp'],
                y=df[col],
                mode='lines',
                name=f'{window}-Day MA',
                line=dict(color=colors[i % len(colors)], width=1.5, dash='dash')
            ))
    
    # Update layout.
    fig_ma.update_layout(
        title='Bitcoin Price with Moving Averages',
        xaxis_title='Date',
        yaxis_title='Price (USD)',
        template='plotly_dark',
        hovermode='x unified',
        legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
    )
    
    fig_ma.show()

In [None]:
# Plot Volatility.
if 'historical_data' in locals() and historical_data is not None and 'volatility' in historical_data.columns:
    # Convert to pandas for easier plotting.
    df = historical_data.to_pandas()
    
    # Create figure.
    fig_vol = go.Figure()
    
    # Add price (scaled down to fit with volatility).
    price_scaling = df['volatility'].mean() / df['price'].mean() * 10
    fig_vol.add_trace(go.Scatter(
        x=df['timestamp'],
        y=df['price'] * price_scaling,
        mode='lines',
        name='Bitcoin Price (scaled)',
        line=dict(color='#F2A900', width=1.5, dash='dot')
    ))
    
    # Add volatility.
    fig_vol.add_trace(go.Scatter(
        x=df['timestamp'],
        y=df['volatility'],
        mode='lines',
        name='Volatility (20-day)',
        line=dict(color='#B10DC9', width=2)
    ))
    
    # Update layout.
    fig_vol.update_layout(
        title='Bitcoin Price Volatility (20-day Standard Deviation)',
        xaxis_title='Date',
        yaxis_title='Volatility',
        template='plotly_dark',
        hovermode='x unified',
        legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
    )
    
    fig_vol.show()

In [None]:
# Plot RSI.
if 'historical_data' in locals() and historical_data is not None and 'RSI' in historical_data.columns:
    # Convert to pandas for easier plotting.
    df = historical_data.to_pandas()
    
    # Create figure.
    fig_rsi = go.Figure()
    
    # Add RSI.
    fig_rsi.add_trace(go.Scatter(
        x=df['timestamp'],
        y=df['RSI'],
        mode='lines',
        name='RSI (14-day)',
        line=dict(color='#FF851B', width=2)
    ))
    
    # Add overbought/oversold lines.
    fig_rsi.add_shape(
        type="line", line_color="red", line_width=1, opacity=0.5,
        x0=df['timestamp'].min(), x1=df['timestamp'].max(), y0=70, y1=70
    )
    fig_rsi.add_shape(
        type="line", line_color="green", line_width=1, opacity=0.5,
        x0=df['timestamp'].min(), x1=df['timestamp'].max(), y0=30, y1=30
    )
    
    # Add annotations.
    fig_rsi.add_annotation(
        x=df['timestamp'].max(),
        y=70,
        text="Overbought",
        showarrow=False,
        yshift=10,
        font=dict(color="red")
    )
    fig_rsi.add_annotation(
        x=df['timestamp'].max(),
        y=30,
        text="Oversold",
        showarrow=False,
        yshift=-10,
        font=dict(color="green")
    )
    
    # Update layout.
    fig_rsi.update_layout(
        title='Bitcoin Relative Strength Index (RSI)',
        xaxis_title='Date',
        yaxis_title='RSI',
        yaxis=dict(range=[0, 100]),
        template='plotly_dark',
        hovermode='x unified'
    )
    
    fig_rsi.show()

In [None]:
# Plot Rate of Change.
if 'historical_data' in locals() and historical_data is not None:
    # Convert to pandas for easier plotting.
    df = historical_data.to_pandas()
    
    # Check if ROC columns exist.
    roc_columns = [col for col in df.columns if col.startswith('ROC_')]
    
    if roc_columns:
        # Create figure.
        fig_roc = go.Figure()
        
        # Add ROC for each period.
        colors = ['#FF4136', '#0074D9', '#2ECC40']
        for i, col in enumerate(roc_columns):
            period = col.split('_')[1]
            fig_roc.add_trace(go.Scatter(
                x=df['timestamp'],
                y=df[col],
                mode='lines',
                name=f'{period}-Day Rate of Change',
                line=dict(color=colors[i % len(colors)], width=2)
            ))
        
        # Add zero line.
        fig_roc.add_shape(
            type="line", line_color="gray", line_width=1, opacity=0.5,
            x0=df['timestamp'].min(), x1=df['timestamp'].max(), y0=0, y1=0
        )
        
        # Update layout.
        fig_roc.update_layout(
            title='Bitcoin Price Rate of Change (%)',
            xaxis_title='Date',
            yaxis_title='Rate of Change (%)',
            template='plotly_dark',
            hovermode='x unified',
            legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
        )
        
        fig_roc.show()

## 4. Collecting Real-Time Bitcoin Price Data

Now, let's collect some real-time Bitcoin price data.

In [None]:
# Simulate real-time data collection (for demonstration).
interval = 3  # seconds between data points
num_points = 10  # number of real-time points to collect

print(f"Simulating real-time data collection: {num_points} points, {interval}s interval...")

realtime_data = None
for i in range(num_points):
    price = fetch_bitcoin_price()
    timestamp = datetime.utcnow()
    if realtime_data is None:
        realtime_data = add_to_dataframe(None, timestamp, price)
    else:
        realtime_data = add_to_dataframe(realtime_data, timestamp, price)
    print(f"[{i+1}/{num_points}] {timestamp} - ${price:.2f}")
    time.sleep(interval)

print("Real-time data collection complete.")
realtime_data.head()

## 5. Processing and Visualizing Real-Time Data

Let's process and visualize the real-time Bitcoin price data.

In [None]:
# Process real-time data.
if 'realtime_data' in locals() and realtime_data is not None and len(realtime_data) > 0:
    print("Computing technical indicators for real-time data...")
    
    # Adjust window sizes based on the amount of data available.
    window_sizes = [min(3, len(realtime_data)-1), min(5, len(realtime_data)-1)]
    window_sizes = [w for w in window_sizes if w > 0]
    
    if window_sizes:
        # Compute technical indicators with appropriate window sizes.
        realtime_data = compute_moving_averages(realtime_data, windows=window_sizes)
        realtime_data = compute_volatility(realtime_data, window=window_sizes[0])
        realtime_data = compute_rate_of_change(realtime_data, periods=[1])
        
        print("Finished computing indicators for real-time data.")
        
        # Display the processed real-time data.
        realtime_data.head(num_points)
    else:
        print("Not enough data points for technical indicators.")

## 6. Combining Historical and Real-Time Data

Now, let's combine historical and real-time data for a comprehensive view.

In [None]:
# Combine historical and real-time data.
if ('historical_data' in locals() and historical_data is not None and len(historical_data) > 0 and
    'realtime_data' in locals() and realtime_data is not None and len(realtime_data) > 0):
    
    # Convert both DataFrames to pandas for easier concatenation.
    hist_pdf = historical_data.to_pandas()
    real_pdf = realtime_data.to_pandas()
    
    # Concatenate the DataFrames.
    combined_pdf = pd.concat([hist_pdf, real_pdf], ignore_index=True)
    
    # Convert back to cuDF.
    combined_data = cudf.DataFrame.from_pandas(combined_pdf)
    
    print(f"Combined data has {len(combined_data)} rows")
    
    # Recalculate technical indicators.
    combined_data = compute_moving_averages(combined_data, windows=[7, 20, 50])
    combined_data = compute_volatility(combined_data, window=20)
    combined_data = compute_rate_of_change(combined_data, periods=[1, 7])
    combined_data = compute_rsi(combined_data, window=14)
    
    # Display the combined data.
    combined_data.tail()

## Visualizing Combined Data

Let's create a comprehensive visualization of the combined dataset.

In [None]:
# Visualize combined data.
if 'combined_data' in locals() and combined_data is not None and len(combined_data) > 0:
    print("Generating visualization of combined data...")
    
    # Create plot with combined data.
    fig_combined = plot_bitcoin_data(combined_data, 
                                    title="Bitcoin Price Analysis - Historical + Real-time Data")
    
    # Display the plot.
    fig_combined.show()

## Saving the Data

Let's save the processed data to a CSV file for future reference.

In [None]:
# Save combined data to CSV.
if 'combined_data' in locals() and combined_data is not None and len(combined_data) > 0:
    filename = "bitcoin_data_combined.csv"
    save_to_csv(combined_data, filename=filename)
    print(f"Data saved to {filename}")

## Bitcoin Price Forecasting

Finally, let's implement a simple forecasting model to predict future Bitcoin prices.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

def forecast_bitcoin_prices(historical_data, forecast_days=30):
    """Forecast Bitcoin prices using historical data
    
    Args:
        historical_data (cudf.DataFrame): Historical Bitcoin price data
        forecast_days (int): Number of days to forecast
    
    Returns:
        tuple: (historical_data_pandas, forecast_data_pandas) as pandas DataFrames
    """
    print(f"Forecasting Bitcoin prices for the next {forecast_days} days...")
    
    if historical_data is None or len(historical_data) < 30:
        print("Insufficient historical data for forecasting. Need at least 30 data points.")
        return None, None
    
    # Convert to pandas for forecasting.
    df = historical_data.to_pandas()
    
    # Sort by timestamp.
    df = df.sort_values('timestamp')
    
    # Set timestamp as index.
    df.set_index('timestamp', inplace=True)
    
    # Keep only the price column for basic forecasting.
    price_series = df['price']
    
    # Create features for regression (using lag features).
    X = np.column_stack([
        price_series.shift(1).values[30:],
        price_series.shift(7).values[30:],
        price_series.shift(14).values[30:],
        price_series.shift(30).values[30:],
        price_series.rolling(7).mean().shift(1).values[30:],
        price_series.rolling(14).mean().shift(1).values[30:],
        price_series.rolling(30).mean().shift(1).values[30:],
        price_series.rolling(7).std().shift(1).values[30:],
        price_series.pct_change(periods=1).shift(1).values[30:],
        price_series.pct_change(periods=7).shift(1).values[30:],
    ])
    
    # Target variable.
    y = price_series.values[30:]
    
    # Remove NaN rows.
    valid_indices = ~np.isnan(X).any(axis=1) & ~np.isnan(y)
    X_clean = X[valid_indices]
    y_clean = y[valid_indices]
    
    # Standardize features.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_clean)
    
    # Train a linear regression model.
    model = LinearRegression()
    model.fit(X_scaled, y_clean)
    
    print("Model trained. Generating forecast...")
    
    # Prepare data for forecasting.
    forecast_horizon = forecast_days
    forecast_dates = [df.index[-1] + timedelta(days=i+1) for i in range(forecast_horizon)]
    
    # Initialize with known values.
    forecast_values = []
    forecast_df = price_series.copy()
    
    # Step-by-step forecast.
    for i in range(forecast_horizon):
        # Get the latest data point.
        latest_price = forecast_df.iloc[-1] if i == 0 else forecast_values[-1]
        latest_price_lag1 = forecast_df.iloc[-1]
        latest_price_lag7 = forecast_df.iloc[-7] if len(forecast_df) > 7 else forecast_df.iloc[0]
        latest_price_lag14 = forecast_df.iloc[-14] if len(forecast_df) > 14 else forecast_df.iloc[0]
        latest_price_lag30 = forecast_df.iloc[-30] if len(forecast_df) > 30 else forecast_df.iloc[0]
        
        # Calculate rolling stats.
        if i == 0:
            ma7 = forecast_df.rolling(7).mean().iloc[-1]
            ma14 = forecast_df.rolling(14).mean().iloc[-1]
            ma30 = forecast_df.rolling(30).mean().iloc[-1]
            std7 = forecast_df.rolling(7).std().iloc[-1]
            pct1 = forecast_df.pct_change(periods=1).iloc[-1]
            pct7 = forecast_df.pct_change(periods=7).iloc[-1]
        else:
            # Append the latest prediction to the series.
            temp_series = pd.concat([forecast_df, pd.Series([forecast_values[-1]], index=[forecast_dates[i-1]])])
            ma7 = temp_series.rolling(7).mean().iloc[-1]
            ma14 = temp_series.rolling(14).mean().iloc[-1]
            ma30 = temp_series.rolling(30).mean().iloc[-1]
            std7 = temp_series.rolling(7).std().iloc[-1]
            pct1 = (temp_series.iloc[-1] / temp_series.iloc[-2]) - 1 if len(temp_series) > 1 else 0
            pct7 = (temp_series.iloc[-1] / temp_series.iloc[-7]) - 1 if len(temp_series) > 7 else 0
        
        # Create feature vector.
        X_forecast = np.array([[
            latest_price_lag1,
            latest_price_lag7,
            latest_price_lag14,
            latest_price_lag30,
            ma7,
            ma14,
            ma30,
            std7,
            pct1,
            pct7
        ]])
        
        # Scale the features.
        X_forecast_scaled = scaler.transform(X_forecast)
        
        # Make prediction.
        forecast_price = model.predict(X_forecast_scaled)[0]
        forecast_values.append(forecast_price)
    
    # Create DataFrame with forecasted values.
    forecast_result = pd.DataFrame({'price': forecast_values}, index=forecast_dates)
    
    # Add confidence intervals (simple approach using historical volatility).
    volatility = df['price'].pct_change().std() * np.sqrt(forecast_horizon)
    forecast_result['lower_bound'] = forecast_result['price'] * (1 - volatility * 1.96)
    forecast_result['upper_bound'] = forecast_result['price'] * (1 + volatility * 1.96)
    
    print("30-day forecast generated with confidence intervals.")
    
    return df, forecast_result

# Run forecasting if we have historical data.
if 'historical_data' in locals() and historical_data is not None and len(historical_data) > 0:
    # Generate forecast.
    historical_df, forecast_df = forecast_bitcoin_prices(historical_data, forecast_days=30)
    
    if historical_df is not None and forecast_df is not None:
        # Display the forecast.
        forecast_df.head()

## Visualizing the Forecast

Let's create a visualization of our price forecast with confidence intervals.

In [None]:
def plot_forecast(historical_df, forecast_df, title="Bitcoin Price Forecast"):
    """
    Plot historical data with forecasted prices
    
    Args:
        historical_df (pd.Series): Historical price data with timestamp index
        forecast_df (pd.DataFrame): Forecast data with timestamp index
        title (str): Plot title
    
    Returns:
        plotly.graph_objects.Figure: Plotly figure
    """
    # Create figure.
    fig = go.Figure()
    
    # Add historical price.
    fig.add_trace(go.Scatter(
        x=historical_df.index,
        y=historical_df,
        mode='lines',
        name='Historical Price',
        line=dict(color='#F2A900', width=2)
    ))
    
    # Add forecast line.
    fig.add_trace(go.Scatter(
        x=forecast_df.index,
        y=forecast_df['price'],
        mode='lines',
        name='Forecasted Price',
        line=dict(color='red')
    ))
    
    # Add confidence interval.
    fig.add_trace(go.Scatter(
        x=forecast_df.index.tolist() + forecast_df.index.tolist()[::-1],
        y=forecast_df['upper_bound'].tolist() + forecast_df['lower_bound'].tolist()[::-1],
        fill='toself',
        fillcolor='rgba(255,0,0,0.2)',
        line=dict(color='rgba(255,255,255,0)'),
        name='95% Confidence Interval'
    ))
    
    # Update layout.
    fig.update_layout(
        title=title,
        xaxis_title='Date',
        yaxis_title='Bitcoin Price (USD)',
        hovermode='x unified',
        legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
    )
    
    return fig

# Plot forecast if available.
if 'historical_df' in locals() and 'forecast_df' in locals() and historical_df is not None and forecast_df is not None:
    
    print("Generating forecast visualization...")
    
    # Create forecast plot
    fig_forecast = plot_forecast(
        historical_df['price'], # Pass just the price series, not the whole dataframe
        forecast_df, 
        title="Bitcoin 30-Day Price Forecast with 95% Confidence Interval"
    )
    
    # Display the plot.
    fig_forecast.show()

## Conclusion

In this notebook, we've demonstrated a complete workflow for Bitcoin price data analysis using GPU-accelerated cuDF:

1. **Data Acquisition**: Fetched historical and real-time Bitcoin price data
2. **GPU-Accelerated Processing**: Calculated technical indicators with cuDF
3. **Visualization**: Created interactive charts of price trends and indicators
4. **Forecasting**: Implemented a simple model to predict future prices

This demonstrates how cuDF can significantly enhance the performance of data-intensive financial analysis workflows.

For more information, refer to:
- The cuDF API documentation in `cudf.API.ipynb` and `cudf.API.md`
- Performance benchmarks in `performance_comparison.ipynb` and `performance_comparison.md`