# A1: Individual Assignment - Predictive Modeling for Cryptocurrency Investment Strategies

*Work by Marta Martins*

Dataset chosen: https://www.kaggle.com/datasets/jessevent/all-crypto-currencies

The goal is to explore trends forecasting, return classification and porfolio optimization using machine learning techniques.

In [1]:
# Core Python Libraries
import numpy as np
import pandas as pd
from tqdm import tqdm

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff

# Data Sources
import yfinance as yf

# Machine Learning: Models & Evaluation
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    mean_squared_error,
    classification_report,
    confusion_matrix,
    accuracy_score
)

In [2]:
# 1. Read the cryptocurrency dataset from Kaggle using kagglehub
import kagglehub

# 2. Download the latest version of the dataset
path = kagglehub.dataset_download("jessevent/all-crypto-currencies")
print("Path to dataset files:", path)

# 3. Load the dataset (example: crypto-markets.csv is the main file)
crypto = pd.read_csv(f"{path}/crypto-markets.csv")

# 4. Confirm successful load and preview the data
print("Cryptocurrency dataset loaded successfully!")
print("Shape of the dataset:", crypto.shape)
print("First 5 rows:")
display(crypto.head())

Path to dataset files: /kaggle/input/all-crypto-currencies
Cryptocurrency dataset loaded successfully!
Shape of the dataset: (942297, 13)
First 5 rows:


Unnamed: 0,slug,symbol,name,date,ranknow,open,high,low,close,volume,market,close_ratio,spread
0,bitcoin,BTC,Bitcoin,2013-04-28,1,135.3,135.98,132.1,134.21,0.0,1488567000.0,0.5438,3.88
1,bitcoin,BTC,Bitcoin,2013-04-29,1,134.44,147.49,134.0,144.54,0.0,1603769000.0,0.7813,13.49
2,bitcoin,BTC,Bitcoin,2013-04-30,1,144.0,146.93,134.05,139.0,0.0,1542813000.0,0.3843,12.88
3,bitcoin,BTC,Bitcoin,2013-05-01,1,139.0,139.89,107.72,116.99,0.0,1298955000.0,0.2882,32.17
4,bitcoin,BTC,Bitcoin,2013-05-02,1,116.38,125.6,92.28,105.21,0.0,1168517000.0,0.3881,33.32


## Data Pre-Processing

In [3]:
# Check for nulls
crypto.isnull().sum()

Unnamed: 0,0
slug,0
symbol,0
name,0
date,0
ranknow,0
open,0
high,0
low,0
close,0
volume,0


In [4]:
# Check for duplicates
crypto.duplicated().sum()

np.int64(0)

In [5]:
# Check data types
crypto.dtypes

Unnamed: 0,0
slug,object
symbol,object
name,object
date,object
ranknow,int64
open,float64
high,float64
low,float64
close,float64
volume,float64


In [6]:
# Summary Statistics
crypto.describe()

Unnamed: 0,ranknow,open,high,low,close,volume,market,close_ratio,spread
count,942297.0,942297.0,942297.0,942297.0,942297.0,942297.0,942297.0,942297.0,942297.0
mean,1000.170608,348.3522,408.593,296.2526,346.1018,8720383.0,172506000.0,0.459499,112.34
std,587.575283,13184.36,16163.86,10929.31,13098.22,183980200.0,3575590000.0,0.32616,6783.713
min,1.0,2.5e-09,3.2e-09,2.5e-10,2e-10,0.0,0.0,-1.0,0.0
25%,465.0,0.002321,0.002628,0.002044,0.002314,175.0,29581.0,0.1629,0.0
50%,1072.0,0.023983,0.026802,0.021437,0.023892,4278.0,522796.0,0.4324,0.0
75%,1484.0,0.22686,0.250894,0.204391,0.225934,119090.0,6874647.0,0.7458,0.03
max,2072.0,2298390.0,2926100.0,2030590.0,2300740.0,23840900000.0,326502500000.0,1.0,1770563.0


In [7]:
# Convert timestamp to datetime
crypto['date'] = pd.to_datetime(crypto['date'])

In [8]:
crypto.head()

Unnamed: 0,slug,symbol,name,date,ranknow,open,high,low,close,volume,market,close_ratio,spread
0,bitcoin,BTC,Bitcoin,2013-04-28,1,135.3,135.98,132.1,134.21,0.0,1488567000.0,0.5438,3.88
1,bitcoin,BTC,Bitcoin,2013-04-29,1,134.44,147.49,134.0,144.54,0.0,1603769000.0,0.7813,13.49
2,bitcoin,BTC,Bitcoin,2013-04-30,1,144.0,146.93,134.05,139.0,0.0,1542813000.0,0.3843,12.88
3,bitcoin,BTC,Bitcoin,2013-05-01,1,139.0,139.89,107.72,116.99,0.0,1298955000.0,0.2882,32.17
4,bitcoin,BTC,Bitcoin,2013-05-02,1,116.38,125.6,92.28,105.21,0.0,1168517000.0,0.3881,33.32


In [9]:
# Check for unique values
crypto['name'].unique()

array(['Bitcoin', 'XRP', 'Ethereum', ..., '42-coin', 'Bit20', 'Project-X'],
      dtype=object)

In [10]:
# Count how many unique values - how many unique cryptocurrencies
crypto['name'].nunique()

2071

Since analyzing all of the 2071 cryptocurrencies in the dataset would increase the runtime and probably lead to biased results due to data sparsity, I decided to focus most of my analysis on the top 10 assets, filtered by ranking ('ranknow').

Therefore, analyzing a smaller, high-quality subset enhances efficiency and reduces the risk of overfitting in predictive models.

In [11]:
# Top 10 coins per ranking
# Compute average rank per symbol and sort
top_10_by_rank = (
    crypto.groupby('symbol')['ranknow']
    .mean()
    .sort_values()
    .head(10)
    .index
    .tolist()
)

print("Top 10 coins by average rank:", top_10_by_rank)

Top 10 coins by average rank: ['BTC', 'XRP', 'ETH', 'XLM', 'BCH', 'EOS', 'LTC', 'USDT', 'BSV', 'ADA']


In [12]:
# Create a dataset for those 10 coins
top_ranked = crypto[crypto['symbol'].isin(top_10_by_rank)]

In [13]:
fig = px.line(
    top_ranked,
    x='date',
    y='market',
    color='symbol',
    title='Market Cap Evolution of Top 10 Ranked Cryptocurrencies',
    labels={'market': 'Market Cap (USD)', 'date': 'Date'}
)
fig.show()

## Feature Engineering & Data Augmentation

- Rolling statistics
- Moving Averages

In [14]:
# Initialize an empty list to store processed dataframes
augmented_dfs = []

# Loop through each top coin individually
for coin in top_10_by_rank:
    coin = crypto[crypto['symbol'] == coin].copy()
    coin.sort_values('date', inplace=True)

    # Daily return - percentage change in price from previous day
    coin['daily_return'] = coin['close'].pct_change()

    # Rolling average and volatility (7-day)
    coin['rolling_mean_7d'] = coin['daily_return'].rolling(window=7).mean() # average return over last 7 days
    coin['rolling_std_7d'] = coin['daily_return'].rolling(window=7).std() # volatility over last 7 days

    # Sharpe ratio approximation (risk-free rate ~ 0)
    coin['sharpe_ratio_7d'] = coin['rolling_mean_7d'] / coin['rolling_std_7d'] # approximate risk adjusted return

    # Append to list
    augmented_dfs.append(coin)

# Combine all into a single dataframe
crypto_augmented = pd.concat(augmented_dfs, ignore_index=True)

print("Feature engineering complete! New columns added:")
crypto_augmented[['symbol', 'date', 'daily_return', 'rolling_mean_7d', 'rolling_std_7d', 'sharpe_ratio_7d']].head()

Feature engineering complete! New columns added:


Unnamed: 0,symbol,date,daily_return,rolling_mean_7d,rolling_std_7d,sharpe_ratio_7d
0,BTC,2013-04-28,,,,
1,BTC,2013-04-29,0.076969,,,
2,BTC,2013-04-30,-0.038328,,,
3,BTC,2013-05-01,-0.158345,,,
4,BTC,2013-05-02,-0.100692,,,


### Feature Engineering Summary

To enhance the dataset for financial analysis, we engineered features that reflect daily performance and short-term risk:
- **Daily return**: Day-to-day price change
- **Rolling 7-day mean**: Trend indicator
- **Rolling 7-day std**: Volatility measure
- **Sharpe ratio (7-day)**: Risk-adjusted performance proxy

These new features enable deeper insight into how top-ranked cryptocurrencies behave over time, facilitating later risk-return evaluations.


### Data Visualization of these new variables

In [15]:
# Line plot of daily returns over time for the top ranked coins
fig = px.line(
    crypto_augmented,
    x='date',
    y='daily_return',
    color='symbol',
    title='Daily Return of Top 10 Cryptocurrencies Over Time',
    labels={'daily_return': 'Daily Return (%)', 'date': 'Date'}
)
fig.show()

In [16]:
# Line plot of rolling 7-day volatility
fig = px.line(
    crypto_augmented,
    x='date',
    y='rolling_std_7d',
    color='symbol',
    title='7-Day Rolling Volatility of Top 10 Cryptocurrencies',
    labels={'rolling_std_7d': '7-Day Volatility', 'date': 'Date'}
)
fig.show()

In [17]:
# Correlation Matrix
# Pivot daily returns into a wide format (symbol as columns, date as index)
returns_wide = (
    crypto_augmented
    .pivot(index='date', columns='symbol', values='daily_return')
)

# Drop NaNs to avoid bias in correlation
returns_wide_clean = returns_wide.dropna()

# Compute correlation
correlation_matrix = returns_wide_clean.corr()

In [18]:
# Correlation Matrix with threshold
corr = returns_wide_clean.corr(numeric_only=True)
threshold = 0.5
filtercorr = corr[abs(corr) > threshold]
fig = px.imshow(filtercorr, text_auto=".3f")
fig.show()

In [19]:
# Correlation Matrix between features
# Select relevant numerical features
features_for_corr = [
    'close',
    'volume',
    'market',
    'daily_return',
    'rolling_mean_7d',
    'rolling_std_7d',
    'sharpe_ratio_7d',
]

# Calculate the correlation matrix for the selected features
feature_correlation = crypto_augmented[features_for_corr].corr()

# Display the correlation matrix as a heatmap
fig = ff.create_annotated_heatmap(
    z=feature_correlation.values.round(2),
    x=feature_correlation.columns.tolist(),
    y=feature_correlation.index.tolist(),
    colorscale='RdBu',
    zmin=-1,
    zmax=1,
    showscale=True,
)
fig.update_layout(title='Correlation Matrix of Selected Features')
fig.show()

In [20]:
# Correlation Matrix based on daily return
# Check the impact of different features on daily return

# Focus on correlations with 'daily_return'
daily_return_correlations = feature_correlation[['daily_return']].sort_values(
    by='daily_return', ascending=False
)

print("\nCorrelation of Features with Daily Return:")
print(daily_return_correlations)

# Visualize the correlations of 'daily_return' with other features
fig = px.bar(
    daily_return_correlations,
    y='daily_return',
    x=daily_return_correlations.index,
    title='Correlation with Daily Return',
    labels={'y': 'Correlation Coefficient', 'x': 'Feature'},
)
fig.update_layout(yaxis_range=[-1, 1])  # Set y-axis range for clarity
fig.show()


Correlation of Features with Daily Return:
                 daily_return
daily_return         1.000000
rolling_mean_7d      0.415506
sharpe_ratio_7d      0.317931
rolling_std_7d       0.194302
volume               0.020571
market               0.003359
close               -0.000638


**Summary:**

The goal of the exploratory data analysis is to understand return behavior, volatility and correlations between key features.

Most cryptocurrencies show high volatility, especially between 2017 and early 2018, reflecting the speculative nature of the market during crypto boom and corrections.

BTC and ETH tend to exhbity a little bit more stability, while lower-cap assets like BSV show sharper peaks.

The correlation heatmap confirms that major cryptocurrencies tend to move together. BSV and USDT are outliers, showing lower and even negative correlation with other tickers.


Overall, crypto markets remain unstable and major coins move together. This analysis should inform both predictive modeling and porfolio strategy.


## Hypothesis Testing


In [21]:
# Check the top 10 rank
crypto.groupby('symbol')['ranknow'].mean().sort_values().reset_index().head(10)

Unnamed: 0,symbol,ranknow
0,BTC,1.0
1,XRP,2.0
2,ETH,3.0
3,XLM,4.0
4,BCH,5.0
5,EOS,6.0
6,LTC,7.0
7,USDT,8.0
8,BSV,9.0
9,ADA,10.0


In [22]:
# Define top 10 cryptocurrencies by Yahoo Finance tickers
assets = ['BTC-USD', 'ETH-USD', 'XRP-USD', 'XLM-USD', 'USDT-USD',
          'LTC-USD', 'EOS-USD', 'BSV-USD', 'BCH-USD', 'ADA-USD']

# Define the period
period = '10y'

results = []

for symbol in tqdm(assets):
    try:
        # Download 5 years of daily data
        df = yf.download(symbol, period=period)
        df = df[['Close']].dropna().copy()
        df['Return'] = df['Close'].pct_change()

        # Lag features to simulate momentum
        df['Lag1'] = df['Return'].shift(1)
        df['Lag2'] = df['Return'].shift(2)
        df['Lag3'] = df['Return'].shift(3)

        # Drop missing
        df.dropna(inplace=True)

        # Define features and target
        X = df[['Lag1', 'Lag2', 'Lag3']]
        y = df['Return']

        # Split data (80% train, 20% test)
        split = int(0.8 * len(df))
        X_train, X_test = X[:split], X[split:]
        y_train, y_test = y[:split], y[split:]

        # Train model
        model = RandomForestRegressor(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)

        # Predict
        y_pred = model.predict(X_test)

        # Evaluate RMSE
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        results.append({'symbol': symbol, 'rmse': rmse})

    except Exception as e:
        print(f"Error with {symbol}: {e}")


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to Tru

In [23]:
# Create results DataFrame
results = pd.DataFrame(results)

# Add volatility calculation for hypothesis 1
vol_list = []

for symbol in assets:
    try:
        df = yf.download(symbol, period=period)
        df['Return'] = df['Close'].pct_change()
        volatility = df['Return'].std()
        vol_list.append({'symbol': symbol, 'volatility': volatility})
    except:
        continue

vol = pd.DataFrame(vol_list)

# Merge RMSE and Volatility
final = pd.merge(results, vol, on='symbol')
final


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed

YF.download() has changed argument auto_adjust default to Tru

Unnamed: 0,symbol,rmse,volatility
0,BTC-USD,0.027308,0.035614
1,ETH-USD,0.040357,0.045468
2,XRP-USD,0.049584,0.060987
3,XLM-USD,0.058474,0.059409
4,USDT-USD,0.000564,0.003676
5,LTC-USD,0.040815,0.052029
6,EOS-USD,0.054596,0.060181
7,BSV-USD,0.05286,0.07138
8,BCH-USD,0.05185,0.059628
9,ADA-USD,0.067922,0.06683


#### **Hypothesis 1:**

Among the top 10 cryptocurrencies, assets with lower historical volatility produce more accurate return forecasts and deliver superior risk-adjusted returns in long-term investment strategies.

*NOTE*: Analysis is based only on the Kaggle dataset.

In [24]:
# Group by coin symbol and compute average 7-day volatility and average Sharpe ratio
volatility_summary = crypto_augmented.groupby('symbol')['rolling_std_7d'].mean().reset_index(name='avg_volatility_7d')
sharpe_summary = crypto_augmented.groupby('symbol')['sharpe_ratio_7d'].mean().reset_index(name='avg_sharpe_ratio_7d')

# Merge into a single DataFrame
summary = pd.merge(volatility_summary, sharpe_summary, on='symbol')

# Sort by volatility
summary_sorted = summary.sort_values(by='avg_volatility_7d', ascending=True)

# Display
print(summary_sorted)

  symbol  avg_volatility_7d  avg_sharpe_ratio_7d
7   USDT           0.003944             0.001766
3    BTC           0.034862             0.097599
6    LTC           0.051097            -0.019784
5    ETH           0.057713             0.045032
9    XRP           0.057732            -0.055060
8    XLM           0.065947            -0.041066
4    EOS           0.078497            -0.035199
1    BCH           0.079193            -0.055900
0    ADA           0.082500            -0.095291
2    BSV           0.298419             0.040731


In [25]:
# Scatter Plot - Sharpe Ration vs. Volatility

fig = px.scatter(summary_sorted,
                 x='avg_volatility_7d',
                 y='avg_sharpe_ratio_7d',
                 text='symbol',
                 title='Avg Sharpe Ratio vs Avg Volatility (Top 10 Cryptos)',
                 labels={'avg_volatility_7d': 'Avg 7-Day Volatility', 'avg_sharpe_ratio_7d': 'Avg Sharpe Ratio'},
                 width=800, height=500)

fig.update_traces(textposition='top center')
fig.show()

In [26]:
# Filter for selected coins and date range
selected_symbols = ['ETH', 'USDT', 'BSV', 'ADA']
start_date = '2015-01-01'
end_date = '2018-12-31'

filtered_df = crypto_augmented[
    (crypto_augmented['symbol'].isin(selected_symbols)) &
    (crypto_augmented['date'] >= start_date) &
    (crypto_augmented['date'] <= end_date)
].copy()

# Calculate 90-day rolling Sharpe Ratio: mean / std of daily returns
filtered_df['rolling_mean_90'] = filtered_df.groupby('symbol')['daily_return'].transform(lambda x: x.rolling(window=90).mean())
filtered_df['rolling_std_90'] = filtered_df.groupby('symbol')['daily_return'].transform(lambda x: x.rolling(window=90).std())
filtered_df['sharpe_90'] = filtered_df['rolling_mean_90'] / filtered_df['rolling_std_90']

# Plot
fig = px.line(
    filtered_df,
    x='date',
    y='sharpe_90',
    color='symbol',
    title='Rolling 90-Day Sharpe Ratio (2015–2018)',
    labels={'sharpe_90': 'Sharpe Ratio', 'date': 'Date', 'symbol': 'Crypto Symbol'}
)

fig.show()


**Interpretation:**

For this graph, only 3 cryptocurrencies were selected in order to make the comparison easier.

- ETH -> medium volatility and high sharpe ratio
- USDT -> low volatility and low sharpe ratio
- BSV -> high volatility and medium sharpe ratio
- ADA -> low / medium volatility and low sharpe ratio



---

**Hypothesis 1 Summary:**

The hypothesis is partially supported:


1.   Moderate volatility alongside with steady return behavior provides more benefitial conditions for long-term forecasting and portfolio inclusion (like for BTC and ETH).
2.   However, outliers like BSV suggest that volatility does not perfectly represent risk-adjusted return.



#### **Hypothesis 2:**

**Investment Timing Strategy**

Short-term momentum in daily returns of major cryptocurrencies can be leveraged to forecast 5-year return patterns, supporting the use of lag-based ML models in strategic asset allocation.

*NOTE*: Analysis is based on the Kaggle dataset and data from Yahoo Finance.

In [27]:
# Define lag features
lag_days = 7
lag_features = [f'return_lag_{i}' for i in range(1, lag_days + 1)]

# Initialize results dictionary
results = {}

# Loop through top 10 coins
top10 = crypto_augmented['symbol'].value_counts().index[:10]  # Adjust if you have a specific top 10 list

for coin in top10:
    # Subset and sort data
    coin_df = crypto_augmented[crypto_augmented['symbol'] == coin].copy()
    coin_df = coin_df.sort_values('date')

    # Create lagged return features
    for lag in range(1, lag_days + 1):
        coin_df[f'return_lag_{lag}'] = coin_df['daily_return'].shift(lag)

    # Define binary target: 1 if next-day return > 0, else 0
    coin_df['target'] = (coin_df['daily_return'].shift(-1) > 0).astype(int)

    # Drop NaN values (from lags)
    coin_df.dropna(subset=lag_features + ['target'], inplace=True)

    # Define features and target
    X = coin_df[lag_features]
    y = coin_df['target']

    # Train-test split (no shuffle to preserve time order)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

    # Train Random Forest
    clf = RandomForestClassifier(random_state=42)
    clf.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)

    # Store accuracy and feature importance
    results[coin] = {
        'accuracy': acc,
        'precision': report['1']['precision'],
        'recall': report['1']['recall'],
        'f1-score': report['1']['f1-score'],
        'feature_importances': clf.feature_importances_
    }

# Display results summary
results_df = pd.DataFrame(results).T.reset_index().rename(columns={'index': 'Coin'})
print(results_df)

   Coin  accuracy precision    recall  f1-score  \
0   BTC  0.513514   0.53629  0.615741  0.573276   
1   LTC  0.513514  0.464052  0.379679  0.417647   
2   XRP  0.533505  0.502959  0.467033   0.48433   
3   XLM  0.563694  0.519608  0.375887  0.436214   
4  USDT  0.593407  0.466667   0.12844  0.201439   
5   ETH   0.46888  0.402062  0.357798  0.378641   
6   EOS  0.441176  0.294118       0.1  0.149254   
7   BCH  0.540816  0.387097  0.315789  0.347826   
8   ADA  0.571429  0.454545  0.294118  0.357143   
9   BSV  0.333333       0.0       0.0       0.0   

                                 feature_importances  
0  [0.1437728730210368, 0.1390499518384564, 0.138...  
1  [0.14281406918374917, 0.14368924817703355, 0.1...  
2  [0.1438957646442564, 0.1388734814130292, 0.138...  
3  [0.14360910833407994, 0.14075589941635483, 0.1...  
4  [0.1906516882709939, 0.1523017324748985, 0.149...  
5  [0.1393060656036267, 0.14274937547764316, 0.14...  
6  [0.1362281978620007, 0.13691767856160025, 0.14... 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



The results of the classification model show that only some coins have sufficient data history and returned classification accuracies near 50%, with reasonable F-1 scores -> BTC, ETH, BCH, LTC, XRP and XLM.

On the other side, cryptocurrencies like USDT, ADA and BSV are going to be excluded on the following analysis due to either:
- near 0 price fluctuations, like USDT (a stablecoin)
- extremely imbalanced classes, like BSV, never producing upward returns in test set
- unreliable patters that prevented the model from learning useful signs.

Like this, the focus on meaningful cryptocurrencies ensures that the forecasting analysis is based on valid, interpretable results with potential real-world application in portfolio strategy.

In [28]:
# Filter only meaningful coins
meaningful = ['BTC', 'ETH', 'XRP', 'XLM', 'LTC', 'BCH']
filtered_results = results_df[results_df['Coin'].isin(meaningful)]

# Accuracy bar plot
fig = px.bar(filtered_results,
             x='Coin',
             y='accuracy',
             title='Classification Accuracy by Coin (Lag-Based Prediction)',
             labels={'accuracy': 'Accuracy'},
             hover_data={
                 'accuracy': True,
                 'precision': True,
                 'recall': True,
                 'f1-score': True
             })
fig.show()

# Hove over to see more metrics !

In [29]:
# Filter Kaggle data for training (2015–2018)
eth_kaggle_train = crypto_augmented[
    (crypto_augmented['symbol'] == 'ETH') &
    (crypto_augmented['date'] >= '2015-01-01') &
    (crypto_augmented['date'] <= '2018-12-31')
].copy()

# Create lag features
for lag in range(1, 8):
    eth_kaggle_train[f'return_lag_{lag}'] = eth_kaggle_train['daily_return'].shift(lag)

eth_kaggle_train = eth_kaggle_train.dropna()

# Define features and target
lag_features = [f'return_lag_{i}' for i in range(1, 8)]
X_train = eth_kaggle_train[lag_features]
y_train = (eth_kaggle_train['daily_return'].shift(-1) > 0).astype(int)

In [30]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

In [31]:
# Get data from Yahoo for ETH
# Download Ethereum data from Yahoo Finance
eth_yahoo = yf.download('ETH-USD', start='2019-01-01', end='2024-12-31', auto_adjust=False)

# Reset index to get 'Date' as a column
eth_yahoo.reset_index(inplace=True)

# Keep only the columns you need
eth_yahoo = eth_yahoo[['Date', 'Adj Close']]
eth_yahoo.rename(columns={'Date': 'date', 'Adj Close': 'close'}, inplace=True)

# Calculate daily return
eth_yahoo['daily_return'] = eth_yahoo['close'].pct_change()

[*********************100%***********************]  1 of 1 completed


In [32]:
# Already downloaded + has daily_return
eth_yahoo = eth_yahoo.sort_values('date').copy()

# Create lag features on Yahoo data
for lag in range(1, 8):
    eth_yahoo[f'return_lag_{lag}'] = eth_yahoo['daily_return'].shift(lag)

# Filter from 2019 onward and drop NaNs
eth_yahoo_test = eth_yahoo[eth_yahoo['date'] >= '2019-01-01'].copy()

In [33]:
eth_yahoo_test['predicted_signal'] = clf.predict(eth_yahoo_test[lag_features])


X does not have valid feature names, but RandomForestClassifier was fitted with feature names



In [34]:
# Strategy = only act when signal = 1
eth_yahoo_test['strategy_return'] = eth_yahoo_test['predicted_signal'] * eth_yahoo_test['daily_return']

# Cumulative returns
eth_yahoo_test['cumulative_strategy'] = (1 + eth_yahoo_test['strategy_return']).cumprod()
eth_yahoo_test['cumulative_actual'] = (1 + eth_yahoo_test['daily_return']).cumprod()

In [35]:
# Plot
fig = px.line(
    eth_yahoo_test,
    x='date',
    y=['cumulative_actual', 'cumulative_strategy'],
    title='ETH: Actual vs Strategy Cumulative Return (2019–2024)',
    labels={'value': 'Cumulative Return', 'date': 'Date'}
)

fig.update_layout(legend_title_text='Return Type')
fig.show()


**Interpretation of the results, using ETH as an example**

The results shown above suport partially hypothesis 2.

The cumulative return chart from 2019 to the end of 2024 shows the performance of the machine learning based strategy (the red line) versus a passive holding of ETH (blue line).

In most periods, the strategy exhibitis upward trend that aligns with ETH's actual return spikes, indicating that the lag-based Random Forest model is able to anticipate some of the trends.

Overall, the model suggests that even a relatively short-term momemtum analysis can serve as a risk-controlled strategy that responds adaptively to changing market conditions.


In [36]:
# Let's see how the model behaves for crypto XLM (the one with the higher accuracy)
# Filter Kaggle data for XLM from 2015–2018
xlm_kaggle_train = crypto_augmented[
    (crypto_augmented['symbol'] == 'XLM') &
    (crypto_augmented['date'] >= '2015-01-01') &
    (crypto_augmented['date'] <= '2018-12-31')
].copy()

# Create 7 lag features
for lag in range(1, 8):
    xlm_kaggle_train[f'return_lag_{lag}'] = xlm_kaggle_train['daily_return'].shift(lag)

xlm_kaggle_train = xlm_kaggle_train.dropna()

# Define training features and target
lag_features = [f'return_lag_{i}' for i in range(1, 8)]
X_train_xlm = xlm_kaggle_train[lag_features]
y_train_xlm = (xlm_kaggle_train['daily_return'].shift(-1) > 0).astype(int)

In [37]:
# Train the XLM Model

clf_xlm = RandomForestClassifier(random_state=42)
clf_xlm.fit(X_train_xlm, y_train_xlm)

In [38]:
# 1. Download data from Yahoo Finance
xlm_yahoo = yf.download('XLM-USD', start='2015-01-01', end='2024-12-31', auto_adjust=False)
xlm_yahoo = xlm_yahoo.reset_index()
xlm_yahoo = xlm_yahoo[['Date', 'Adj Close']].rename(columns={'Date': 'date', 'Adj Close': 'close'})

# 2. Create daily return column
xlm_yahoo['daily_return'] = xlm_yahoo['close'].pct_change()

# 3. Create lag features
for lag in range(1, 8):
    xlm_yahoo[f'return_lag_{lag}'] = xlm_yahoo['daily_return'].shift(lag)

# 4. Define list of lag feature names
lag_features = [f'return_lag_{i}' for i in range(1, 8)]

# 5. Drop rows with NaNs in any lag features
xlm_yahoo_test = xlm_yahoo[xlm_yahoo['date'] >= '2019-01-01'].copy()

[*********************100%***********************]  1 of 1 completed


In [39]:
# Predict signals using trained model
# Flatten the columns if they are multi-indexed (like after downloading from Yahoo)
xlm_yahoo_test.columns = [col[0] if isinstance(col, tuple) else col for col in xlm_yahoo_test.columns]

# Now ensure the lag_features are correctly defined
lag_features = [f'return_lag_{i}' for i in range(1, 8)]

# Predict using the trained model
xlm_yahoo_test['predicted_signal'] = clf_xlm.predict(xlm_yahoo_test[lag_features])

In [40]:
# Compute Strategy vs Actual Returns
xlm_yahoo_test['strategy_return'] = xlm_yahoo_test['predicted_signal'] * xlm_yahoo_test['daily_return']
xlm_yahoo_test['cumulative_strategy'] = (1 + xlm_yahoo_test['strategy_return']).cumprod()
xlm_yahoo_test['cumulative_actual'] = (1 + xlm_yahoo_test['daily_return']).cumprod()

In [41]:
# Plot the results
fig = px.line(
    xlm_yahoo_test,
    x='date',
    y=['cumulative_actual', 'cumulative_strategy'],
    title='XLM: Actual vs Strategy Cumulative Return (2019–2024)',
    labels={'value': 'Cumulative Return', 'date': 'Date'}
)

fig.update_layout(legend_title_text='Return Type')
fig.show()

**Interpretation of the results, using XLM as an example:**

The accuracy for the cryptocurrency XLM is higher. As illustrated in the graph, the strategy line (in red) tracks the general movement of the actual return more closely than in the case of ETH.

While the model doesn't fully capture the extreme peaks, if effectively mirrors many upward and downward trends.

Overall, the line chart suggests that the model captured underlying return patterns in XLM's market behavior with moderate predictive success.


---

**Hypothesis 2: Summary**

THe overall results partially support hypothesis 2, bearing in mind that the hypothesis stated that a machine learning model trained on short-term lagged features could predict patterns in unseen data for different cryptocurrencies.

The model created shows predictive potential, but its effectivenes varies accross different cryptocurrencies and market conditions, reinforcing the need for asset-specific modeling strategies and possible more complex features or hybrid approaches in future work.




### Hypothesis 3: Portfolio Diversification
Correlations among top cryptocurrencies suggest that combining low-volatility assets like USDT with high-volatility assets like BTC enhances the Sharpe ratio of a diversified portfolio.


*NOTE*: Analysis is based only on the Kaggle dataset.


In [42]:
# Filter the crypto dataset only for BTC and USDT
crypto_btc_usdt = crypto[(crypto['symbol'] == 'BTC') | (crypto['symbol'] == 'USDT')].copy()

In [43]:
crypto_btc_usdt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3411 entries, 0 to 11197
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   slug         3411 non-null   object        
 1   symbol       3411 non-null   object        
 2   name         3411 non-null   object        
 3   date         3411 non-null   datetime64[ns]
 4   ranknow      3411 non-null   int64         
 5   open         3411 non-null   float64       
 6   high         3411 non-null   float64       
 7   low          3411 non-null   float64       
 8   close        3411 non-null   float64       
 9   volume       3411 non-null   float64       
 10  market       3411 non-null   float64       
 11  close_ratio  3411 non-null   float64       
 12  spread       3411 non-null   float64       
dtypes: datetime64[ns](1), float64(8), int64(1), object(3)
memory usage: 373.1+ KB


In [44]:
# Pivot: rows = date, columns = symbol, values = close price
prices = crypto_btc_usdt.pivot(index='date', columns='symbol', values='close')

# Sort by date
prices = prices.sort_index()

# Drop any rows with missing data
prices = prices.dropna()

prices.head()

symbol,BTC,USDT
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-02-25,237.47,1.21
2015-02-26,236.43,1.21
2015-03-02,275.67,0.606502
2015-03-03,281.7,0.606229
2015-03-06,272.72,1.0


In [45]:
# Daily returns for BTC and USDT
returns = prices.pct_change().dropna()
returns.head()

symbol,BTC,USDT
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-02-26,-0.00438,0.0
2015-03-02,0.165969,-0.498759
2015-03-03,0.021874,-0.00045
2015-03-06,-0.031878,0.649542
2015-03-07,0.01298,0.0


In [46]:
# Correlation Heatmap
correlation_matrix = returns.corr()

fig = px.imshow(correlation_matrix,text_auto='.3f')

fig.update_layout(
    title='Correlation between BTC and USDT',
#    xaxis_title='Cryptocurrency',
#    yaxis_title='Cryptocurrency'
)
fig.show()

Very low negative correlation between BTC and USDT -> suggests that combining both cryptocurrencies could diversify the risk without sacrificing too much performance.

In [47]:
# Compute Cumulative Returns for BTC and USDT
cumulative_returns_btc = (1 + returns['BTC']).cumprod()
cumulative_returns_usdt = (1 + returns['USDT']).cumprod()

# Define Fixed Portfolio Weights
weights = [0.6, 0.4]  # 60% BTC, 40% USDT

# Portfolio Daily Returns and Cumulative Returns
# Select only the asset return columns for the dot product
portfolio_daily_returns = returns[['BTC', 'USDT']].dot(weights)
portfolio_cumulative_returns = (1 + portfolio_daily_returns).cumprod()

# Simulate $200,000 Investment
initial_investment = 200000
portfolio_value_btc = cumulative_returns_btc * initial_investment
portfolio_value_usdt = cumulative_returns_usdt * initial_investment
portfolio_value_combined = portfolio_cumulative_returns * initial_investment

# Create DataFrame with Portfolio Values
portfolio_values = pd.DataFrame({
    'BTC Only': portfolio_value_btc,
    'USDT Only': portfolio_value_usdt,
    'Combined Portfolio': portfolio_value_combined
})

# Plot Cumulative Portfolio Value (Fixed Weights)
fig = px.line(portfolio_values,
              x=portfolio_values.index,
              y=portfolio_values.columns,
              title="Cumulative Growth of $200,000 in BTC, USDT, and Combined Portfolio")
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Portfolio Value (USD)")
fig.show()

In [48]:
# Simulate Portfolios for Different Weightings
all_portfolio_cumulative_returns = pd.DataFrame(index=returns.index)

for weight_btc in np.arange(0, 1.1, 0.1):
    weight_usdt = 1 - weight_btc
    weights = [weight_btc, weight_usdt]
    portfolio_returns = returns[['BTC', 'USDT']].dot(weights)
    cumulative_returns = (1 + portfolio_returns).cumprod()
    label = f'BTC: {weight_btc:.1f}, USDT: {weight_usdt:.1f}'
    all_portfolio_cumulative_returns[label] = cumulative_returns

# Convert to Portfolio Values
all_portfolio_values = all_portfolio_cumulative_returns * initial_investment

# Plot All Portfolio Combinations
fig = px.line(all_portfolio_values,
              x=all_portfolio_values.index,
              y=all_portfolio_values.columns,
              title="Portfolio Value for Different BTC/USDT Weightings ($200,000 Investment)")
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Portfolio Value (USD)")
fig.show()

Based on both graphs and the correlation matrix, invessting entirely in BTC yields the highest return, but at the same time it exposes the investor to extreme volatility (risk).

The combined portfolio makes the volatility a little bit more steady and still achieves significant growth - highlighting the importance of diversification.

---

**Hypothesis 3: Summary**

Hypothesis 3 is supported.

The low correlation between BTC and USDT means that combining them in a portfolio helps reduce overall volatility. Although the portfolio with 100% of BTC achieves the highest returns, it also represents the highest risk.

Like this, by introducing a low volatile asset like USDT, the combined portfolio achieves more consistent growth -> supporting better risk-adjusted performance.



## Classification Model -> Strategies for Cryptocurrency Trading

ETH vs. XLM

### Ethereum (ETH)

In [49]:
# Filter for Ethereum (ETH)
eth_df = crypto[crypto['symbol'] == 'ETH'].copy()
eth_df['date'] = pd.to_datetime(eth_df['date'])
eth_df = eth_df.sort_values('date')

# Calculate daily return
eth_df['daily_return'] = eth_df['close'].pct_change()

# Create binary target: 1 for positive return (bullish), 0 for non-positive (bearish)
eth_df['target'] = (eth_df['daily_return'] > 0).astype(int)

# Feature engineering: lagged features
eth_df['lag_return_1'] = eth_df['daily_return'].shift(1)
eth_df['lag_return_2'] = eth_df['daily_return'].shift(2)
eth_df['lag_volume'] = eth_df['volume'].shift(1)
eth_df['lag_spread'] = eth_df['spread'].shift(1)

# Drop rows with NaN values
eth_df.dropna(inplace=True)

# Define features and target
features = ['lag_return_1', 'lag_return_2', 'lag_volume', 'lag_spread']
X = eth_df[features]
y = eth_df['target']

# Train-test split (80/20, no shuffle to preserve time order)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred = clf.predict(X_test_scaled)

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
fig = px.imshow(conf_matrix,
                labels=dict(x="Predicted", y="Actual", color="Count"),
                x=["Bearish (0)", "Bullish (1)"],
                y=["Bearish (0)", "Bullish (1)"],
                text_auto=True,
                title="Confusion Matrix – ETH Bull vs Bear Days")
fig.show()

Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.53      0.54       131
           1       0.47      0.49      0.48       111

    accuracy                           0.51       242
   macro avg       0.51      0.51      0.51       242
weighted avg       0.51      0.51      0.51       242



In [50]:
# Assume eth_yahoo_test already has 'predicted_signal' and 'daily_return'
df = eth_yahoo_test.copy()
df.columns = df.columns.droplevel(1)

# Strategy returns: invest in ETH when predicted_signal == 1
df['strategy_return'] = df['daily_return'] * df['predicted_signal']

# Include transaction costs (assume 0.1% per trade)
transaction_cost = 0.001
df['signal_shifted'] = df['predicted_signal'].shift(1).fillna(0)
df['trades'] = df['predicted_signal'] != df['signal_shifted']
df['strategy_return_net'] = df['strategy_return'] - df['trades'] * transaction_cost

# Cumulative returns
df['cumulative_actual'] = (1 + df['daily_return']).cumprod()
df['cumulative_strategy'] = (1 + df['strategy_return_net']).cumprod()

# Compute Sharpe Ratio
sharpe = (df['strategy_return_net'].mean() / df['strategy_return_net'].std()) * np.sqrt(252)

# Compute Max Drawdown
df['rolling_max'] = df['cumulative_strategy'].cummax()
df['drawdown'] = df['cumulative_strategy'] / df['rolling_max'] - 1
max_drawdown = df['drawdown'].min()

# Print stats
print(f"Sharpe Ratio: {sharpe:.2f}")
print(f"Max Drawdown: {max_drawdown:.2%}")

Sharpe Ratio: 0.33
Max Drawdown: -72.79%


In [51]:
# Plot the data
df_plot = df[['date', 'cumulative_actual', 'cumulative_strategy']].melt(
    id_vars='date',
    var_name='Type',
    value_name='Cumulative Return'
)

fig = px.line(
    df_plot,
    x='date',
    y='Cumulative Return',
    color='Type',
    title='ETH: Cumulative Return – Strategy vs. Buy & Hold (with transaction costs)'
)
fig.show()

The **ETH classifier** struggled to distinguish bull and bear days, with an accuracy of 51%. The trading strategy based on these predictions produced:

- Sharpe Ratio: 0.33 (weak risk-adjusted performance)
- Max Drawdown: -72.79% (indicating severe downside risk)

While the strategy avoided extreme volatility, it significantly underperformed a buy-and-hold ETH approach, missing major upside moves. These results suggest limited predictive power and a high opportunity cost compared to passive investment.

### Stellar (XLM)

In [52]:
# Filter XLM data from Kaggle dataset (2015–2018 for training)
xlm_kaggle = crypto_augmented[
    (crypto_augmented['symbol'] == 'XLM') &
    (crypto_augmented['date'] >= '2015-01-01') &
    (crypto_augmented['date'] <= '2018-12-31')
].copy()

# Create lag features
for lag in range(1, 8):
    xlm_kaggle[f'return_lag_{lag}'] = xlm_kaggle['daily_return'].shift(lag)

xlm_kaggle = xlm_kaggle.dropna()

# Define features and target
lag_features = [f'return_lag_{i}' for i in range(1, 8)]
X_train_xlm = xlm_kaggle[lag_features]
y_train_xlm = (xlm_kaggle['daily_return'].shift(-1) > 0).astype(int)

# Train Random Forest model
clf_xlm = RandomForestClassifier(random_state=42)
clf_xlm.fit(X_train_xlm, y_train_xlm)

In [53]:
# Classification Report and Confusion Matrix
# True labels: whether the next day's return was positive
y_true_xlm = (xlm_yahoo_test['daily_return'].shift(-1) > 0).astype(int).dropna()

# Predicted labels
y_pred_xlm = xlm_yahoo_test.loc[y_true_xlm.index, 'predicted_signal']

# Classification Report
print("Classification Report – XLM Bull vs Bear Days")
print(classification_report(y_true_xlm, y_pred_xlm))

# Confusion Matrix
cm_xlm = confusion_matrix(y_true_xlm, y_pred_xlm)

# Plot the confusion matrix
plt.figure(figsize=(6, 4))
fig = px.imshow(cm_xlm,
                labels=dict(x="Predicted", y="Actual", color="Count"),
                x=["Bearish (0)", "Bullish (1)"],
                y=["Bearish (0)", "Bullish (1)"],
                text_auto=True,
                title="Confusion Matrix – XLM Bull vs Bear Days")
fig.show()

Classification Report – XLM Bull vs Bear Days
              precision    recall  f1-score   support

           0       0.49      0.72      0.59      1084
           1       0.50      0.28      0.36      1107

    accuracy                           0.50      2191
   macro avg       0.50      0.50      0.47      2191
weighted avg       0.50      0.50      0.47      2191



<Figure size 600x400 with 0 Axes>

In [54]:
# Backtesting Logic with Transaction Costs
# Backtest trading strategy
transaction_cost = 0.001  # 0.1% per trade

xlm_yahoo_test['strategy_return'] = xlm_yahoo_test['daily_return'] * xlm_yahoo_test['predicted_signal']
xlm_yahoo_test['signal_shifted'] = xlm_yahoo_test['predicted_signal'].shift(1).fillna(0)
xlm_yahoo_test['trades'] = xlm_yahoo_test['predicted_signal'] != xlm_yahoo_test['signal_shifted']
xlm_yahoo_test['strategy_return_net'] = xlm_yahoo_test['strategy_return'] - xlm_yahoo_test['trades'] * transaction_cost

# Cumulative returns
xlm_yahoo_test['cumulative_actual'] = (1 + xlm_yahoo_test['daily_return']).cumprod()
xlm_yahoo_test['cumulative_strategy'] = (1 + xlm_yahoo_test['strategy_return_net']).cumprod()

In [55]:
# Sharpe Ration and Max Drawdown
sharpe_xlm = (xlm_yahoo_test['strategy_return_net'].mean() / xlm_yahoo_test['strategy_return_net'].std()) * np.sqrt(252)

xlm_yahoo_test['rolling_max'] = xlm_yahoo_test['cumulative_strategy'].cummax()
xlm_yahoo_test['drawdown'] = xlm_yahoo_test['cumulative_strategy'] / xlm_yahoo_test['rolling_max'] - 1
max_drawdown_xlm = xlm_yahoo_test['drawdown'].min()

print(f"Sharpe Ratio (XLM): {sharpe_xlm:.2f}")
print(f"Max Drawdown (XLM): {max_drawdown_xlm:.2%}")

Sharpe Ratio (XLM): 0.12
Max Drawdown (XLM): -80.99%


In [56]:
# Plot Strategy vs Buy & Hold
df_plot_xlm = xlm_yahoo_test[['date', 'cumulative_actual', 'cumulative_strategy']].melt(id_vars='date', var_name='Type', value_name='Cumulative Return')

px.line(df_plot_xlm, x='date', y='Cumulative Return', color='Type',
        title='XLM: Cumulative Return – Strategy vs. Buy & Hold (with transaction costs)')

The **XLM’s model** had a higher classification accuracy, with 313 bullish and 777 bearish days correctly predicted. However, the backtested strategy still lagged behind buy-and-hold:

- Sharpe Ratio: 0.12
- Max Drawdown: -80.99%

Although the classifier captured XLM’s behavior more effectively, it did not translate into superior returns, revealing a tradeoff between model accuracy and practical gains.


---

The high drawdoens show that while the machine learning strategy smooths some volatility compared to buy-and-hold, it still exposes the investor to significant downside risk.

## Conclusion

Three hypothesis were tested in this notebook:



*   While regression-based approaches revealed certain short-term trends, their predictive power was limited across volatile cryptocurrencies. Forecasting exact return values remains challenging due to the noisy and non-linear nature of crypto markets.
*   The classification performance was modest for ETH but achieved better accuracy for XLM.


*    Using volatility and correlation analysis, we confirmed that combining low-volatility assets (like USDT) with high-volatility ones (like BTC) reduced overall portfolio risk.



Applying data-driven approaches to cryptocurrency investing with machine learning offers promising insights but also reveals the challenges of forecasting in a highly speculative asset class. The historical data underscores the importance of rigorous backtesting, diversification, and prudent risk management when designing crypto investment strategies.






## References

Jvent. (2018, December 1). Every cryptocurrency daily market price. Kaggle. https://www.kaggle.com/datasets/jessevent/all-crypto-currencies

OpenAI. (2025). ChatGPT [Large language model]. https://chat.openai.com/chat

Yahoo! (n.d.). Symbol Lookup from Yahoo Finance. Yahoo! Finance. https://finance.yahoo.com/lookup/