# CryptoAI: Multi-Modal Crypto Market Analysis Notebook

This notebook demonstrates the complete workflow for the CryptoAI project. We:

- Fetch crypto news from Coindesk’s RSS feed and label them using NLTK’s VADER.
- Fine-tune a DistilBERT model (LLM) for sentiment analysis.
- Download Bitcoin historical market data from CoinGecko and generate candlestick charts.
- Fine-tune OpenAI’s CLIP (VLM) on the generated charts to classify them as bullish or bearish.
- Finally, we run a demonstration that uses the fine-tuned models to perform inference on new data.

This notebook is intended for experimentation and development. For a production system, consider adding robust error handling, logging, and optimizations.

In [None]:
# Install dependencies (if not already installed)
!pip install feedparser nltk datasets transformers pycoingecko mplfinance torch torchvision streamlit pandas Pillow matplotlib git+https://github.com/openai/CLIP.git

In [1]:
# Download NLTK VADER lexicon
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Part 1: LLM Fine-Tuning

In this section, we:

- Fetch crypto news articles from Coindesk’s RSS feed.
- Label each article using VADER (assigning a binary label: 1 for non-negative and 0 for negative sentiment).
- Create a Hugging Face dataset and fine-tune a DistilBERT model for sentiment classification.

In [2]:
import feedparser
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import pandas as pd

def fetch_crypto_news():
    # List of RSS feed URLs from multiple crypto news sources.
    feed_urls = [
        "https://feeds.feedburner.com/CoinDesk",
        "https://cointelegraph.com/rss",
        "https://cryptoslate.com/feed/",
        "https://www.ccn.com/feed/"
    ]
    articles = []
    for url in feed_urls:
        feed = feedparser.parse(url)
        for entry in feed.entries:
            # Concatenate title and summary
            text = entry.title + ". " + entry.summary
            articles.append(text)
    # Optionally, remove duplicate articles (based on text)
    articles = list(set(articles))
    return articles

def label_articles(articles):
    sia = SentimentIntensityAnalyzer()
    data = []
    for text in articles:
        sentiment = sia.polarity_scores(text)
        compound = sentiment['compound']
        # Label: 1 if compound score is non-negative, else 0
        label = 1 if compound >= 0 else 0
        data.append({'text': text, 'label': label})
    return data

# Fetch and label data from multiple sources
articles = fetch_crypto_news()
print(f"Fetched {len(articles)} articles.")

data = label_articles(articles)
print(f"Labeled data contains {len(data)} samples.")

# Convert list of dictionaries to a DataFrame
df = pd.DataFrame(data)
# Remove the extra index column if it exists
if '__index_level_0__' in df.columns:
    df = df.drop(columns=['__index_level_0__'])
# Create a Dataset from the DataFrame
dataset = Dataset.from_pandas(df)

# Split the dataset into training and test sets
dataset = dataset.train_test_split(test_size=0.2, seed=42)

# Fine-tune DistilBERT for sentiment classification
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./models/llm_finetuned",
    evaluation_strategy="epoch",
    learning_rate=2e-6,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.1,
    logging_steps=10,
    remove_unused_columns=False,  # Disable automatic removal of unused columns
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

print("Starting LLM fine-tuning...")
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./models/llm_finetuned", safe_serialization=False)
tokenizer.save_pretrained("./models/llm_finetuned")

print("LLM fine-tuning complete and model saved to ./models/llm_finetuned")

  warn(f"Failed to load image Python extension: {e}")


Fetched 66 articles.
Labeled data contains 66 samples.


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]



Starting LLM fine-tuning...


[34m[1mwandb[0m: Currently logged in as: [33mdorkhah9-shorewise-consulting[0m (use `wandb login --relogin` to force relogin)


Epoch,Training Loss,Validation Loss
1,0.7449,0.728487
2,0.7048,0.700083
3,0.7175,0.676595
4,0.6761,0.658901
5,0.6707,0.640666
6,0.6691,0.627655
7,0.6464,0.616903
8,0.6483,0.609517
9,0.6407,0.605662
10,0.6264,0.604511


LLM fine-tuning complete and model saved to ./models/llm_finetuned


## Part 2: VLM Fine-Tuning

Next, we:

- Download Bitcoin historical data from CoinGecko.
- Generate candlestick charts with mplfinance.
- Label each chart as **bullish** (if the final close is higher than the initial open) or **bearish**.
- Fine-tune OpenAI's CLIP model on these charts using a simple training loop.

In [3]:
import os
import pandas as pd
import mplfinance as mpf
from pycoingecko import CoinGeckoAPI
from datetime import datetime, timedelta
import torch
from torch.utils.data import Dataset, DataLoader
import clip  # Install via: pip install git+https://github.com/openai/CLIP.git
from PIL import Image
import torchvision.transforms as transforms

# Create directories to store chart images and models
CHART_DIR = "data/charts"
os.makedirs(CHART_DIR, exist_ok=True)
os.makedirs("models", exist_ok=True)

def fetch_bitcoin_data(days=60):
    cg = CoinGeckoAPI()
    end_date = datetime.now()
    start_date = end_date - timedelta(days=days)
    data = cg.get_coin_market_chart_range_by_id(id='bitcoin', vs_currency='usd',
                                                from_timestamp=start_date.timestamp(),
                                                to_timestamp=end_date.timestamp())
    prices = data['prices']
    df = pd.DataFrame(prices, columns=['timestamp', 'price'])
    df['datetime'] = pd.to_datetime(df['timestamp'], unit='ms')
    df.set_index('datetime', inplace=True)
    # Resample to daily OHLC values
    daily = df.resample('1D').agg({'price': ['first', 'max', 'min', 'last']})
    daily.columns = ['Open', 'High', 'Low', 'Close']
    daily = daily.dropna()
    return daily

def generate_candlestick_charts(df, chart_period=7):
    chart_files = []
    for i in range(0, len(df) - chart_period + 1, chart_period):
        df_chunk = df.iloc[i:i+chart_period]
        start_date = df_chunk.index[0].strftime("%Y-%m-%d")
        end_date = df_chunk.index[-1].strftime("%Y-%m-%d")
        file_name = f"bitcoin_{start_date}_to_{end_date}.png"
        file_path = os.path.join(CHART_DIR, file_name)
        mpf.plot(df_chunk, type='candle', style='charles', title=f"BTC {start_date} to {end_date}",
                 savefig=file_path)
        chart_files.append((file_path, df_chunk))
    return chart_files

def generate_sliding_window_charts(df, window_size=7, step=1):
    chart_files = []
    for i in range(0, len(df) - window_size + 1, step):
        df_chunk = df.iloc[i:i+window_size]
        start_date = df_chunk.index[0].strftime("%Y-%m-%d")
        end_date = df_chunk.index[-1].strftime("%Y-%m-%d")
        file_name = f"bitcoin_{start_date}_to_{end_date}.png"
        file_path = os.path.join(CHART_DIR, file_name)
        mpf.plot(df_chunk, type='candle', style='charles', title=f"BTC {start_date} to {end_date}",
                 savefig=file_path)
        chart_files.append((file_path, df_chunk))
    return chart_files

class ChartDataset(Dataset):
    def __init__(self, chart_files, transform=None):
        self.chart_files = chart_files
        self.transform = transform
        
    def __len__(self):
        return len(self.chart_files)
    
    def __getitem__(self, idx):
        file_path, df_chunk = self.chart_files[idx]
        image = Image.open(file_path).convert("RGB")
        # Label: bullish if final close > initial open; else bearish
        label = 1 if df_chunk['Close'].iloc[-1] > df_chunk['Open'].iloc[0] else 0
        if self.transform:
            image = self.transform(image)
        return image, label

# Fetch Bitcoin data and generate charts
df_bitcoin = fetch_bitcoin_data(days=60)
#chart_files = generate_candlestick_charts(df_bitcoin, chart_period=7)
chart_files = generate_sliding_window_charts(df_bitcoin, window_size=7, step=1)
print(f"Generated {len(chart_files)} candlestick charts.")

# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, preprocess = clip.load("ViT-B/32", device=device)

# Create dataset and dataloader
transform = transforms.Compose([preprocess])
dataset = ChartDataset(chart_files, transform=transform)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

optimizer = torch.optim.Adam(clip_model.parameters(), lr=1e-5)
clip_model.train()

print("Starting VLM fine-tuning...")
for epoch in range(3):  # Fine-tune for 3 epochs
    for images, labels in dataloader:
        images = images.to(device)
        labels = torch.tensor(labels).to(device)
        # Use fixed text prompts for both classes: 0 -> "bearish", 1 -> "bullish"
        text_inputs = clip.tokenize(["bearish", "bullish"]).to(device)
        
        image_features = clip_model.encode_image(images)
        text_features = clip_model.encode_text(text_inputs)
        
        # Normalize features
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
        
        # Compute logits: shape (batch_size, 2)
        logits_per_image = (image_features @ text_features.t()) * 100.0
        
        # Compute cross entropy loss with targets (0 or 1)
        loss = torch.nn.functional.cross_entropy(logits_per_image, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch} Loss: {loss.item()}")
        
# Save the fine-tuned VLM model weights
torch.save(clip_model.state_dict(), "./models/vlm_finetuned.pth")
print("VLM fine-tuning complete and model saved to ./models/vlm_finetuned.pth")

Generated 55 candlestick charts.
Starting VLM fine-tuning...


  labels = torch.tensor(labels).to(device)


Epoch 0 Loss: 1.7087438106536865
Epoch 0 Loss: 4.089028835296631
Epoch 0 Loss: 2.3307607173919678
Epoch 0 Loss: 0.5180132389068604
Epoch 0 Loss: 0.9236690998077393
Epoch 0 Loss: 0.588776707649231
Epoch 0 Loss: 0.4214000105857849
Epoch 0 Loss: 1.077101469039917
Epoch 0 Loss: 0.6181157231330872
Epoch 0 Loss: 0.642157793045044
Epoch 0 Loss: 0.6840968728065491
Epoch 0 Loss: 0.7055062651634216
Epoch 0 Loss: 0.6209826469421387
Epoch 0 Loss: 0.6279971599578857
Epoch 1 Loss: 0.7177717089653015
Epoch 1 Loss: 0.5640753507614136
Epoch 1 Loss: 0.7447790503501892
Epoch 1 Loss: 0.7233631014823914
Epoch 1 Loss: 0.6102532744407654
Epoch 1 Loss: 0.5444055795669556
Epoch 1 Loss: 0.7015656232833862
Epoch 1 Loss: 0.5051751136779785
Epoch 1 Loss: 0.5148640871047974
Epoch 1 Loss: 0.535960853099823
Epoch 1 Loss: 0.8493967056274414
Epoch 1 Loss: 0.7733601331710815
Epoch 1 Loss: 0.3940049111843109
Epoch 1 Loss: 0.6975641846656799
Epoch 2 Loss: 0.5802949666976929
Epoch 2 Loss: 0.48255085945129395
Epoch 2 Loss: 

## Part 3: Dashboard Demonstration

Here we load the fine-tuned models and perform inference on new data:

- Run sentiment analysis on a newly fetched crypto news article.
- Generate a new Bitcoin candlestick chart and classify it using the fine-tuned CLIP model.

This section simulates a dashboard-like demonstration.

In [4]:
import streamlit as st
import tempfile
from pycoingecko import CoinGeckoAPI
import mplfinance as mpf

# Helper: Load fine-tuned LLM model
@st.cache(allow_output_mutation=True)
def load_llm_model():
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    model = AutoModelForSequenceClassification.from_pretrained("./models/llm_finetuned")
    tokenizer = AutoTokenizer.from_pretrained("./models/llm_finetuned")
    return model, tokenizer

# Helper: Load fine-tuned VLM model
@st.cache(allow_output_mutation=True)
def load_vlm_model():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    model.load_state_dict(torch.load("./models/vlm_finetuned.pth", map_location=device))
    model.to(device)
    model.eval()
    return model, preprocess, device

def sentiment_analysis(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    outputs = model(**inputs)
    logits = outputs.logits.detach().cpu().numpy()
    sentiment = "Positive" if logits.argmax() == 1 else "Negative"
    return sentiment

def fetch_bitcoin_data(days=30):
    from datetime import datetime, timedelta
    cg = CoinGeckoAPI()
    end_date = datetime.now()
    start_date = end_date - timedelta(days=days)
    data = cg.get_coin_market_chart_range_by_id(id='bitcoin', vs_currency='usd',
                                                from_timestamp=start_date.timestamp(),
                                                to_timestamp=end_date.timestamp())
    prices = data['prices']
    df = pd.DataFrame(prices, columns=['timestamp', 'price'])
    df['datetime'] = pd.to_datetime(df['timestamp'], unit='ms')
    df.set_index('datetime', inplace=True)
    daily = df.resample('1D').agg({'price': ['first', 'max', 'min', 'last']})
    daily.columns = ['Open', 'High', 'Low', 'Close']
    daily = daily.dropna()
    return daily

def generate_chart(days=30):
    df = fetch_bitcoin_data(days=days)
    temp_file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
    mpf.plot(df, type='candle', style='charles', title="Bitcoin Candlestick Chart", savefig=temp_file.name)
    return temp_file.name

def classify_chart(image, vlm_model, preprocess, device):
    image_input = preprocess(image).unsqueeze(0).to(device)
    text_inputs = clip.tokenize(["bullish", "bearish"]).to(device)
    with torch.no_grad():
        image_features = vlm_model.encode_image(image_input)
        text_features = vlm_model.encode_text(text_inputs)
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
        logits = (image_features @ text_features.t()) * 100.0
        probs = torch.nn.functional.softmax(logits, dim=-1).cpu().numpy()[0]
        label = "bullish" if probs.argmax() == 0 else "bearish"
    return label, probs

# --- Demonstration ---
print("--- Dashboard Demonstration ---")

# Load LLM model and run sentiment analysis on a sample news article
llm_model, tokenizer = load_llm_model()
news_articles = fetch_crypto_news()
if news_articles:
    test_article = news_articles[0]
    print("Test Article:", test_article)
    sentiment = sentiment_analysis(test_article, llm_model, tokenizer)
    print("Sentiment:", sentiment)
else:
    print("No news articles found.")

# Generate a new Bitcoin chart and classify it
chart_path = generate_chart(days=30)
print(f"Generated chart at {chart_path}")
vlm_model, preprocess, device = load_vlm_model()
image = Image.open(chart_path).convert("RGB")
label, probs = classify_chart(image, vlm_model, preprocess, device)
print("Predicted Chart Pattern:", label)
print("Confidence Scores:", probs)

2025-03-05 13:32:52.447 
  command:

    streamlit run c:\users\moham\appdata\local\programs\python\python39\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]
2025-03-05 13:32:52.448 
`st.cache` is deprecated and will be removed soon. Please use one of Streamlit's new
caching commands, `st.cache_data` or `st.cache_resource`. More information
[in our docs](https://docs.streamlit.io/develop/concepts/architecture/caching).

**Note**: The behavior of `st.cache` was updated in Streamlit 1.36 to the new caching
logic used by `st.cache_data` and `st.cache_resource`. This might lead to some problems
or unexpected behavior in certain edge cases.

2025-03-05 13:32:52.451 
`st.cache` is deprecated and will be removed soon. Please use one of Streamlit's new
caching commands, `st.cache_data` or `st.cache_resource`. More information
[in our docs](https://docs.streamlit.io/develop/concepts/architecture/caching).

**Note**: The behavior of `st.cache` was updated in Streamlit 1.36 to the new caching


--- Dashboard Demonstration ---
Test Article: Why this Crypto Hedge Fund Expects Bitcoin Dominance to Drop. Bull. Credit: Paolo Feser, Unsplash
Sentiment: Positive




Generated chart at C:\Users\moham\AppData\Local\Temp\tmpxd88h64j.png




Predicted Chart Pattern: bearish
Confidence Scores: [0.37578073 0.62421924]


## Conclusion

In this notebook, we demonstrated the complete workflow for CryptoAI:

- Data fetching and preprocessing for crypto news and Bitcoin market data.
- Fine-tuning of an LLM for sentiment analysis and a VLM for chart pattern recognition.
- Running inference to obtain sentiment labels and chart classifications.

Feel free to experiment further with the models and visualizations. Happy analyzing!