In [5]:
import pandas as pd
from torch.utils.data import DataLoader
from sklearn.preprocessing import StandardScaler

# Wstęp

Zadanie 11 jest drugą częścią do zajęć laboratoryjnych poświęconych sieciom rekurencyjnym i predykcji z wykorzystaniem danych multimodalnych. Efektem prac będzie sieć rekurencyjna do predykcji kursu kryptowaluty Bitcoin (BTC) w oparciu o dane z giełdy oraz o wyniki analizy emocji komunikatów z mediów społecznościowych, do których również należy utworzyć dedykowany model sieci rekurencyjnej. Plan realizacji etapów wygląda następująco:

1.  EmoTweet - model sieci rekurencyjnej do analizy emocji (10 pkt., laboratorium 10)
2.  Agregacja informacji emotywnej i przygotowanie MultiBTC - multimodalnego model sieci rekurencyjnej do predykcji kursu BTC (10 pkt., laboratorium 11)
3.  Ewaluacja modelu MultiBTC (10 pkt., laboratorium 12)

Łącznie można otrzymać 30 punktów.

# Cel ćwiczenia

Celem drugiego etapu prac jest przygotowanie modelu MultiBTC sieci rekurencyjnej LSTM do przewidywania kolejnego elementu sekwencji pod warunkiem wcześniejszych obserwacji. Dopuszczalne jest rozwiązanie, które działa podobnie jak klasyfikator z poprzedniego zadania, przy czym w tym wypadku skonstruowany zostanie regresor, a zmienną predykowaną będzie np. średni kurs w następnym dniu pod warunkiem obserwacji z dni poprzednich.

# Warunki zaliczenia

Do zaliczenia drugiego etapu należy wykonać następujące kroki:

1.  Klasyfikacja zbioru tweetów przy pomocy 2 modeli EmoTweet opracowanych w etapie nr 1 (gdyby sieci LSTM były zbyt wolne, można użyć modeli opartych o fastText).
2.  Przygotowanie modelu LSTM, dla którego każdy element sekwencji będzie multimodalny, tj. będzie opisany cechami pochodzącymi z różnych źródeł:

- Dane z giełdy kryptowalutowej
- Zagregowane wartości emocji z tweetów

# Zbiór tweetów

Zbiór tweetów pochodzi z serwisu [Twitter](https://twitter.com/) i jest podzbiorem 2 milionów wiadomości dotyczących [Bitcoina](https://en.wikipedia.org/wiki/Bitcoin) z okresu od stycznia 2018 do maja 2020 roku.

## Pobranie


In [3]:
# należy wgrać plik z katalogu "dane" o nazwie bitcoin_tweets_2M.csv.7z

## Rozpakowanie


In [2]:
!7za x bitcoin_tweets_2M.csv.7z

/bin/bash: line 1: 7za: command not found


## Zawartość

Dane zawierają następujące kolumny:

- `timestamp` - data wysłania wiadomości
- `likes` - liczba polubień wiadomości
- `retweets` - liczba przekazań dalej wiadomości
- `username` - nick użytkownika
- `text` - tekst tweeta "zanonimizowany" przy pomocy metody [`preprocess`](https://github.com/cardiffnlp/tweeteval/blob/main/TweetEval_Tutorial.ipynb), która była użyta przy tworzeniu zbioru [TweetEval](https://github.com/cardiffnlp/tweeteval)


In [11]:
# for server
import os

# os.chdir("/home")

# for local

print(os.getcwd())

/home/piotr/projects/ai/gsn-l/lab-11


In [73]:
tweets_data = pd.read_csv("bitcoin_tweets_2M.csv")
tweets_data

Unnamed: 0,timestamp,likes,retweets,username,text
0,2018-01-01 00:00:03,0,0,ANDRO1711,"From the future of bitcoin to Facebook, 2018 i..."
1,2018-01-01 00:00:04,2,3,BitcoinAverage - Cryptocurrency Exchange Rates,BitcoinAverage - bitcoin price index - ($ 1394...
2,2018-01-01 00:00:09,0,0,Jimmyhoshi,Singapore bar offers bitcoin New Year party pa...
3,2018-01-01 00:00:16,0,0,BTC Bros,how the Chinese bitcoin market collapsed in 20...
4,2018-01-01 00:00:26,1,1,SBIYP,Cryptocurrency Craze! #bitcoin #ethereum #dash...
...,...,...,...,...,...
2454286,2020-05-29 23:57:21,1,0,𝙂𝙧𝙞𝙢,"All good till now man, hope all is well there ..."
2454287,2020-05-29 23:57:48,0,0,Digital Asset Controller,It’s just used as a wedge to divid the people ...
2454288,2020-05-29 23:58:10,0,0,(CEO of MONEY PRINTERS),is this sweat... oh wait just underwater with ...
2454289,2020-05-29 23:58:43,2,0,luke,The whole timing of this virus is very suspici...


# Dane z giełdy [Bitstamp](https://www.bitstamp.net/)

Zbiór pochodzi z serwisu Bitstamp i zawiera informacje o kursie Bitcoina od stycznia 2017 roku do kwietnia 2021 roku, zarówno w interwałach jednodniowych (24h), jak też godzinowych (1h).

## Pobranie


In [8]:
# należy wgrać plik z katalogu "dane" o nazwie bitstamp.7z

## Rozpakowanie


In [12]:
!7za x bitstamp.7z


7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs AMD Ryzen 7 4700U with Radeon Graphics          (860F01),ASM,AES-NI)

Scanning the drive for archives:
  0M Sca        1 file, 773846 bytes (756 KiB)

Extracting archive: bitstamp.7z
--
Path = bitstamp.7z
Type = 7z
Physical Size = 773846
Headers Size = 221
Method = LZMA2:3m
Solid = +
Blocks = 1

    Everything is Ok

Files: 2
Size:       3036393
Compressed: 773846


## Zawartość

Kwoty są podane w dolarach amerykańskich (kurs BTC/USD). Daty wyznaczają moment zamknięcia, a momentem otwarcia jest godzina wstecz (wariant 1h) lub dzień wstecz (wariant 24h). Każdy ze zbiorów zawiera następujące kolumny:

- `timestamp` - data w [formacie Unix](https://www.epochconverter.com/)
- `date` - j.w. w formacie YYYY-MM-DD HH:MM:SS
- `open` - kurs otwarcia
- `high` - najwyższa wartość
- `low` - najniższa wartość
- `close` - kurs zamknięcia
- `volume` - wolumen obrotu BTC

Interwał godzinowy:


In [13]:
bitstamp_data_1h = pd.read_csv("Bitstamp_BTCUSD_1h_2017_2018_2019_2020_2021-04-08.csv")
bitstamp_data_1h

Unnamed: 0,timestamp,date,open,high,low,close,volume
0,1483228800,2017-01-01 00:00:00,966.34,966.99,964.60,966.60,102.484806
1,1483232400,2017-01-01 01:00:00,966.60,966.60,962.54,963.87,149.025554
2,1483236000,2017-01-01 02:00:00,964.35,965.75,961.99,963.97,94.267396
3,1483239600,2017-01-01 03:00:00,963.88,964.71,960.53,962.83,77.619667
4,1483243200,2017-01-01 04:00:00,960.61,963.64,960.60,963.46,46.810220
...,...,...,...,...,...,...,...
37387,1617822000,2021-04-07 19:00:00,55832.62,56127.66,55441.93,56127.66,289.995730
37388,1617825600,2021-04-07 20:00:00,56075.95,56242.37,55690.00,56204.82,175.990086
37389,1617829200,2021-04-07 21:00:00,56243.09,56401.40,56053.20,56199.64,281.857236
37390,1617832800,2021-04-07 22:00:00,56160.72,56549.00,56111.13,56449.54,117.778871


Interwał dzienny:


In [14]:
bitstamp_data_24h = pd.read_csv(
    "Bitstamp_BTCUSD_24h_2017_2018_2019_2020_2021-04-08.csv"
)
bitstamp_data_24h

Unnamed: 0,timestamp,date,open,high,low,close,volume
0,1483228800,2017-01-01 00:00:00,966.34,1005.00,960.53,997.75,6850.593309
1,1483315200,2017-01-02 00:00:00,997.75,1032.00,990.01,1012.54,8167.381030
2,1483401600,2017-01-03 00:00:00,1011.44,1039.00,999.99,1035.24,9089.658025
3,1483488000,2017-01-04 00:00:00,1035.51,1139.89,1028.56,1114.92,21562.456972
4,1483574400,2017-01-05 00:00:00,1114.38,1136.72,885.41,1004.74,36018.861120
...,...,...,...,...,...,...,...
1553,1617408000,2021-04-03 00:00:00,58967.61,59801.39,56922.00,57064.42,1663.268353
1554,1617494400,2021-04-04 00:00:00,57064.13,58501.00,56466.25,58212.18,1440.631820
1555,1617580800,2021-04-05 00:00:00,58213.69,59280.00,56800.00,59125.00,2402.437135
1556,1617667200,2021-04-06 00:00:00,59135.36,59473.90,57216.00,58018.30,2711.397847


# Realizacja zadania

Szczegółowa realizacja zadania powinna zawierać następujące etapy:

## Przygotowanie danych (5 pkt.)

1.  Wykorzystać modele utworzone w etapie 1 do opisania wymiarami afektywnymi (ZJAWISKO_1 oraz ZJAWISKO_2) zbioru tweetów `tweets_data`.
2.  Wyodrębnić podzbiór danych `bitstamp_data_*` z okresu dla którego są dostępne tweety.
3.  Dokonać agregacji informacji afektywnej dla interwału godzinowego oraz interwału dziennego. Przykładowo, jeżeli rozpatruję interwał dzienny, to dla kursu z daty zamknięcia 2017-01-02 00:00:00 agreguję informację afektywną z tweetów pojawiających się pomiędzy 2017-01-01 00:00:00 a 2017-01-02 00:00:00. Dodatkowo dokonać agregacji dodatkowych metadanych opisujących tweety, tj. `likes` oraz `retweets`. Metoda agregacji jest dowolna. Przykładowe możliwości:

- suma
- średnia
- histogram

4.  Dokonać podziału danych na zbiór uczący (80%), walidacyjny (10%) oraz testowy (10%) poprzez wyznaczenie 2 punktów podziału na osi czasu (dane są ułożone chronologicznie). Innymi słowy, uczenie i strojenie modelu odbywa się na danych historycznych, a testowanie na aktualnych.

## Budowanie modeli (5 pkt.)

Model ma służyć do przewidywania kursu **w przyszłości** na podstawie danych **historycznych**. W każdym badaniu w sekcji **Ewaluacja modeli** należy sprawdzić jakość predykcji na 2 typach modeli:

1. **Model dzienny** - model, który w chwili T przewiduje (do wyboru jedna z opcji):

- kurs zamknięcia w chwili T+1
- średni kurs dla okresu od T do T+1 (wymaga obliczenia na podstawie danych godzinowych)

2. **Model godzinowy** - model, który w chwili T przewiduje kurs zamknięcia dla okresu T+1.

### Ogólne uwagi końcowe

Wszystkie wyniki proszę podać z wykorzystaniem 2 miar jakości predykcji:

1. [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
2. [R2-score](https://en.wikipedia.org/wiki/Coefficient_of_determination)

Przy każdej procedurze uczenia należy wykorzystywać zbiór walidacyjny w taki sposób, by po każdej epoce uczenia sprawdzać jakość predykcji na tym zbiorze. Należy zapamiętać ten model, którego jakość była najlepsza na zbiorze walidacyjnym i na tym modelu dopiero robić ostateczną ewaluację z wykorzystaniem zbioru testowego. Proszę obserwować proces uczenia. Spadek jakości na zbiorze walidacyjnym w dalszych epokach uczenia (po wcześniejszym wzrastaniu w poprzednich epokach) może oznaczać, że model przeuczył się na zbiorze uczącym i można przerwać trenowanie. Często definiuje się w tym celu dodatkowy parametr tzw. **cierpliwości** (ang. patience), który określa, przez ile epok możemy kontynuować uczenie bez otrzymania wyniki lepszego niż dotychczasowy najlepszy.


## 1. Przygotowanie danych


In [15]:
import fasttext

In [16]:
# Stwórz dataloader


def create_loader(
    TEXT_PATH: str,
    LABELS_PATH: str,
    MODEL_PATH: str = "fasttext_tweetmodel_btc_sg_100_en.bin",
    batch_size: int = 64,
):

    model = fasttext.load_model(MODEL_PATH)

    with open(TEXT_PATH, "r") as file:
        data = file.read()
    lines = data.split("\n")
    if lines[-1] == "":
        lines = lines[:-1]
    texts = pd.DataFrame(lines)

    representations = [
        torch.tensor(
            [model.get_word_vector(word) for word in fasttext.tokenize(texts[0][i])]
        )
        for i in range(len(texts))
    ]
    X = pad_sequence(representations, batch_first=True, padding_value=0.0)

    labels = pd.read_csv(LABELS_PATH, sep="\t", header=None).to_numpy()
    y = torch.tensor(labels).squeeze(1)

    loader = DataLoader(TensorDataset(X, y), batch_size=batch_size, shuffle=True)
    return loader

In [17]:
import torch.nn as nn
import torch


class LSTMModel(nn.Module):

    def __init__(
        self,
        input_size: int,
        hidden_size: list[int],
        output_size: int,
        n_layers: int,
        dropout: float = 0.2,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.lstm = nn.LSTM(
            input_size,
            self.hidden_size[0],
            self.n_layers,
            batch_first=True,
            dropout=dropout,
        )
        self.dropout = nn.Dropout(dropout)
        if len(hidden_size) > 1:
            linears = [
                nn.Linear(self.hidden_size[i], self.hidden_size[i + 1])
                for i in range(len(self.hidden_size) - 1)
            ]
        else:
            linears = []
        linears.append(nn.Linear(self.hidden_size[-1], output_size))
        self.fc = nn.Sequential(*linears)
        self.relu = nn.ReLU()

    def forward(self, x):
        h0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size[0]).to(device)
        c0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size[0]).to(device)

        output, (hidden_state, _) = self.lstm(x, (h0, c0))
        output = self.dropout(hidden_state[-1])
        output = self.fc(output)
        output = self.relu(output)
        return output

### Wczytanie modeli z plików


In [18]:
import torch
from torch import nn
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from typing import Tuple
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


In [19]:
emotion_model = LSTMModel(100, [32, 16], 4, 2).to(device)
if torch.cuda.is_available():
    emotion_model.load_state_dict(torch.load("emotion_model.pth"))
else:
    emotion_model.load_state_dict(
        torch.load("emotion_model.pth", map_location=torch.device("cpu"))
    )

In [20]:
emoji_model = LSTMModel(100, [32, 16], 20, 2).to(device)
if torch.cuda.is_available():
    emoji_model.load_state_dict(torch.load("emoji_model.pth"))
else:
    emoji_model.load_state_dict(
        torch.load("emoji_model.pth", map_location=torch.device("cpu"))
    )

In [21]:
# emotion
TRAIN_TEXT_EMOTION = "tweeteval/datasets/emotion/train_text.txt"
TRAIN_LABELS_EMOTION = "tweeteval/datasets/emotion/train_labels.txt"
VAL_TEXT_EMOTION = "tweeteval/datasets/emotion/test_text.txt"
VAL_LABELS_EMOTION = "tweeteval/datasets/emotion/test_labels.txt"

train_loader_emotion = create_loader(
    TRAIN_TEXT_EMOTION, TRAIN_LABELS_EMOTION, batch_size=32
)
val_loader_emotion = create_loader(VAL_TEXT_EMOTION, VAL_LABELS_EMOTION, batch_size=32)

  torch.tensor(


In [22]:
# emoji
TRAIN_TEXT_EMOJI = "tweeteval/datasets/emoji/train_text.txt"
TRAIN_LABELS_EMOJI = "tweeteval/datasets/emoji/train_labels.txt"
VAL_TEXT_EMOJI = "tweeteval/datasets/emoji/test_text.txt"
VAL_LABELS_EMOJI = "tweeteval/datasets/emoji/test_labels.txt"

train_loader_emoji = create_loader(TRAIN_TEXT_EMOJI, TRAIN_LABELS_EMOJI, batch_size=32)
val_loader_emoji = create_loader(VAL_TEXT_EMOJI, VAL_LABELS_EMOJI, batch_size=32)

In [23]:
# validate emotion model
emotion_model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for idx, (X, y) in enumerate(val_loader_emotion):
        X, y = X.to(device), y.to(device)
        outputs = emotion_model(X)
        _, predicted = torch.max(outputs.data, 1)
        total += y.size(0)
        correct += (predicted == y).sum().item()
    print(f"Accuracy: {100 * correct / total}")

Accuracy: 66.1505981703026


In [24]:
# validate emoji model
emoji_model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for idx, (X, y) in enumerate(val_loader_emoji):
        X, y = X.to(device), y.to(device)
        outputs = emoji_model(X)
        _, predicted = torch.max(outputs.data, 1)
        total += y.size(0)
        correct += (predicted == y).sum().item()
    print(f"Accuracy: {100 * correct / total}")

Accuracy: 34.88


In [68]:
tweets_data

Unnamed: 0,timestamp,likes,retweets,username,text
0,2018-01-01 00:00:03,0,0,ANDRO1711,"From the future of bitcoin to Facebook, 2018 i..."
1,2018-01-01 00:00:04,2,3,BitcoinAverage - Cryptocurrency Exchange Rates,BitcoinAverage - bitcoin price index - ($ 1394...
2,2018-01-01 00:00:09,0,0,Jimmyhoshi,Singapore bar offers bitcoin New Year party pa...
3,2018-01-01 00:00:16,0,0,BTC Bros,how the Chinese bitcoin market collapsed in 20...
4,2018-01-01 00:00:26,1,1,SBIYP,Cryptocurrency Craze! #bitcoin #ethereum #dash...
...,...,...,...,...,...
2454286,2020-05-29 23:57:21,1,0,𝙂𝙧𝙞𝙢,"All good till now man, hope all is well there ..."
2454287,2020-05-29 23:57:48,0,0,Digital Asset Controller,It’s just used as a wedge to divid the people ...
2454288,2020-05-29 23:58:10,0,0,(CEO of MONEY PRINTERS),is this sweat... oh wait just underwater with ...
2454289,2020-05-29 23:58:43,2,0,luke,The whole timing of this virus is very suspici...


In [75]:
# reduce dataset for testing
tweets_data = tweets_data[:100_000]

In [27]:
# save to file
tweets_text = tweets_data["text"]

# drop missing rows
# tweets_text.dropna(inplace=True)
print("nulls: ", tweets_text.isnull().sum())
print(len(tweets_text))
tweets_text.to_csv("tweets.txt", index=False, header=False)

nulls:  0
100000


### Wektoryzacja tekstu


In [28]:
# Wczytanie modelu fastText
fasttext_model_path = "fasttext_tweetmodel_btc_sg_100_en.bin"
fasttext_model = fasttext.load_model(fasttext_model_path)

In [52]:
import fasttext
from tqdm import tqdm
from multiprocessing import Pool, cpu_count
import torch
import fasttext
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor


# Funkcja do wektoryzacji tekstu
def vectorize_text(texts, model):
    representations = [
        torch.tensor([model.get_word_vector(word) for word in fasttext.tokenize(text)])
        for text in tqdm(texts)
    ]
    return pad_sequence(representations, batch_first=True, padding_value=0.0)


# Wczytanie pliku z tweetami
# tweets_path = "tweets.txt"
# with open(tweets_path, "r") as file:
#     tweet_lines = file.readlines()
# tweets_text = [line.strip() for line in tweet_lines]

# Wektoryzacja tweetów
tweet_vectors = vectorize_text(tweets_text, fasttext_model)
tweet_vectors = tweet_vectors.to(device)

100%|██████████| 100000/100000 [00:58<00:00, 1714.72it/s]


In [53]:
# save vectorized text to file
torch.save(tweet_vectors, "tweet_vectors.pt")

In [54]:
# load vectorized text from file
tweet_vectors = torch.load("tweet_vectors.pt")

In [57]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def predict(model, vectors, batch_size=32):
    model.eval()
    predictions = []

    with torch.no_grad():
        for i in range(0, len(vectors), batch_size):
            batch_vectors = vectors[i : i + batch_size].to(device)
            batch_predictions = model(batch_vectors)
            batch_labels = torch.argmax(batch_predictions, dim=1)
            predictions.append(batch_labels.cpu())  # Move results back to CPU

    return torch.cat(predictions)

### Predykcja emocji i emoji


In [58]:
emotion_labels = predict(emotion_model, tweet_vectors, batch_size=32)

In [59]:
emoji_labels = predict(emoji_model, tweet_vectors, batch_size=32)

In [60]:
# save emotion_labels to csv
emotion_labels_df = pd.DataFrame(emotion_labels)
emotion_labels_df.to_csv("emotion_labels.csv", index=False, header=False)

In [61]:
# save emoji to csv
emoji_labels_df = pd.DataFrame(emoji_labels)
emoji_labels_df.to_csv("emoji_labels.csv", index=False, header=False)

In [76]:
idx  # load emotion labels
emotion_labels = pd.read_csv("emotion_labels.csv", names=["emotion"])
emoji_labels = pd.read_csv("emoji_labels.csv", names=["emoji"])


tweets_data = pd.concat([tweets_data, emotion_labels, emoji_labels], axis=1)
display(tweets_data)

Unnamed: 0,timestamp,likes,retweets,username,text,emotion,emoji
0,2018-01-01 00:00:03,0,0,ANDRO1711,"From the future of bitcoin to Facebook, 2018 i...",2.0,4.0
1,2018-01-01 00:00:04,2,3,BitcoinAverage - Cryptocurrency Exchange Rates,BitcoinAverage - bitcoin price index - ($ 1394...,3.0,10.0
2,2018-01-01 00:00:09,0,0,Jimmyhoshi,Singapore bar offers bitcoin New Year party pa...,0.0,4.0
3,2018-01-01 00:00:16,0,0,BTC Bros,how the Chinese bitcoin market collapsed in 20...,3.0,2.0
4,2018-01-01 00:00:26,1,1,SBIYP,Cryptocurrency Craze! #bitcoin #ethereum #dash...,3.0,2.0
...,...,...,...,...,...,...,...
99995,2018-01-18 21:10:34,0,0,Ar Viraj,BUY Payment method: SEPA Offer ID: QOmdxI Amou...,0.0,2.0
99996,2018-01-18 21:10:53,8,9,David Scutt,What 12 major analysts from banks like Goldman...,0.0,2.0
99997,2018-01-18 21:11:07,1,0,SQL Cyclist,add up all my 401(k)s...I finally have enough ...,1.0,2.0
99998,2018-01-18 21:11:16,0,1,Remi Vee 🛡️,Dubai Plans to Launch 20 Blockchain-Based Serv...,2.0,4.0


### 2. wyodrębnienie bitstamp data dla okresu tweetów


In [77]:
start_date = tweets_data["timestamp"].min()
end_date = tweets_data["timestamp"].max()

bitstamp_data_24h_filtered = bitstamp_data_24h[
    (bitstamp_data_24h["date"] >= start_date) & (bitstamp_data_24h["date"] <= end_date)
]
bitstamp_data_1h_filtered = bitstamp_data_1h[
    (bitstamp_data_1h["date"] >= start_date) & (bitstamp_data_1h["date"] <= end_date)
]

display(bitstamp_data_24h_filtered)
display(bitstamp_data_1h_filtered)

Unnamed: 0,timestamp,date,open,high,low,close,volume
366,1514851200,2018-01-02 00:00:00,13394.2,15257.53,12910.58,14678.94,16299.669303
367,1514937600,2018-01-03 00:00:00,14670.96,15500.0,14546.28,15155.62,12275.001197
368,1515024000,2018-01-04 00:00:00,15155.62,15430.27,14192.37,15143.67,15004.018593
369,1515110400,2018-01-05 00:00:00,15143.67,17200.0,14810.0,16928.0,16248.91468
370,1515196800,2018-01-06 00:00:00,16927.99,17234.99,16220.0,17149.67,9501.016755
371,1515283200,2018-01-07 00:00:00,17142.43,17149.97,15707.16,16124.02,8632.813843
372,1515369600,2018-01-08 00:00:00,16173.98,16300.0,13900.0,14999.99,16676.349942
373,1515456000,2018-01-09 00:00:00,14999.99,15367.18,14123.97,14403.51,13913.524694
374,1515542400,2018-01-10 00:00:00,14403.51,14900.0,13412.0,14890.02,18479.01253
375,1515628800,2018-01-11 00:00:00,14899.99,14973.07,12800.0,13243.83,19630.075171


Unnamed: 0,timestamp,date,open,high,low,close,volume
8761,1514768400,2018-01-01 01:00:00,13635.06,13704.42,13312.94,13355.00,393.344524
8762,1514772000,2018-01-01 02:00:00,13355.00,13536.88,13302.02,13429.08,194.327902
8763,1514775600,2018-01-01 03:00:00,13407.98,13640.00,13321.90,13481.73,317.537375
8764,1514779200,2018-01-01 04:00:00,13481.73,13699.14,13372.41,13697.00,457.691397
8765,1514782800,2018-01-01 05:00:00,13681.04,13697.00,13483.05,13569.99,209.560181
...,...,...,...,...,...,...,...
9185,1516294800,2018-01-18 17:00:00,11787.40,11910.00,11503.24,11503.24,1229.670484
9186,1516298400,2018-01-18 18:00:00,11503.24,11800.00,11451.01,11788.35,1186.964160
9187,1516302000,2018-01-18 19:00:00,11790.00,11849.00,11560.83,11830.00,470.300133
9188,1516305600,2018-01-18 20:00:00,11825.00,11998.99,11699.99,11823.78,867.553699


### 3. agregacja dla interwału godzinnego i dziennego


In [78]:
tweets_data.set_index("timestamp", inplace=True)
tweets_data.index = pd.to_datetime(tweets_data.index)
numeric_cols = tweets_data.select_dtypes(include=["number"])
hourly_mean_numeric = numeric_cols.resample("h").mean()

display(hourly_mean_numeric)

Unnamed: 0_level_0,likes,retweets,emotion,emoji
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-01 00:00:00,1.354430,0.240506,1.158228,3.569620
2018-01-01 01:00:00,6.939850,1.180451,1.315789,3.721805
2018-01-01 02:00:00,1.379032,0.338710,1.467742,3.540323
2018-01-01 03:00:00,0.757812,0.132812,1.382812,3.367188
2018-01-01 04:00:00,1.237410,1.539568,1.366906,3.381295
...,...,...,...,...
2018-01-18 17:00:00,2.306962,1.278481,1.329114,3.401899
2018-01-18 18:00:00,5.759868,1.467105,1.421053,3.486842
2018-01-18 19:00:00,3.092593,0.844444,1.374074,3.296296
2018-01-18 20:00:00,1.637324,0.531690,1.422535,3.228873


In [79]:
daily_mean_numeric = numeric_cols.resample("D").mean()

display(daily_mean_numeric)

Unnamed: 0_level_0,likes,retweets,emotion,emoji
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-01,2.107404,0.867998,1.299318,3.621529
2018-01-02,1.970263,0.704357,1.391425,3.479772
2018-01-03,1.939863,1.076117,1.314605,3.413918
2018-01-04,3.431941,1.209056,1.23102,3.267082
2018-01-05,2.334383,1.049493,1.225778,3.351696
2018-01-06,2.473111,1.708645,1.238711,3.343544
2018-01-07,2.339862,0.887158,1.22702,3.369909
2018-01-08,2.086491,1.189298,1.311754,3.319649
2018-01-09,2.327101,1.085397,1.204144,3.292067
2018-01-10,2.073393,0.971359,1.247518,3.26249


In [80]:
bitstamp_data_1h_filtered.set_index("date", inplace=True)
bitstamp_data_1h_filtered.index = pd.to_datetime(bitstamp_data_1h_filtered.index)

display(bitstamp_data_1h_filtered)

Unnamed: 0_level_0,timestamp,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-01 01:00:00,1514768400,13635.06,13704.42,13312.94,13355.00,393.344524
2018-01-01 02:00:00,1514772000,13355.00,13536.88,13302.02,13429.08,194.327902
2018-01-01 03:00:00,1514775600,13407.98,13640.00,13321.90,13481.73,317.537375
2018-01-01 04:00:00,1514779200,13481.73,13699.14,13372.41,13697.00,457.691397
2018-01-01 05:00:00,1514782800,13681.04,13697.00,13483.05,13569.99,209.560181
...,...,...,...,...,...,...
2018-01-18 17:00:00,1516294800,11787.40,11910.00,11503.24,11503.24,1229.670484
2018-01-18 18:00:00,1516298400,11503.24,11800.00,11451.01,11788.35,1186.964160
2018-01-18 19:00:00,1516302000,11790.00,11849.00,11560.83,11830.00,470.300133
2018-01-18 20:00:00,1516305600,11825.00,11998.99,11699.99,11823.78,867.553699


In [81]:
bitstamp_data_24h_filtered.set_index("date", inplace=True)
bitstamp_data_24h_filtered.index = pd.to_datetime(bitstamp_data_24h_filtered.index)

display(bitstamp_data_24h_filtered)

Unnamed: 0_level_0,timestamp,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-02,1514851200,13394.2,15257.53,12910.58,14678.94,16299.669303
2018-01-03,1514937600,14670.96,15500.0,14546.28,15155.62,12275.001197
2018-01-04,1515024000,15155.62,15430.27,14192.37,15143.67,15004.018593
2018-01-05,1515110400,15143.67,17200.0,14810.0,16928.0,16248.91468
2018-01-06,1515196800,16927.99,17234.99,16220.0,17149.67,9501.016755
2018-01-07,1515283200,17142.43,17149.97,15707.16,16124.02,8632.813843
2018-01-08,1515369600,16173.98,16300.0,13900.0,14999.99,16676.349942
2018-01-09,1515456000,14999.99,15367.18,14123.97,14403.51,13913.524694
2018-01-10,1515542400,14403.51,14900.0,13412.0,14890.02,18479.01253
2018-01-11,1515628800,14899.99,14973.07,12800.0,13243.83,19630.075171


In [82]:
merged_df_hours = bitstamp_data_1h_filtered.merge(
    hourly_mean_numeric, left_index=True, right_index=True
)

display(merged_df_hours)

Unnamed: 0,timestamp,open,high,low,close,volume,likes,retweets,emotion,emoji
2018-01-01 01:00:00,1514768400,13635.06,13704.42,13312.94,13355.00,393.344524,6.939850,1.180451,1.315789,3.721805
2018-01-01 02:00:00,1514772000,13355.00,13536.88,13302.02,13429.08,194.327902,1.379032,0.338710,1.467742,3.540323
2018-01-01 03:00:00,1514775600,13407.98,13640.00,13321.90,13481.73,317.537375,0.757812,0.132812,1.382812,3.367188
2018-01-01 04:00:00,1514779200,13481.73,13699.14,13372.41,13697.00,457.691397,1.237410,1.539568,1.366906,3.381295
2018-01-01 05:00:00,1514782800,13681.04,13697.00,13483.05,13569.99,209.560181,2.071429,0.714286,1.185714,4.078571
...,...,...,...,...,...,...,...,...,...,...
2018-01-18 17:00:00,1516294800,11787.40,11910.00,11503.24,11503.24,1229.670484,2.306962,1.278481,1.329114,3.401899
2018-01-18 18:00:00,1516298400,11503.24,11800.00,11451.01,11788.35,1186.964160,5.759868,1.467105,1.421053,3.486842
2018-01-18 19:00:00,1516302000,11790.00,11849.00,11560.83,11830.00,470.300133,3.092593,0.844444,1.374074,3.296296
2018-01-18 20:00:00,1516305600,11825.00,11998.99,11699.99,11823.78,867.553699,1.637324,0.531690,1.422535,3.228873


In [83]:
merged_df_days = bitstamp_data_24h_filtered.merge(
    daily_mean_numeric, left_index=True, right_index=True
)

display(merged_df_days)

Unnamed: 0,timestamp,open,high,low,close,volume,likes,retweets,emotion,emoji
2018-01-02,1514851200,13394.2,15257.53,12910.58,14678.94,16299.669303,1.970263,0.704357,1.391425,3.479772
2018-01-03,1514937600,14670.96,15500.0,14546.28,15155.62,12275.001197,1.939863,1.076117,1.314605,3.413918
2018-01-04,1515024000,15155.62,15430.27,14192.37,15143.67,15004.018593,3.431941,1.209056,1.23102,3.267082
2018-01-05,1515110400,15143.67,17200.0,14810.0,16928.0,16248.91468,2.334383,1.049493,1.225778,3.351696
2018-01-06,1515196800,16927.99,17234.99,16220.0,17149.67,9501.016755,2.473111,1.708645,1.238711,3.343544
2018-01-07,1515283200,17142.43,17149.97,15707.16,16124.02,8632.813843,2.339862,0.887158,1.22702,3.369909
2018-01-08,1515369600,16173.98,16300.0,13900.0,14999.99,16676.349942,2.086491,1.189298,1.311754,3.319649
2018-01-09,1515456000,14999.99,15367.18,14123.97,14403.51,13913.524694,2.327101,1.085397,1.204144,3.292067
2018-01-10,1515542400,14403.51,14900.0,13412.0,14890.02,18479.01253,2.073393,0.971359,1.247518,3.26249
2018-01-11,1515628800,14899.99,14973.07,12800.0,13243.83,19630.075171,2.060959,0.858155,1.233506,3.318819


In [84]:
merged_df_hours.to_csv("merged_data_hours.csv")
merged_df_days.to_csv("merged_data_days.csv")

In [86]:
merged_df_hours = pd.read_csv("merged_data_hours.csv")
merged_df_days = pd.read_csv("merged_data_days.csv")

In [89]:
def split_data(data, train_size=0.8, val_size=0.1):
    first_split = int(train_size * len(data))
    second_split = int((train_size + val_size) * len(data))

    train_data = data.iloc[:first_split]
    val_data = data.iloc[first_split:second_split]
    test_data = data.iloc[second_split:]
    return train_data, val_data, test_data

In [90]:
train_data_hours, val_data_hours, test_data_hours = split_data(merged_df_hours)
print(len(train_data_hours), len(val_data_hours), len(test_data_hours))

343 43 43


In [91]:
train_data_days, val_data_days, test_data_days = split_data(merged_df_days)
print(len(train_data_days), len(val_data_days), len(test_data_days))

13 2 2


## 2. Budowanie modeli


Dataset for time series


In [93]:
from torch.utils.data import Dataset


class TimeSeriesDataset(Dataset):
    chunk_length: int
    X: torch.Tensor
    y: torch.Tensor

    def __init__(self, X: pd.DataFrame, y: pd.Series, sequence_length):
        self.chunk_length = sequence_length
        self.X = torch.Tensor(X).float()
        self.y = torch.Tensor(y).float().squeeze(-1)

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        """
        Returns a tuple of the chunk of data ending at index idx and the label at index idx
        """
        if idx >= self.chunk_length - 1:
            i_start = idx - self.chunk_length + 1
            x = self.X[i_start : (idx + 1), :]
        else:
            padding = self.X[0].repeat(self.chunk_length - idx - 1, 1)
            x = self.X[0 : (idx + 1), :]

            x = torch.cat((padding, x), 0)
        return x, self.y[idx]

In [94]:
class LSTMModel(nn.Module):

    def __init__(
        self,
        input_size: int,
        hidden_size: list[int],
        output_size: int,
        n_layers: int,
        dropout: float = 0.2,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.lstm = nn.LSTM(
            input_size,
            self.hidden_size[0],
            self.n_layers,
            batch_first=True,
            dropout=dropout,
        )
        self.dropout = nn.Dropout(dropout)

        if len(hidden_size) > 1:
            linears = [
                nn.Linear(self.hidden_size[i], self.hidden_size[i + 1])
                for i in range(len(self.hidden_size) - 1)
            ]
        else:
            linears = []

        linears.append(nn.Linear(self.hidden_size[-1], output_size))
        self.fc = nn.Sequential(*linears)
        self.relu = nn.ReLU()

    def forward(self, x):
        h0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size[0]).to(device)
        c0 = torch.zeros(self.n_layers, x.size(0), self.hidden_size[0]).to(device)

        output, (hidden_state, _) = self.lstm(x, (h0, c0))
        output = self.dropout(hidden_state[-1])
        output = self.fc(output)
        output = self.relu(output)
        return output

In [95]:
import torch.optim as optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from sklearn.metrics import mean_squared_error, r2_score


def validate_regression(model, dataloader):
    model.eval()

    true_values = []
    predicted_values = []

    with torch.no_grad():
        for X_batch, y_batch in dataloader:
            # Forward pass to get predictions
            y_pred = model(X_batch.cuda())

            # Append true and predicted values to lists
            true_values.extend(y_batch.numpy())
            predicted_values.extend(y_pred.cpu().numpy())

    mse = mean_squared_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)
    return mse, r2


def fit(
    model: nn.Module,
    optimiser: optim.Optimizer,
    loss_fn: torch.nn.CrossEntropyLoss,
    train_dl: DataLoader,
    val_dl: DataLoader,
    test_dl: DataLoader,
    epochs: int,
    print_metrics: str = True,
):
    epoch_log = []
    train_mse_log = []
    val_mse_log = []

    train_r2_log = []
    val_r2_log = []

    test_mse_log = []
    test_r2_log = []

    for epoch in range(epochs):
        epoch_log.append(epoch)

        for X_batch, y_batch in tqdm(train_dl):

            model.train()
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()
            model.eval()

        with torch.no_grad():
            train_mse, train_r2 = validate_regression(model, train_dl)
            val_mse, val_r2 = validate_regression(model, val_dl)
            test_mse, test_r2 = validate_regression(model, test_dl)

            train_mse_log.append(train_mse.item())
            val_mse_log.append(val_mse.item())

            train_r2_log.append(train_r2.item())
            val_r2_log.append(val_r2.item())

            test_mse_log.append(test_mse.item())
            test_r2_log.append(test_r2.item())

        if print_metrics:
            print(
                f"epoch: {epoch}"
                f"train MSE = {train_mse:.3f} (R2: {train_r2:.3f})"
                f"test  MSE = {test_mse:.3f} (R2: {test_r2:.3f})"
                f"val   MSE = {val_mse:.3f} (R2: {val_r2:.3f})"
            )

    return (
        epoch_log,
        train_mse_log,
        val_mse_log,
        test_mse_log,
        train_r2_log,
        val_r2_log,
        test_r2_log,
    )

In [98]:
def split_X_y(data: pd.DataFrame, target_col: str) -> tuple[pd.DataFrame, pd.Series]:
    X = data.drop(columns=[target_col])
    y = data[target_col]
    return X, y


chunk_length = 10

# set the "close" column as the target variable,
# split dataset into X (features) and y (targets)
X_train_hours, y_train_hours = split_X_y(train_data_hours, "close")
X_val_hours, y_val_hours = split_X_y(val_data_hours, "close")
X_test_hours, y_test_hours = split_X_y(test_data_hours, "close")

# scale the data
X_scaler = StandardScaler()
y_scaler = StandardScaler()

numeric_cols = X_train_hours.select_dtypes(include=["number"]).columns


X_train_hours = X_scaler.fit_transform(X_train_hours[numeric_cols])
y_train_hours = y_scaler.fit_transform(y_train_hours.values.reshape(-1, 1))

X_val_hours = X_scaler.transform(X_val_hours[numeric_cols])
y_val_hours = y_scaler.transform(y_val_hours.values.reshape(-1, 1))

X_test_hours = X_scaler.transform(X_test_hours[numeric_cols])
y_test_hours = y_scaler.transform(y_test_hours.values.reshape(-1, 1))


train_dataset_hours = TimeSeriesDataset(X_train_hours, y_train_hours, chunk_length)
val_dataset_hours = TimeSeriesDataset(X_val_hours, y_val_hours, chunk_length)
test_dataset_hours = TimeSeriesDataset(X_test_hours, y_test_hours, chunk_length)


batch_size = 32

train_loader_hours = DataLoader(
    train_dataset_hours, batch_size=batch_size, shuffle=False
)
val_loader_hours = DataLoader(val_dataset_hours, batch_size=batch_size, shuffle=False)
test_loader_hours = DataLoader(test_dataset_hours, batch_size=batch_size, shuffle=False)

print("Length of train data:", len(train_loader_hours))
print("Length of validation data:", len(val_loader_hours))
print("Length of test data:", len(test_loader_hours))

Length of train data: 11
Length of validation data: 2
Length of test data: 2
