- recent update: 24.10.29
- update content:
    1. mid-price 생성
    2. significant or insignificant 예측 모델 생성
    3. significant 예측되는 경우에만 significant increase or decrease인지 예측
- target var: mid price return significant change (0 or 1)
- Model: XGBoost(significant or insignificant 예측) + LSTM(significant increase or decrease 예측)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, Dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, roc_auc_score
from imblearn.over_sampling import SMOTE
import seaborn as sns
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import precision_score
#from optuna.integration import XGBoostPruningCallback
#import optuna

In [7]:
import pickle
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# .pkl 파일 경로
pkl_file_path = '/content/drive/MyDrive/eunsung/data/sequence_data.pkl'

# 저장된 파일 불러오기
with open(pkl_file_path, 'rb') as file:
    sequence_data = pickle.load(file)

# 로드된 데이터 확인
print(sequence_data.keys())  # ['train_sequences', 'train_labels', 'valid_sequences', 'valid_labels', 'test_sequences', 'test_labels']

# Tensor 데이터 준비
train_sequences = sequence_data['train_sequences']
train_labels = sequence_data['train_labels']
valid_sequences = sequence_data['valid_sequences']
valid_labels = sequence_data['valid_labels']
test_sequences = sequence_data['test_sequences']
test_labels = sequence_data['test_labels']

# Hyperparameters
hyperparams = {
    "input_dim": train_sequences.shape[2],  # Feature size
    "hidden_dim": 128,                      # Hidden size for LSTM
    "num_layers": 2,                        # Number of LSTM layers
    "num_classes": len(np.unique(train_labels)),  # Number of output classes
    "dropout_rate": 0.3,                    # Dropout rate
    "batch_size": 64,                       # Batch size for training
    "epochs": 50,                           # Number of training epochs
    "learning_rate": 5e-4,                  # Initial learning rate
    "weight_decay": 1e-6,                   # Weight decay for regularization
    "gradient_clipping": 1.0,               # Max norm for gradient clipping
    "model_save_path": "./model_lstm_dae",  # Directory to save the model
}

# Create model save directory
os.makedirs(hyperparams["model_save_path"], exist_ok=True)

# Custom Dataset 정의
class TimeSeriesDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return (
            torch.tensor(self.sequences[idx], dtype=torch.float32),
            torch.tensor(self.labels[idx], dtype=torch.long),
        )

# Dataset 생성
train_dataset = TimeSeriesDataset(train_sequences, train_labels)
valid_dataset = TimeSeriesDataset(valid_sequences, valid_labels)
test_dataset = TimeSeriesDataset(test_sequences, test_labels)

# DataLoader 생성
train_loader = DataLoader(train_dataset, batch_size=hyperparams["batch_size"], shuffle=True, drop_last=True)
valid_loader = DataLoader(valid_dataset, batch_size=hyperparams["batch_size"], shuffle=False, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=hyperparams["batch_size"], shuffle=False, drop_last=True)

print(f"Train batches: {len(train_loader)}")
print(f"Validation batches: {len(valid_loader)}")
print(f"Test batches: {len(test_loader)}")

# Denoising Autoencoder
class DenoisingAutoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(DenoisingAutoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, input_dim),
            nn.ReLU()
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded

# LSTM Classifier
class LSTMClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, num_classes, dropout_rate):
        super(LSTMClassifier, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True, dropout=dropout_rate)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        _, (hidden, _) = self.lstm(x)
        return self.fc(hidden[-1])  # Use the output from the last LSTM layer

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize models
dae = DenoisingAutoencoder(input_dim=hyperparams["input_dim"], hidden_dim=64).to(device)
lstm = LSTMClassifier(
    input_dim=64,  # Encoded dimension from DAE
    hidden_dim=hyperparams["hidden_dim"],
    num_layers=hyperparams["num_layers"],
    num_classes=hyperparams["num_classes"],
    dropout_rate=hyperparams["dropout_rate"]
).to(device)

# Loss and optimizer
dae_optimizer = optim.Adam(dae.parameters(), lr=hyperparams["learning_rate"])
lstm_optimizer = optim.Adam(lstm.parameters(), lr=hyperparams["learning_rate"], weight_decay=hyperparams["weight_decay"])

criterion = nn.CrossEntropyLoss()

# Training Denoising Autoencoder
def train_dae(dae, train_loader, epochs=10):
    dae.train()
    for epoch in range(epochs):
        total_loss = 0
        for inputs, _ in train_loader:
            noisy_inputs = inputs + torch.randn_like(inputs) * 0.1
            noisy_inputs, inputs = noisy_inputs.to(device), inputs.to(device)

            _, decoded = dae(noisy_inputs)
            loss = nn.MSELoss()(decoded, inputs)

            dae_optimizer.zero_grad()
            loss.backward()
            dae_optimizer.step()

            total_loss += loss.item()
        print(f"DAE Epoch [{epoch+1}/{epochs}], Loss: {total_loss/len(train_loader):.4f}")

# Training and Validation Loop for LSTM
def train_lstm(dae, lstm, train_loader, valid_loader):
    best_valid_loss = float('inf')
    train_losses, valid_losses = [], []

    for epoch in range(hyperparams["epochs"]):
        # Training Phase
        lstm.train()
        epoch_train_loss = 0.0

        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            with torch.no_grad():
                encoded, _ = dae(inputs)  # DAE encoding output (batch_size, feature_size)

            if encoded.dim() == 2:  # Check if the output is 2D
                encoded = encoded.unsqueeze(1)  # Add sequence_length dimension

            lstm_optimizer.zero_grad()
            outputs = lstm(encoded)  # Forward pass through LSTM
            loss = criterion(outputs, targets)
            loss.backward()
            nn.utils.clip_grad_norm_(lstm.parameters(), max_norm=hyperparams["gradient_clipping"])
            lstm_optimizer.step()

            epoch_train_loss += loss.item()

        train_loss = epoch_train_loss / len(train_loader)
        train_losses.append(train_loss)

        # Validation Phase
        lstm.eval()
        epoch_valid_loss = 0.0

        with torch.no_grad():
            for inputs, targets in valid_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                encoded, _ = dae(inputs)

                if encoded.dim() == 2:
                    encoded = encoded.unsqueeze(1)

                outputs = lstm(encoded)
                loss = criterion(outputs, targets)
                epoch_valid_loss += loss.item()

        valid_loss = epoch_valid_loss / len(valid_loader)
        valid_losses.append(valid_loss)

        # Save the best model
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(lstm.state_dict(), os.path.join(hyperparams["model_save_path"], 'best_lstm_model.pth'))

        print(f"Epoch [{epoch+1}/{hyperparams['epochs']}], Train Loss: {train_loss:.4f}, Valid Loss: {valid_loss:.4f}")

    return train_losses, valid_losses

# Train DAE
train_dae(dae, train_loader)

# Train LSTM with DAE features
train_losses, valid_losses = train_lstm(dae, lstm, train_loader, valid_loader)

# Plotting training and validation loss
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(train_losses) + 1), train_losses, label='Train Loss')
plt.plot(range(1, len(valid_losses) + 1), valid_losses, label='Valid Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Train and Validation Loss')
plt.legend()
plt.grid()
plt.show()


dict_keys(['train_sequences', 'train_labels', 'valid_sequences', 'valid_labels', 'test_sequences', 'test_labels'])
Train batches: 8379
Validation batches: 4451
Test batches: 4451


  torch.tensor(self.sequences[idx], dtype=torch.float32),
  torch.tensor(self.labels[idx], dtype=torch.long),


DAE Epoch [1/10], Loss: 0.4434
DAE Epoch [2/10], Loss: 0.4241
DAE Epoch [3/10], Loss: 0.4240
DAE Epoch [4/10], Loss: 0.4240
DAE Epoch [5/10], Loss: 0.4240
DAE Epoch [6/10], Loss: 0.4240
DAE Epoch [7/10], Loss: 0.4240
DAE Epoch [8/10], Loss: 0.4240
DAE Epoch [9/10], Loss: 0.4240
DAE Epoch [10/10], Loss: 0.4240
Epoch [1/50], Train Loss: 0.5398, Valid Loss: 2.4070
Epoch [2/50], Train Loss: 0.3537, Valid Loss: 3.0696
Epoch [3/50], Train Loss: 0.3190, Valid Loss: 3.1454
Epoch [4/50], Train Loss: 0.3007, Valid Loss: 3.8145
Epoch [5/50], Train Loss: 0.2873, Valid Loss: 3.9757
Epoch [6/50], Train Loss: 0.2764, Valid Loss: 3.6648
Epoch [7/50], Train Loss: 0.2661, Valid Loss: 3.8773
Epoch [8/50], Train Loss: 0.2569, Valid Loss: 4.3089
Epoch [9/50], Train Loss: 0.2465, Valid Loss: 4.3198
Epoch [10/50], Train Loss: 0.2372, Valid Loss: 4.5880
Epoch [11/50], Train Loss: 0.2279, Valid Loss: 4.3636
Epoch [12/50], Train Loss: 0.2189, Valid Loss: 4.0619
Epoch [13/50], Train Loss: 0.2097, Valid Loss: 4.1

KeyboardInterrupt: 

In [None]:
# Evaluation on Test Set
lstm.load_state_dict(torch.load(os.path.join(hyperparams["model_save_path"], 'best_lstm_model.pth')))
lstm.eval()

all_preds, all_targets = [], []
with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = lstm(inputs)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_targets.extend(targets.cpu().numpy())

# Classification Report and Confusion Matrix
print(classification_report(all_targets, all_preds, target_names=["Significant Increase", "Insignificant Increase", "Insignificant Decrease"]))

cm = confusion_matrix(all_targets, all_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["Significant Increase", "Insignificant Increase", "Insignificant Decrease"], yticklabels=["Significant Increase", "Insignificant Increase", "Insignificant Decrease"])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()


### 실제 데이터에 적용
- insignificant increase, decrease도 적용

- Precision (정밀도):

정밀도는 모델이 해당 클래스로 예측한 값 중 실제로 맞은 비율입니다.
예를 들어, "Significant increase" 클래스에 대해 모델이 98%의 정밀도를 보였다는 것은, 모델이 이 클래스로 예측한 값 중 98%가 실제로 맞았다는 의미입니다.
하지만 "Significant decrease" 클래스는 데이터에 등장하지 않아서 정밀도가 0.00으로 표시되었습니다.

- Recall (재현율):

재현율은 실제로 해당 클래스에 속한 데이터 중에서 모델이 올바르게 예측한 비율입니다.
예를 들어, "Insignificant increase" 클래스에 대한 재현율이 0.92라는 것은, 실제로 "Insignificant increase"에 속한 데이터 중 92%를 모델이 정확히 맞췄다는 의미입니다.
"Significant increase" 클래스의 재현율이 0.01이라는 것은, 이 클래스에 속한 실제 데이터 중 1%만 모델이 맞췄다는 의미입니다.

- F1-score:

F1 스코어는 정밀도와 재현율의 조화평균으로, 두 지표를 종합적으로 평가하는 지표입니다. 정밀도와 재현율 사이의 균형을 중요시할 때 유용합니다.
예를 들어, "Insignificant increase" 클래스의 F1 스코어가 0.84라는 것은, 정밀도와 재현율이 적절히 균형을 이뤘음을 의미합니다.

- Support:

Support는 각 클래스에 실제로 속한 데이터의 개수입니다.
예를 들어, "Insignificant increase" 클래스는 241,234개의 데이터 포인트를 가지고 있다는 것을 의미합니다.

- Micro avg (마이크로 평균):

마이크로 평균은 전체 데이터에서의 정밀도, 재현율, F1 스코어를 계산합니다. 이는 각 클래스의 데이터 개수를 고려하지 않고, 전체 데이터를 한 번에 평가하는 방식입니다.
마이크로 평균 77%는 모든 클래스의 데이터를 합쳐서 모델이 77%의 정확도를 보였음을 의미합니다.

- Macro avg (매크로 평균):

매크로 평균은 각 클래스의 정밀도, 재현율, F1 스코어의 평균을 단순히 계산한 것입니다. 이는 클래스별 데이터 비율을 고려하지 않기 때문에, 클래스 간 불균형이 있는 경우 잘못된 평가가 나올 수 있습니다.
예를 들어, "Significant decrease" 클래스처럼 데이터가 없는 클래스는 성능이 0으로 평가되며, 이러한 클래스들도 평균에 포함되기 때문에 성능이 낮아집니다.

- Weighted avg (가중 평균):

가중 평균은 각 클래스의 정밀도, 재현율, F1 스코어를 해당 클래스의 데이터 개수에 비례하여 평균을 계산한 것입니다. 즉, 클래스의 데이터가 많을수록 해당 클래스의 성능이 평균에 더 큰 영향을 줍니다.
Weighted 평균 77%는 실제 데이터 비율에 따라 가중치를 부여하여 계산한 성능입니다. 이 값은 전체적으로 모델이 77%의 정확도를 가지고 있음을 보여줍니다.