Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Given groups=1, weight of size [48, 37, 11], expected input[8, 691, 18] to have 37 channels, but got 691 channels instead #61

Open
Abdelsater opened this issue Jan 8, 2023 · 6 comments

Comments

@Abdelsater
Copy link

I have the following setup and I actually changed the d_input from 38 to 37

`# Training parameters
DATASET_PATH = 'output.npz'
BATCH_SIZE = 8
NUM_WORKERS = 4
LR = 1e-4
EPOCHS = 30

Model parameters

d_model = 48 # Lattent dim
N = 2 # Number of layers
dropout = 0.2 # Dropout rate

d_input = 37 # From dataset
d_output = 8 # From dataset

Config

sns.set()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device {device}")
`

My dataset has the following shape when converting the ozedataset from csv to npz :
[('R', (7500, 19), dtype('float32')), ('X', (7500, 8, 672), dtype('float32')), ('Z', (7500, 18, 672), dtype('float32'))]

but when I am running the benchmark of the transformer repo I am getting the following error when I train :
`[Epoch 1/30]: 0%| | 0/5500 [00:00<?, ?it/s]
torch.Size([8, 18, 691])
torch.Size([8, 8, 672])

RuntimeError Traceback (most recent call last)
Cell In[6], line 16
14 print(x.shape)
15 print(y.shape)
---> 16 netout = net(x.to(device))
18 # Comupte loss
19 loss = loss_function(y.to(device), netout)

File ~/miniconda3/envs/Test/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Implementations/Transformers/OzeChallenge/Original/transformer/src/benchmark.py:121, in ConvGru.forward(self, x)
119 def forward(self, x):
120 x = x.transpose(1, 2)
--> 121 x = self.conv1(x)
122 x = self.activation(x)
123 x = self.conv2(x)

File ~/miniconda3/envs/Test/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/Test/lib/python3.10/site-packages/torch/nn/modules/conv.py:313, in Conv1d.forward(self, input)
312 def forward(self, input: Tensor) -> Tensor:
--> 313 return self._conv_forward(input, self.weight, self.bias)

File ~/miniconda3/envs/Test/lib/python3.10/site-packages/torch/nn/modules/conv.py:309, in Conv1d._conv_forward(self, input, weight, bias)
305 if self.padding_mode != 'zeros':
306 return F.conv1d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
307 weight, bias, self.stride,
308 _single(0), self.dilation, self.groups)
--> 309 return F.conv1d(input, weight, bias, self.stride,
310 self.padding, self.dilation, self.groups)

RuntimeError: Given groups=1, weight of size [48, 37, 11], expected input[8, 691, 18] to have 37 channels, but got 691 channels instead`

I know I have to use rollaxis to get my input in the following shape :
_x.shape = torch.Size([7500, 672, 37]) _y.shape = torch.Size([7500, 672, 8])

could you please help me with it , I am a bit confused !!

Thank you in advance

@maxjcohen
Copy link
Owner

Hi, as you can see your input vector has shape (8, 18, 691) from what was printed before the error, whereas you would like to have an input with shape (batch_size, time_length, d_input). My guess is that you tried to cobined R with X in your data preprocessing, which would explain the size 691 = 672 + 19 of you current input vector.

I suggest double checking your data processing function, and to concatenate X and R on the d_input dimension, as to obtain an input vector of shape (7500, 19+8=37, 672). Note that you may need to broadcast R.

@Abdelsater
Copy link
Author

Abdelsater commented Jan 26, 2023

thank you for your calrification, but I am still not quite sure how to address it, this how the function is converting the dataset from csv to npz :

def csv2npz(dataset_x_path, dataset_y_path, output_path, filename, labels_path='labels.json'):
    """Load input dataset from csv and create x_train tensor."""
    # Load dataset as csv
    x = pd.read_csv(dataset_x_path)
    y = pd.read_csv(dataset_y_path)

    # Load labels, file can be found in challenge description
    with open(labels_path, "r") as stream_json:
        labels = json.load(stream_json)

    m = x.shape[0]
    K = TIME_SERIES_LENGTH  # Can be found through csv

    # Create R and Z
    R = x[labels["R"]].values
    R = np.tile(R, 672)
    R = R.astype(np.float32)

    X = y[[f"{var_name}_{i}" for var_name in labels["X"]
           for i in range(K)]]
    X = X.values.reshape((m, -1, K))
    X = X.astype(np.float32)

    Z = x[[f"{var_name}_{i}" for var_name in labels["Z"]
           for i in range(K)]]
    Z = Z.values.reshape((m, -1, K))
#     Z = Z.transpose((0, 2, 1))
    Z = Z.astype(np.float32)

    np.savez(path.join(output_path, filename), R=R, X=X, Z=Z)

my input and output after reading the csv files look like the following :

d_input : 37
d_output : 8

Could you please point out for me or edit the code directly and thank you again for sharing the code and supporting in troubleshouting

@maxjcohen
Copy link
Owner

my input and output after reading the csv files look like the following :

d_input : 37
d_output : 8

This seems good to me, so the problem probably isn't from the csv2npz function, but rather in your dataloader and how it handles the R and X variables. Sorry for the late answer, please tell me if that helps.

@Abdelsater
Copy link
Author

Abdelsater commented Feb 9, 2023

As I stated before my npz dataset dimension looks like this 👍

[('R', (7500, 12768), dtype('float32')), ('X', (7500, 8, 672), dtype('float32')), ('Z', (7500, 18, 672), dtype('float32'))]'

the benchmark notebook that I am trying to use looks like this (it is taken from your repo on gihub) and the dataloader specifically looks like this:

import numpy as np
from matplotlib import pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from tqdm import tqdm
import seaborn as sns

from tst.loss import OZELoss

from src.benchmark import BiGRU, ConvGru
from src.dataset import OzeDataset
from src.utils import compute_loss
from src.visualization import map_plot_function, plot_values_distribution, plot_error_distribution, plot_errors_threshold, plot_visual_sample
# Training parameters
DATASET_PATH = 'Output-Dataset.npz'
BATCH_SIZE = 8
NUM_WORKERS = 4
LR = 1e-4
EPOCHS = 30

# Model parameters
d_model = 48 # Lattent dim
N = 2 # Number of layers
dropout = 0.2 # Dropout rate

d_input = 37 # From dataset
d_output = 8 # From dataset

# Config
sns.set()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device {device}")
Using device cuda:0
Training
Load dataset
ozeDataset = OzeDataset(DATASET_PATH)

dataset_train, dataset_val, dataset_test = random_split(ozeDataset, (5500, 1000, 1000))
dataloader_train = DataLoader(dataset_train,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=NUM_WORKERS,
                              pin_memory=False
                             )

dataloader_val = DataLoader(dataset_val,
                            batch_size=BATCH_SIZE,
                            shuffle=True,
                            num_workers=NUM_WORKERS
                           )

dataloader_test = DataLoader(dataset_test,
                             batch_size=BATCH_SIZE,
                             shuffle=False,
                             num_workers=NUM_WORKERS
                            )
Load network
# Load transformer with Adam optimizer and MSE loss function
net = ConvGru(d_input, d_model, d_output, N, dropout=dropout, bidirectional=True).to(device)
optimizer = optim.Adam(net.parameters(), lr=LR)
loss_function = OZELoss(alpha=0.3)
Train
model_save_path = f'models/model_LSTM_{datetime.datetime.now().strftime("%Y_%m_%d__%H%M%S")}.pth'
val_loss_best = np.inf

# Prepare loss history
hist_loss = np.zeros(EPOCHS)
hist_loss_val = np.zeros(EPOCHS)
for idx_epoch in range(EPOCHS):
    running_loss = 0
    with tqdm(total=len(dataloader_train.dataset), desc=f"[Epoch {idx_epoch+1:3d}/{EPOCHS}]") as pbar:
        for idx_batch, (x, y) in enumerate(dataloader_train):
            optimizer.zero_grad()

            # Propagate input
            netout = net(x.to(device))

            # Comupte loss
            loss = loss_function(y.to(device), netout)

            # Backpropage loss
            loss.backward()

            # Update weights
            optimizer.step()

            running_loss += loss.item()
            pbar.set_postfix({'loss': running_loss/(idx_batch+1)})
            pbar.update(x.shape[0])
        
        train_loss = running_loss/len(dataloader_train)
        val_loss = compute_loss(net, dataloader_val, loss_function, device).item()
        pbar.set_postfix({'loss': train_loss, 'val_loss': val_loss})
        
        hist_loss[idx_epoch] = train_loss
        hist_loss_val[idx_epoch] = val_loss
        
        if val_loss < val_loss_best:
            val_loss_best = val_loss
            torch.save(net.state_dict(), model_save_path)
        
plt.plot(hist_loss, 'o-', label='train')
plt.plot(hist_loss_val, 'o-', label='val')
plt.legend()
print(f"model exported to {model_save_path} with loss {val_loss_best:5f}")'

The error that I am getting is the following:

RuntimeError: Given groups=1, weight of size [48, 37, 11], expected input[8, 13440, 18] to have 37 channels, but got 13440 channels instead'

This error is giving me hard times, since I tried several transformation before , but since you confirmed the same input and output , how we can make this work , by the way I tried the original benchmark using the csv directly and it worked , the code looks like this :

import datetime

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from pathlib import Path
import sys
import psutil

from src.dataset import OzeDataset, OzeEvaluationDataset, OzeNPZDataset
from src.utils import npz_check, compute_loss, csv2npz
from src.model import BenchmarkLSTM
BATCH_SIZE = 100
# NUM_WORKERS = psutil.cpu_count() # Use this to get number of logical processing units
NUM_WORKERS = psutil.cpu_count(logical=False) # Use this to get number of physical Cores
LR = 1e-2
EPOCHS = 30
HIDDEN_DIM = 100
NUM_LAYERS = 3

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device {device}")

#dataset = OzeNPZDataset(dataset_path=npz_check(Path('datasets'), 'dataset'), labels_path="labels.json")
dataset = OzeDataset(dataset_x_path="Datasets/x_train_LsAZgHU.csv", dataset_y_path="Datasets/y_train_EFo1WyE.csv", labels_path="labels.json")
#K = dataset.time_series_length
K= 672

# More info about memory pinning here: https://pytorch.org/docs/stable/data.html#memory-pinning
is_cuda = device == torch.device("cuda:0")
num_workers = 0 if is_cuda else NUM_WORKERS
dataloader = DataLoader(dataset,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        pin_memory=is_cuda,
                        num_workers=num_workers)

m, M = dataloader.dataset.m, dataloader.dataset.M

d_input = dataset.get_x_shape()[2]  # From dataset
print('d_input : {}'.format(d_input))
d_output = dataset.get_y_shape()[2]  # From dataset
print('d_output : {}'.format(d_output))
# Load benchmark network with Adam optimizer and MSE loss function
net = BenchmarkLSTM(input_dim=d_input, hidden_dim=HIDDEN_DIM, output_dim=d_output, num_layers=NUM_LAYERS).to(device)
loss_function = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=LR)

model_save_path = f'model_{datetime.datetime.now().strftime("%Y_%m_%d__%H%M%S")}.pth'

def fit():
    """
    Fits selected network
    """
    loss_best = np.inf
    # Prepare loss history
    hist_loss = np.zeros(EPOCHS)
    for idx_epoch in range(EPOCHS):
        running_loss = 0
        with tqdm(total=len(dataloader.dataset), desc=f"[Epoch {idx_epoch+1:3d}/{EPOCHS}]") as pbar:
            for idx_batch, (inp, out) in enumerate(dataloader):
                optimizer.zero_grad()

                # Propagate input
                net_out = net(inp.to(device))

                # Compute loss
                loss = loss_function(out.to(device), net_out)

                # Backpropagate loss
                loss.backward()

                # Update weights
                optimizer.step()

                running_loss += loss.item()
                pbar.set_postfix({'loss': running_loss/(idx_batch+1)})
                pbar.update(inp.shape[0])

            train_loss = running_loss/len(dataloader)
            pbar.set_postfix({'loss': train_loss})

            hist_loss[idx_epoch] = train_loss

            if train_loss < loss_best:
                train_loss_best = train_loss
                torch.save(net.state_dict(), model_save_path)
    print(f"\nmodel exported to {model_save_path} with loss {train_loss_best:5f}")
    return hist_loss

try:
    hist_loss = fit()
except RuntimeError as err:
    if str(err).startswith('CUDA out of memory.'):
        print('\nSwitching device to cpu to workaround CUDA out of memory problem.')
        device = torch.device("cpu")
        net = net.to(device)
        dataloader = DataLoader(dataset,
                                batch_size=BATCH_SIZE,
                                shuffle=True,
                                pin_memory=False,
                                num_workers=NUM_WORKERS)
        hist_loss = fit()
    else:
        sys.exit()

plt.plot(hist_loss, 'o-', label='train')
plt.legend()

Thank you for debugging this with me , my goal is to re-run your experiment so I can build my own transformer in the end, so understanding your experiment will help me a lot. Thank you

@fremk
Copy link

fremk commented Jan 4, 2024

Indeed there is an issue with the dimensions. @maxjcohen in the ozedataset class, R and Z are concatenated on the wrong dimension. I suggest changing

m = Z.shape[0]  # Number of training example
K = Z.shape[1]  # Time serie length

R = np.tile(R[:, np.newaxis, :], (1, K, 1))

# Store R, Z and X as x and y
self._x = np.concatenate([Z, R], axis=-1)
self._y = X

to

m = Z.shape[0]  # Number of training example
K = Z.shape[-1]  # Time serie length

Z = Z.transpose((0, 2, 1))

X = X.transpose((0, 2, 1))

R = np.tile(R[:, np.newaxis, :], (1, K, 1))

# Store R, Z and X as x and y
self._x = np.concatenate([Z, R], axis=-1)
self._y = X

in src/dataset.py
Since the dimensions from the competition dataset are (after using your csv to npz function) X (7500,8,672), Z (7500,18,672) and R (7500,19), the time series lengths is Z.shape[-1] instead of Z.shape[1] and we need to transpose both Z and X so that they match the transformer dimensions.
The final shapes should be 7500,672,37 for self._x and 7500,672,8 for self._y. Shouldn't that be the case?
It works just fine for me now.

@maxjcohen
Copy link
Owner

Hi, there may be inconsistance between this repo and the competition's dataset, as the latter hasn't been maintained since 2021 while this project kept being updated.

The preprocessing function remains correct, however you may need to adapt it to your dataset. In all cases, we want to concatenate of the feature dimension (d_input). The competition's dataset seems to be transposed, so you are wright in adding an additional transpose here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants