Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The kernel appears to have died. It will restart automatically. #26

Closed
DavidZhang88 opened this issue Mar 8, 2019 · 14 comments
Closed

Comments

@DavidZhang88
Copy link

DavidZhang88 commented Mar 8, 2019

i was trying to run this code in Jupyter notebook,but when i run this cell, it came out an error: 'The kernel appears to have died. It will restart automatically.' I cant figure out why this error will come out,could anybody offer me some help? Thank you so much.

# Train the simple copy task.
V = 11
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

for epoch in range(3):
    model.train()
    run_epoch(data_gen(V, 30, 20), model, 
              SimpleLossCompute(model.generator, criterion, model_opt))
    model.eval()
    print(run_epoch(data_gen(V, 30, 5), model, 
                    SimpleLossCompute(model.generator, criterion, None)))
@DavidZhang88
Copy link
Author

@xvdp Have you met this problem? Could you offer me some help? Thank you so much.

@wesg52
Copy link

wesg52 commented Mar 17, 2019

I am having the same issue.

@rchavezj
Copy link

I'm also having the same issue

@ngarneau
Copy link

Also having the same issue. Running in python directly give a floating point exception (core dumped).

@v-iashin
Copy link

v-iashin commented Mar 23, 2019

The same issue on Ubuntu 16.04, Threadripper 2950X, PyTorch 1.0.1.

UPD: I am not sure but it seems like a deadlock somewhere because I couldn't catch this with a debugger.

@chenjun0210
Copy link

I'm also having the same issue

@BerenLuthien
Copy link

I am having the same issue.

@ArdalanM
Copy link

Had the same issue, here is the fix:

Modify run_epoch cast all counters to numpy values with .detach().numpy() or just .numpy()

Here is the corrected function:

def run_epoch(data_iter, model, loss_compute):
    "Standard Training and Logging Function"
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.trg, batch.src_mask, batch.trg_mask)
        loss = loss_compute(out, batch.trg_y, batch.ntokens)
        total_loss += loss.detach().numpy()
        total_tokens += batch.ntokens.numpy()
        tokens += batch.ntokens.numpy()
        if i % 50 == 1:
            elapsed = time.time() - start
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" % (i, loss.detach().numpy() / batch.ntokens.numpy(), tokens / elapsed))
            start = time.time()
            tokens = 0
    return total_loss / total_tokens

@rchavezj
Copy link

@ArdalanM Your MVP

@anantshah200
Copy link

I still have the same issue. It runs fine for a few batches and then gives a floating point exception. Any other suggestions.

@anantshah200
Copy link

@ArdalanM, it runs for about 400-500 batches and then throws a floating point exception. Had you experienced the same type of error? Any suggestions to solve it?

@clived2
Copy link

clived2 commented Oct 17, 2019

I am having this same issue, running pytorch 1.2.0 on a Ubuntu 18.04.3 desktop, every time I try to run a CNN script atthe point where the "training" is invoked. ANN, RNN scripts work without any such issues. It seems that a lot of people are having this problem for quite a while, i am amazed that this issue is still unresolved

@rithikreddy2k2
Copy link

rithikreddy2k2 commented Jul 22, 2021

Firstly Uninstall pytorch as follows:
conda uninstall pytorch
pip uninstall torch ( Run this code twice to check if its uninstalled sucessfully )

Then Freshly install Pytorch as follows:
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
or Check the official website for latest version https://pytorch.org/

This solved the issue for me!
Hope its solves

@srush srush closed this as completed May 2, 2022
@canlinzhang
Copy link

canlinzhang commented Feb 15, 2023

I just created a new virtual environment in conda (I use anaconda). Then do:

pip install transformers
conda install pytorch torchvision torchaudio -c pytorch

After that re-run your code should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests