Training loss does not improve when running the cifar10 sample #537

contryboy · 2024-02-23T14:03:44Z

Describe the issue

I installed the latest version of oneAPI Base Toolkit and python packages and tried following:

I ran the example code [1] runs without an error. How ever, the loss does not improve after several iterations, and stopped around 2.3.
I also tried the example described in [2], observed the similar issues, that the code runs fast, but the accuracy stopped around 0.18. As I commented on that post, the author replied that he also observed similar issues in the newer versions.
I tried another CNN based model which runs fine on NVIDIA P100, but observed similar issue (runs fast, but train loss does not improve) on Intel Arc.

Could you help to take a look the issue, see if you can reproduce at least the first case?

Hardware:
Intel Arc A770 16G, Intel i5, 16GRAM.

Software:
Ubuntu2204, intel_extension_for_pytorch-2.1.10+xpu, torch-2.1.0a0, torchvision-0.16.0a0

Thanks in advance!

[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32
[2] https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/

contryboy · 2024-02-23T15:43:49Z

I further did following testing with the cifar10 notebook for comparison:

I modified the note book to run on cpu on the same machine: train loss resulted around 1.8
I modified the note book to run on xpu on the same machine without optimization step (ipex.optimize), train loss resulted around 2.3 (same problem as optimized version)
I ran the notebook on xpu on the same machine with 2 epochs, the train loss keeps stay around 2.3 and does not improve furhter.
I modified the note book to run on p100 gpu on kaggle: train loss resulted around 1.8 (consistent as run on cpu)

vishnumadhu365 · 2024-04-09T12:42:06Z

@contryboy Tested the ipex sample train [1] with intel-extension-for-pytorch 2.1.20+xpu and found the loss decreasing to ~1.4 over 5 epochs. Will share more updates if I get to run cj-mills notebook [2]

System:
oneapi basetoolkit - 2024.1.0
intel-extension-for-pytorch - 2.1.20+xpu
Python - 3.10
GPU Driver - https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html

import torch
import torchvision
import time

############# code changes ###############
import intel_extension_for_pytorch as ipex

############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = "datasets/cifar10/"
device = torch.device('xpu')

transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.Resize((224, 224)),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ]
)
train_dataset = torchvision.datasets.CIFAR10(
    root=DATA,
    train=True,
    transform=transform,
    download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
model.train()
######################## code changes #######################
model = model.to(device)
criterion = criterion.to(device)
model, optimizer = ipex.optimize(model, optimizer=optimizer)
######################## code changes #######################

num_epoch = 5
running_loss = 0.0
loss_print_batch = 100

start_time = time.time()
for epoch in range(num_epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        ########## code changes ##########
        data = data.to(device)
        target = target.to(device)
        ########## code changes ##########
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        #print(batch_idx)
        # print statistics
        running_loss += loss.item()
        if batch_idx % loss_print_batch == 0:    
            print(f'[{epoch + 1}, {batch_idx + 1:5d}] loss: {running_loss / loss_print_batch:.3f}')
            running_loss = 0.0

print(f"Time to train : {round(time.time()-start_time,2)} seconds")

torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "checkpoint.pth",
)

print("Execution finished")

contryboy · 2024-04-09T13:24:11Z

Hi @vishnumadhu365 , Thanks for your effort. Unfortunately I am not able to try your code to reproduce it again. I have changed to use another graphic card...

vishnumadhu365 · 2024-04-17T13:53:09Z

@contryboy no worries, feel free to reach out if you still face issues.

TheMrCodes · 2024-05-10T16:49:22Z

Hi there, ran in an weird but simular issue using an Arv A770 and ipex version 2.1.30+xpu

The two runs with the highest accuracy were done on my CPU (Intel i5-13500T), the middle ones (darkblue and green) same Setup but on the Arc GPU and the lowest also on Arc but with Eval step.

Don't know why but on of the functions torch.no_grad() or model.eval() are stomping my performance. (Probably model.eval() as stated in Issue#40)
Currently working on an minimal repoducable code example.

huiyan2021 · 2024-09-05T03:07:17Z

Hi @TheMrCodes , could you try ipex 2.1.40+xpu?

TheMrCodes · 2024-09-10T12:13:15Z

No sorry I can't, I do not longer have the A770 installed in my PC.
So I would appreciate if someone else could re-run the test with the code above

huiyan2021 · 2024-09-11T06:35:17Z

@contryboy Tested the ipex sample train [1] with intel-extension-for-pytorch 2.1.20+xpu and found the loss decreasing to ~1.4 over 5 epochs. Will share more updates if I get to run cj-mills notebook [2]

System: oneapi basetoolkit - 2024.1.0 intel-extension-for-pytorch - 2.1.20+xpu Python - 3.10 GPU Driver - https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html

import torch
import torchvision
import time

############# code changes ###############
import intel_extension_for_pytorch as ipex

############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = "datasets/cifar10/"
device = torch.device('xpu')

transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.Resize((224, 224)),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ]
)
train_dataset = torchvision.datasets.CIFAR10(
    root=DATA,
    train=True,
    transform=transform,
    download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
model.train()
######################## code changes #######################
model = model.to(device)
criterion = criterion.to(device)
model, optimizer = ipex.optimize(model, optimizer=optimizer)
######################## code changes #######################

num_epoch = 5
running_loss = 0.0
loss_print_batch = 100

start_time = time.time()
for epoch in range(num_epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        ########## code changes ##########
        data = data.to(device)
        target = target.to(device)
        ########## code changes ##########
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        #print(batch_idx)
        # print statistics
        running_loss += loss.item()
        if batch_idx % loss_print_batch == 0:    
            print(f'[{epoch + 1}, {batch_idx + 1:5d}] loss: {running_loss / loss_print_batch:.3f}')
            running_loss = 0.0

print(f"Time to train : {round(time.time()-start_time,2)} seconds")

torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "checkpoint.pth",
)

print("Execution finished")

Hi @TheMrCodes , I ran code above with ipex 2.1.40+xpu on Arc770, and the loss reduced to 0.001 after several epos:

huiyan2021 · 2024-09-12T05:53:18Z

Close it since this is not an issue any more in latest release.

contryboy mentioned this issue Feb 23, 2024

posts/intel-pytorch-extension-tutorial/native-ubuntu/ cj-mills/christianjmills#38

Open

cj-mills mentioned this issue Apr 8, 2024

posts/arc-a770-testing/part-3/ cj-mills/christianjmills#40

Open

ZhaoqiongZ added Bug Something isn't working ARC ARC GPU Correctness Output incorrect or unacceptable accuracy loss and removed Bug Something isn't working labels Apr 24, 2024

huiyan2021 self-assigned this Aug 22, 2024

huiyan2021 closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loss does not improve when running the cifar10 sample #537

Training loss does not improve when running the cifar10 sample #537

contryboy commented Feb 23, 2024 •

edited

Loading

contryboy commented Feb 23, 2024 •

edited

Loading

vishnumadhu365 commented Apr 9, 2024

contryboy commented Apr 9, 2024

vishnumadhu365 commented Apr 17, 2024

TheMrCodes commented May 10, 2024 •

edited

Loading

huiyan2021 commented Sep 5, 2024

TheMrCodes commented Sep 10, 2024

huiyan2021 commented Sep 11, 2024

huiyan2021 commented Sep 12, 2024

Training loss does not improve when running the cifar10 sample #537

Training loss does not improve when running the cifar10 sample #537

Comments

contryboy commented Feb 23, 2024 • edited Loading

Describe the issue

contryboy commented Feb 23, 2024 • edited Loading

vishnumadhu365 commented Apr 9, 2024

contryboy commented Apr 9, 2024

vishnumadhu365 commented Apr 17, 2024

TheMrCodes commented May 10, 2024 • edited Loading

huiyan2021 commented Sep 5, 2024

TheMrCodes commented Sep 10, 2024

huiyan2021 commented Sep 11, 2024

huiyan2021 commented Sep 12, 2024

contryboy commented Feb 23, 2024 •

edited

Loading

contryboy commented Feb 23, 2024 •

edited

Loading

TheMrCodes commented May 10, 2024 •

edited

Loading