Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loss does not improve when running the cifar10 sample #537

Closed
contryboy opened this issue Feb 23, 2024 · 9 comments
Closed

Training loss does not improve when running the cifar10 sample #537

contryboy opened this issue Feb 23, 2024 · 9 comments
Assignees
Labels
ARC ARC GPU Correctness Output incorrect or unacceptable accuracy loss

Comments

@contryboy
Copy link

contryboy commented Feb 23, 2024

Describe the issue

I installed the latest version of oneAPI Base Toolkit and python packages and tried following:

  1. I ran the example code [1] runs without an error. How ever, the loss does not improve after several iterations, and stopped around 2.3.
  2. I also tried the example described in [2], observed the similar issues, that the code runs fast, but the accuracy stopped around 0.18. As I commented on that post, the author replied that he also observed similar issues in the newer versions.
  3. I tried another CNN based model which runs fine on NVIDIA P100, but observed similar issue (runs fast, but train loss does not improve) on Intel Arc.

Could you help to take a look the issue, see if you can reproduce at least the first case?

Hardware:
Intel Arc A770 16G, Intel i5, 16GRAM.

Software:
Ubuntu2204, intel_extension_for_pytorch-2.1.10+xpu, torch-2.1.0a0, torchvision-0.16.0a0

Thanks in advance!

[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32
[2] https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/

@contryboy
Copy link
Author

contryboy commented Feb 23, 2024

I further did following testing with the cifar10 notebook for comparison:

  1. I modified the note book to run on cpu on the same machine: train loss resulted around 1.8
  2. I modified the note book to run on xpu on the same machine without optimization step (ipex.optimize), train loss resulted around 2.3 (same problem as optimized version)
  3. I ran the notebook on xpu on the same machine with 2 epochs, the train loss keeps stay around 2.3 and does not improve furhter.
  4. I modified the note book to run on p100 gpu on kaggle: train loss resulted around 1.8 (consistent as run on cpu)

@vishnumadhu365
Copy link

@contryboy Tested the ipex sample train [1] with intel-extension-for-pytorch 2.1.20+xpu and found the loss decreasing to ~1.4 over 5 epochs. Will share more updates if I get to run cj-mills notebook [2]

System:
oneapi basetoolkit - 2024.1.0
intel-extension-for-pytorch - 2.1.20+xpu
Python - 3.10
GPU Driver - https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html

import torch
import torchvision
import time

############# code changes ###############
import intel_extension_for_pytorch as ipex

############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = "datasets/cifar10/"
device = torch.device('xpu')

transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.Resize((224, 224)),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ]
)
train_dataset = torchvision.datasets.CIFAR10(
    root=DATA,
    train=True,
    transform=transform,
    download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
model.train()
######################## code changes #######################
model = model.to(device)
criterion = criterion.to(device)
model, optimizer = ipex.optimize(model, optimizer=optimizer)
######################## code changes #######################

num_epoch = 5
running_loss = 0.0
loss_print_batch = 100

start_time = time.time()
for epoch in range(num_epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        ########## code changes ##########
        data = data.to(device)
        target = target.to(device)
        ########## code changes ##########
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        #print(batch_idx)
        # print statistics
        running_loss += loss.item()
        if batch_idx % loss_print_batch == 0:    
            print(f'[{epoch + 1}, {batch_idx + 1:5d}] loss: {running_loss / loss_print_batch:.3f}')
            running_loss = 0.0

print(f"Time to train : {round(time.time()-start_time,2)} seconds")

torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "checkpoint.pth",
)

print("Execution finished")

@contryboy
Copy link
Author

Hi @vishnumadhu365 , Thanks for your effort. Unfortunately I am not able to try your code to reproduce it again. I have changed to use another graphic card...

@vishnumadhu365
Copy link

@contryboy no worries, feel free to reach out if you still face issues.

@ZhaoqiongZ ZhaoqiongZ added Bug Something isn't working ARC ARC GPU Correctness Output incorrect or unacceptable accuracy loss and removed Bug Something isn't working labels Apr 24, 2024
@TheMrCodes
Copy link

TheMrCodes commented May 10, 2024

Hi there, ran in an weird but simular issue using an Arv A770 and ipex version 2.1.30+xpu
image
The two runs with the highest accuracy were done on my CPU (Intel i5-13500T), the middle ones (darkblue and green) same Setup but on the Arc GPU and the lowest also on Arc but with Eval step.

Don't know why but on of the functions torch.no_grad() or model.eval() are stomping my performance. (Probably model.eval() as stated in Issue#40)
Currently working on an minimal repoducable code example.

@huiyan2021 huiyan2021 self-assigned this Aug 22, 2024
@huiyan2021
Copy link

Hi @TheMrCodes , could you try ipex 2.1.40+xpu?

@TheMrCodes
Copy link

No sorry I can't, I do not longer have the A770 installed in my PC.
So I would appreciate if someone else could re-run the test with the code above

@huiyan2021
Copy link

@contryboy Tested the ipex sample train [1] with intel-extension-for-pytorch 2.1.20+xpu and found the loss decreasing to ~1.4 over 5 epochs. Will share more updates if I get to run cj-mills notebook [2]

System: oneapi basetoolkit - 2024.1.0 intel-extension-for-pytorch - 2.1.20+xpu Python - 3.10 GPU Driver - https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html

import torch
import torchvision
import time

############# code changes ###############
import intel_extension_for_pytorch as ipex

############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = "datasets/cifar10/"
device = torch.device('xpu')

transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.Resize((224, 224)),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ]
)
train_dataset = torchvision.datasets.CIFAR10(
    root=DATA,
    train=True,
    transform=transform,
    download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
model.train()
######################## code changes #######################
model = model.to(device)
criterion = criterion.to(device)
model, optimizer = ipex.optimize(model, optimizer=optimizer)
######################## code changes #######################

num_epoch = 5
running_loss = 0.0
loss_print_batch = 100

start_time = time.time()
for epoch in range(num_epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        ########## code changes ##########
        data = data.to(device)
        target = target.to(device)
        ########## code changes ##########
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        #print(batch_idx)
        # print statistics
        running_loss += loss.item()
        if batch_idx % loss_print_batch == 0:    
            print(f'[{epoch + 1}, {batch_idx + 1:5d}] loss: {running_loss / loss_print_batch:.3f}')
            running_loss = 0.0

print(f"Time to train : {round(time.time()-start_time,2)} seconds")

torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "checkpoint.pth",
)

print("Execution finished")

Hi @TheMrCodes , I ran code above with ipex 2.1.40+xpu on Arc770, and the loss reduced to 0.001 after several epos:
image

@huiyan2021
Copy link

Close it since this is not an issue any more in latest release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARC ARC GPU Correctness Output incorrect or unacceptable accuracy loss
Projects
None yet
Development

No branches or pull requests

5 participants