when we try reproducing a multigpu run and the original tensor locates at GPU6, our script raises this error. we need a better device mapping