Skip to content

[HOWTO] TPU POD Training on a multiple nodes (bits_and_tpu branch) #1237

Discussion options

You must be logged in to vote

Hi! I tried it and it works without issues. I just followed the instruction on XLA's README:

  1. I created a container with a bunch of Python packages I need, and has this shape:
FROM gcr.io/tpu-pytorch/xla:r1.10_3.8_tpuvm

RUN pip install --upgrade pip

RUN mkdir /<your-dir>/
WORKDIR /<your-dir>/

# I have `git+https://github.com/rwightman/pytorch-image-models.git@fafece230b8c8325fd6144efbab25cbc6cf5ca5c`
# in my `requirements.txt`. This is a specific commit from `bits_and_tpu`, but I guess also `@bits_and_tpu` should work.
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
RUN pip install wandb

# Assuming that you are in the directory with your code
COPY . .
  1. I bu…

Replies: 3 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
3 replies
@Dreamer312
Comment options

@rwightman
Comment options

@Dreamer312
Comment options

Answer selected by rwightman
Comment options

You must be logged in to vote
1 reply
@dedeswim
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants
Converted from issue

This discussion was converted from issue #949 on April 27, 2022 21:53.