Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run examples on Windows 10 #55

Open
mhamra opened this issue Aug 28, 2023 · 9 comments
Open

Can't run examples on Windows 10 #55

mhamra opened this issue Aug 28, 2023 · 9 comments
Labels
compability issues arising from specific hardware or system configs

Comments

@mhamra
Copy link

mhamra commented Aug 28, 2023

Hi,
I've tried to run the examples, but I received this error.

(CodeLlama) PS C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama> python -m torch.distributed.run --nproc_per_node 1 example_infilling.py --ckpt_dir CodeLlama-7b-Python --tokenizer_path ./CodeLlama-7b-Python/tokenizer.model
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\example_infilling.py", line 79, in <module>
    fire.Fire(main)
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\example_infilling.py", line 18, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "C:\Users\marce\OneDrive\mah-docs\CodeLlama\codellama\llama\generation.py", line 90, in build
    checkpoint = torch.load(ckpt_path, map_location="cpu")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, '<'.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 18284) of binary: C:\ProgramData\anaconda3\envs\CodeLlama\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 798, in <module>
    main()
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\CodeLlama\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_infilling.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-28_12:39:51
  host      : DESKTOP-THP4I5R
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 18284)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs
@mhamra
Copy link
Author

mhamra commented Aug 30, 2023

UPDATE

I've made a mistake running the download.sh script. I've passed my email instead of the URL received from FB.

@manoj21192
Copy link

Did your issue resolved? I am unable to run on windows 10 as well. I am getting "Distributed package doesnt have NCCL built-in error"

@hijkw hijkw added the compability issues arising from specific hardware or system configs label Sep 6, 2023
@realhaik
Copy link

realhaik commented Sep 12, 2023

@manoj21192 This will work on windows


temperature  = 0
top_p  = 0
max_seq_len  = 4096
max_batch_size  = 1
max_gen_len  = None
num_of_worlds = 1

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)


generator = Llama.build(
    
    ckpt_dir="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct",
    tokenizer_path="C:/AI/LLaMA2_Docker_FileSystem/codellama/CodeLlama-7b-Instruct/tokenizer.model",
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size,
    model_parallel_size = num_of_worlds
)

@99991
Copy link

99991 commented Sep 18, 2023

UPDATE

I've made a mistake running the download.sh script. I've passed my email instead of the URL received from FB.

Thank you! I can reproduce this. I at first entered my email and then noticed my error and entered the correct URL when running download.sh, but loading was still not possible.

I cloned the repository again, entered the correct URL on first try and then it worked.

@bronzwikgk
Copy link

What mistake am I making here?
from typing import Optional

import fire

from llama import Llama

def main(
ckpt_dir: "D:\pathto\codellama\CodeLlama-7b",
tokenizer_path: "D:\pathto\codellama\CodeLlama-7b\tokenizer.model",
temperature: float = 0.2,
top_p: float = 0.9,
max_seq_len: int = 256,
max_batch_size: int = 4,
max_gen_len: Optional[int] = None,
):
generator = Llama.build(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
max_seq_len=max_seq_len,
max_batch_size=max_batch_size,
)
"

I Am getting this error: "

D:\path2\codellama>python example_completion.py
ERROR: The function received no value for the required argument: ckpt_dir
Usage: example_completion.py CKPT_DIR TOKENIZER_PATH
optional flags: --temperature | --top_p | --max_seq_len |
--max_batch_size | --max_gen_len

For detailed information on this command, run:
example_completion.py --help "

@realhaik
Copy link

realhaik commented Oct 1, 2023

What mistake am I making here? from typing import Optional

import fire

from llama import Llama

def main( ckpt_dir: "D:\pathto\codellama\CodeLlama-7b", tokenizer_path: "D:\pathto\codellama\CodeLlama-7b\tokenizer.model", temperature: float = 0.2, top_p: float = 0.9, max_seq_len: int = 256, max_batch_size: int = 4, max_gen_len: Optional[int] = None, ): generator = Llama.build( ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size, ) "

I Am getting this error: "

D:\path2\codellama>python example_completion.py ERROR: The function received no value for the required argument: ckpt_dir Usage: example_completion.py CKPT_DIR TOKENIZER_PATH optional flags: --temperature | --top_p | --max_seq_len | --max_batch_size | --max_gen_len

For detailed information on this command, run: example_completion.py --help "

@bronzwikgk

Based on the code and error message you've provided, here are some issues I've identified:

  1. The type hints in the function arguments are actually string literals, which is incorrect syntax for Python.
  2. The paths should be properly escaped or defined as raw strings.

Here's a revised version of the code:

from typing import Optional
import fire
from llama import Llama

def main(
    ckpt_dir: str = r"D:\pathto\codellama\CodeLlama-7b",
    tokenizer_path: str = r"D:\pathto\codellama\CodeLlama-7b\tokenizer.model",
    temperature: float = 0.2,
    top_p: float = 0.9,
    max_seq_len: int = 256,
    max_batch_size: int = 4,
    max_gen_len: Optional[int] = None,
):
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
    )
    
if __name__ == "__main__":
    fire.Fire(main)
  1. Fixed the type hints for ckpt_dir and tokenizer_path to be str.
  2. Used raw string literals for the Windows paths (by prefixing the string with an r), which allow for backslashes to be interpreted correctly.
  3. Added if __name__ == "__main__": fire.Fire(main) to run the function when the script is executed.

Try running the updated code and see if the error persists.

@bronzwikgk
Copy link

Thanks, Moved One step ahead.
Getting this error now: {{
Traceback (most recent call last):
File "D:\shunyadotek\codellama\example_completion.py", line 55, in
fire.Fire(main)
File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\shunyadotek\codellama\example_completion.py", line 20, in main
generator = Llama.build(
^^^^^^^^^^^^
File "D:\shunyadotek\codellama\llama\generation.py", line 68, in build
torch.distributed.init_process_group("nccl")
File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\rendezvous.py", line 235, in _env_rendezvous_handler
rank = int(_get_env_or_raise("RANK"))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\shunya-desk-01\AppData\Roaming\Python\Python311\site-packages\torch\distributed\rendezvous.py", line 220, in _get_env_or_raise
raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
}}

@realhaik
Copy link

realhaik commented Oct 1, 2023

torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

@bronzwikgk
I don't see this line in your code : torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:23455', world_size=num_of_worlds, rank=0)

Are you sure you have it in your code?
See my answer with the full code with this line, few answers above.

@realhaik
Copy link

realhaik commented Oct 1, 2023

@bronzwikgk Right, I see that you are using torch.distributed.init_process_group("nccl")
nccl is for linux only, use my example above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compability issues arising from specific hardware or system configs
Projects
None yet
Development

No branches or pull requests

6 participants