Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Handling Input Prompt Files in codegen-inference.py #32

Closed
pppyb opened this issue May 17, 2024 · 7 comments
Closed

Issue with Handling Input Prompt Files in codegen-inference.py #32

pppyb opened this issue May 17, 2024 · 7 comments

Comments

@pppyb
Copy link

pppyb commented May 17, 2024

Thank you very much for the contributions of the authors. While attempting to implement the RepoCoder method, we encountered an issue in the codegen_inference.py file. After modifying the file to:
image
we encountered the following error:

image It seems that this file can only generate results for the in-file method.
@zfj1998
Copy link
Collaborator

zfj1998 commented May 17, 2024

Sometimes the tokenizer won't return the same number of tokens. So you may loose the token limitation a bit in

self.max_retrieval_length = 2000 # half of the max length of the model
and L18. For example, only allow 1900/900 tokens of the retrieved content for Codex/CodeGen.

@zfj1998 zfj1998 closed this as completed May 17, 2024
@pppyb
Copy link
Author

pppyb commented May 18, 2024

Thank you, Fengji. I followed your suggestions and made the modifications, but I am still encountering the same error; it doesn't seem to work. It only changed the retrieval context length, and the generated prompt still exceeds 2048.
image
Additionally, in codegen-inference.py, the truncation=True in this line prompts = self.tokenizer(prompt_batch, return_tensors='pt', padding=True, truncation=True) does not seem to function for some reason. If I manually force truncation, the quality of the generated code is very poor.
image

**Since the RuntimeError occurs within the internals of the transformers model, it seems there is no direct way to access the tensor causing the error:
Traceback (most recent call last):
File "codegen_inference.py", line 78, in
cg.batch_generate(file_path)
File "codegen_inference.py", line 59, in batch_generate
gen_text.extend(self._generate_batch(batch))
File "codegen_inference.py", line 40, in _generate_batch
gen_tokens = self.model.generate(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/generation_utils.py", line 1490, in generate
return self.greedy_search(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/generation_utils.py", line 2233, in greedy_search
outputs = self(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 693, in forward
transformer_outputs = self.transformer(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 578, in forward
outputs = block(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 304, in forward
attn_outputs = self.attn(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 251, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 167, in _attn
attn_weights = torch.where(causal_mask, attn_weights, mask_value)
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3

Do you have any better solutions for the Salesforce/codegen-350M-mono model?

@zfj1998 zfj1998 reopened this May 19, 2024
@zfj1998
Copy link
Collaborator

zfj1998 commented May 19, 2024

if you use the force truncation by Tokenizer, it will change the last line of code, thus affecting the target hole of code completion. A better way is to reduce the length of the retrieved context or remove the in-file context from the beginning lines. You have to check more carefully whether the 'rg-one-gram-ws-20-ss-24.jsonl' is generated correctly since the code here

if current_token_length + token_len < self.max_retrieval_length:
explicitly controls the length of prompt.

@pppyb
Copy link
Author

pppyb commented May 19, 2024

Thank you very much for your prompt response! I appreciate your suggestions and will attempt both solutions. However, I have some concerns regarding the process, as I am trying to replicate the results you presented in your paper on the codegen-350M-mono model, specifically those in Table 2 (a, b).

I followed the instructions provided in the README file meticulously and made no changes to the code logic other than modifying the hardcoded input paths. The steps I followed are as outlined below:

  1. I ran the run_RG1_and_oracle_method function in run_pipeline.py to generate prompts/rg-one-gram-ws-20-ss-2.jsonl.
  2. I then executed codegen_inference.py to produce the prediction file: prompts/rg-one-gram-ws-20-ss-2_codegen-350M-mono.jsonl.
  3. I updated the prediction_path in run_pipeline.py to the newly generated prediction file and reran the run_RepoCoder_method in run_pipeline.py to obtain prompts/repocoder-one-gram-ws-20-ss-2.jsonl.
  4. Lastly, I used the prompts/repocoder-one-gram-ws-20-ss-2.jsonl as an input to run codegen_inference.py again to get the results for the repocoder algorithm.

Could you please let me know if you encountered any issues when you achieved the results documented in your paper? I am concerned there may be an issue with my process since I am strictly adhering to the steps mentioned without altering any fundamental code logic.

Your guidance on this matter would be greatly appreciated.

@zfj1998
Copy link
Collaborator

zfj1998 commented May 20, 2024

The pipeline looks great. While if you want to get the results for the 3rd and 4th iteration, you may need to change the mode here to 'r-g-r-g-r-g' or 'r-g-r-g-r-g-r-g' and then call run_RepoCoder_method again to obtain the prompt files for the continued rounds.

@kechenliuuu3469
Copy link

@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?

Thanks a ton!

@pppyb
Copy link
Author

pppyb commented Jun 18, 2024

@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?

Thanks a ton!

Hey @kechenliuuu3469 -- apologies for the lack of response earlier on this. I think inference.py is a part of the repository and I hope this issue #28 will solve your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants