Issue with Handling Input Prompt Files in codegen-inference.py #32

pppyb · 2024-05-17T08:07:16Z

Thank you very much for the contributions of the authors. While attempting to implement the RepoCoder method, we encountered an issue in the codegen_inference.py file. After modifying the file to:

we encountered the following error:

It seems that this file can only generate results for the in-file method.

zfj1998 · 2024-05-17T08:29:35Z

Sometimes the tokenizer won't return the same number of tokens. So you may loose the token limitation a bit in

CodeT/RepoCoder/build_prompt.py

Line 15 in 35f54d6

self.max_retrieval_length = 2000 # half of the max length of the model

and L18. For example, only allow 1900/900 tokens of the retrieved content for Codex/CodeGen.

pppyb · 2024-05-18T14:14:47Z

Thank you, Fengji. I followed your suggestions and made the modifications, but I am still encountering the same error; it doesn't seem to work. It only changed the retrieval context length, and the generated prompt still exceeds 2048.

Additionally, in codegen-inference.py, the truncation=True in this line prompts = self.tokenizer(prompt_batch, return_tensors='pt', padding=True, truncation=True) does not seem to function for some reason. If I manually force truncation, the quality of the generated code is very poor.

**Since the RuntimeError occurs within the internals of the transformers model, it seems there is no direct way to access the tensor causing the error:
Traceback (most recent call last):
File "codegen_inference.py", line 78, in
cg.batch_generate(file_path)
File "codegen_inference.py", line 59, in batch_generate
gen_text.extend(self._generate_batch(batch))
File "codegen_inference.py", line 40, in _generate_batch
gen_tokens = self.model.generate(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/generation_utils.py", line 1490, in generate
return self.greedy_search(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/generation_utils.py", line 2233, in greedy_search
outputs = self(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 693, in forward
transformer_outputs = self.transformer(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 578, in forward
outputs = block(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 304, in forward
attn_outputs = self.attn(
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 251, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 167, in _attn
attn_weights = torch.where(causal_mask, attn_weights, mask_value)
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3

Do you have any better solutions for the Salesforce/codegen-350M-mono model?

zfj1998 · 2024-05-19T02:34:56Z

if you use the force truncation by Tokenizer, it will change the last line of code, thus affecting the target hole of code completion. A better way is to reduce the length of the retrieved context or remove the in-file context from the beginning lines. You have to check more carefully whether the 'rg-one-gram-ws-20-ss-24.jsonl' is generated correctly since the code here

CodeT/RepoCoder/build_prompt.py

Line 77 in 35f54d6

if current_token_length + token_len < self.max_retrieval_length:

explicitly controls the length of prompt.

pppyb · 2024-05-19T09:46:13Z

Thank you very much for your prompt response! I appreciate your suggestions and will attempt both solutions. However, I have some concerns regarding the process, as I am trying to replicate the results you presented in your paper on the codegen-350M-mono model, specifically those in Table 2 (a, b).

I followed the instructions provided in the README file meticulously and made no changes to the code logic other than modifying the hardcoded input paths. The steps I followed are as outlined below:

I ran the run_RG1_and_oracle_method function in run_pipeline.py to generate prompts/rg-one-gram-ws-20-ss-2.jsonl.
I then executed codegen_inference.py to produce the prediction file: prompts/rg-one-gram-ws-20-ss-2_codegen-350M-mono.jsonl.
I updated the prediction_path in run_pipeline.py to the newly generated prediction file and reran the run_RepoCoder_method in run_pipeline.py to obtain prompts/repocoder-one-gram-ws-20-ss-2.jsonl.
Lastly, I used the prompts/repocoder-one-gram-ws-20-ss-2.jsonl as an input to run codegen_inference.py again to get the results for the repocoder algorithm.

Could you please let me know if you encountered any issues when you achieved the results documented in your paper? I am concerned there may be an issue with my process since I am strictly adhering to the steps mentioned without altering any fundamental code logic.

Your guidance on this matter would be greatly appreciated.

zfj1998 · 2024-05-20T02:44:12Z

The pipeline looks great. While if you want to get the results for the 3rd and 4th iteration, you may need to change the mode here to 'r-g-r-g-r-g' or 'r-g-r-g-r-g-r-g' and then call run_RepoCoder_method again to obtain the prompt files for the continued rounds.

kechenliuuu3469 · 2024-05-28T20:01:33Z

@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?

Thanks a ton!

pppyb · 2024-06-18T07:53:18Z

@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?

Thanks a ton!

Hey @kechenliuuu3469 -- apologies for the lack of response earlier on this. I think inference.py is a part of the repository and I hope this issue #28 will solve your problem.

zfj1998 closed this as completed May 17, 2024

zfj1998 reopened this May 19, 2024

zfj1998 closed this as completed Jun 18, 2024

Ling-JM mentioned this issue Jul 23, 2024

Issue with token length of input prompt in codegen-inference.py #40

Closed

pppyb mentioned this issue Sep 18, 2024

output folder not created #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Handling Input Prompt Files in codegen-inference.py #32

Issue with Handling Input Prompt Files in codegen-inference.py #32

pppyb commented May 17, 2024

zfj1998 commented May 17, 2024

pppyb commented May 18, 2024 •

edited

Loading

zfj1998 commented May 19, 2024 •

edited

Loading

pppyb commented May 19, 2024

zfj1998 commented May 20, 2024

kechenliuuu3469 commented May 28, 2024

pppyb commented Jun 18, 2024

Issue with Handling Input Prompt Files in codegen-inference.py #32

Issue with Handling Input Prompt Files in codegen-inference.py #32

Comments

pppyb commented May 17, 2024

zfj1998 commented May 17, 2024

pppyb commented May 18, 2024 • edited Loading

zfj1998 commented May 19, 2024 • edited Loading

pppyb commented May 19, 2024

zfj1998 commented May 20, 2024

kechenliuuu3469 commented May 28, 2024

pppyb commented Jun 18, 2024

pppyb commented May 18, 2024 •

edited

Loading

zfj1998 commented May 19, 2024 •

edited

Loading