# CODE GENERATION

1. Introduction
2. HumanEval
3. Generation with LLMs
4. Generation with Agents
5. Context-aware generation
6. Exercise
7. References

# 1. Introduction

#### PROBLEM

- Input: documentation (+ unit tests)
- Output: code

#### DATASETS

- [HumanEval](https://arxiv.org/abs/2107.03374): [164 programming problems](https://huggingface.co/datasets/openai_humaneval)
- [MBPP](https://arxiv.org/abs/2108.07732): The Mostly Basic Programming Problem, [974 programming problems](https://huggingface.co/datasets/mbpp)
- [WikiSQL](https://github.com/salesforce/WikiSQL): 87726 hand-annotated SQL query and natural language question pairs
- [more datasets](https://paperswithcode.com/datasets?task=code-generation)

#### METRICS

![](res/12_metrics.png)

Source: [Li et al - The Metric for Automatic Code Generation 2020](https://www.sciencedirect.com/science/article/pii/S1877050920302210)

#### METRIC `pass@k`

1. Generate $k$ code samples per problem.
2. A problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported.

# 2. HumanEval

- paper: [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374)
- github: [openai/human-eval](https://github.com/openai/human-eval)
- dataset: [openai_humaneval](https://huggingface.co/datasets/openai_humaneval)
- leaderboard: [Code Generation on HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval)
- [Big Code Models Leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard)

HumanEval:
- 164 tasks
- task information: function signature, docstring, body, and several tests (on average, 7.7 tests per task)
- tasks are rewritten to avoid overlap with GitHub: for example, there are more than ten public repositories containing solutions to [Codeforces](https://codeforces.com/) tasks
- programming tasks in the HumanEval dataset assess language understanding, reasoning, algorithms, and simple mathematics

Quality is evaluated using the `pass@k` metric.

#### EXAMPLES

- white background: send to model
- yellow background: generation result

![Prompt](https://production-media.paperswithcode.com/datasets/a2af6cf1-212b-4a05-8d5c-55170a21ce05.png)

# 3. Generation with LLMs

#### DEMO

![](./res/12_qwen_demo.png)

[Qwen2.5-Coder](https://huggingface.co/spaces/Qwen/Qwen2.5-Coder-demo)

#### EASY WITH HUGGINGFACE

```
from transformers import pipeline
pipe = pipeline('text-generation', model='lvwerra/codeparrot')

text = 'def add_numbers(a, b):\n    """add two numbers"""'
code = pipe(text)[0]['generated_text']

print(code)
```

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-Coder-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "write a quick sort algorithm."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

#### CODE MODELS: HUMANEVAL LEADERBOARD

![](res/12_humaneval_1.png)

![](res/12_humaneval_2.png)

Source: [[Zheng et al. - A Survey of Large Language Models for Code 2023]](https://arxiv.org/abs/2311.10372)

#### CODE MODELS: MBPP LEADERBOARD

![](res/12_mbpp_1.png)

![](res/12_mbpp_2.png)

Source: [[Zheng et al. - A Survey of Large Language Models for Code 2023]](https://arxiv.org/abs/2311.10372)

#### CODE MODELS: BIG CODE LEADERBOARD

![](./res/12_big_code_leaderboard.png)

Source: [[bigcode-models-leaderboard]](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard)

# 4. Generation with Agents

![](res/12_humaneval_agents_1.png)
![](res/12_humaneval_agents_2.png)

Source: [Code Generation on HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval)

| Rank | Model | Pass@1  | Source | Year| 
|---:|-------------------------------------------------------|-------|--------------------------------------------------------------------------------------------------------|-----:|
|  1 | AgentCoder  (GPT-4)                                   | 96.3  | AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation                  | 2023 |
|  2 | LDB  (GPT-3.5, based on seed programs from Reflexion) | 95.1  | LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step                      | 2024 |
|  3 | Language Agent Tree Search  (GPT-4)                   | 94.4  | Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models                    | 2023 |
|  4 | L2MAC  (GPT-4)                                        | 90.2  | L2MAC: Large Language Model Automatic Computer for Extensive Code Generation                           | 2023 |
|  5 | OctorCoder  (GPT-4)                                   | 86.6  | OctoPack: Instruction Tuning Code Large Language Models                                                | 2023 |
|  6 | ANPL  (GPT-4)                                         | 86.6  | ANPL: Towards Natural Programming with Interactive Decomposition                                       | 2023 |
|  7 | MetaGPT  (GPT-4)                                      | 85.9  | MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework                                    | 2023 |
|  8 | Parsel  (GPT-4 + CodeT)                               | 85.1  | Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions                         | 2022 |
|  9 | Claude 3 Opus  (0-shot)                               | 84.9  | The Claude 3 Model Family: Opus, Sonnet, Haiku                                                         | 2024 |
| 10 | Language Agent Tree Search  (GPT-3.5)                 | 83.8  | Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models                    | 2023 |
| 11 | AgentCoder  (ChatGPT)                                 | 79.9  | AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation                  | 2023 |
| 12 | LLMCodeGen Scrum  (GPT-3.5 + zero-shot)               | 78.5  | When LLM-based Code Generation Meets the Software Development Process                                  | 2024 |
| 13 | GPT-4  (0-shot)                                       | 76.5  | DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence       | 2024 |
| 14 | ANPL  (GPT-3.5)                                       | 76.2  | ANPL: Towards Natural Programming with Interactive Decomposition                                       | 2023 |
| 15 | Claude 3 Haiku  (0-shot)                              | 75.9  | The Claude 3 Model Family: Opus, Sonnet, Haiku                                                         | 2024 |
| 16 | INTERVENOR  (GPT-3.5)                                 | 75.6  | INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair | 2023 |
| 17 | SRank  (WizardCoder)                                  | 75.31 | Neural Code Generation Enhancement via Functional Overlap Reranking                                    | 2023 |
| 18 | Gemini Ultra  (0-shot)                                | 74.4  | Gemini: A Family of Highly Capable Multimodal Models                                                   | 2023 |
| 19 | Claude 3 Sonnet  (0-shot)                             |    73 | The Claude 3 Model Family: Opus, Sonnet, Haiku                                                         | 2024 |
| 20 | Gemini 1.5 Pro  (0-shot)                              | 71.9  | Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context                    | 2024 |

*Agent-based approaches outperform agent-less approaches.*

More:
- [Collaborative Agents](09_collaborative_agents.ipynb)

# 5. Context-aware generation

- fine-tuning
- in-context learning
- RAG

#### RAG (Retrieval-Augmented Generation)

![](res/12_naive_rag.png)

Source: [[Gao et al. - Retrieval-Augmented Generation for Large Language Models 2024]](https://arxiv.org/abs/2312.10997)

#### Indexing

1. cleaning and extraction of raw data in diverse formats
2. converting into a uniform plain text format
3. text is segmented into smaller chunks
4. chunks are then encoded into vector representations using an embedding model
5. embeddings are stored in vector database

This step is crucial for enabling efficient similarity searches in the subsequent retrieval phase.

#### Retrieval

1. transform the query into a vector representation (the same encoding model utilized during the indexing phase)
2. compute the similarity scores between the query vector and the vector of chunks within the indexed corpus.
3. prioritize and retrieve the top-$K$ chunks that demonstrate the greatest similarity to the query

The retrieval phase often struggles with precision and recall, leading to the selection of misaligned or irrelevant chunks, and the missing of crucial information.

#### Generation

- chunks are subsequently used as the expanded context in prompt

In generating responses, the model may face the issue of **hallucination**, where it produces content not supported by the retrieved context. This phase can also suffer from irrelevance, toxicity, or bias in the outputs, detracting from the quality and reliability of the responses.

#### Advanced RAG

![](res/12_naive_advanced_modular_rag.png)

Source: [[Gao et al. - Retrieval-Augmented Generation for Large Language Models 2024]](https://arxiv.org/abs/2312.10997)

#### RAG or Fine-Tuning

![](res/12_rag_vs_all.png) 

Source: [[Gao et al. - Retrieval-Augmented Generation for Large Language Models 2024]](https://arxiv.org/abs/2312.10997)

#### RAG for Code Generation

The diagram below shows a high-level RAG pattern for code generation with Codey APIs.

![](res/12_context-aware_code_generation_rag.png)

1. Identify source information: API definition, code repositories, documentation, or similar.
2. Determine the chunking scheme: methods that respect natural code boundaries, such as function, class, or module borders; techniques like random splits or mid-sentence/clauses could break the context and degrade the output.
3. Generate embeddings and index them in a vector database: when a query is received, another embedding is generated for the query and used to help retrieve relevant information chunks.

More:
- [Context-aware code generation: Retrieval augmentation and Vertex AI Codey APIs](https://cloud.google.com/blog/products/ai-machine-learning/context-aware-code-generation-rag-and-vertex-ai-codey-apis)

# 6. Exercise

Implement code generation using the RAG approach:
- you have a small repository (several files), the code is written in a certain style (generate)
- upon request, it is necessary to generate code that will be written in the same style

Maybe useful: [Code a simple RAG from scratch](https://huggingface.co/blog/ngxson/make-your-own-rag)

# 7. References

- https://arxiv.org/abs/2107.03374
- https://arxiv.org/abs/2208.03133
- https://arxiv.org/abs/2311.10372
- https://arxiv.org/abs/2401.15940
- https://arxiv.org/abs/2403.08299
- https://arxiv.org/abs/2407.21059v1
- [Gao et al - Retrieval-Augmented Generation for Large Language Models A Survey 2024](https://arxiv.org/abs/2312.10997)
- [Zhao et al - Retrieval-Augmented Generation for AI-Generated Content A Survey 2024](https://arxiv.org/abs/2402.19473)
- [Yu et al - Evaluation of Retrieval-Augmented Generation A Survey 2024](https://arxiv.org/abs/2405.07437)
- https://github.com/saltudelft/ml4se?tab=readme-ov-file#code-generation