# 11. CODE SEARCH

1. Introduction
2. Bi-Encoder
3. Cross-Encoder
4. TOSS
5. RAG
6. References

# 1. Introduction

#### Why do we need a code search?

- by text description find the code that implements the specified functionality
- by code find the same or similar code

#### What is the difference between code search and text search?

- programming languages differ from natural languages: the code has a structure
- different programming languages in one repository
- often the search query and the answer to it are written in different languages

#### Search queries

A query is an explicit expression of search intent used by the user.

What do we want from a query language?
- Simplicity. It is easy to formulate a query.
- Expressiveness. A query language should allow one to define what one wants to find.
- Precision. Queries should allow one to express intent as unambiguously as possible.

Types of query languages:
- Informal: free-form
- Formal:
  - using well-known programming languages
  - using specially developed programming languages
  - using pairs: input -- output
- Hybrid

#### Free-form query languages

> For example, `read file line by line`

The query describes the intent in natural language. May contain elements of the programming language.

> For example, `FileReader close`

Advantages:
- convenient for the user

Disadvantages:
- ambiguity (e.g., *float*)
- the dictionary used in the queries may not match the dictionary of the code base (e.g., *array* and [.] mean an array in Python)

The problem of dictionary mismatch (e.g., searching for regular expressions) can be alleviated by embedding words of natural language and words of the programming language in a common vector space (word2vec, CodeBERT). Or by preprocessing and expanding the queries.

#### Queries based on common programming languages

Main approaches:
- code fragments. For example, a query is a code with a partial implementation that needs to be completed:
> ```try { File file = File . createTempFile (" foo " , " bar ") ; } catch ( IOException e ) { }```
- code with placeholders --- missing code fragments are explicitly indicated:
>```public void actionClose ( JButton a , JFrame f ) { __CODE_SEARCH__ ; } ```
- code with some patterns --- abstract code templates for searching:
>```if (# = #) @ ;```

Disadvantages:
- problem parsing code with existing libraries (you must first make the code syntactically correct)

Advantages:
- can be used in a recommender system while writing code (saves time per request)

#### Queries based on specially designed programming languages

**Using logical programming languages**

Predicates that describe properties of the code. For example, find a package with a class called `HelloWorld`:
>```package (? P , class , ? C ) , class (? C , name , HelloWorld )```

Sometimes it is possible to specify higher-level properties. For example,
>```import count > 5 AND extends class FooBar```

**Significant extensions of existing languages**

For example, finding nested `if-else`:
>```$ ( if $$ else $ ) $ + ```

**Other special languages**

For example, describing a program so that computations are described through a computation graph. It is possible to search independently of the language.

#### Using a pair: input -- output

The approach uses the key feature of code -- executability. The query describes the desired behavior through examples. For example,
> for the input `abc@def.org` we want to have `abc` as the output.

#### Hybrid queries

These approaches combine several of the options described above. For example,
> ```sort playerScores in ascending order```, where `playerScores` is a variable from the code.

#### Overview of code search

![](res/11_overview.png)

[Source: [Grazia Pradel 2022](https://arxiv.org/abs/2204.02765)]

- *indexing*: preparing data for subsequent search (what to index and what not? what connections remain?)
- *preprocessing & expansion*: modifying the user's query to improve the search quality (using search history; replacing words in the query based on synonyms or embeddings, etc.)
- *retrieval*: retrieving candidates based on the query
- *ranking*: ranking (based on distances, using machine learning methods)
- *pruning*: filtering (by threshold, by quantity, remove similar)

#### Benchmarks

- [CodeSearchNet](https://github.com/github/CodeSearchNet)
- [CodeXGLUE](https://github.com/microsoft/CodeXGLUE)
- [CoSQA](https://arxiv.org/abs/2105.13239)

#### Approaches

Existing approaches to code search can be divided into two groups:
1. Information Retrieval-based methods: they often work quickly, but are sometimes not entirely accurate
2. DL-based methods: they can show high quality, but are usually slower due to the use of large models.

Among DL-approaches, in turn, we can distinguish *bi-encoder* and *cross-encoder* solutions.

# 2. Bi-encoder approach

In the bi-encoder approach, the input data (text query, code) are independently embedded into a vector space.

![](./res/11_bi-encoder.png)

Thus, embeddings for the code can be calculated in advance, which speeds up the search.

#### Notations

Let
- $q$ --- search query
- $C$ --- code corpus (where search occurs)
- $M(\cdot, \cdot)$ --- function of matching (match) code snippet $c$ to search query $q$, the higher the value of function $M$, the better $c$ matches $q$.

Then code search is a search for such $c$ for query $q$ that the value of $M(c, q)$ is maximal:
$$max_{c \in C}M(c, q).$$

#### Approach

Let $c \in C$ be a code snippet from corpus $C$.
Let $\Gamma'$ and $\Gamma''$ be encoders for queries and code snippets, respectively.
Then let $e_q$ and $e_c$ denote embeddings for $q$ and $c$:
$$e_q = \Gamma'(q)$$
$$e_c = \Gamma''(c).$$
In addition, we need some similarity function or distance function $s_{bi}$ for vectors.
For example, the cosine of the angle between vectors or the Euclidean distance:
$$s_{bi}: (e_q, e_c) \mapsto R.$$

Thus, we have $$M_{bi}(\cdot, \cdot) = s_{bi}(\Gamma'(\cdot), \Gamma''(\cdot)).$$

The encoders $\Gamma'$ and $\Gamma''$ are neural networks pre-trained on pairs (text, code).

# 3. Cross-encoder approach

Unlike biencoders, in the case of a cross-encoder the attention mechanism is performed on the query-code pair, thus the interaction of information in a natural language with information in a programming language occurs during the vectorization process.

![](./res/11_cross-encoder.png)

Due to this, cross-encoder approaches can achieve higher accuracy than bi-encoder solutions.
But cross-encoder approaches cannot create separate embeddings for the code, which are necessary for many applications.

#### Approach

In this approach, the search query and code snippet are combined (concatenated via a special token $<SEP>$) and are input to the $\Gamma$ model, which calculates cross-modal similarity (encoder + regression):

$$M_{cr}(\cdot, \cdot) = \Gamma([q, <SEP>, c]).$$

# 4. TOSS

How to combine the advantages of all approaches?

![](./res/11_toss_paper.png)

In [Revisiting Code Search in a Two-Stage Paradigm](https://arxiv.org/abs/2208.11274), the authors proposed
TOSS (TwO-Stage fuSion code Search framework) --- a two-stage code search framework that combines the advantages of different code search methods.
TOSS first uses Information Retrieval and biencoder models to efficiently find a small number of top candidates, and then uses crossencoders for more accurate ranking.

#### Two-stage code search

1. First, find a set of candidate snippets based on the function $M_{recall}$.

2. Then rank the set of candidates based on the function $M_{rank}$.

![](./res/11_toss_overview.png)

1. Consider a simple case where at the first stage (stage 1) we only have one model.

$C_{sub} = \arg \max_{C' \subset C, |C'| = K} \sum_{c \in C'} M_{recall}(c, q)$ --- a set of $K$ candidates.

The function $M_{recall}$ should be fast, but possibly inaccurate.

2. $c_{*} = \arg \max_{c \in C_{sub}} M_{rank }(c, q)$ --- the final snippet.

The $M_{rank}$ function can be complex.

The first stage ($M_{recall}$) can include several ($m$) different ways to search the code. The resulting candidates are combined and sent to the second stage:

$$C^{i}_{sub} = \arg \max_{C' \subset C, |C'| = K} \sum_{c \in C'} M^i_{recall}(c, q), \,\,\, 1 \leq i \leq m$$
$$C_{all} = \cup_{1 \leq i \leq m}C^i_{sub}$$

# 5. RAG

Retrieval Augmented Generation --- next lecture

# 6. References

- [Grazia Pradel - Code search A survey of techniques for finding code 2022](https://arxiv.org/abs/2204.02765)
- Birillo et al - Reflekt A library for compile-time reflection in Kotlin 2022
- Chai et al - Cross-domain deep code search with few-shot meta learning 2022
- Cheng Kuang - CSRS Code search with relevance matching and semantic matching 2022
- Eberhart McMillan - Generating clarifying questions for query refinement in source code search 2022
- Grazia et al - DiffSearch A scalable and precise search engine for code changes 2022
- Gu et al - Accelerating code search with deep hashing and code classification 2022
- Gu et al - Multimodal representation for neural code search 2022
- Haldar et al - A multi-perspective architecture for semantic code search 2021
- [Hu et al - Revisiting code search in a two-stage paradigm 2022](https://arxiv.org/abs/2208.11274)
- Li et al - CodeRetriever Unimodal and bimodal contrastive learning for code search 2022
- Liu et al - CodeMatcher Searching code based on sequential semantics of important query words 2022
- Liu et al - GraphSearchNet Enhancing GNNs via capturing global dependency for semantic code search 2022
- Shi et al - Enhancing semantic code search with multimodal contrastive learning and soft data augmentation 2022
- Shuai et al - Improving code search with co-attentive representation learning 2020
- Sun et al - Code search based on context-aware code translation 2022
- Sun et al - On the importance of building high-quality training datasets for neural code search 2022
- Villmow et al - Addressing leakage in self-supervised contextualized code retrieval 2022
- Wang et al - Enriching query semantics for code search with reinforcement learning 2021
- Yan et al - Are the code snippets what we are searching for A benchmark and an empirical study on code search with natural-language queries 2020
- Zhang et al - Bag-of-words baselines for semantic code search 2021
- https://ieeexplore.ieee.org/abstract/document/8453172
- https://dl.acm.org/doi/abs/10.1145/2393596.2393606
- https://ieeexplore.ieee.org/abstract/document/6606630