# Suggestion 3: Using Pretrained Code Models to perform source code localization

There is limited research study out there that looks into large code models for source code retrieval. Our group had to turn to grey literature and internet browsing as a solution.

We discovered [txtai](https://github.com/neuml/txtai) - an opensource emebdding database that can serves as a LLM knowledge base for prompting. This database allowed us to store our code in the database, and perform semantic search with app reviews. We performed three different methods of source code localization, and presented this final method as the one achieving the best result.

## Approach 1
The first approach is to simply extract all Java code, and use a code model to convert the code files into embeddings. After the embeddings have been stored in the database, use semantic search to perform searches and compare the result file with the correct file.

The result: `Precision: 8.2%, Recall: 16.4%, F1: 10.9%`

The issues with this approach and why it did not perform well:

- Some Java files contain more tokens than the model could handle, so the resulting embeddings did not represent the full code
- The code model provided embedding specifically trained for code. When it is applied to natural language, it may fall short, and vice versa. In this scenario, a bimodal language model adept at extracting features from both natural language and code is better, like [CodeT5+](https://huggingface.co/Salesforce/codet5p-220m-bimodal). Unfortunately this model is currently not supported by txtai

## Approach 2
The second approach is to first summarize each method of each Java class using a code model trained to perform documentation generation of Java code, [CodeTrans](https://huggingface.co/SEBIS/code_trans_t5_base_code_documentation_generation_java). The summarization of each file is then stored in the embedding database. Then the semantic search is performed, and the results are compared.

The summarization process can be seen in [this Notebook](https://colab.research.google.com/drive/1SDR5iKsyAeV6f43fQExWcPJfyhyGkwTp?usp=drive_link)

The result is round F1 20% depending on how many predicted search results are taken.

The issues with this approach and why it did not perform well:
- The summarization performed a layer of abstraction, and important features may be lost in the process.
- Despite summarization, some input could still be too long to process by the code model.

## Final Approach
The chosen approach with the best result is presented in this Collab. Each method of each Java file is collected and stored in the embedding database. When a method belonging to a file gets retrieved by the database, we record the file path and compare it to the correct file.
This approach achieved the best result: `Precision: 42.2%, Recall: 74.4%, F1: 53.9%`

### Step 1: Access Data
Mount the drive: The data is stored in Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

%cd /content/drive/MyDrive/EECS_6448/Data

Mounted at /content/drive
/content/drive/MyDrive/EECS_6448/Data


### Step 2: Extract Methods
Each Java file is examined, and methods within the file are extracted

In [None]:
import os

def extract_methods(java_file):
  methods = {}

  with open(java_file, 'r') as file:
      lines = file.readlines()
      in_method = False
      method_name = ""
      method_content = ""

      for line in lines:
          # Check for method declaration
          if any(modifier in line for modifier in ['public', 'private', 'protected', 'static']):
              method_name = line.split('(')[0].split()[-1]  # Extract method name
              in_method = True


          # Continue collecting method content
          if in_method:
              method_content += line

          # End of method reached
          if in_method and line.strip().endswith('}'):
              in_method = False
              methods[method_name] = method_content  # Store method name and content

      return methods

def read_java_files(directory):
    java_files_data = []
    java_dict = {}
    # Traverse through the directory and its subdirectories
    for root, dirs, files in os.walk(directory):
        for file in files:
            # Check if the file ends with ".java"
            if file.endswith(".java"):
                file_path = os.path.join(root, file)
                # Read the Java file
                with open(file_path, 'r') as java_file:
                    methods_dict = extract_methods(file_path)
                    for method, content in methods_dict.items():

                      java_dict[content] = file_path.replace('a_comic_viewer/droid-comic-viewer-master/', '')\
                                                    .replace('acdisplay/AcDisplay-master/', '')\
                                                    .replace('.proc.txt', '')\
                                                    .replace('project/app/src/', 'src/').replace('src/', '').replace('main/java/', '').replace('main/', '')\
                                                    .replace('Data/RQ_2/Source/', '')\
                                                    .replace('com.achep.acdisplay/', '').replace('net.androidcomics.acv/','')
                      java_files_data.append(content)
    return java_dict,java_files_data

# Replace 'directory_path' with the path of your directory containing Java files
directory_path = 'Data/RQ_2/Source'
java_dict,java_files_data = read_java_files(directory_path)

Step 3: Add Methods to Embedding Database

This step uses this model - https://huggingface.co/arcee-ai/bge-large-code-retriever-vanilla

In comparison with other models (CodeBert and similar feature extraction model), this model performs the best.

The method contents are converted to embeddings inside the database, while a dictionary of each method and their parent class is kept.

In [None]:
!pip install git+https://github.com/neuml/txtai

from txtai import Embeddings

embeddings = Embeddings(path="arcee-ai/bge-large-code-retriever-vanilla")
# Create an index for the list of text
embeddings.index(java_files_data)

Collecting git+https://github.com/neuml/txtai
  Cloning https://github.com/neuml/txtai to /tmp/pip-req-build-nwvrq4pz
  Running command git clone --filter=blob:none --quiet https://github.com/neuml/txtai /tmp/pip-req-build-nwvrq4pz
  Resolved https://github.com/neuml/txtai to commit f9229bc27c5160402ffb8caee3ab24620c8e602a
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### Step 4: Run an example
An example review is ran. As it is shown, many of the same files can be retrieves because the reviews are similar to many methods within the same file. Therefore, when performing searches in the next step, we allow for 30 results to be compared

In [None]:
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ["i love this app is really simple and functional you cant go wrong with it if you are looking for a lightweight app for you cbr and cbz comics dont look further this ifs the app to get."]:
  # Extract uid of first result
  # search result format: (uid, score)
  uids = embeddings.search(query, 40)

  print(uids)
  for uid in uids:
  # Print text
    print("%-20s %s" % (query, java_dict[java_files_data[uid[0]]]))

Query                Best Match
--------------------------------------------------
[(2957, 0.7919315099716187), (2958, 0.7902666926383972), (2959, 0.7892378568649292), (2956, 0.7884824275970459), (2953, 0.7808594703674316), (2954, 0.7799752950668335), (1896, 0.7789991497993469), (1895, 0.7789991497993469), (1894, 0.7789991497993469), (1893, 0.7789991497993469), (1892, 0.7789991497993469), (1891, 0.7789991497993469), (63, 0.778807520866394), (2955, 0.7780110239982605), (1890, 0.7758274078369141), (65, 0.7758264541625977), (3027, 0.7736642956733704), (2952, 0.7735443711280823), (3695, 0.7728898525238037), (64, 0.7724614143371582), (62, 0.7722431421279907), (3696, 0.7721375226974487), (3698, 0.7718230485916138), (3693, 0.7715427875518799), (3697, 0.771469235420227), (2318, 0.7711247205734253), (3694, 0.7710498571395874), (66, 0.7708614468574524), (1618, 0.7704434990882874), (1617, 0.7704434990882874), (1616, 0.7704434990882874), (1615, 0.7704434990882874), (1614, 0.7704434990882874), (161

### Step 5: Final result
The search results are retrieved for each comment, and the precision, recall, and F1-score are retrieved.

In [None]:
import json

# Assuming the JSON data is stored in a file named 'data.json'
with open('Data/rq2-manual-normalized.json', 'r') as json_file:
  data = json.load(json_file)

  true_pos = 0
  false_neg = 0
  false_pos = 0


  for query in data.keys():

    uids = embeddings.search(query, 30)

    predicted_files = set()


    for uid in uids:
      file_path = java_dict[java_files_data[uid[0]]]
      predicted_files.add(file_path)

      if file_path in data[query]:
        true_pos += 1
      else:
        false_pos += 1
    for correct_file in data[query]:
      if correct_file not in predicted_files:
        false_neg += 1


  precision = true_pos / (true_pos + false_pos)
  recall = true_pos / (true_pos + false_neg)
  f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall != 0) else 0

  print(f"Precision: {precision * 100:.1f}%, Recall: {recall * 100:.1f}%, F1: {f1 * 100:.1f}%")

Precision: 42.2%, Recall: 74.4%, F1: 53.9%
