# Using CLDK to validate code translation

In this tutorial, we will use [CLDK](https://github.com/IBM/codellm-devkit/tree/main) to translate code and check properties of the translated code. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis for this task. By the end of this tutorial, you will have implemented a simple Java-to-Python code translator that also performs light-weight property checking on the translated code.

Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code translation and checking the translated code:

1. Create a new instance of the CLDK class.
2. Create an analysis object for the target Java application.
3. Iterate over all files in the application.
4. Iterate over all classes in a file.
5. Sanitize the class for prompting the LLM.
6. Create treesitter-based Java and Python analysis objects

We will write a couple of helper methods to 1) format the LLM instruction for translating a Java class to Python and 2) prompt the LLM via Ollama. We will then use CLDK to analyze the Java code, get context information for translating that code, and finally check the properties of the translated code.

## Prequisites

See the [Code Sumarization](./code_summarization.ipynb) recipe, which describes the prerequisites and set up that are required for this notebook, as well. These prerequisites include Python 3.11 or later, Java 11 or later, Maven 3.9 or later, [Ollama 0.3.4](https://ollama.com/) or later, [Granite code models](https://ollama.com/library/granite-code), a sample Java application (the [Apache Commons CLI](https://github.com/apache/commons-cli), and the CKDK tools.

## Translate Java code to Python and build a light-weight property checker (for translation validation)
Code translation aims to convert source code from one programming language to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent work, [presented at ICSE'24](https://dl.acm.org/doi/10.1145/3597503.3639226), we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating a Java class to Python and checking various properties of translated code (e.g., number of methods, number of fields, formal arguments, etc.) as a simple form of translation validation.

In [None]:
!pip install git+https://github.com/IBM/codellm-devkit.git

#### Step 1: Import the required modules

In [None]:
from cldk.analysis.python.treesitter import PythonSitter
from cldk.analysis.java.treesitter import JavaSitter
import ollama
from cldk import CLDK
from cldk.analysis import AnalysisLevel

#### Step 2: Define a function for creating the LLM prompt

This function instructs the LLM to translate a Java class to Python and includes the body of the Java class after removing all the comments and import statements.

In [None]:
def format_inst(code, focal_class, language):
    """
    Format the instruction for the given focal method and class.
    """
    inst = f"Question: Can you translate the Java class `{focal_class}` below to Python and generate under code block (```)?\n"

    inst += "\n"
    inst += f"```{language}\n"
    inst += code
    inst += "```" if code.endswith("\n") else "\n```"
    inst += "\n"
    return inst

#### Step 3: Define a function to call the LLM

As before in the [Code Sumarization](./code_summarization.ipynb) recipe, we use Ollama with Granite Code 8b Instruct, which we pull into Ollama, if necessary.

In [None]:
!ollama pull granite-code:8b-instruct

In [None]:
def prompt_ollama(message: str, model_id: str = "granite-code:8b-instruct") -> str:
    """Prompt local model on Ollama"""
    response_object = ollama.generate(model=model_id, prompt=message)
    return response_object["response"]

#### Step 4: Translate a Java class to Python and check two two properties of the translated code

The two properties checked are the following:
* The number of translated methods
* The number of translated fields

Since, the Java analysis has been already been run in the other recipes, we don't need to regenerate it.

In [None]:
# Create an instance of CLDK for Java analysis
cldk = CLDK(language="java")

# Create an analysis object for the Java application, providing the application path
analysis = cldk.analysis(project_path="temp/commons-cli-rel-commons-cli-1.7.0", analysis_level=AnalysisLevel.symbol_table, analysis_json_path='analysis')

# For simplicity, we run the code translation on a single class(this filter can be removed to run this code over the entire application)
target_class = "org.apache.commons.cli.GnuParser"

# Go through all the classes in the application
for class_name in analysis.get_classes():

    if class_name == target_class:
        # Get the location of the Java class
        class_path = analysis.get_java_file(qualified_class_name=class_name)

        # Read the file content
        if not class_path:
            class_body = ""
        with open(class_path, "r", encoding="utf-8", errors="ignore") as f:
            class_body = f.read()

        # Sanitize the file content by removing comments
        sanitized_class =  JavaSitter().remove_all_comments(source_code=class_body)

        # Create prompt for translating sanitized Java class to Python
        inst = format_inst(code=sanitized_class, language="java", focal_class=class_name.split(".")[-1])

        print(f"Instruction:\n{inst}\n")
        print(f"Translating Java code to Python and it will take few minutes (or even seconds) based on where the model has been hosted. . .\n")

        # Prompt the local model on Ollama
        translated_code = prompt_ollama(message=inst)

        # Print translated code
        print(f"Translated Python code: {translated_code}\n")

        # Create python sitter instance for analyzing translated Python code
        py_cldk = PythonSitter()

        # Compute methods, function, and field counts for translated code
        all_methods = py_cldk.get_all_methods(module=translated_code)
        all_functions = py_cldk.get_all_functions(module=translated_code)
        all_fields = py_cldk.get_all_fields(module=translated_code)

        # Check counts against method and field counts for Java code
        if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):
            print(f'Number of translated method not matching in class {class_name}')
        else:
            print(f'Number of translated method in class {class_name} is {len(all_methods)}')
        if all_fields is not None:
            if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):
                print(f'Number of translated field not matching in class {class_name}')
            else:
                print(f'Number of translated fields in class {class_name} is {len(all_fields)}')

Note that a file [./analysis/analysis.json](./analysis/analysis.json) was created or reused from a previous run, such as the [Code Sumarization](./code_summarization.ipynb) recipe.

After the LLM's response is received, you should see the translated Python code for `GnuParser` printed out, similar to the following:

```python
from org.apache.commons.cli import Parser, Util, Char
import re

class GnuParser(Parser):
    def flatten(self, options, arguments, stopAtNonOption):
        tokens = []
        eat_the_rest = False
        for i in range(len(arguments)):
            arg = arguments[i]
            if arg == "--":
                eat_the_rest = True
                tokens.append("--")
            elif arg == "-":
                tokens.append("-")
            elif re.match("^--", arg):
                opt = Util.stripLeadingHyphens(arg)
                if options.hasOption(opt):
                    tokens.append(arg)
                else:
                    equal_pos = DefaultParser.indexOfEqual(opt)
                    if equal_pos != -1 and options.hasOption(opt[0:equal_pos]):
                        tokens.append(arg[:arg.index("=")])
                        tokens.append(arg[arg.index("=")+1:])
                    elif options.hasOption(arg[0:2]):
                        tokens.append(arg[0:2])
                        tokens.append(arg[2:])
                    else:
                        eat_the_rest = stopAtNonOption
                        tokens.append(arg)
            else:
                tokens.append(arg)

            if eat_the_rest:
                for i in range(i+1, len(arguments)):
                    tokens.append(arguments[i])

        return tokens
```
