# Using CLDK to explain Java methods

In this tutorial, we will use [CLDK](https://github.com/IBM/codellm-devkit/tree/main) to generate a summary explaination for a Java method. You will explore some of the benefits of using CLDK to perform quick and easy program analysis and to build an LLM-based code summarizer. By the end of this tutorial, you will have implemented such a tool and generated a summary for a Java method.

Specifically, you will learn how to perform the following tasks on Java code to create LLM prompts for code summarization:

1. Create a new instance of the `CLDK` class.
2. Create an analysis object for the target Java code.
3. Iterate over all files in the code.
4. Iterate over all classes in a file.
5. Initialize `treesitter` utils for the class content.
6. Iterate over all methods in a class.
7. Get the code body for a method.
8. Sanitize the class for prompting the LLM.

We will write several helper methods to 1) format the LLM instruction for summarizing a given target method and 2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate the summary for the target method.

## Prerequisites

Before we get started, let's make sure you have the following installed:

1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)
2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to install Java)
3. Maven 3.9 or later (you can use [SDKMAN!](https://sdkman.io) to install maven)
4. [Ollama 0.3.4](https://ollama.com/) or later.
5. [Granite code models](https://ollama.com/library/granite-code), which will serve as our LLM for this tutorial.

See the [Coding_Assistant_in_VSCode](../Coding_Assistant_in_VSCode/Coding_Assistant_in_VSCode.ipynb) recipe for instructions on setting up Ollama and installing the Granite models.

### Install Ollama Python SDK

We also need the Python API for Ollama and the CLDK toolkit.

In [None]:
!pip install ollama

### Install CLDK
CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:

In [None]:
!pip install git+https://github.com/IBM/codellm-devkit.git

## Analyze and Summarize Java Code

First we get the sample Java code, then we will use CLDK to analyze and summarize it.

### Get the Sample Java Code
For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the sample Java code. You can download the source code to a temporary directory by running the following commands:

In [None]:
%%bash
COMMONS=commons-cli-1.7.0
test -d temp || ( \
  mkdir -p temp && \
  cd temp && \
  wget https://github.com/apache/commons-cli/archive/refs/tags/rel/$COMMONS.zip -O $COMMONS.zip && \
  unzip -o $COMMONS.zip && \
  rm -f $COMMONS.zip && \
  cd - \
) && ls -l temp/

You should see `commons-cli-rel-commons-cli-1.7.0` in `temp`.

### Generate the Code Summary

Code summarization or code explanation is the task of converting code written in a programming language to natural language. It has several benefits, such as understanding code without looking at its details, documenting code for better maintenance, etc. To perform code summarization, one needs to understand the basic details of the code implementation, and use that knowledge to generate the summary using various AI-based approaches. Here, we use LLMs, specifically Granite Code 83 Instruct. We will show how a developer can easily use CLDK to analyze code by calling various APIs without having to implement analyses with lower-level tools.

#### Step 1: Add the neccessary imports

In [None]:
import ollama
from cldk import CLDK
from cldk.analysis import AnalysisLevel

#### Step 2: Define a function for creating the LLM prompt

This function instructs the LLM to summarize a Java method and includes relevant code for the task.

In [None]:
def format_inst(code, focal_method, focal_class, language):
    """
    Format the instruction for the given focal method and class.
    """
    inst = f"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\n"

    inst += "\n"
    inst += f"```{language}\n"
    inst += code
    inst += "```" if code.endswith("\n") else "\n```"
    inst += "\n"
    return inst

#### Step 3: Define a function to call the LLM

In this case, Granite Code Instruct model, invoked using Ollama.

In [None]:
def prompt_ollama(message: str, model_id: str = "granite-code:3b") -> str:
    """Prompt local model on Ollama"""
    response_object = ollama.generate(model=model_id, prompt=message, options={"temperature":0.2})
    return response_object["response"]

#### Step 4: Create an instance of CLDK, specifying the programming language of the source code

Java in this case.

In [None]:
cldk = CLDK(language="java")

#### Step 5: Select the analysis engine and analysis level

CLDK uses different analysis engines---[CodeAnalyzer](https://github.com/IBM/codenet-minerva-code-analyzer) (built over [WALA](https://github.com/wala/WALA) and [JavaParser](https://github.com/javaparser/javaparser)), [Treesitter](https://tree-sitter.github.io/tree-sitter/), and [CodeQL](https://codeql.github.com/) (future). CodeAnalyzer is the default analysis engine. CLDK supports different analysis levels: 1) symbol table, 2) call graph, 3) program dependency graph, and 4) system dependency graph. The analysis level can be selected using the `AnalysisLevel` enumerated type. For this example, we select the symbol-table analysis level, with CodeAnalyzer as the default analysis engine.

> **NOTE:** If the next cell throws an error `CalledProcessError`, make sure you have a working Java installation! See the **Prerequisites** above.

In [None]:
# Create an analysis object for the Java application
analysis = cldk.analysis(project_path="temp/commons-cli-rel-commons-cli-1.7.0", analysis_level=AnalysisLevel.symbol_table, analysis_json_path='analysis')

Note that a file [./analysis/analysis.json](./analysis/analysis.json) was created. This file will be used in the other notebooks, too!

#### Step 6: Iterate over all the class files and create the prompt

In this case, we want to provide a sanitized Java class in the prompt, containing only the relevant information for summarizing the target method. To illustrate, consider the floowing class:

```java
package com.ibm.org;
import A.B.C.D;
...
public class Foo {
 // code comment
 public void bar(){
    int a;
    a = baz();
    // do something
    }
 private int baz()
 {
    // do something
 }
 public String dummy (String a)
 {
    // do somthing
 }
```
Let's say we want to generate a summary for method `bar`. To understand what it does, we add the callees of this method in the prompt, which in this case includes `baz`. We remove the other methods, imports, comments, etc. All of this can be achieved with a single call to CLDK's `sanitize_focal_class` API. In this process, we also use Treesitter to analyze the code.  After creating the sanitized code, we call the previously defined `format_inst` method to create the LLM prompt and pass the prompt to `prompt_ollama` to generate the method summary.

In [None]:
# For simplicity, we run the code summarization on a single class and method
# (this filter can be removed to run this code over the entire application)
target_class = "org.apache.commons.cli.GnuParser"
target_method = "flatten(Options, String[], boolean)"

# Iterate over all classes in the application
for class_name in analysis.get_classes():
    if class_name == target_class:
        print(f"Class: {class_name}")
        class_file_path = analysis.get_java_file(qualified_class_name=class_name)

        # Read code for the class
        with open(class_file_path, "r") as f:
            code_body = f.read()

        # Initialize treesitter utils for the class file content
        tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)

        # Iterate over all methods in class
        for method in analysis.get_methods_in_class(qualified_class_name=class_name):
            if method == target_method:

                # Get all the method details
                method_details = analysis.get_method(qualified_class_name=class_name,
                                                     qualified_method_name=method)

                # Sanitize the class for analysis with respect to the target method
                sanitized_class = tree_sitter_utils.sanitize_focal_class(method_details.declaration)

                # Format the instruction for the given target method and class
                instruction = format_inst(
                    code=sanitized_class,
                    focal_method=method_details.declaration,
                    focal_class=class_name.split(".")[-1],
                    language="java"
                )

                print(f"Instruction:\n{instruction}\n")
                print(f"Generating code summary and it will take few minutes (or even seconds) based on where the model has been hosted...\n")

                # Prompt the local model on Ollama
                llm_output = prompt_ollama(message=instruction)

                # Print the LLM output
                print(f"LLM Output:\n{llm_output}")

After the LLM's response is received, you should see the generated summary of the `flatten` method printed out (similar to the following text).


**LLM Output:**
Sure! The method `protected String[] flatten(final Options options, final String[] arguments, final boolean stopAtNonOption)` in the class `GnuParser` is responsible for parsing command-line arguments according to the GNU convention. It takes three parameters: `options`, which represents a collection of valid command-line options; `arguments`, which is an array of strings representing the command-line arguments to be parsed; and `stopAtNonOption`, which determines whether to stop processing arguments once a non-option argument is encountered.

The method iterates through each argument in the `arguments` array and performs various checks based on the argument's format. If the argument starts with "--", it adds the entire argument to the `tokens` list. If the argument is "-", it adds "-" to the `tokens` list. If the argument starts with a hyphen but does not match any known option, it determines whether to stop processing arguments or add the argument to the `tokens` list based on the value of `stopAtNonOption`.

If an argument matches a known option, it is added to the `tokens` list along with its associated value (if applicable). The method also handles cases where the argument appears in a different format, such as "-Dproperty=value" or "--option=value".

Finally, the method returns the `tokens` list as an array of strings.
