# Summarization for Java
This colab details the process of generating summarization for the Java files.

This file demonstrates one of the approaches we took to implement Suggestion 3. To view the process and results of suggestion 3, see [this notebook](https://colab.research.google.com/drive/10PzMBs3ZkH4qZX2cCrBdkxAIKzllftZe?usp=drive_link)

### Step 1: Mounted Data

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

%cd /content/drive/MyDrive/EECS_6448/Data

Mounted at /content/drive
/content/drive/MyDrive/EECS_6448/Data


### Step 2: Methods extraction

In [None]:


def extract_method(java_code_file):
# Dictionary to store method names and code content
  methods = {}


  with open(java_code_file, 'r') as file:
    lines = file.readlines()
    in_method = False
    method_name = ""
    method_content = ""

    for line in lines:
        # Check for method declaration
        if any(modifier in line for modifier in ['public', 'private', 'protected', 'static']):
            method_name = line.split('(')[0].split()[-1]  # Extract method name
            in_method = True
            method_content = line  # Start collecting method content

        # Continue collecting method content
        if in_method:
            method_content += line

        # End of method reached
        if in_method and line.strip().endswith('}'):
            in_method = False
            methods[method_name] = method_content  # Store method name and content

  return methods

{'MyAppWidgetHost': '    public MyAppWidgetHost(@NonNull Context context, int hostId) {\n    public MyAppWidgetHost(@NonNull Context context, int hostId) {\n        super(Device.hasLollipopMR1Api()\n                // Up to Android 5.1 app widget host has a bug, that\n                // holds the context reference.\n                // See the fix: https://github.com/android/platform_frameworks_base/commit/7a96f3c917e0001ee739b65da37b2fadec7d7765\n                ? context\n                : context.getApplicationContext(), hostId);\n    }\n', 'updateView': '    public final AppWidgetHostView updateView(@NonNull Context context, int appWidgetId,\n    public final AppWidgetHostView updateView(@NonNull Context context, int appWidgetId,\n                                              @NonNull AppWidgetProviderInfo appWidget,\n                                              @Nullable AppWidgetHostView view) {\n        mTempView = view;\n        view = createView(context, appWidgetId, appWidget

In [None]:
import os
def read_java_files(directory):
  java_dict = {}
  for root, dirs, files in os.walk(directory):

    for file in files:
        # Check if the file ends with ".java"
        if file.endswith(".java"):

          file_path = os.path.join(root, file)
          print(file_path)

          # Read the Java file
          with open(file_path, 'r') as java_file:
            methods_dict = extract_method(file_path)
            file_path = file_path.replace('a_comic_viewer/droid-comic-viewer-master/', '')\
                                          .replace('acdisplay/AcDisplay-master/', '')\
                                          .replace('.proc.txt', '')\
                                          .replace('project/app/src/', 'src/').replace('src/', '').replace('main/java/', '').replace('main/', '')\
                                          .replace('Data/RQ_2/Source/', '')\
                                          .replace('com.achep.acdisplay/', '').replace('net.androidcomics.acv/','')
            java_dict[file_path] = methods_dict

  return java_dict
directory_path = 'Data/RQ_2/Source'
methods_dict = read_java_files(directory_path)


Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/ACVApplication.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/Constants.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/BuildUtils.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/StringUtils.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/AlertUtils.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/Reflect.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/FileUtils.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/IntentUtils.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/utils/MathUtils.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/comic/Comic.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/comic/ACVComic.java
Data/RQ_2/Source/net.androidcomics.acv/src/net/robotmedia/acv/comic/ACVRectangle.java
Data/RQ_2

### Step 3: Install and import packages

In [None]:
!pip install -q transformers sentencepiece

In [None]:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline

### Step 4: Build out pipeline

In [None]:
pipeline = SummarizationPipeline(
    model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_code_documentation_generation_java"),
    tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_code_documentation_generation_java", skip_special_tokens=True),
    device=0
)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Step 5: Perform Tokenization

In [None]:
!pip install tree_sitter
!git clone https://github.com/tree-sitter/tree-sitter-java

fatal: destination path 'tree-sitter-java' already exists and is not an empty directory.


In [None]:
from tree_sitter import Language, Parser

Language.build_library(
  'build/my-languages.so',
  ['tree-sitter-java']
)

JAVA_LANGUAGE = Language('build/my-languages.so', 'java')
parser = Parser()
parser.set_language(JAVA_LANGUAGE)

In [None]:

def my_traverse(node, code_list):
  lines = code.split('\n')
  if node.child_count == 0:
    line_start = node.start_point[0]
    line_end = node.end_point[0]
    char_start = node.start_point[1]
    char_end = node.end_point[1]
    if line_start != line_end:
        code_list.append(' '.join([lines[line_start][char_start:]] + lines[line_start+1:line_end] + [lines[line_end][:char_end]]))
    else:
        code_list.append(lines[line_start][char_start:char_end])
  else:
    for n in node.children:
      my_traverse(n, code_list)
  return ' '.join(code_list)

### Step 5: Summarize

In [None]:
with open('summarization.txt', 'w+') as output:
  for file_name, method_content in methods_dict.items():
    result = ""
    for method, content in method_content.items():
      tree = parser.parse(bytes(content, "utf8"))
      code = content
      code_list=[]
      tokenized_code = my_traverse(tree.root_node, code_list)
      summary = pipeline([tokenized_code])
      result = result + ' ' + summary[0]['summary_text']
    output.write(file_name + ", " + result + '\n')



Your max_length is set to 20, but your input_length is only 18. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)
Your max_length is set to 20, but your input_length is only 17. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)
Your max_length is set to 20, but your input_length is only 17. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)
Your max_length is set to 20, but your input_length is only 17. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)
Your max_len