In this file I will create a CPG file from the java dataset I have downoaded from the repository

https://github.com/ASSERT-KTH/CodRep

The java files are in txt file format. So lets convert it into .java format. We can use the following command to do this in terminal for all the datasets:

In [None]:
(base) kartikeyadatta@Mac Tasks % find /Users/path locartion.../data/code-rep-dataset/Dataset5/Tasks -name '*.txt' -exec bash -c 'for f; do mv "$f" "${f%.txt}.java"; done' bash {} +

Lets generate a Joern file from the java data we have collected

In [None]:
(base) kartikeyadatta@Mac Tasks % joern --generate-cpg --language java --project-name dataset_cpg /Users/path for the folder... /data/code-rep-data

Load the CPG File: To load the CPG file and start interacting with it, I used Joern’s built-in features. Since I already have the cpg.bin file, I can load it with Joern’s scripting API.

In [5]:
!scala

Welcome to Scala 3.6.4 (23.0.2, Java OpenJDK 64-Bit Server VM).
Type in expressions for evaluation. Or try :help.
[?2004h[90m~[0m                                                                               
[34mscala> [0m
[?1l>[?1000l[?2004l

In [None]:
val cpg = CpgLoader.load("/path/to/cpg.bin")

SyntaxError: invalid syntax (1825618756.py, line 6)

Step 2: Extract Features from the CPG
Now that the CPG is loaded, let’s extract the features required for refactoring suggestions.

Here are the features we will extract:

Method Length: Number of lines of code in a method.

Cyclomatic Complexity: A measure of the complexity of the method.

Repetitive Code: Identifying duplicate code within methods.

Nesting Levels: The depth of nested structures inside a method.

Number of Variables: The number of variables declared within a method.

You can extract these features using the Joern query language, which works similarly to SQL and is based on the GraphQL query language.

1. Method Length (Lines of Code):
We will query the methods in the CPG and count their lines:

In [None]:
val methodLength = cpg.method.map(m => (m.name, m.lineCount))

This query will return the name of each method along with its number of lines of code.

2. Cyclomatic Complexity:
Cyclomatic Complexity is a measure of the number of linearly independent paths through a method. Joern computes this complexity automatically as part of its analysis. You can extract it as follows:

In [None]:
val cyclomaticComplexity = cpg.method.map(m => (m.name, m.cyclomaticComplexity))

3. Repetitive Code:
To find duplicate code, you can search for repeated methods or code blocks. For simplicity, let's assume we are looking for methods that are similar:

In [None]:
val duplicateMethods = cpg.method
  .groupBy(m => m.code)
  .filter{ case (_, methods) => methods.size > 1 }
  .map{ case (code, methods) => (code, methods.map(_.name)) }

This will return methods that have the same code (repeated code blocks) and list them together.

4. Nesting Levels:
Nesting levels refer to how deeply nested the code is (e.g., inside loops or conditionals). You can extract this by analyzing control flow structures:

In [None]:
val nestingLevels = cpg.method.map(m => (m.name, m.controlStructureDepth))

5. Number of Variables:
To get the number of variables in each method, you can query for the variables declared inside a method:

In [None]:
val numberOfVariables = cpg.method.map(m => (m.name, m.local.variable.size))

Step 3: Heuristic Labeling
Once I've extracted the necessary features from the CPG, you can label methods as "refactorable" or "not refactorable" based on some heuristics. For example:

Long Methods: Methods with more than 20 lines can be flagged.

High Cyclomatic Complexity: Methods with a cyclomatic complexity greater than 10 can be flagged.

Repeated Code: If the method contains duplicate code blocks, flag it.

A simple heuristic labeling function might look like this:

In [None]:
val labels = methodLength.map { case (name, length) =>
   val cyclomatic = cyclomaticComplexity.find(c => c._1 == name).map(_._2).getOrElse(0)
   val isRefactorable = if (length > 20 || cyclomatic > 10) 1 else 0
   (name, isRefactorable)
}

This will give you a binary label (1 for "refactorable", 0 for "not refactorable") for each method based on its length and cyclomatic complexity.

Step 4: Prepare Data for Machine Learning
With the extracted features and labels, you can now create a dataset for your machine learning model. For each method, you'll have a feature vector and a corresponding label (1 or 0).

The structure will look something like this:

| Method Name | Length | Cyclomatic Complexity | Nesting Level | Number of Variables | Is Refactorable |
|-------------|--------|------------------------|----------------|----------------------|------------------|
| methodA     | 30     | 12                     | 3              | 5                    | 1                |
| methodB     | 10     | 2                      | 1              | 2                    | 0                |
| methodC     | 25     | 11                     | 4              | 4                    | 1                |

You can then save this data into a CSV or a DataFrame (using Scala or Python) for training the machine learning model.

Step 5: Save Features and Labels for Model Training
You can save this data into a CSV file for easier processing later:

In [None]:
import java.io._
val writer = new PrintWriter(new File("method_features.csv"))
writer.write("method_name,length,cyclomatic_complexity,nesting_level,num_variables,is_refactorable\n")

featuresAndLabels.foreach { case (name, length, cyclomatic, nesting, variables, label) =>
  writer.write(s"$name,$length,$cyclomatic,$nesting,$variables,$label\n")
}

writer.close()

Next Steps:
Feature Engineering: If necessary, normalize or standardize the features (e.g., scaling the length or cyclomatic complexity).

Train a Model: Use the extracted features and labels to train a machine learning model (e.g., Logistic Regression, Random Forest, etc.).

Evaluation: After training, evaluate the model using performance metrics such as accuracy, precision, recall, etc.