In [1]:
import json
import os
from src.dataset import AssociatedEditsDataset
from tutorial_utils.printing import *
from tutorial_utils.preprocessing import *
from tutorial_utils.inference import *

# Directory where the tutorial examples are stored
BASE_DIR = "tutorial_examples"

# Directory where the model weights are stored
MODELS_BASE_DIR = "Models"

# Tutorial on using Language Models and associated edits for smarter edit prediction

This tutorial walks readers through the technique presented in our work and how it can be used in a practical setup. We assume the existence of the following actors:
1. An end user who is actively editing code in an IDE
2. A designer of an edit prediction tool that is / will be deployed in this IDE setup

Ideally, the designer would already have a heuristic for identifying the target location where the edit needs to be predicted. This could be the location of the user's cursor or pre-identified code segments where the tool is triggered.

The tutorial will now discuss how the designer can use our technique to predict edits that the end user would make. 

We would be working with a simple example to walk the readers through the process. Choose an example below to continue. 

In [2]:
# Choose an example to run by uncommenting the corresponding EXAMPLE_DIR line below

# Example 1 (maps to Illustrative Example in Section 2 of the paper)
# Features: 
#   - Associated edits mined by Overwatch, not in the spatial vicinity of the target line.
#   - Target edit requires insertion of a token (i.e. "Serialization") not present in the original code.
EXAMPLE_DIR = "serialization_import"     # uncomment this line to run this example

# Example 2 (maps to Figure 4 in Section 5.6 of the paper)
# Features:
#   - Associated edits from the spatial vicinity of the target line.
#   - Target edit requires insertion of a token (i.e. "NotFound") not present in the original code.
#   - Target edit requires Natural Language Understanding of the code to understand the context of the edit.
# EXAMPLE_DIR = "http_error_codes"        # uncomment this line to run this example

# Example 3 (maps to Figure 5 in Section 5.6 of the paper)
# Features:
#   - Associated edits mined by Overwatch, in the spatial vicinity of the target line.
#   - Target edit requires insertion of a token (i.e. "Input") not present in the original code.
#   - Target edit requires understanding of spatial context.
# EXAMPLE_DIR = "ex_input"        # uncomment this line to run this example

print("Running example:", EXAMPLE_DIR)

Running example: serialization_import


Run the cell below to load the necessary files for the example. 

In [3]:
# load necessary files for the example

v0_file_path = os.path.join(BASE_DIR, EXAMPLE_DIR, "v0.cs")
v1_file_path = os.path.join(BASE_DIR, EXAMPLE_DIR, "v1.cs")
v2_file_path = os.path.join(BASE_DIR, EXAMPLE_DIR, "v2.cs")

# Metadata about the editing intent and the target line
editing_intents_file_path = os.path.join(BASE_DIR, EXAMPLE_DIR, "editing_intent.json")
with open(editing_intents_file_path, 'r') as f:
    editing_intents = json.load(f)

v0_v1_intent = editing_intents['v0_v1']["intent"]
v0_v1_edit_type = editing_intents['v0_v1']["editType"]
v0_v1_line = editing_intents['v0_v1']["lineNumber"]

v1_v2_intent = editing_intents['v1_v2']["intent"]
v1_v2_edit_type = editing_intents['v1_v2']["editType"]
v1_v2_line = editing_intents['v1_v2']["lineNumber"]

Run the cell below to see the initial contents of the file (we'll call this version `v0`).

In [4]:
# print contents of v0 of handleErrors.cs

v0_file_lines = get_file_contents(v0_file_path)

print("v0:")
print_code(v0_file_lines)

v0:
0      using Newtonsoft.Json;
1      using System;
2      using System.Collections.Generic;
3      using System.IO;
4      using System.Linq;
5      using System.Text;
6      using System.Threading.Tasks;
7      
8      namespace Example
9      {
10          /// <summary>
11          /// Helper API to the OpenAI API.
12          /// </summary>
13          public static class OpenAI
14          {
15              /// <summary>
16              /// Complete the prompt using the specified parameters. Any non-specified parameters will fall back to default values specified in <see cref="DefaultCompletionRequestArgs"/>.
17              /// </summary>
18              /// <returns>Returns a new instance of the object read from the binary file.</returns>
19                  private static T ReadFromBinaryFile<T>(string filePath) {
20                      using (Stream stream = File.Open(filePath, FileMode.Open)) {
21                          try {
22                              var binaryFor

Run the cell below to see what the user intends to do in the first edit. We call this edit from `v0` to `v1` an *associated edit*.

In [5]:
print("User's intent for v0 -> v1:")
print(v0_v1_intent)

User's intent for v0 -> v1:
The user intends to add a SerializationException. They start by replacing Exception on line 25 with SerializationException.


Run the cell below to see the new version of the file (`v1`) and it's diff with v0.

In [6]:
# v1
v1_file_lines = get_file_contents(v1_file_path)

print("v1:")
print_code(v1_file_lines)

v1:
0      using Newtonsoft.Json;
1      using System;
2      using System.Collections.Generic;
3      using System.IO;
4      using System.Linq;
5      using System.Text;
6      using System.Threading.Tasks;
7      
8      namespace Example
9      {
10          /// <summary>
11          /// Helper API to the OpenAI API.
12          /// </summary>
13          public static class OpenAI
14          {
15              /// <summary>
16              /// Complete the prompt using the specified parameters. Any non-specified parameters will fall back to default values specified in <see cref="DefaultCompletionRequestArgs"/>.
17              /// </summary>
18              /// <returns>Returns a new instance of the object read from the binary file.</returns>
19                  private static T ReadFromBinaryFile<T>(string filePath) {
20                      using (Stream stream = File.Open(filePath, FileMode.Open)) {
21                          try {
22                              var binaryFor

In [7]:
# highlight the diff between v0 and v1
print("v0 and v1 diff:")
print_diff(v0_file_lines, v1_file_lines, fromfile='v0', tofile='v1')

v0 and v1 diff:
[1;31m--- v0[0m
[1;32m+++ v1[0m
@@ -23,7 +23,7 @@
                         var binaryFormatter = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();

                         return (T) binaryFormatter.Deserialize(stream);

                     }

[1;31m-                    catch(Exception){
[0m
[1;32m+                    catch(SerializationException){
[0m
                         throw();

                     }

                 }



Run the cell below to see how the user now intends to edit `v1`.

In [8]:
print("User's intent for v1 -> v2:")
print(v1_v2_intent)

User's intent for v1 -> v2:
The user then moves their cursor to line 6 with an intent to import the Serialization namespace.


The task of the edit prediction tool is to predict this intended change. The edit from `v1` to `v2` is hence called the *target edit*.

Run the cell below to see the target line of code in `v1`.

In [9]:
# highlight target line in v1

# check if the cursor has moved to a new line (proxy: edit type is insert)
cursor_moved_to_new_line = v1_v2_edit_type == "insert"

print("Target in v1:")
highlight_lines(v1_file_lines, [v1_v2_line], cursor_moved_to_new_line)

Target in v1:
0     using Newtonsoft.Json;
1     using System;
2     using System.Collections.Generic;
3     using System.IO;
4     using System.Linq;
5     using System.Text;
6     [1;31m|[0m
7     using System.Threading.Tasks;
8     
9     namespace Example
10     {
11         /// <summary>
12         /// Helper API to the OpenAI API.
13         /// </summary>
14         public static class OpenAI
15         {
16             /// <summary>
17             /// Complete the prompt using the specified parameters. Any non-specified parameters will fall back to default values specified in <see cref="DefaultCompletionRequestArgs"/>.
18             /// </summary>
19             /// <returns>Returns a new instance of the object read from the binary file.</returns>
20                 private static T ReadFromBinaryFile<T>(string filePath) {
21                     using (Stream stream = File.Open(filePath, FileMode.Open)) {
22                         try {
23                             var bi

In [10]:
# load v2 and highlight diff

v2_file_lines = get_file_contents(v2_file_path)

print("Expected edit:")
print_diff(v1_file_lines, v2_file_lines, fromfile='v1', tofile='v2')

Expected edit:
[1;31m--- v1[0m
[1;32m+++ v2[0m
@@ -3,6 +3,7 @@
 using System.Collections.Generic;

 using System.IO;

 using System.Linq;

[1;32m+using System.Runtime.Serialization;
[0m
 using System.Text;

 using System.Threading.Tasks;

 



We now discuss how our technique can be used for predicting this target edit.

## Step 1: Mining and processing associated edits

Any edit mining technique can be used to fetch a list of edits that are relevant to the user's target code location (called associated edits in our work). For example, Overwatch can be used to extract temporally relevant edits that represent common editing patterns in an IDE. On the other hand, a simpler heuristic for predicting local edits could be to use edits in the spatial vicinity of the target code location.

We leave the choice of the associated edit mining technique to the designer. The examples in this tutorial are from the C3PO and Overwatch papers where the associated edits have been mined from the spatial and temporal vicinities respectively. For this demonstration, we consider the edit from `v0` to `v1` as the *associated edit*.

Once these associated edits have been identified, we can proceed to collect and process them. Here are the steps we follow:
- Step 1: Collect the associated edits at a line-level granularity. Hence the lines edited from `v0` to `v1` are collected. Ignore any edits with whitespace changes.
- Step 2: Collect the spatial context around the associated edits. We collect 5 lines of code preceeding and following the associated edits at `v1`.
- Step 3: Collect the spatial context around the target line. We collect 8 lines of code preceeding and following the target line at `v1`.  

This information is then represented in the following format:

```
{
    "AssociatedEdits": [
        {
            "Prefix": ...
            "Before": ...
            "After": ...
            "Suffix": ...
        },
        ...
    ],
    "Current": {
        "Prefix": ...
        "Before": ...
        "After": ...
        "Suffix": ...
    }
}
```

The `Prefix` and `Suffix` fields are optional but highly recommended to collect the spatial code context surrounding the edits (say, +/- 5 lines of code as discussed above). Note that the `After` field of `Current` is only used for evaluation and would not be available in practice.

Run the cell below to see how the JSON of the working example looks like after processing.

In [11]:
# Step 1: Collect the associated edit
v0_v1_diff = get_diff_chunks(v0_file_lines, v1_file_lines)[0]

# Step 2: Get spatial context around the associated edit (i.e. the prefix and suffix)
v0_v1_prefix = get_prefix(v1_file_lines, v0_v1_line, num_lines_in_context=5)
v0_v1_suffix = get_suffix(v1_file_lines, v0_v1_line, num_lines_in_context=5)

# Step 3: Get the target edit + spatial context
v1_v2_diff = get_diff_chunks(v1_file_lines, v2_file_lines)[0]
v1_v2_prefix = get_prefix(v1_file_lines, v1_v2_line, num_lines_in_context=8)
v1_v2_suffix = get_suffix(v1_file_lines, v1_v2_line, num_lines_in_context=8)

processed_example_content = {
    "AssociatedEdits": [
        # We just consider one associated edit for this tutorial (v0 -> v1)
        {
            "prefix": v0_v1_prefix,
            "before": v0_v1_diff["before"],
            "after": v0_v1_diff["after"],
            "suffix": v0_v1_suffix,
        }
    ],
    # The target edit from v1 -> v2
    "Current": {
        "prefix": v1_v2_prefix,
        "before": v1_v2_diff["before"],
        "after": v1_v2_diff["after"],
        "suffix": v1_v2_suffix,
    },
}

# print processed_example_content
print("Processed example content:")
print(json.dumps(processed_example_content, indent=4))

Processed example content:
{
    "AssociatedEdits": [
        {
            "prefix": "                using (Stream stream = File.Open(filePath, FileMode.Open)) {\n                    try {\n                        var binaryFormatter = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();\n                        return (T) binaryFormatter.Deserialize(stream);\n                    }",
            "before": "                    catch(Exception){",
            "after": "                    catch(SerializationException){",
            "suffix": "                        throw();\n                    }\n                }\n            }\n    }"
        }
    ],
    "Current": {
        "prefix": "using Newtonsoft.Json;\nusing System;\nusing System.Collections.Generic;\nusing System.IO;\nusing System.Linq;",
        "before": "",
        "after": "using System.Runtime.Serialization;",
        "suffix": "using System.Threading.Tasks;\n\nnamespace Example\n{\n    /// <summary>

In [12]:
# save this example in a json file for later use

PROCESSED_EXAMPLE_JSON_FILE_PATH = os.path.join(BASE_DIR, EXAMPLE_DIR, "processed_example.json")
save_processed_example_json(processed_example_content, PROCESSED_EXAMPLE_JSON_FILE_PATH, id=EXAMPLE_DIR)

Saved processed example to: tutorial_examples\serialization_import\processed_example.json


## Step 2: Prompt generation

Once the edits have been processed, we generate a textual prompt that can then be sent to a Language Model. We follow an XML-style format for presenting the information as follows:

```
<CurrentEdit>
    <Prefix> . . . </Prefix>
    <Before> . . . </Before>
    <After> [INSERT] </After>
    <Suffix> . . . </Suffix>
</CurrentEdit>
<CtxEdits>
    <Edit>
        <Prefix> . . . </Prefix>
        <Before> . . . </Before>
        <After>  . . . </After>
        <Suffix> . . . </Suffix>
    </Edit>
    <Edit>
        . . . 
    </Edit>
    . . .
</CtxEdits>
```

Note that we use an infilling/insertion model that can use special tokens to represent holes where the final version of the target code would be generated. The `[INSERT]` token represents this special token here.

Run the cell below to see how the prompt looks for the working example.

In [13]:
processed_sample = AssociatedEditsDataset(json_addr=PROCESSED_EXAMPLE_JSON_FILE_PATH)[0]
print("Working example prompt:")
print(processed_sample["Spctx"])
print("[INSERT]")
print(processed_sample["FutureCtx"])

Working example prompt:
<edit>
<prefix>
using Newtonsoft.Json;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
</prefix><before>

</before>
<after>

[INSERT]

</after>
<suffix>
using System.Threading.Tasks;

namespace Example
{
    /// <summary>
    /// Helper API to the OpenAI API.
    /// </summary>
    public static class OpenAI
</suffix>
</edit><ctxEdits><edit>
<prefix>
                using (Stream stream = File.Open(filePath, FileMode.Open)) {
                    try {
                        var binaryFormatter = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
                        return (T) binaryFormatter.Deserialize(stream);
                    }
</prefix>
<before>
                    catch(Exception){
</before>
<after>
                    catch(SerializationException){
</after>
<suffix>
                        throw();
                    }
                }
            }
    }
</suffix></edit></ctxEdits>


## Step 3: Generating the edit

Now that the prompt has been generated, we use it as an input to a Language Model. Given that OpenAI models like `code-davinci-002` can only be accessed via API keys, we use our trained CodeT5 models in this tutorial. In practice, the designer of the edit prediction tool can choose to any of these models (or even train another smaller LLM for their choice of associated edits).

Run the cell below to load the CodeT5 model and tokenizer. 

In [14]:
MODEL_DIR = r"codet5-c3po-unfiltered-further-finetuned-c3po-filtered"
MAX_SEQ_LENGTH = 1024

MODEL_PATH = os.path.join(MODELS_BASE_DIR, MODEL_DIR)

model, tokenizer = get_model_tokenizer_from_path(MODEL_PATH, MAX_SEQ_LENGTH)

Loaded config from model path:  Models\codet5-c3po-unfiltered-further-finetuned-c3po-filtered
Loaded tokenizer from model path:  Models\codet5-c3po-unfiltered-further-finetuned-c3po-filtered
Loaded model from model path:  Models\codet5-c3po-unfiltered-further-finetuned-c3po-filtered


Run the cell below to generate top 5 predictions using the CodeT5 model.

In [15]:
top5_predictions = get_topK_predictions(model, tokenizer, processed_sample, MAX_SEQ_LENGTH, topK = 5)

Run the cells below to see the output generated by the CodeT5 model and how it compares with the expected output.

In [16]:
print("Expected output:")
print(processed_sample["ExpectedText"])

Expected output:
using System.Runtime.Serialization;


In [17]:
print_topK_predictions(top5_predictions)

Prediction 1:
 using System.Runtime.Serialization;

Prediction 2:
 using System.Runtime.Serialization;

Prediction 3:
 using System.Runtime.Serialization;

Prediction 4:
 using System.Runtime.Serialization;

Prediction 5:
 using System.Runtime.Serialization.Formatters.Binary;



These predictions can now be shown as suggestions to the end user.

Please feel free to run this tutorial with other examples in the `Examples` directory. You can also run a new example by following these steps:
- Create a new directory within `tutorial_examples` (BASE_DIR) and add three file versions (`v0.cs`, `v1.cs`, `v2.cs`)
- The associated edit would be the transition from `v0` to `v1` and the target edit should be transition from `v1` to `v2`.
- Add another file called `editing_intent.json` to this new directory. This file should contain the line numbers corresponding to the edits. The line numbers would be used a proxies to the cursor location of the user. Please refer to one of the example `editing_intent.json` files for the format.
- Start the notebook with `EXAMPLE_DIR` set to the new directory. You are now ready to run the new example!

Note that this notebook reflects a very simple IDE simulation. In practice, 
- There can multiple associated edits (as opposed to one edit used in the examples)
- Multiple lines can be edited in the context and the target (only single-line edits are demonstrated using the examples)
- There can be several intermediate edits that an edit mining engine like Overwatch may skip (we here show examples where the target edit temporally follows the associated edit).

Thank you for going through this tutorial notebook. We hope that this tutorial helped provide a simple simulation of how our approach can be used in an edit prediction tool.