## CodeReviewer applications and fine-tuning



The objective of this notebook is to look at predictions of the CodeReviewer model before and after fine-tuning for the Review generation task. The original paper of CodeReviewer can be found [here](https://arxiv.org/abs/2203.09095).

I chose to look at the performance of the model on Kotlin programming language. It's interesting to look at, since Kotlin code was not present in the dataset. However, the model "saw" Java, which looks somewhat like Kotlin (maybe the with the help of beer). So, let's see if it's going to generalize to Kotlin or not. Of course, we don't expect it to generate something like "rewrite it in Kotlin style" before fine-tuning.

What Kotlin repo to take?

Let's take the [Kotlin Programming Language](https://github.com/JetBrains/kotlin) repo! It has plenty of PRs. Therefore, finding comments shouldn't be an issue. Although the Kotlin repo has its specifics. For example, comments like `"Filed https://youtrack.jetbrains.com/issue/KT-44513"`, which are impossible to predict without external knowledge. Nevertheless, I think all repos have their own specifics. So, it's a fine repo to look at.


But before we assemble a dataset with Kotlin diffs, we should understand, what the model actually wants from us as inputs. From the paper, it seems that we only need to provide diffs to the model. However, the paper also mentions adding tokens like `[ADD]`, `[DEL]` and `[KEEP]` for a different pre-training task (diff quality estimation).

Strange, that for our task it doesn't want us to add those tokens. It should understand the code with tokens better since it was trained on them.

Ok, let's try first without replacing `+` and `-` with `[ADD]` and `[DEL]`.

In [25]:
import os
from tqdm import tqdm
import json
from transformers import pipeline
import evaluate
import torch

### Download the dataset

It's not mandatory, but if you want you can download the [official dataset](https://zenodo.org/record/6900648) (code provided in the following cell) and look at the predictions on different examples other than hardcoded. The examples are taken from the test part of the dataset.

To install the dataset, change `download_cg_dataset` variable to `True`.

In [None]:
download_cg_dataset = False

DATASET_DIR = "./dataset"
CG_DATASET_DIR = os.path.join(DATASET_DIR, "Comment_Generation")

if not os.path.exists(DATASET_DIR):
    ! mkdir./dataset

if download_cg_dataset and not os.path.exists(CG_DATASET_DIR):
    !./scripts/cg_dataset.sh

### Let's look at predictions of raw diffs!

Let's just take a couple of samples from the test partition of the official dataset and look at the predictions.

In [39]:
def get_sample_diffs(with_target=False):
    # Get samples from the dataset
    with open("./dataset/comment_generation_sample/sample.jsonl", "r") as file:
        lines = file.readlines()
        diffs = [json.loads(line)['patch'] for line in lines]
        targets = [json.loads(line)['msg'] for line in lines]

    if with_target:
        return diffs, targets
    return diffs

In [40]:
pipe = pipeline("text2text-generation", "microsoft/codereviewer", max_length=200)

diffs = get_sample_diffs()

result = pipe(diffs)

for c, r in zip(diffs, result):
    print(r)

{'generated_text': '<msg>Please remove this extra line'}
{'generated_text': '<msg>This is not needed.'}
{'generated_text': '<msg>I think this line is not needed'}
{'generated_text': "<msg>This is a bit of a nitpick, but I think it would be better to have the `&& html` on the next line, so that it's clearer what's going on."}
{'generated_text': "<msg>This is a bit of a nitpick, but I think it would be better to use `getURL` instead of `getAdminURL` since it's more consistent with the rest of the codebase."}
{'generated_text': '<msg>I think this is a bit too much.'}
{'generated_text': '<msg>This file is not part of the PR.'}
{'generated_text': '<msg>please remove the extra line'}
{'generated_text': '<msg>Please remove this extra line.'}
{'generated_text': '<msg>This is a bit of a nitpick, but I think it would be better to have the `let` keyword on the next line.'}


Doesn't look very specific :(

Seems like it just generates responses from the pool of most like responses. Maybe we should preprocess the data as they said in the paper!

Now, in the paper they didn't specify how to preprocess it. So, I went to their [official repo](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer) and looked at the script they propose to do predictions with: [run_infer_msg.py](https://github.com/microsoft/CodeBERT/blob/master/CodeReviewer/code/run_infer_msg.py).

They use the following functions to preprocess the data. Let's use it as well!

In [2]:
def add_special_tokens(diff_hunk: str):
    diff_lines = diff_hunk.split("\n")[1:]        # remove start @@
    diff_lines = [line for line in diff_lines if len(line.strip()) > 0]
    map_dic = {"-": 0, "+": 1, " ": 2}
    def f(s):
        if s in map_dic:
            return map_dic[s]
        else:
            return 2
    labels = [f(line[0]) for line in diff_lines]
    diff_lines = [line[1:].strip() for line in diff_lines]
    input_str = ""
    for label, line in zip(labels, diff_lines):
        if label == 1:
            input_str += "<add>" + line
        elif label == 0:
            input_str += "<del>" + line
        else:
            input_str += "<keep>" + line
    return input_str

This function just replaces `-` and `+` with `<add>` and `<del>` tokens. It adds a `<keep>` token, strips the lines and removes the first line as well.

Here is an example of what it does with a diff hunk.

In [None]:
diff = get_sample_diffs()[0]

print("============= INITIAL DIFF =============\n")
print(f"{diff}\n")
print("============= PREPROCESSED DIFF =============\n")
print(add_special_tokens(diff))


@@ -53,7 +53,7 @@ public class ProtocGapicPluginGeneratorTest {
                 model.getFiles().stream().map(ProtoFile::getProto).collect(Collectors.toList()))
             // Only the file to generate a client for (don't generate dependencies)
             .addFileToGenerate("multiple_services.proto")
-            .setParameter("language=java")
+            .setParameter("language=java,transport=grpc")
             .build();
 
     CodeGeneratorResponse response = ProtocGeneratorMain.generate(codeGeneratorRequest);


<keep>model.getFiles().stream().map(ProtoFile::getProto).collect(Collectors.toList()))<keep>// Only the file to generate a client for (don't generate dependencies)<keep>.addFileToGenerate("multiple_services.proto")<del>.setParameter("language=java")<add>.setParameter("language=java,transport=grpc")<keep>.build();<keep>CodeGeneratorResponse response = ProtocGeneratorMain.generate(codeGeneratorRequest);


Ok, let's see if the model will pay more attention to the code after preprocessing. (It should!)

In [None]:
pipe = pipeline("text2text-generation", "microsoft/codereviewer", max_length=200)

diffs = list(map(add_special_tokens, get_sample_diffs()))

result = pipe(diffs)

for c, r in zip(diffs, result):
    print(r)

{'generated_text': '<msg>I think we should add a `.setParameter("language=java,transport=grpc")` here.'}
{'generated_text': '<msg>I think we should set this only if the index is configured.'}
{'generated_text': "<msg>I think we should use `window.analytics.load(<%= ENV['SEGMENT_KEY']%>);`"}
{'generated_text': "<msg>I think this should be `isNaN(file.data.size) ? '' : html`"}
{'generated_text': "<msg>I think we should use `getNavigateURL` here instead of `getAdminURL` because it's not the same as `getAdminURL` in the `getNavigateURL` function."}
{'generated_text': '<msg>I think this should be `const AbsMat& im`'}
{'generated_text': '<msg>nit: import order'}
{'generated_text': '<msg>I think we should use `snprintf_s` here'}
{'generated_text': "<msg>I think this should be `getattr(api.getForegroundObject(), '_lastDetectedKeyboardLayoutChange', 0)`"}
{'generated_text': "<msg>This is a bit weird, but I guess it's fine."}


Well, at least it looks more specific. Let's look, how the generated comments compare with the target ones.

In [None]:
_, target = get_sample_diffs(with_target=True)

for generated, trg in zip(result, target):
    print("==========================================================================================================")
    print(f"Generated: {generated['generated_text'][5:]}\nTarget: {trg}'")

Generated: I think we should add a `.setParameter("language=java,transport=grpc")` here.
Target: can we also test for `transport=rest`?'
Generated: I think we should set this only if the index is configured.
Target: If record_batch_size is not set in config.ini, this code will trigger a notice about an undefined value. I would suggest either wrapping the setPageSize() call in an `if (!empty(...)) {` check, or else providing a default value in the set call (i.e. `$config->Index->record_batch_size ?? 100`).'
Generated: I think we should use `window.analytics.load(<%= ENV['SEGMENT_KEY']%>);`
Target: I didn't realize we were hardcoding this, thanks for moving it to an env value.'
Generated: I think this should be `isNaN(file.data.size) ? '' : html`
Target: We are trying to support IE 10-11, so we'll need a polyfill for this one, I think.'
Generated: I think we should use `getNavigateURL` here instead of `getAdminURL` because it's not the same as `getAdminURL` in the `getNavigateURL` functi

Nice, I see at least one exactly matching meaning:

```
Generated: nit: import order
Target: alpha sort the imports'
```

And let's look as well to the comments with the code for the sake of completeness.

In [None]:
for generated, code, trg in zip(result, get_sample_diffs(), target):
    print("==========================================================================================================")
    print(f"=== Diff:\n{code}\n\n=== Generated comment:\n{generated['generated_text'][5:]}\n\n=== Target:\n{trg}")

=== Diff:
@@ -53,7 +53,7 @@ public class ProtocGapicPluginGeneratorTest {
                 model.getFiles().stream().map(ProtoFile::getProto).collect(Collectors.toList()))
             // Only the file to generate a client for (don't generate dependencies)
             .addFileToGenerate("multiple_services.proto")
-            .setParameter("language=java")
+            .setParameter("language=java,transport=grpc")
             .build();
 
     CodeGeneratorResponse response = ProtocGeneratorMain.generate(codeGeneratorRequest);

=== Generated comment:
I think we should add a `.setParameter("language=java,transport=grpc")` here.

=== Target:
can we also test for `transport=rest`?
=== Diff:
@@ -182,7 +182,9 @@ abstract class AbstractSolrBackendFactory implements FactoryInterface
      */
     protected function createBackend(Connector $connector)
     {
+        $config = $this->config->get($this->mainConfig);
         $backend = new $this->backendClass($connector);
+        $backend->se

In a lot of cases it just says: I think we should add [something that is being added already in the diff] :)

However, most of the other comments at least make sense. And a couple of them are extremely close to the ground truth.
For example, the "import order" and the "I think we should set this only if the index is configured" ones.

Let's also look at what the model generates on the examples from the Figure 5 the [paper](https://arxiv.org/pdf/2203.09095.pdf) (I found them in the test partition of the original dataset as well)

In [None]:
code = [
"""@@ -388,4 +388,10 @@ public class MockExecutorLoader implements ExecutorLoader {
   public void unassignExecutor(int executionId) throws ExecutorManagerException {
     executionExecutorMapping.remove(executionId);
   }
+
+  @Override
+  public List<ExecutableFlow> fetchRecentlyFinishedFlows(long lifeTimeMs)
+      throws ExecutorManagerException {
+    return null;
+  }
 }
""",
"""@@ -124,7 +124,7 @@ public class DockerOptions {
       for (int i = 0; i < maxContainerCount; i++) {
         node.add(caps, new DockerSessionFactory(clientFactory, docker, image, caps));
       }
-      LOG.info(String.format(
+      LOG.finest(String.format(
           "Mapping %s to docker image %s %d times",
           caps,
           name,
"""
]

code = list(map(add_special_tokens, code))

paper_diffs = pipe(code)

for gen_comment in paper_diffs:
    print(gen_comment['generated_text'])

<msg>I think this should return an empty list instead of null.
<msg>I think this should be `LOG.finest("Mapping {} to docker image {} times", caps, name, maxContainerCount);`


Ok, the first response matches with the ground truth, but the second is a classical "I think this should be [already added stuff]".

Their dataset might have a lot of "I think this should be the stuff I added" comments from the authors of the code.

### The moment of Kotlin!

Now that we know what the model wants from us, let's gather our own Kotlin mini-dataset.

A reader doesn't have to execute scripts that gather the dataset, because the mini-dataset is committed to the repo (`dataset/github` dir).

I wrote two scripts, which are stored in the `scripts` folder: `github_dataset.py` and `github_prep.py`. They download all the PR comments since 2020 from the `jetbrains/kotlin` repo.

Basically, I filtered the comments just like they did in the paper: kept only the ones that don't have replies. I also additionally filtered out the diffs that were longer than 1000 characters to exclude new added long files from coverage. Such files are useless to process, because they require several comments, but not one. In the end it resulted in **763 comments**. They are contained in `dataset/github/jetbrains_kotlin_pure.jsonl` file.

Let's generates the comments for the dataset.

In [None]:
with open("./dataset/github/jetbrains_kotlin_pure.jsonl", "r") as file:
    items = [json.loads(line) for line in file.readlines()]

query = [item['patch'] for item in items]
query = list(map(add_special_tokens, query))

outputs = []

batch_size = 5
for i in tqdm(range(0, len(query), batch_size)):
    batch = query[i:i + batch_size]
    generated = pipe(batch)
    outputs += generated


predictions_and_targets = []

for i, output in enumerate(outputs):
    pred_and_target = {
        "pred": output['generated_text'],
        "target": items[i]['msg'],
        "id": i,
    }
    predictions_and_targets.append(pred_and_target)

with open("./dataset/github/jetbrains_kotlin_preds.jsonl", "w") as file:
    file.write("\n".join([json.dumps(pred_and_target) for pred_and_target in predictions_and_targets]))


100%|██████████| 153/153 [04:00<00:00,  1.57s/it]


Let's look at some predictions!

In [None]:
for pred_and_target in predictions_and_targets[:15]:
    print("==========================================================================================================")
    print(f"Generated:\n{pred_and_target['pred'][5:]}\n\nTarget:\n{pred_and_target['target']}")

Generated:
This is not correct. The `targetCallable` is a `FunctionImportedFromObject` or a `Callable`.

Target:
Should we use `ImportedFromObjectCallableDescriptor<*>` instead?
Generated:
I think we can remove this import.

Target:
Minor: unused import. I'll remove it before pushing.
Generated:
I think this is a bug.

Target:
I somehow missed it. Thanks!
Minor: there is `SUSPEND_CALL_RESULT_NAME`, which, in fact, I should've used.
Generated:
This is the same as `this.functionStack.lastOrNull()?.dispatchReceiverParameter`

Target:
I will change it to `thisReceiverExpression.convertWithOffsets` usage
Generated:
This is a bit of a hack, but I'm not sure how to do it better.

Target:
correct one is: 
```
   val b = number.toByte()
   val d = number.toDouble()
   val f = number.toFloat()
   val i = number.toInt()
   val l = number.toLong()
   val s = number.toShort()
```
Generated:


Target:
This test should not have passed. It is checking that no delegate field is created for int

With Kotlin there is more general comments like those, which were present before adding special tokens. For example, "This is a bit of a hack...".
These tokens were improving the models "understanding" of the code. I would assume that the model understands Kotlin less.

However, there are some comments that hit the target: the "unused import" one and the "remove annotation" one. With both imports and annotations looking like in Java, it's sensible that the model was able to write this.

Although it's worth mentioning, that in another diff with annotations, the model also proposed to remove it, but it wasn't the ground truth answer.

### With eyes and with numbers

Ok, looking with a naked eye, it seems like the model understands Kotlin less than other languages. (Judging by the number of general comments)

But let's try to estimate the "goodness" of the predictions with some metric. The metric that is used in the paper (and, therefore, we know the score of the model on the official dataset) is BLEU.

Even though, as they explain in the paper, this is not a good metric, we don't have a better metric apart from human expert estimation, because answers containing entirely different tokens could be right.

For human expert estimation I don't have spare humans to spend. So let's calculate the metric!

In [None]:
bleu = evaluate.load("bleu")

with open("./dataset/github/jetbrains_kotlin_preds.jsonl", "r") as file:
    pred_targets = [json.loads(line) for line in file.readlines()]
    predictions = [pred_target['pred'] for pred_target in pred_targets]
    references = [[pred_target['target']] for pred_target in pred_targets]

print(bleu.compute(predictions=predictions, references=references))

{'bleu': 0.009101684249318965, 'precisions': [0.18890882146478433, 0.024603698811096433, 0.009734280452512496, 0.005610622779128483], 'brevity_penalty': 0.4054897436135259, 'length_ratio': 0.5255800606706568, 'translation_length': 12821, 'reference_length': 24394}


Ok, going back to the paper, they have BLEU values of 7.97, for example. This is strange, because BLEU calculates the ratio and the values should be in [0, 1].

Even if we assume that they scaled it to [0, 100], it doesn't match with the following calculation:

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import TrainingArguments, Trainer
from dataset.KotlinDataset import KotlinDataset


In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/codereviewer")

preds = [" ".join(tokenizer.tokenize("This is an important debugging help and shouldn't be lower than the default visible INFO.")).lower()]
refs = [[" ".join(tokenizer.tokenize("This change prevents a user understanding how their server is configured. Best to leave at `info` level.")).lower()]]

print(bleu.compute(predictions=preds, references=refs))

{'bleu': 0.0, 'precisions': [0.17647058823529413, 0.0, 0.0, 0.0], 'brevity_penalty': 0.7026185226629954, 'length_ratio': 0.7391304347826086, 'translation_length': 17, 'reference_length': 23}


The BLEU score is claimed to be `7.97` here, but it's `0` (at least by definition from HuggingFace). I tried computing it with their tokenizer as well, and it was 0 as well...

Alright, let's instead try to improve the metric with fine-tuning the model on our Kotlin mini-dataset.

I've already divided the data into train and validation/test using `github_divide.py` script that I wrote. 80% of the comments went to train and 20% on validation and testing. We don't distinguish validation and testing, because the dataset is small.

Also, worth mentioning that our mini-dataset doesn't contain examples with code that shouldn't be commented. Let's say our model will look only at the code that should be commented for the sake of keeping the dataset small (so that I can run this toy fine-tuning everything relatively fast).

In [9]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/codereviewer", max_length=200)
model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/codereviewer")

train_dataset = KotlinDataset(tokenizer, "./dataset/fine_tuning/train.jsonl")
eval_dataset = KotlinDataset(tokenizer, "./dataset/fine_tuning/test.jsonl")

training_args = TrainingArguments(
    output_dir="output",
    save_steps=100,
    eval_steps=20,
    evaluation_strategy="steps",
    save_total_limit=2,
    logging_dir="./logs",
    do_train=True,
    do_eval=True,
    load_best_model_at_end=True,
    num_train_epochs=2,
    per_device_train_batch_size=10,
    learning_rate=0.0003,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()

trainer.save_model("./fine_tuned_model")

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
20,No log,0.966699
40,No log,0.877195
60,No log,0.858529
80,No log,0.863517
100,No log,0.861417
120,No log,0.85661


In [45]:
fine_pipe = pipeline("text2text-generation", model="./fine_tuned_model", tokenizer="./fine_tuned_model", max_length=200)

In [41]:
def print_bleu(preds_file):
    bleu = evaluate.load("bleu")

    with open(preds_file, "r") as file:
        pred_targets = [json.loads(line) for line in file.readlines()]
        predictions = [pred_target['pred'] for pred_target in pred_targets]
        references = [[pred_target['target']] for pred_target in pred_targets]

    print(bleu.compute(predictions=predictions, references=references))

In [42]:
def generate_messages(input_file, pipel, output_file):
    with open(input_file, "r") as file:
        items = [json.loads(line) for line in file.readlines()]

    query = [item['patch'] for item in items]
    query = list(map(add_special_tokens, query))

    outputs = []

    batch_size = 5
    for i in tqdm(range(0, len(query), batch_size)):
        batch = query[i:i + batch_size]
        generated = pipel(batch)
        outputs += generated


    predictions_and_targets = []

    for i, output in enumerate(outputs):
        pred_and_target = {
            "pred": output['generated_text'],
            "target": items[i]['msg'],
            "id": i,
        }
        predictions_and_targets.append(pred_and_target)

    with open(output_file, "w") as file:
        file.write("\n".join([json.dumps(pred_and_target) for pred_and_target in predictions_and_targets]))

Let's finally compare the BLEU score after and before fine-tuning.

In [48]:
FINE_PRED_FILE = "./dataset/fine_tuning/fine_preds.jsonl"

generate_messages("./dataset/fine_tuning/test.jsonl", fine_pipe, FINE_PRED_FILE)
print_bleu(FINE_PRED_FILE)

100%|██████████| 31/31 [01:57<00:00,  3.78s/it]


{'bleu': 0.012056471119902568, 'precisions': [0.14469200524246395, 0.0204806116876024, 0.007694499857509262, 0.0029797377830750892], 'brevity_penalty': 0.7467646210764841, 'length_ratio': 0.7739906674781903, 'translation_length': 3815, 'reference_length': 4929}


In [43]:
NOT_FINE_PRED_FILE = "./dataset/fine_tuning/not_fine_preds.jsonl"

generate_messages("./dataset/fine_tuning/test.jsonl", pipe, NOT_FINE_PRED_FILE)
print_bleu(NOT_FINE_PRED_FILE)

100%|██████████| 31/31 [00:46<00:00,  1.51s/it]


{'bleu': 0.012351168451748475, 'precisions': [0.19844660194174757, 0.03125, 0.014853647881170816, 0.0097856477166822], 'brevity_penalty': 0.4008472859386672, 'length_ratio': 0.5224183404341651, 'translation_length': 2575, 'reference_length': 4929}


Ok, the BLEU score didn't really change. But let's look if the comments started making more sense. We have a chance as BLEU doesn't really capture the "sense".

In [44]:
with open("./dataset/fine_tuning/fine_preds.jsonl", "r") as file:
    fine_preds = [json.loads(line) for line in file.readlines()]
    
with open("./dataset/fine_tuning/not_fine_preds.jsonl", "r") as file:
    not_fine_preds = [json.loads(line) for line in file.readlines()]
    
with open("./dataset/fine_tuning/test.jsonl", "r") as file:
    diffs = [json.loads(line)['patch'] for line in file.readlines()]

In [47]:
for i in range(20):
    print("==========================================================================================================")
    print(f"===== Diff:\n{diffs[i]}\n\n===== Fine generated:\n{fine_preds[i]['pred'][5:]}\n===== Not fine generated:\n{not_fine_preds[i]['pred'][5:]}\n===== Target:\n{fine_preds[i]['target']}\n")

===== Diff:
@@ -1420,6 +1423,20 @@ class ClassFileToSourceStubConverter(val kaptContext: KaptContextForStubGenerati
         lineMappings.registerSignature(this, node)
         return this
     }
+
+    private fun fieldType(field: FieldNode, origin: JvmDeclarationOrigin?): Type {
+        val signType = Type.getType(field.desc)
+        return when (val declaration = origin?.element) {
+            is KtProperty -> {
+                val delegateType = kaptContext.bindingContext[BindingContext.EXPRESSION_TYPE_INFO, declaration.delegateExpression]?.type

===== Fine generated:
This is not a good way to get the delegate type. It should be resolved in the compiler.
===== Not fine generated:
This is a bit of a misnomer. It's not a function, it's a function that returns a type.
===== Target:
After this commit `convertKotlinType()` will be always used for properties with delegates.
Using it only for anonymous-type delegates would be safer, I think.

===== Diff:
@@ -76,8 +76,10 @@ class Decl

### Impresive!

The model understood that it's inside the programming language repo:

> "This is not a good way to get the delegate type. It should be resolved in the compiler."


And the fine-tuned model doeasn't have the "I think this should be [already added stuff]" problem:
> Fine-tuned: "This is not a property of `BaseKotlinLibrary`, so it should be `hasUnresolvedDependencies: Boolean`."
> 
> Not fine-tuned: "I think this should be `val`."

However, the fine-tuned model is sometimes a bit hesitant in it's choices :)
> "This is not correct. The `FirRegularClassSymbol` is not a regular class symbol. It is a class symbol. It is not a class symbol. It is a class symbol. It is not a class symbol. It is a class symbol. It is not a class symbol. It is a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol. It is not a class symbol."

And the model also learned to generate links (that don't work, but it's another problem):
> "This error message is not very clear to me. I would suggest to add a bit more information about the expected signature.
>
> It's better to add a link to the issue, e.g. https://github.com/JetBrains/kotlin/blob/master/compiler/testData/diagnostics/testData/diagnostics/symbol.kt#L4"

Some comments actually match the sense of the ground truth:
> GT: "I would just call it `hasDependencies`. "Unresolved" is a special additional property used to indicate that `fun BaseKotlinLibrary.unresolvedDependencies()` returns `UnresolvedLibrary`s that need further resolve."
> 
> Pred: "This is not a property of `BaseKotlinLibrary`, so it should be `hasUnresolvedDependencies: Boolean`"
 
> GT: "Please revert this change to keep the style consistent."
> 
> Pred: "This change is not needed."

> GT: "And this is, imho, unnecessary copy-paste too."
> 
> Pred: "Please remove this"

And so on...

For readers curiosity, here are all the predictions on the testing partition of the Kotlin mini-dataset.


In [48]:
for i in range(len(diffs)):
    print("==========================================================================================================")
    print(f"===== Diff:\n{diffs[i]}\n\n===== Fine generated:\n{fine_preds[i]['pred'][5:]}\n===== Not fine generated:\n{not_fine_preds[i]['pred'][5:]}\n===== Target:\n{fine_preds[i]['target']}\n")

===== Diff:
@@ -1420,6 +1423,20 @@ class ClassFileToSourceStubConverter(val kaptContext: KaptContextForStubGenerati
         lineMappings.registerSignature(this, node)
         return this
     }
+
+    private fun fieldType(field: FieldNode, origin: JvmDeclarationOrigin?): Type {
+        val signType = Type.getType(field.desc)
+        return when (val declaration = origin?.element) {
+            is KtProperty -> {
+                val delegateType = kaptContext.bindingContext[BindingContext.EXPRESSION_TYPE_INFO, declaration.delegateExpression]?.type

===== Fine generated:
This is not a good way to get the delegate type. It should be resolved in the compiler.
===== Not fine generated:
This is a bit of a misnomer. It's not a function, it's a function that returns a type.
===== Target:
After this commit `convertKotlinType()` will be always used for properties with delegates.
Using it only for anonymous-type delegates would be safer, I think.

===== Diff:
@@ -76,8 +76,10 @@ class Decl