# Test data preparation
Since the training data was generated by requesting OpenAI GPT API, it might be not ideal for the case of ChatGPT. The reason is that while ChatGPT uses the same GPT model, its parameters may differ from the one used in API.
For that reason I've collected ChatGPT answers using Toloka platform.
This notebook is to combine it with answers downloaded from StackOverflow with the same question IDs

In [1]:
import pandas as pd
import html
import re

In [2]:
def rename_and_concat(dfs):
    renamed_columns = {
        'INPUT:question_id': 'question_id',
        'OUTPUT:answer': 'answer'
    }
    renamed_dfs = [df.rename(columns=renamed_columns)[renamed_columns.values()].copy() for df in dfs]
    return pd.concat(renamed_dfs, ignore_index=True)

In [3]:
toloka_answers = rename_and_concat((
    pd.read_table("./toloka/assignments_from_pool_37361170__07-02-2023.tsv"),
    pd.read_table("./toloka/assignments_from_pool_37593832__07-02-2023.tsv"),
    pd.read_table("./toloka/assignments_from_pool_37610098__07-02-2023.tsv")
))

len(toloka_answers)

1451

In [4]:
toloka_answers = toloka_answers.drop_duplicates(subset=['question_id'])

In [5]:
vc = toloka_answers[toloka_answers.answer.str.contains("Copy code")].answer.str.split('Copy code').apply(lambda x: x[0]).str.split().apply(lambda x: x[-1]).value_counts()
vc.head(50)

javascript         104
python              60
bash                59
css                 56
php                 36
sql                 31
scss                29
java                23
ruby                18
csharp              17
lua                 15
perl                13
vbnet               11
kotlin              10
typescript           9
makefile             7
c                    7
shell                6
command:             5
yaml                 3
example:             3
less                 3
rust                 2
cpp                  2
go                   2
objectivec           2
swift                2
R                    1
GB:                  1
syntax:              1
use:                 1
xml                  1
Example:             1
one:                 1
function:            1
prompt:              1
json                 1
GUID:                1
commit.              1
method.              1
branch:              1
loop:                1
JavaScript:          1
pseudo-elem

In [6]:
copy_code_prefixed = [
    "javascript",
    "python",
    "bash",
    "css",
    "php",
    "sql",
    "scss",
    "java",
    "ruby",
    "csharp",
    "lua",
    "perl",
    "vbnet",
    "kotlin",
    "typescript",
    "makefile",
    "c",
    "shell",
    "yaml",
    "less",
    "rust",
    "cpp",
    "go",
    "objectivec",
    "swift",
    "R",
    "xml",
    "json",
]

In [7]:
toloka_answers.answer[0]

'To rename a local Git branch, you can use the command git branch -m old_branch_name new_branch_name. This will rename the local branch from "old_branch_name" to "new_branch_name".\r\n\r\nIf you want to rename a local branch which has not yet been pushed to a remote repository, you can first rename the local branch with the above command, then use the command git push origin :old_branch_name new_branch_name to delete the old branch and push the new branch to the remote repository.\r\n\r\nIf you want to rename both a local and remote branch, you would first rename the local branch as described above, then use the command git push origin :old_branch_name new_branch_name to delete the old remote branch and push the new branch. Then use the command git push origin -u new_branch_name to set the new remote branch as the default.'

In [8]:
curvy_indices = toloka_answers.answer.str.split('Copy code').apply(lambda x: x[0]).str.split().apply(lambda x: x[-1]) == "example:"

In [9]:
toloka_answers[curvy_indices & toloka_answers.answer.str.contains("Copy code")]

Unnamed: 0,question_id,answer
33,122102,There are several ways to deep clone an object...
46,157944,You can convert an array of type Element[] int...
943,343646,You can ignore directories in Git on Windows b...


In [10]:
toloka_answers.answer[33]

"There are several ways to deep clone an object in JavaScript, and the most efficient method depends on the specific use case and the size of the object being cloned.\r\n\r\nOne of the most efficient ways to deep clone an object is to use the Object.assign() method in combination with the spread operator (...). This method creates a new object and copies the properties and values of the original object to the new object. Here's an example:\r\n\r\nCopy code\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = {...originalObject};\r\nAnother way is to use the JSON.parse(JSON.stringify(obj)) method, which converts the object to a JSON string, and then parses it back into a new object. This method is relatively fast and works well for small to medium-sized objects. However, it has some limitations. It does not work with functions, undefined, Symbol and some other object properties.\r\n\r\nCopy code\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObje

In [11]:
stripped_toloka = toloka_answers.copy()
stripped_toloka.answer = stripped_toloka.answer.str.replace("Copy code", "")
stripped_toloka.answer[33]

"There are several ways to deep clone an object in JavaScript, and the most efficient method depends on the specific use case and the size of the object being cloned.\r\n\r\nOne of the most efficient ways to deep clone an object is to use the Object.assign() method in combination with the spread operator (...). This method creates a new object and copies the properties and values of the original object to the new object. Here's an example:\r\n\r\n\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = {...originalObject};\r\nAnother way is to use the JSON.parse(JSON.stringify(obj)) method, which converts the object to a JSON string, and then parses it back into a new object. This method is relatively fast and works well for small to medium-sized objects. However, it has some limitations. It does not work with functions, undefined, Symbol and some other object properties.\r\n\r\n\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = JSON.parse(JS

In [12]:
raw_human_answers = pd.read_json("./raw/answers_human/all_answers.jsonl", lines=True)
len(raw_human_answers)

18129

In [13]:
all_answers = raw_human_answers[['question_id', 'body']].rename(columns={'body': 'answer'}).copy()
all_answers['target'] = 0
all_answers.answer = all_answers.answer.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))
all_answers = all_answers.drop_duplicates(subset=['question_id'])
all_answers = all_answers[all_answers.question_id.isin(list(toloka_answers.question_id))]
all_answers.question_id.nunique()

1447

In [14]:
toloka_answers['target'] = 1
toloka_answers.question_id.nunique()

1447

In [15]:
test_data = pd.concat((all_answers, toloka_answers), ignore_index=True)
test_data.question_id.nunique()

1447

In [16]:
test_data.answer = test_data.answer.str.replace("Copy code", "")

In [17]:
questions = pd.read_json("./toloka/questions.jsonl", lines=True)
add_questions = pd.read_csv("./toloka/questions.csv")
questions = pd.concat((questions, add_questions), ignore_index=True)
test_data['question'] = test_data.question_id.apply(lambda x: list(questions[questions.question_id == x].question)[0])
test_data

Unnamed: 0,question_id,answer,target,question
0,927358,A simple step-by-step guide is as follows:\n\n...,0,How do I undo the most recent local commits in...
1,2003505,Here you can delete remote branches correspond...,0,How do I delete a Git branch locally and remot...
2,231767,yield:\n\ncan return a value multiple times fr...,0,"What does the ""yield"" keyword do?\n\nWhat is t..."
3,6591213,If you want to change the name of a branch\ngi...,0,How do I rename a local Git branch?\n\nHow do ...
4,348170,git add -A\n\nis used to add all files to your...,0,How do I undo 'git add' before commit?\n\nI mi...
...,...,...,...,...
2889,1213430,To delete a Git repository that was created wi...,1,How to fully delete a git repository created w...
2890,4470523,You are not doing anything wrong. The fast-for...,1,Create a branch in Git from another branch\n\n...
2891,15202997,The difference between the three is:\r\n\r\nSi...,1,"What is the difference between canonical name,..."
2892,4912092,Google uses JavaScript to capture screenshots ...,1,Using HTML5/Canvas/JavaScript to take in-brows...


In [18]:
test_data[test_data.answer.str.len() < 8000].to_json("test_data.jsonl", lines=True, orient='records')


In [19]:
test_data[test_data.answer.str.len() < 8000].shape

(2886, 4)

In [20]:
max(toloka_answers.answer.str.len())

3074

In [21]:
test_data.shape

(2894, 4)