# Test data preparation
Since the training data was generated by requesting OpenAI GPT API, it might be not ideal for the case of ChatGPT. The reason is that while ChatGPT uses the same GPT model, its parameters may differ from the one used in API.
For that reason I've collected ChatGPT answers using Toloka platform.
This notebook is to combine it with answers downloaded from StackOverflow with the same question IDs

In [1]:
import pandas as pd
import html
import re

In [2]:
def rename_and_concat(dfs):
    renamed_columns = {
        'INPUT:question_id': 'question_id',
        'OUTPUT:answer': 'answer'
    }
    renamed_dfs = [df.rename(columns=renamed_columns)[renamed_columns.values()].copy() for df in dfs]
    return pd.concat(renamed_dfs, ignore_index=True)

In [3]:
toloka_answers = rename_and_concat((
    pd.read_table("./data/toloka/assignments_from_pool_37361170__07-02-2023.tsv"),
    pd.read_table("./data/toloka/assignments_from_pool_37593832__07-02-2023.tsv"),
    pd.read_table("./data/toloka/assignments_from_pool_37610098__07-02-2023.tsv")
))

len(toloka_answers)

1451

In [4]:
toloka_answers = toloka_answers.drop_duplicates(subset=['question_id'])

In [5]:
vc = toloka_answers[toloka_answers.answer.str.contains("Copy code")].answer.str.split('Copy code').apply(lambda x: x[0]).str.split().apply(lambda x: x[-1]).value_counts()
vc.head(50)

javascript         104
python              60
bash                59
css                 56
php                 36
sql                 31
scss                29
java                23
ruby                18
csharp              17
lua                 15
perl                13
vbnet               11
kotlin              10
typescript           9
makefile             7
c                    7
shell                6
command:             5
yaml                 3
example:             3
less                 3
rust                 2
cpp                  2
go                   2
objectivec           2
swift                2
R                    1
GB:                  1
syntax:              1
use:                 1
xml                  1
Example:             1
one:                 1
function:            1
prompt:              1
json                 1
GUID:                1
commit.              1
method.              1
branch:              1
loop:                1
JavaScript:          1
pseudo-elem

In [6]:
copy_code_prefixed = [
    "javascript",
    "python",
    "bash",
    "css",
    "php",
    "sql",
    "scss",
    "java",
    "ruby",
    "csharp",
    "lua",
    "perl",
    "vbnet",
    "kotlin",
    "typescript",
    "makefile",
    "c",
    "shell",
    "yaml",
    "less",
    "rust",
    "cpp",
    "go",
    "objectivec",
    "swift",
    "R",
    "xml",
    "json",
]

In [7]:
stripped_toloka = toloka_answers.copy()
stripped_toloka.answer = stripped_toloka.answer.str.replace("Copy code", "")
stripped_toloka.answer[33]

"There are several ways to deep clone an object in JavaScript, and the most efficient method depends on the specific use case and the size of the object being cloned.\r\n\r\nOne of the most efficient ways to deep clone an object is to use the Object.assign() method in combination with the spread operator (...). This method creates a new object and copies the properties and values of the original object to the new object. Here's an example:\r\n\r\n\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = {...originalObject};\r\nAnother way is to use the JSON.parse(JSON.stringify(obj)) method, which converts the object to a JSON string, and then parses it back into a new object. This method is relatively fast and works well for small to medium-sized objects. However, it has some limitations. It does not work with functions, undefined, Symbol and some other object properties.\r\n\r\n\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = JSON.parse(JS

In [8]:
raw_human_answers = pd.read_json("./data/raw/answers_human/all_answers.jsonl", lines=True)
len(raw_human_answers)

29207

In [9]:
all_answers = raw_human_answers[['question_id', 'body']].rename(columns={'body': 'answer'}).copy()
all_answers['target'] = 0
all_answers.answer = all_answers.answer.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))
all_answers = all_answers.drop_duplicates(subset=['question_id'])
all_answers = all_answers[~all_answers.question_id.isin(list(toloka_answers.question_id))]
all_answers = all_answers.iloc[:toloka_answers.shape[0]:]
all_answers.shape

(1447, 3)

In [10]:
toloka_answers['target'] = 1
toloka_answers.question_id.nunique()

1447

In [11]:
test_data = pd.concat((all_answers, toloka_answers), ignore_index=True)
test_data.question_id.nunique()

2894

In [12]:
test_data.answer = test_data.answer.str.replace("Copy code", "")

In [13]:
questions = pd.read_json("./data/raw/questions/questions.jsonl", lines=True)
test_data['question'] = test_data.question_id.apply(lambda x: list(questions[questions.question_id == x].body)[0])
test_data.question = test_data.question.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))
test_data.question = questions.title.apply(html.unescape) + '\n\n' + test_data.question

test_data

Unnamed: 0,question_id,answer,target,question
0,11227809,An answer for quick and simple understanding (...,0,Why is processing a sorted array faster than p...
1,292357,Fetch\ngit fetch really only downloads new dat...,0,How do I undo the most recent local commits in...
2,477816,The most common MIME type is application/json....,0,How do I delete a Git branch locally and remot...
3,5767325,"let removeAnElement = (arr, element)=>{\n l...",0,What is the difference between 'git pull' and ...
4,244777,"I searched all pages of answers, and none ment...",0,"What does the ""yield"" keyword do?\n\nCan I use..."
...,...,...,...,...
2889,1213430,To delete a Git repository that was created wi...,1,How to change indentation in Visual Studio Cod...
2890,4470523,You are not doing anything wrong. The fast-for...,1,How to specify a port to run a create-react-ap...
2891,15202997,The difference between the three is:\r\n\r\nSi...,1,What's the PostgreSQL datatype equivalent to M...
2892,4912092,Google uses JavaScript to capture screenshots ...,1,"Does Python have an ordered set?\n\nGoogle's ""..."


In [14]:
questions

Unnamed: 0,tags,accepted_answer_id,question_id,link,title,body
0,"[java, c++, performance, cpu-architecture, bra...",11227902.0,11227809,https://stackoverflow.com/questions/11227809/w...,Why is processing a sorted array faster than p...,<p>Here is a piece of C++ code that shows some...
1,"[git, version-control, git-commit, undo]",927386.0,927358,https://stackoverflow.com/questions/927358/how...,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...
2,"[git, version-control, git-branch, git-push, g...",2003515.0,2003505,https://stackoverflow.com/questions/2003505/ho...,How do I delete a Git branch locally and remot...,<p>Failed Attempts to Delete a Remote Branch:<...
3,"[git, version-control, git-pull, git-fetch]",292359.0,292357,https://stackoverflow.com/questions/292357/wha...,What is the difference between &#39;git pull&#...,"<p>What are the differences between <a href=""h..."
4,"[python, iterator, generator]",231855.0,231767,https://stackoverflow.com/questions/231767/wha...,What does the &quot;yield&quot; keyword do?,<p>What is the use of the <code>yield</code> k...
...,...,...,...,...,...,...
9995,"[ruby, arrays]",5878727.0,5878697,https://stackoverflow.com/questions/5878697/ho...,How do I remove blank elements from an array?,<p>I have the following array </p>\n\n<pre><co...
9996,"[java, variables, properties, system, environm...",7055010.0,7054972,https://stackoverflow.com/questions/7054972/ja...,Java system properties and environment variables,<p>What's the difference between system proper...
9997,"[html, css, css-multicolumn-layout]",7785711.0,7785374,https://stackoverflow.com/questions/7785374/ho...,How to prevent column break within an element?,<p>Consider the following HTML:</p>\n\n<pre><c...
9998,"[javascript, load-order]",8996894.0,8996852,https://stackoverflow.com/questions/8996852/lo...,load and execute order of scripts,<p>There are so many different ways to include...


In [51]:
test_data.to_json("./data/balanced_data.jsonl", lines=True, orient='records')

In [37]:
test_data.shape

(7098, 4)