# Data preparation
Here I"m cleaning up data from ChatGPT (manually collected by Toloka users) and answers from StackOverflow.
Then it is labeled and combined into one dataset to be later used for training

In [15]:
import pandas as pd
import html
import re

In [16]:
def rename_and_concat(dfs):
    renamed_columns = {
        'INPUT:question_id': 'question_id',
        'OUTPUT:answer': 'answer'
    }
    renamed_dfs = [df.rename(columns=renamed_columns)[renamed_columns.values()].copy() for df in dfs]
    return pd.concat(renamed_dfs, ignore_index=True)

In [22]:
toloka_answers = rename_and_concat([
    pd.read_table("./data/toloka/assignments_from_pool_37361170__07-02-2023.tsv"),
    pd.read_table("./data/toloka/assignments_from_pool_37593832__07-02-2023.tsv"),
    pd.read_table("./data/toloka/assignments_from_pool_37610098__07-02-2023.tsv"),
    pd.read_table("./data/toloka/assignments_from_pool_37660279__10-02-2023.tsv")
])

print(len(toloka_answers))
toloka_answers

2411


Unnamed: 0,question_id,answer
0,6591213,"To rename a local Git branch, you can use the ..."
1,927358,You can use the git reset command to undo the ...
2,359494,"In JavaScript, it is generally recommended to ..."
3,2003505,"To delete a local branch, you can use the comm..."
4,100003,"In Python, a class is an object that defines t..."
...,...,...
2406,26797739,"Yes, Swift does have a trimmingCharacters(in:)..."
2407,153890,"Yes, there is a way to do this using the print..."
2408,44084846,It seems that the Docker daemon is not running...
2409,1714297,The setId method sets a unique identifier for ...


In [23]:
toloka_answers = toloka_answers.drop_duplicates(subset=['question_id'])
len(toloka_answers)

2406

In [24]:
vc = toloka_answers[toloka_answers.answer.str.contains("Copy code")].answer.str.split('Copy code').apply(lambda x: x[0]).str.split().apply(lambda x: x[-1]).value_counts()
vc.head(50)

javascript     167
python         113
css             97
bash            95
php             62
scss            53
sql             51
java            50
csharp          33
lua             25
ruby            23
typescript      20
vbnet           17
perl            16
c               16
less            15
kotlin          15
shell           10
makefile         8
go               8
objectivec       6
swift            6
cpp              5
command:         5
rust             4
example:         4
yaml             4
json             2
xml              2
branch:          2
syntax:          1
line:            1
CLI:             1
80:              1
file:            1
database:        1
graphql          1
Example:         1
package:         1
function:        1
R                1
GB:              1
use:             1
one:             1
prompt:          1
GUID:            1
commit.          1
method.          1
loop:            1
JavaScript:      1
Name: answer, dtype: int64

In [25]:
copy_code_prefixed = [
    "javascript",
    "python",
    "bash",
    "css",
    "php",
    "sql",
    "scss",
    "java",
    "ruby",
    "csharp",
    "lua",
    "perl",
    "vbnet",
    "kotlin",
    "typescript",
    "makefile",
    "c",
    "shell",
    "yaml",
    "less",
    "rust",
    "cpp",
    "go",
    "objectivec",
    "swift",
    "R",
    "xml",
    "json",
]

In [26]:
stripped_toloka = toloka_answers.copy()
stripped_toloka.answer = stripped_toloka.answer.str.replace("Copy code", "")
stripped_toloka.answer[33]

"There are several ways to deep clone an object in JavaScript, and the most efficient method depends on the specific use case and the size of the object being cloned.\r\n\r\nOne of the most efficient ways to deep clone an object is to use the Object.assign() method in combination with the spread operator (...). This method creates a new object and copies the properties and values of the original object to the new object. Here's an example:\r\n\r\n\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = {...originalObject};\r\nAnother way is to use the JSON.parse(JSON.stringify(obj)) method, which converts the object to a JSON string, and then parses it back into a new object. This method is relatively fast and works well for small to medium-sized objects. However, it has some limitations. It does not work with functions, undefined, Symbol and some other object properties.\r\n\r\n\r\nconst originalObject = {a: 1, b: 2, c: {d: 3}};\r\nconst clonedObject = JSON.parse(JS

In [27]:
raw_human_answers = pd.read_json("./data/raw/answers_human/all_answers.jsonl", lines=True)
len(raw_human_answers)

29207

In [28]:
all_answers = raw_human_answers[['question_id', 'body']].rename(columns={'body': 'answer'}).copy()
all_answers['target'] = 0
all_answers.answer = all_answers.answer.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))
all_answers = all_answers.drop_duplicates(subset=['question_id'])
all_answers = all_answers[~all_answers.question_id.isin(list(toloka_answers.question_id))]
all_answers = all_answers.iloc[:toloka_answers.shape[0]:]
all_answers.shape

(2406, 3)

In [29]:
toloka_answers['target'] = 1
toloka_answers.question_id.nunique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  toloka_answers['target'] = 1


2406

In [30]:
test_data = pd.concat((all_answers, toloka_answers), ignore_index=True)
test_data.question_id.nunique()

4812

In [31]:
test_data.answer = test_data.answer.str.replace("Copy code", "")

In [32]:
questions = pd.read_json("./data/raw/questions/questions.jsonl", lines=True)
test_data['question'] = test_data.question_id.apply(lambda x: list(questions[questions.question_id == x].body)[0])
test_data.question = test_data.question.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))
test_data.question = questions.title.apply(html.unescape) + '\n\n' + test_data.question

test_data

Unnamed: 0,question_id,answer,target,question
0,11227809,An answer for quick and simple understanding (...,0,Why is processing a sorted array faster than p...
1,292357,Fetch\ngit fetch really only downloads new dat...,0,How do I undo the most recent local commits in...
2,477816,The most common MIME type is application/json....,0,How do I delete a Git branch locally and remot...
3,5767325,"let removeAnElement = (arr, element)=>{\n l...",0,What is the difference between 'git pull' and ...
4,244777,"I searched all pages of answers, and none ment...",0,"What does the ""yield"" keyword do?\n\nCan I use..."
...,...,...,...,...
4807,26797739,"Yes, Swift does have a trimmingCharacters(in:)...",1,Updating to latest version of CocoaPods?\n\nDo...
4808,153890,"Yes, there is a way to do this using the print...",1,When is assembly faster than C?\n\nI'm trying ...
4809,44084846,It seems that the Docker daemon is not running...,1,What is the difference between Set and List?\n...
4810,1714297,The setId method sets a unique identifier for ...,1,Remove all occurrences of a value from a list?...


In [33]:
questions

Unnamed: 0,tags,accepted_answer_id,question_id,link,title,body
0,"[java, c++, performance, cpu-architecture, bra...",11227902.0,11227809,https://stackoverflow.com/questions/11227809/w...,Why is processing a sorted array faster than p...,<p>Here is a piece of C++ code that shows some...
1,"[git, version-control, git-commit, undo]",927386.0,927358,https://stackoverflow.com/questions/927358/how...,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...
2,"[git, version-control, git-branch, git-push, g...",2003515.0,2003505,https://stackoverflow.com/questions/2003505/ho...,How do I delete a Git branch locally and remot...,<p>Failed Attempts to Delete a Remote Branch:<...
3,"[git, version-control, git-pull, git-fetch]",292359.0,292357,https://stackoverflow.com/questions/292357/wha...,What is the difference between &#39;git pull&#...,"<p>What are the differences between <a href=""h..."
4,"[python, iterator, generator]",231855.0,231767,https://stackoverflow.com/questions/231767/wha...,What does the &quot;yield&quot; keyword do?,<p>What is the use of the <code>yield</code> k...
...,...,...,...,...,...,...
9995,"[ruby, arrays]",5878727.0,5878697,https://stackoverflow.com/questions/5878697/ho...,How do I remove blank elements from an array?,<p>I have the following array </p>\n\n<pre><co...
9996,"[java, variables, properties, system, environm...",7055010.0,7054972,https://stackoverflow.com/questions/7054972/ja...,Java system properties and environment variables,<p>What's the difference between system proper...
9997,"[html, css, css-multicolumn-layout]",7785711.0,7785374,https://stackoverflow.com/questions/7785374/ho...,How to prevent column break within an element?,<p>Consider the following HTML:</p>\n\n<pre><c...
9998,"[javascript, load-order]",8996894.0,8996852,https://stackoverflow.com/questions/8996852/lo...,load and execute order of scripts,<p>There are so many different ways to include...


In [34]:
test_data.to_json("./data/balanced_data.jsonl", lines=True, orient='records')

In [35]:
test_data.shape

(4812, 4)