In [18]:
import pandas as pd

In [19]:
import re

## Raw questions and human answers
The files in `data/raw/questions` and `data/raw/answers_human` folders are the result of running the `download_questions.py` and `download_answers.py` and merging them into single files.
It was done manually in the terminal

In [20]:
raw_questions_df = pd.read_json("data/raw/questions/questions.jsonl", lines=True)
questions_df = raw_questions_df.drop(['accepted_answer_id', 'link'], axis=1)
questions_df = questions_df.rename(columns={'body': 'question'})
questions_df

Unnamed: 0,tags,question_id,title,question
0,"[java, c++, performance, cpu-architecture, bra...",11227809,Why is processing a sorted array faster than p...,<p>Here is a piece of C++ code that shows some...
1,"[git, version-control, git-commit, undo]",927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...
2,"[git, version-control, git-branch, git-push, g...",2003505,How do I delete a Git branch locally and remot...,<p>Failed Attempts to Delete a Remote Branch:<...
3,"[git, version-control, git-pull, git-fetch]",292357,What is the difference between &#39;git pull&#...,"<p>What are the differences between <a href=""h..."
4,"[python, iterator, generator]",231767,What does the &quot;yield&quot; keyword do?,<p>What is the use of the <code>yield</code> k...
...,...,...,...,...
9995,"[ruby, arrays]",5878697,How do I remove blank elements from an array?,<p>I have the following array </p>\n\n<pre><co...
9996,"[java, variables, properties, system, environm...",7054972,Java system properties and environment variables,<p>What's the difference between system proper...
9997,"[html, css, css-multicolumn-layout]",7785374,How to prevent column break within an element?,<p>Consider the following HTML:</p>\n\n<pre><c...
9998,"[javascript, load-order]",8996852,load and execute order of scripts,<p>There are so many different ways to include...


In [21]:
raw_answers_df = pd.read_json("data/raw/answers_human/answers_human.jsonl", lines=True)
answers_df = raw_answers_df.drop(['last_activity_date', 'answer_id', 'link', 'title'], axis=1)
answers_df = answers_df.rename(columns={'body': 'answer'})
answers_df

Unnamed: 0,question_id,answer
0,11227809,"<p><strong>You are a victim of <a href=""https:..."
1,927358,<h1>Undo a commit &amp; redo</h1>\n<pre class=...
2,2003505,<h1>Executive Summary</h1>\n<pre><code>git pus...
3,292357,"<p>In the simplest terms, <a href=""http://git-..."
4,231767,"<p>To understand what <code>yield</code> does,..."
...,...,...
8939,5878697,"<p>There are many ways to do this, one is <cod..."
8940,7054972,<p>I think the difference between the two boil...
8941,7785374,<p>The correct way to do this is with the <a h...
8942,8996852,<p>If you aren't dynamically loading scripts o...


## Cleaning up and preparing the data

In [22]:
qa_df = pd.merge(questions_df, answers_df, on="question_id", how="inner")
qa_df

Unnamed: 0,tags,question_id,title,question,answer
0,"[java, c++, performance, cpu-architecture, bra...",11227809,Why is processing a sorted array faster than p...,<p>Here is a piece of C++ code that shows some...,"<p><strong>You are a victim of <a href=""https:..."
1,"[git, version-control, git-commit, undo]",927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...,<h1>Undo a commit &amp; redo</h1>\n<pre class=...
2,"[git, version-control, git-branch, git-push, g...",2003505,How do I delete a Git branch locally and remot...,<p>Failed Attempts to Delete a Remote Branch:<...,<h1>Executive Summary</h1>\n<pre><code>git pus...
3,"[git, version-control, git-pull, git-fetch]",292357,What is the difference between &#39;git pull&#...,"<p>What are the differences between <a href=""h...","<p>In the simplest terms, <a href=""http://git-..."
4,"[python, iterator, generator]",231767,What does the &quot;yield&quot; keyword do?,<p>What is the use of the <code>yield</code> k...,"<p>To understand what <code>yield</code> does,..."
...,...,...,...,...,...
8943,"[ruby, arrays]",5878697,How do I remove blank elements from an array?,<p>I have the following array </p>\n\n<pre><co...,"<p>There are many ways to do this, one is <cod..."
8944,"[java, variables, properties, system, environm...",7054972,Java system properties and environment variables,<p>What's the difference between system proper...,<p>I think the difference between the two boil...
8945,"[html, css, css-multicolumn-layout]",7785374,How to prevent column break within an element?,<p>Consider the following HTML:</p>\n\n<pre><c...,<p>The correct way to do this is with the <a h...
8946,"[javascript, load-order]",8996852,load and execute order of scripts,<p>There are so many different ways to include...,<p>If you aren't dynamically loading scripts o...


The output of the StackExchange API are texts with html tags. To prepare them for OpenAI API and later NLP classification I'm stripping all the tags from them.
I'm also merging the title with the body of the question, so that AI can use the same context as the human was using to answer the question

In [23]:
import html

In [24]:
qa_df_pure = qa_df.copy()
qa_df_pure.question = qa_df.question.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))
qa_df_pure.question = qa_df.title.apply(html.unescape) + '\n\n' + qa_df_pure.question
qa_df_pure.answer = qa_df.answer.apply(lambda x: html.unescape(re.sub(r'<[^<]+?>', '', x)))

## Limiting the size of the answer.
In requests to OpenAI API I'm using limit of `max_tokens=2048`. It is roughly equal to 8000 characters. That's why I'm filtering out longer answers from the human-generated data

In [25]:
qa_df_limited_answer = qa_df_pure[qa_df_pure.answer.str.len() < 8000].drop(['title'], axis=1)
qa_df_limited_answer.columns

Index(['tags', 'question_id', 'question', 'answer'], dtype='object')

In [26]:
questions_gpt_api = qa_df_limited_answer.drop(['tags', 'answer'], axis=1)

In [27]:
questions_gpt_api.shape

(8811, 2)

## Saving data for further processing by OpenAI API

In [28]:
questions_gpt_api.to_json("data/gpt_api/questions.jsonl", lines=True, orient='records')

In [29]:
qa_df_final = qa_df_pure[qa_df_pure.answer.str.len() < 8000]
qa_df_final.columns, questions_gpt_api.columns

(Index(['tags', 'question_id', 'title', 'question', 'answer'], dtype='object'),
 Index(['question_id', 'question'], dtype='object'))

## Raw AI-generated data
The files in the `data/raw/answers_gpt_api` are the result of running the `download_answers_ai.py` script.
It was done manually in the terminal

In [30]:
ai_answers = pd.read_json("data/raw/answers_gpt_api/answers_ai.jsonl", lines=True)

In [31]:
complete_data = pd.merge(qa_df_final.rename(columns={'answer': 'human_answer'}), ai_answers.rename(columns={'answer': 'ai_answer'}), on='question_id', how='inner')
complete_data = complete_data.drop('tokens_spent', axis=1)
complete_data = complete_data.drop('tags', axis=1)
complete_data = complete_data.drop('title', axis=1)

In [32]:
max(complete_data.human_answer.str.len()), max(complete_data.ai_answer.str.len())

(7988, 5046)

## Limiting answer size again
Despite the first limitation, AI tends to create shorter answers. I don't want to rely on this fact while training the model, so I'm limiting the dataset in a way when StackOverflow max answer size is the same as AI max answer size:

In [33]:
complete_data = complete_data[complete_data.human_answer.str.len() <= max(complete_data.ai_answer.str.len())]
complete_data.to_json("data/data.jsonl", lines=True, orient='records')
max(complete_data.human_answer.str.len()), max(complete_data.ai_answer.str.len()),

(5046, 5046)

## Saving the data
The file `data/data.jsonl` will be used for EDA and model training

In [34]:
complete_data.to_json("data/data.jsonl", lines=True, orient='records')