# Data Preparation for QnA Maker
In this notebook, we use a subset of [Stack Exchange network](https://archive.org/details/stackexchange) question data which includes original questions tagged as 'JavaScript', their duplicate questions and their answers. Here, we provide the steps to prepare the data to use for training, tuning, and testing a [QnA Maker](https://www.qnamaker.ai) model that will match a new question with an existing original question. The data files produced are stored in a `data` directory for ease of reference and also to keep them separate from the training script.

The data preparation steps are
- [import libraries and define parameters](#import),
- [ingest the data](#ingest),
- [cleanse the data](#cleanse),
- [prepare the train, tune, and test datasets](#prepare), and
- [save the datasets.](#save)

## Imports and parameters <a id='import'></a>

In [None]:
import os
import pandas as pd
import csv

In [None]:
def qna_metadata(x):
    metadata = "AnswerId:" + str(x.AnswerId)
    return metadata

## Data ingestion <a id='ingest'></a>
Next, we the original questions and their answers, and the train, tune, and test duplicate questions sets.

In [None]:
data_path = "data"
questions_path = os.path.join(data_path, "questions.tsv")
answers_path = os.path.join(data_path, "answers.tsv")
dupes_train_path = os.path.join(data_path, "dupes_train.tsv")
dupes_tune_path = os.path.join(data_path, "dupes_tune.tsv")
dupes_test_path = os.path.join(data_path, "dupes_test.tsv")

Load the datasets.

In [None]:
questions = pd.read_csv(questions_path, sep='\t', encoding='latin1')
answers = pd.read_csv(answers_path, sep='\t', encoding='latin1')
dupes_train = pd.read_csv(dupes_train_path, sep='\t', encoding='latin1')
dupes_tune = pd.read_csv(dupes_tune_path, sep='\t', encoding='latin1')
dupes_test = pd.read_csv(dupes_test_path, sep='\t', encoding='latin1')

## Create data files for QnA Maker
Name the columns of the data to be used.

In [None]:
QnA_columns = ["Id", "AnswerId", "Question", "Answer"]

The maximum length of texts as [set by QnA Maker](https://docs.microsoft.com/en-us/azure/cognitive-services/qnamaker/limits#knowledge-base-content-limits).

In [None]:
Question_length = 1000
Answer_length = 25000
Number_of_duplicate_questions = 100

Merge the original questions and the training and tuning duplicate questions with their answers. Add the question and answer IDs as metadata.

In [None]:
QnA = (questions[["Id", "AnswerId", "Text0"]]
       .merge(answers[["Id", "Text0"]], left_on="AnswerId", right_on="Id")
       .drop(["Id_y"], axis=1))
TnA = (dupes_train[["Id", "AnswerId", "Text0"]]
       .merge(answers[["Id", "Text0"]], left_on="AnswerId", right_on="Id")
       .drop(["Id_y"], axis=1))
UnA = (dupes_tune[["Id", "AnswerId", "Text0"]].reset_index(drop=True)
       .merge(answers[["Id", "Text0"]], left_on="AnswerId", right_on="Id")
       .drop(["Id_y"], axis=1))
QnA_train = pd.concat([QnA, TnA, UnA])

QnA_train.columns = QnA_columns

QnA_train["Metadata"] = QnA_train.apply(qna_metadata, axis=1)

QnA_train = QnA_train[["Question", "Answer", "Metadata", "Id", "AnswerId"]]

QnA_train.Question = QnA_train.Question.str[:Question_length]
QnA_train.Answer = QnA_train.Answer.str[:Answer_length]

QnA_train = QnA_train.groupby("Answer").apply(
    lambda x: x.iloc[:Number_of_duplicate_questions+1]).reset_index(drop=True)

Create a version of the training data with only the AnswerId in the Answer column.

In [None]:
QnA_noanswer = QnA_train.copy()
QnA_noanswer["Answer"] = "AnswerId is " + QnA_noanswer.AnswerId.astype(str)

Do the same for the test duplicate questions.

In [None]:
QnA_test = (dupes_test[["Id", "AnswerId", "Text0"]].reset_index(drop=True)
            .merge(answers[["Id", "Text0"]], left_on="AnswerId", right_on="Id")
            .drop(["Id_y"], axis=1))

QnA_test.columns = QnA_columns

QnA_test["Metadata"] = QnA_test.apply(qna_metadata, axis=1)

QnA_test = QnA_test[["Question", "Answer", "Metadata", "Id", "AnswerId"]]

QnA_test.Question = QnA_test.Question.str[:Question_length]
QnA_test.Answer = QnA_test.Answer.str[:Answer_length]

QnA_test = QnA_test.groupby("Answer").apply(
    lambda x: x.iloc[:Number_of_duplicate_questions+1]).reset_index(drop=True)

Merge the training and tuning duplicate questions with their original questions. Add the question and answer IDs as metadata.

In [None]:
TnQ = (dupes_train[["Id", "AnswerId", "Text0"]]
       .merge(questions[["AnswerId", "Text0"]], on="AnswerId"))
UnQ = (dupes_tune[["Id", "AnswerId", "Text0"]].reset_index(drop=True)
       .merge(questions[["AnswerId", "Text0"]], on="AnswerId"))
QnQ_train = pd.concat([TnQ, UnQ])

QnQ_train.columns = QnA_columns

QnQ_train["Metadata"] = QnQ_train.apply(qna_metadata, axis=1)

QnQ_train = QnQ_train[["Question", "Answer", "Metadata", "Id", "AnswerId"]]

QnQ_train.Question = QnQ_train.Question.str[:Question_length]
QnQ_train.Answer = QnQ_train.Answer.str[:Answer_length]

QnQ_train = QnQ_train.groupby("Answer").apply(
    lambda x: x.iloc[:Number_of_duplicate_questions+1]).reset_index(drop=True)

Do the same for the test duplicate questions.

In [None]:
QnQ_test = (dupes_test[["Id", "AnswerId", "Text0"]].reset_index(drop=True)
            .merge(questions[["AnswerId", "Text0"]], on="AnswerId"))

QnQ_test.columns = QnA_columns

QnQ_test["Metadata"] = QnQ_test.apply(qna_metadata, axis=1)

QnQ_test = QnQ_test[["Question", "Answer", "Metadata", "Id", "AnswerId"]]

QnQ_test.Question = QnQ_test.Question.str[:Question_length]
QnQ_test.Answer = QnQ_test.Answer.str[:Answer_length]

QnQ_test = QnQ_test.groupby("Answer").apply(
    lambda x: x.iloc[:Number_of_duplicate_questions+1]).reset_index(drop=True)

Sort all dataframes by AnswerID.

In [None]:
QnA_train.sort_values("AnswerId", inplace=True)
QnA_test.sort_values("AnswerId", inplace=True)
QnA_noanswer.sort_values("AnswerId", inplace=True)
QnQ_train.sort_values("AnswerId", inplace=True)
QnQ_test.sort_values("AnswerId", inplace=True)

Write out the files to the data directory.

In [None]:
qnamaker_data_path = "qnamaker_data"
os.makedirs(qnamaker_data_path, exist_ok=True)

QnA_train_path = os.path.join(qnamaker_data_path, 'QnA_train.tsv')
print('Writing {:,} rows to {}'.format(QnA_train.shape[0], QnA_train_path))
QnA_train.to_csv(QnA_train_path, sep='\t', header=True, index=False, quoting=csv.QUOTE_NONE)

QnA_test_path = os.path.join(qnamaker_data_path, 'QnA_test.tsv')
print('Writing {:,} rows to {}'.format(QnA_test.shape[0], QnA_test_path))
QnA_test.to_csv(QnA_test_path, sep='\t', header=True, index=False, quoting=csv.QUOTE_NONE)

QnA_noanswer_path = os.path.join(qnamaker_data_path, 'QnA_noanswer.tsv')
print('Writing {:,} rows to {}'.format(QnA_noanswer.shape[0], QnA_noanswer_path))
QnA_noanswer.to_csv(QnA_noanswer_path, sep='\t', header=True, index=False, quoting=csv.QUOTE_NONE)

QnQ_train_path = os.path.join(qnamaker_data_path, 'QnQ_train.tsv')
print('Writing {:,} rows to {}'.format(QnQ_train.shape[0], QnQ_train_path))
QnQ_train.to_csv(QnQ_train_path, sep='\t', header=True, index=False, quoting=csv.QUOTE_NONE)

QnQ_test_path = os.path.join(qnamaker_data_path, 'QnQ_test.tsv')
print('Writing {:,} rows to {}'.format(QnQ_test.shape[0], QnQ_test_path))
QnQ_test.to_csv(QnQ_test_path, sep='\t', header=True, index=False, quoting=csv.QUOTE_NONE)