In [11]:
import pandas as pd
INPUT_PATH = 'data/squad.biomedical.train.json'
OUTPUT_PATH = 'data/sampled_data.json'

Randomly sample 5000 questions and answers from the MSMarco QnA dataset. The questions are filtered to only be from the biomedical domain.

In [12]:
df = pd.read_json(INPUT_PATH)
sampled_df = df.sample(n=5000, random_state=42)

Next, we will filter out samples for which the answer consists of less than 5 tokens. We classify these questions as subject-verb-object (SVO) sentences. We do this in order to filter out too simple answers, such as "yes".

In [13]:
sampled_df["SVO"] = False
for i, row in sampled_df.iterrows():
    answer = row['data']['paragraphs'][0]['qas'][0]['answers'][0]['text']

    if len(answer.split(" ")) >= 5:
        sampled_df.at[i, "SVO"] = True

sampled_df = sampled_df[sampled_df["SVO"] == True]
print(len(sampled_df))

3029


From these SVO samples, we then select the first 2000 and use that as our dataset.

In [14]:
sampled_df = sampled_df.iloc[:2000].drop("SVO", axis=1)
print(len(sampled_df))

2000


In [16]:
sampled_df.to_json(OUTPUT_PATH, orient="records", lines=True)