## Dataset creation

This notebook illustrates how we went from the original Wizard-of-Tasks dataset to WoTe.

### Loading in raw WoT

In [2]:
# get WoT data - https://registry.opendata.aws/wizard-of-tasks/

import os
import subprocess
import pandas as pd

if not os.path.isdir('original_WoT_test'):
    subprocess.run(['aws', 's3', 'cp', 's3://wizard-of-tasks', './original_WoT_test', '--recursive'])

download: s3://wizard-of-tasks/wizard_of_tasks_cooking_v1.0.json to original_WoT_test/wizard_of_tasks_cooking_v1.0.json
download: s3://wizard-of-tasks/wizard_of_tasks_diy_v1.0.json to original_WoT_test/wizard_of_tasks_diy_v1.0.json
download: s3://wizard-of-tasks/README.md to original_WoT_test/README.md


In [3]:
wot_og_df_cooking = pd.read_json('./original_WoT/wizard_of_tasks_cooking_v1.0.json', orient='index')
print(len(wot_og_df_cooking))
wot_og_df_diy = pd.read_json('./original_WoT/wizard_of_tasks_diy_v1.0.json', orient='index')
print(len(wot_og_df_diy))

# concat both
wot_og_df = pd.concat([wot_og_df_cooking, wot_og_df_diy])
wot_og_df.reset_index(inplace=True)
wot_og_df.rename(columns={'index': 'conversation_id'}, inplace=True)

wot_og_df

272
277


Unnamed: 0,conversation_id,document_url,data_split,turns
0,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,test,[{'text': 'Hi! I love labneh but I've never mi...
1,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,train,[{'text': 'How much cream cheese and other ing...
2,Wizard-of-Task-food-3,https://www.wholefoodsmarket.com/recipes/citru...,train,[{'text': 'Will I be using premade pasta from ...
3,Wizard-of-Task-food-4,https://www.wholefoodsmarket.com/recipes/grill...,train,[{'text': 'Will I be making the tortellini by ...
4,Wizard-of-Task-food-5,https://www.wholefoodsmarket.com/recipes/pasta...,train,[{'text': 'Will I be making my own pasta for t...
...,...,...,...,...
544,Wizard-of-Task-diy-273,https://www.wikihow.com/Prune-Dahlias,train,[{'text': 'What does it mean to prune a plant?...
545,Wizard-of-Task-diy-274,https://www.wikihow.com/Prune-Begonias,train,[{'text': 'Will I need anything besides prunin...
546,Wizard-of-Task-diy-275,https://www.wikihow.com/Test-a-Diode,train,[{'text': 'Would you be able to tell me what a...
547,Wizard-of-Task-diy-276,https://www.wikihow.com/Melt-a-Soap-Bar,train,[{'text': 'Are there tools needed for this pro...


In [4]:
# iterate through each row in wot_og_df and iterate through each turn in the turns column to get all the turns
# append each turn to a list
# create a new dataframe with the list

import numpy as np

turns_list = []
for index, row in wot_og_df.iterrows():
    history = [f"{turn['role']}: {turn['text']}" for turn in row['turns']]
    # replace all None values with empty string
    history = ["" if pd.isna(turn) else turn for turn in history]
    for i, turn in enumerate(row['turns']):
        if len(history) != 0 and len(history) >= i:
            if i > 4:
                turn['history'] = " | ".join(history[i-4:i])
            else:
                turn['history'] = " | ".join(history[:i])
        else:
            turn['history'] = ""
        turn['conversation_id'] = row['conversation_id']
        turn['document_url'] = row['document_url']
        turn['domain'] = "food" if "food" in turn['conversation_id'] else "diy"
        turns_list.append(turn)

turns_df = pd.DataFrame(turns_list)

In [5]:
turns_df

Unnamed: 0,text,turn_counter,dangerous_tools,shared_data,intent,real_life_action,relevant,useful,worker_id,previous_worker_id,role,history,conversation_id,document_url,domain,external_urls
0,Hi! I love labneh but I've never mixed it with...,1,[],[],ask_question_ingredients_tools,,yes,yes,111,,student,,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,
1,Here are the ingredients you will need!,2,[],"[2 cups plain Greek yogurt, 1 tablespoon extra...",return_list_ingredients_tools,,yes,yes,121,111.0,teacher,student: Hi! I love labneh but I've never mixe...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,[]
2,After I've chopped all of the herbs and gather...,3,"[1 teaspoon lemon zest, 1 teaspoon chopped fre...",[],request_next_step,I would gather and prepare all of the ingredie...,yes,yes,105,121.0,student,student: Hi! I love labneh but I've never mixe...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,
3,Yes. I've shared the step below,4,[],"[Transfer to a bowl. Add oil, tarragon, basil,...",return_next_step,,yes,yes,129,105.0,teacher,student: Hi! I love labneh but I've never mixe...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,[]
4,The ingredients are now mixed. What should be ...,5,[],[],request_next_step,I will mix the ingredients together.,yes,no,24,129.0,student,student: Hi! I love labneh but I've never mixe...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18072,"Great, is there anything else that I need to do?",29,[],[],request_next_step,,yes,yes,3,16.0,student,student: I've got a few that seem ready to har...,Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,
18073,Now you just wait to enjoy your delicious home...,30,[],[],answer_question_recipe_steps,,yes,yes,112,3.0,teacher,"teacher: To get them out of the ground, just p...",Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,[]
18074,How long does onions take to fully grow before...,31,[],[],ask_question_recipe_steps,wait for onions to fuly grown,yes,yes,9,112.0,student,student: How much of the plant can I eat once ...,Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,
18075,It will take between 60-80 days. Keep an eye o...,32,[],[],return_next_step,,yes,yes,16,9.0,teacher,"teacher: You must not eat the roots, shoots or...",Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,[]


### Filtering process

Step 1: Remove non question utterances. We keep all questions where the intent contains "ask" or the corresponding answer contains "answer_question_recipe_steps"

In [6]:
qa_df = turns_df[(turns_df['intent'].str.contains('ask')) & (turns_df['intent'] != 'ask_student_question')]
print(qa_df['intent'].unique())
qa_indices = qa_df.index.values
# get indices of rows that are not in ask but response is of intent = answer_question_recipe_steps
answer_indices = turns_df[(turns_df['intent'] == 'answer_question_recipe_steps')].index.values
missing_qa = [a-1 for a in answer_indices if a-1 not in qa_indices]

qa_indices = np.concatenate((qa_indices, missing_qa))
qa_df = turns_df.iloc[qa_indices]

qa_df.reset_index(inplace=True)
qa_df.rename(columns={'index': 'turn_id'}, inplace=True)
qa_df

['ask_question_ingredients_tools' 'ask_question_recipe_steps']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qa_df.rename(columns={'index': 'turn_id'}, inplace=True)


Unnamed: 0,turn_id,text,turn_counter,dangerous_tools,shared_data,intent,real_life_action,relevant,useful,worker_id,previous_worker_id,role,history,conversation_id,document_url,domain,external_urls
0,0,Hi! I love labneh but I've never mixed it with...,1,[],[],ask_question_ingredients_tools,,yes,yes,111,,student,,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,
1,8,Can I let the ingredients sit for longer to ma...,9,[],[],ask_question_recipe_steps,I would let the ingredients sit to marinate.,yes,no,2,142.0,student,student: The ingredients are now mixed. What s...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,
2,12,Do you think I could freeze this recipe?,13,[],[],ask_question_recipe_steps,I would get out my salt and pepper.,no,no,2,68.0,student,student: Can I let the ingredients sit for lon...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,
3,19,How much cream cheese and other ingredients wi...,1,[],[],ask_question_ingredients_tools,,yes,yes,203,,student,,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,
4,21,What are the few other ingredients?,3,"[1 package cream cheese, softened]",[],ask_question_recipe_steps,Take it out all ingredients.,yes,yes,228,129.0,student,student: How much cream cheese and other ingre...,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4474,17725,What should I do besides caring for it on a we...,51,[],[],request_next_step,ask for next step,yes,yes,49,28.0,student,"student: Ok, to be clear, I shouldn't touch or...",Wizard-of-Task-diy-266,https://www.wikihow.com/Grow-Java-Moss,diy,
4475,17782,"I have pruned just above the leaf node, is the...",21,[],[],request_next_step,I would prune as directed.,yes,yes,3,16.0,student,"student: Thanks, I think I already did that pr...",Wizard-of-Task-diy-269,https://www.wikihow.com/Prune-Honeysuckle,diy,
4476,17838,"I'll keep that in mind going forward, then. Ar...",44,[],[],request_next_step,I will make note to prune my tree yearly.,yes,yes,142,44.0,student,"student: I have pruned as directed, what shoul...",Wizard-of-Task-diy-270,https://www.wikihow.com/Prune-a-Leyland-Cypress,diy,
4477,17912,Gotcha! Any words of advice to share before w...,27,[],[],chitchat,ask for advice,yes,yes,49,62.0,student,student: How much water should I give at each ...,Wizard-of-Task-diy-272,https://www.wikihow.com/Grow-English-Ivy-Indoors,diy,


In [7]:
# get indexes of rows in turns_df that are in qa_df
answer_indicies = [i+1 for i in qa_indices]
answers_df = turns_df[turns_df.index.isin(answer_indicies)]
answers_df.reset_index(inplace=True)
answers_df.rename(columns={'index': 'turn_id'}, inplace=True)
answers_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answers_df.rename(columns={'index': 'turn_id'}, inplace=True)


Unnamed: 0,turn_id,text,turn_counter,dangerous_tools,shared_data,intent,real_life_action,relevant,useful,worker_id,previous_worker_id,role,history,conversation_id,document_url,domain,external_urls
0,1,Here are the ingredients you will need!,2,[],"[2 cups plain Greek yogurt, 1 tablespoon extra...",return_list_ingredients_tools,,yes,yes,121,111.0,teacher,student: Hi! I love labneh but I've never mixe...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,[]
1,9,Only 15 minutes is needed for the flavors to m...,10,[],[],answer_question_recipe_steps,,yes,yes,22,2.0,teacher,teacher: None | student: Once the ingredients ...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,[]
2,13,,14,[],[],answer_question_external_fact,,no,no,132,2.0,teacher,teacher: Only 15 minutes is needed for the fla...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,[recently visited search engines will be added...
3,20,You will need 1 package of cream cheese and a ...,2,[],"[1 package cream cheese, softened, 1/2 cup sm...",answer_question_recipe_steps,,yes,yes,129,203.0,teacher,student: How much cream cheese and other ingre...,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,[]
4,22,I attached them to my last message. They are ...,4,[],[],answer_question_recipe_steps,,yes,yes,121,228.0,teacher,student: How much cream cheese and other ingre...,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4474,18059,One inch is the amount of water. It's a measur...,16,[],[],answer_question_external_fact,,yes,yes,154,124.0,teacher,teacher: Now plant the sprouts 1 inch deep in ...,Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,[]
4475,18063,There is not much you can do as weeds growing ...,20,[],[],answer_question_external_fact,,yes,yes,24,78.0,teacher,teacher: One inch is the amount of water. It's...,Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,[]
4476,18071,"You must not eat the roots, shoots or the skin...",28,[],[],return_next_step,,yes,yes,16,149.0,teacher,teacher: You want to look for signs that the o...,Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,[]
4477,18073,Now you just wait to enjoy your delicious home...,30,[],[],answer_question_recipe_steps,,yes,yes,112,3.0,teacher,"teacher: To get them out of the ground, just p...",Wizard-of-Task-diy-277,https://www.wikihow.com/Plant-Sprouted-Onions,diy,[]


In [8]:
# combine each same index of qa_df and answers_df into one row
qa_pairs_df = pd.concat([qa_df.add_suffix('_question'), answers_df.add_suffix('_answer')], axis=1)
qa_pairs_df.columns

Index(['turn_id_question', 'text_question', 'turn_counter_question',
       'dangerous_tools_question', 'shared_data_question', 'intent_question',
       'real_life_action_question', 'relevant_question', 'useful_question',
       'worker_id_question', 'previous_worker_id_question', 'role_question',
       'history_question', 'conversation_id_question', 'document_url_question',
       'domain_question', 'external_urls_question', 'turn_id_answer',
       'text_answer', 'turn_counter_answer', 'dangerous_tools_answer',
       'shared_data_answer', 'intent_answer', 'real_life_action_answer',
       'relevant_answer', 'useful_answer', 'worker_id_answer',
       'previous_worker_id_answer', 'role_answer', 'history_answer',
       'conversation_id_answer', 'document_url_answer', 'domain_answer',
       'external_urls_answer'],
      dtype='object')

In [9]:
qa_pairs_df.drop(columns=[
    'turn_id_answer', 'conversation_id_answer', 'document_url_answer',  
    'worker_id_answer', 'worker_id_question', 'turn_id_question', 'turn_counter_question', 'dangerous_tools_question', 
    'shared_data_question', 'previous_worker_id_question', 'role_question', 'external_urls_question', 'turn_counter_answer',
    'previous_worker_id_answer', 'role_answer', 'external_urls_question', 'shared_data_answer', 'dangerous_tools_answer', 'history_answer'
    ], inplace=True)
qa_pairs_df = qa_pairs_df.rename(columns={'data_split_question': 'data_split', 'conversation_id_question': 'conversation_id'})

In [10]:
# the question_id should be the index of the number of questions in the conversation_id
qa_pairs_df['question_count'] = qa_pairs_df.groupby('conversation_id')['conversation_id'].cumcount()
qa_pairs_df['question_id'] = qa_pairs_df['conversation_id'].str.split('Wizard-of-Task-').str[1]
qa_pairs_df['question_id'] = qa_pairs_df['question_id'] + '-' + qa_pairs_df['question_count'].astype(str)
qa_pairs_df

Unnamed: 0,text_question,intent_question,real_life_action_question,relevant_question,useful_question,history_question,conversation_id,document_url_question,domain_question,text_answer,intent_answer,real_life_action_answer,relevant_answer,useful_answer,domain_answer,external_urls_answer,question_count,question_id
0,Hi! I love labneh but I've never mixed it with...,ask_question_ingredients_tools,,yes,yes,,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,Here are the ingredients you will need!,return_list_ingredients_tools,,yes,yes,food,[],0,food-1-0
1,Can I let the ingredients sit for longer to ma...,ask_question_recipe_steps,I would let the ingredients sit to marinate.,yes,no,student: The ingredients are now mixed. What s...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,Only 15 minutes is needed for the flavors to m...,answer_question_recipe_steps,,yes,yes,food,[],1,food-1-1
2,Do you think I could freeze this recipe?,ask_question_recipe_steps,I would get out my salt and pepper.,no,no,student: Can I let the ingredients sit for lon...,Wizard-of-Task-food-1,https://www.wholefoodsmarket.com/recipes/labne...,food,,answer_question_external_fact,,no,no,food,[recently visited search engines will be added...,2,food-1-2
3,How much cream cheese and other ingredients wi...,ask_question_ingredients_tools,,yes,yes,,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,You will need 1 package of cream cheese and a ...,answer_question_recipe_steps,,yes,yes,food,[],0,food-2-0
4,What are the few other ingredients?,ask_question_recipe_steps,Take it out all ingredients.,yes,yes,student: How much cream cheese and other ingre...,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,I attached them to my last message. They are ...,answer_question_recipe_steps,,yes,yes,food,[],1,food-2-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4474,What should I do besides caring for it on a we...,request_next_step,ask for next step,yes,yes,"student: Ok, to be clear, I shouldn't touch or...",Wizard-of-Task-diy-266,https://www.wikihow.com/Grow-Java-Moss,diy,One inch is the amount of water. It's a measur...,answer_question_external_fact,,yes,yes,diy,[],12,diy-266-12
4475,"I have pruned just above the leaf node, is the...",request_next_step,I would prune as directed.,yes,yes,"student: Thanks, I think I already did that pr...",Wizard-of-Task-diy-269,https://www.wikihow.com/Prune-Honeysuckle,diy,There is not much you can do as weeds growing ...,answer_question_external_fact,,yes,yes,diy,[],6,diy-269-6
4476,"I'll keep that in mind going forward, then. Ar...",request_next_step,I will make note to prune my tree yearly.,yes,yes,"student: I have pruned as directed, what shoul...",Wizard-of-Task-diy-270,https://www.wikihow.com/Prune-a-Leyland-Cypress,diy,"You must not eat the roots, shoots or the skin...",return_next_step,,yes,yes,diy,[],15,diy-270-15
4477,Gotcha! Any words of advice to share before w...,chitchat,ask for advice,yes,yes,student: How much water should I give at each ...,Wizard-of-Task-diy-272,https://www.wikihow.com/Grow-English-Ivy-Indoors,diy,Now you just wait to enjoy your delicious home...,answer_question_recipe_steps,,yes,yes,diy,[],9,diy-272-9


Remove all questions that are:
- remove all that are not questions
- are not useful/ relevant
- not ask_question_recipe_steps
- drop all external questions

In [46]:
qa_pairs_df['intent_answer'].unique()

array(['return_list_ingredients_tools', 'answer_question_recipe_steps',
       'answer_question_external_fact', 'return_next_step',
       'ask_student_question', 'chitchat', 'request_next_step',
       'ask_question_recipe_steps', 'ask_question_ingredients_tools',
       'misc', 'stop'], dtype=object)

In [33]:
# drop all rows where neither intent_question contains ask or intent_answer contains answer
print(len(qa_pairs_df))
filtered_qa_pairs_df = qa_pairs_df[~(~(qa_pairs_df['intent_question'].str.contains('ask')) & ~(qa_pairs_df['intent_answer'].str.contains('answer')))]
print(len(filtered_qa_pairs_df))
# drop all chit chat
filtered_qa_pairs_df = filtered_qa_pairs_df[~filtered_qa_pairs_df['intent_answer'].str.contains('chitchat')]
print(len(filtered_qa_pairs_df))

4479
4382
4351


In [61]:
# drop all questions that have len(external_urls_answer) > 0
internal_questions = filtered_qa_pairs_df[filtered_qa_pairs_df['intent_answer']=='answer_question_recipe_steps']
external_answer = filtered_qa_pairs_df[filtered_qa_pairs_df['intent_answer']=='answer_question_external_fact']
external_link = filtered_qa_pairs_df[filtered_qa_pairs_df['external_urls_answer'].str.len() != 0]
# get the not overlapping elements of internal_questions and external_link
internal_questions = internal_questions[~internal_questions['question_id'].isin(external_link['question_id'])]

# get the not overlapping elements of external_answer and external_link
individuals = external_answer[~external_answer['question_id'].isin(external_link['question_id'])]

print(f'External links: {len(external_link)}')
print(f'External answers: {len(external_answer)}')
print('-----')
print(f'External questions: {len(external_link) + len(individuals)}')
print(f'Internal questions: {len(internal_questions)}')

External links: 610
External answers: 1543
-----
External questions: 1820
Internal questions: 1589


In [62]:
# drop all not useful/ relevant questions
internal_questions = internal_questions[(internal_questions['relevant_question']== "yes") & (internal_questions['useful_question']== "yes")]
len(internal_questions)

1337

In [63]:
internal_questions = internal_questions.drop(columns=['relevant_answer', 'useful_answer', 'external_urls_answer', 
    'relevant_question', 'useful_question', 'real_life_action_question', 'real_life_action_answer', 'question_count'])
internal_questions = internal_questions.rename(columns={'text_question': 'question', 'history_question': 'history'})
internal_questions

Unnamed: 0,question,intent_question,history,conversation_id,document_url_question,domain_question,text_answer,intent_answer,domain_answer,question_id
3,How much cream cheese and other ingredients wi...,ask_question_ingredients_tools,,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,You will need 1 package of cream cheese and a ...,answer_question_recipe_steps,food,food-2-0
4,What are the few other ingredients?,ask_question_recipe_steps,student: How much cream cheese and other ingre...,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,I attached them to my last message. They are ...,answer_question_recipe_steps,food,food-2-1
5,Will I be using premade pasta from the box or ...,ask_question_ingredients_tools,,Wizard-of-Task-food-3,https://www.wholefoodsmarket.com/recipes/citru...,food,Now transfer to a bowl and serve with the frui...,answer_question_recipe_steps,food,food-3-0
6,Okay and what other ingredients do I need?,ask_question_ingredients_tools,student: Will I be using premade pasta from th...,Wizard-of-Task-food-3,https://www.wholefoodsmarket.com/recipes/citru...,food,You will be using premade pasta. The recipe c...,answer_question_recipe_steps,food,food-3-1
10,Should the tortellini be fresh or can I use fr...,ask_question_recipe_steps,student: Will I be making the tortellini by ha...,Wizard-of-Task-food-4,https://www.wholefoodsmarket.com/recipes/grill...,food,You could make it by hand but this recipe incl...,answer_question_recipe_steps,food,food-4-1
...,...,...,...,...,...,...,...,...,...,...
4461,Clean your fabric in the washer and dryer.,return_next_step,teacher: You will now remove it and rinse it w...,Wizard-of-Task-diy-250,https://www.wikihow.com/Dye-Tulle,diy,The leads are the cables coming off of the mul...,answer_question_recipe_steps,diy,diy-250-14
4463,Spray cypress and peppermint oils under and ar...,return_next_step,teacher: Spray roach repellent. | student: I ...,Wizard-of-Task-diy-252,https://www.wikihow.com/Keep-Roaches-Away-from...,diy,If one of your diodes don't work we are going ...,answer_question_recipe_steps,diy,diy-252-5
4471,"My verbena is so overgrown, I need to prune. w...",chitchat,,Wizard-of-Task-diy-265,https://www.wikihow.com/Prune-Verbena,diy,If it is heated to long it will ruin the integ...,answer_question_recipe_steps,diy,diy-265-5
4473,"Okay, that makes sense. Alright then, I've pla...",request_next_step,"student: I have cut off pieces as directed, wh...",Wizard-of-Task-diy-266,https://www.wikihow.com/Grow-Java-Moss,diy,You should go ahead and separate all of them. ...,answer_question_recipe_steps,diy,diy-266-11


In [64]:
with open('./data/internal_questions.json', 'w') as out:
    internal_questions.to_json(out, orient='records')

#### Scrape recipe data
You need to be based in the US to run this

In [57]:
urls = internal_questions['document_url_question'].unique()
len(list(urls))

499

In [19]:
import re

def parse_time(time_str):
    class CustomParserInfo(parserinfo):
        HMS = [('h', 'hr', 'hrs', 'hour', 'hours'), ('m', 'min', 'mins', 'minute', 'minutes'),
                ('s', 'second', 'seconds')]

    try:
        parsed_time = dparser.parse(time_str, fuzzy=True, parserinfo=CustomParserInfo())
        parsed_time_min = parsed_time.minute + parsed_time.hour * 60 + parsed_time.second / 60
        return parsed_time_min
    except:
        minutes = ('m', 'min', 'mins', 'minute', 'minutes')
        hours = ('h', 'hr', 'hrs', 'hour', 'hours')
        seconds = ('s', 'second', 'seconds')

        time_num = re.findall(r"\d+", time_str.strip())
        if time_num != []:
            final_time = 0
            time_int = int(time_num[0])
            for word in time_str.split(" "):
                if word in minutes:
                    final_time = time_int
                elif word in hours:
                    final_time = time_int * 60
                elif word in seconds:
                    final_time = time_int / 60

            return final_time
        else:
            return 0


def get_method_number(soup):
        """ Scrape how many methods the DIY article contains and save this number in DIYDocument.number_of_parallel_
        methods
        """
        steps_lists = soup.find_all(class_='steps')
        if steps_lists != []:
            header = steps_lists[0].find('h3')
            if header:
                if "Part" in header.text:
                    # DIY article has parts
                    return 1
                else:
                    # DIY article has methods
                    number = len(steps_lists)
                    return number
            else:
                # DIY article has just one method, no parts
                return 1

def get_steps(soup):
    """ Parse steps. """
    # article just has one method
    list_of_steps_list = soup.find_all('ol', class_='steps_list_2')
    method_number = get_method_number(soup)
    if list_of_steps_list != []:
        # loop through all parts (which include an array of steps) of the article
        if method_number > 1:
            list_of_steps_list = list_of_steps_list[:1]
        for part in list_of_steps_list:
            list_of_steps = get_steps_list(soup, part)
    return list_of_steps

def get_steps_list(soup, steps_list):
    """ Helper function to scrape all steps from list object """
    # Loop over steps object to access associated text, images, or video.
    steps = []
    for step_tag in steps_list.find_all('li'):
        if step_tag.get('id') is not None:
            # Build Step object.
            step_object = step_tag.find(class_='step')
            if step_object:
                # scrape bold step header
                header = step_object.find('b')
                if header:
                    header = header.text

                # scrape all the remaining text after the bold step header
                text = step_object.find_all(text=True)
                if len(text) >= 3:
                    text = text[2]
                    if text[0] == '[':
                        text = ''
                    else:
                        text += ';'
                elif text != []:
                    text = text[0]

                # scrape all step text that is stored in a bulleted list
                step_list_text = step_object.find('ul')
                if step_list_text:
                    for bullet_point in step_list_text.find_all('li'):
                        text += bullet_point.find_all(text=True)[0].strip() + ';'
                steps.append(f'{header}: {text}')
    return steps

In [20]:
pip install beautifulsoup4 requests

Note: you may need to restart the kernel to use updated packages.


In [None]:
import requests
import json

from bs4 import BeautifulSoup

content = {}

for i,url in enumerate(urls):
    recipe = {}

    request = requests.get(url)
    html = request.text
    status = request.status_code

    if 'wholefoodsmarket' in url and status == 200:
        soup = BeautifulSoup(html, 'html.parser')
        schema = soup.find_all('script', attrs={"type": "application/ld+json"})
        schema = json.loads(schema[0].string)
        recipe['title'] = soup.find(class_='w-header-title').text
        recipe['description'] = schema['description']

        ingredients = []
        for r in schema.get('recipeIngredient'):
            ingredients.append(r)
        recipe['ingredients'] = ingredients

        steps = []
        for r in schema.get('recipeInstructions'):
            steps.append(r.get('text'))
        recipe['steps'] = steps

    elif 'wikihow' in url and status == 200:
        soup = BeautifulSoup(html, 'html.parser')
    
        recipe['title'] = soup.find(id='section_0').text
        recipe['description'] = soup.find(id='mf-section-0').text
        recipe['steps'] = get_steps(soup)

    else:
        print(f'Error for : {url}')

    content[url] = recipe

In [73]:
with open('./data/scraped_context.json', 'w') as out:
    json.dump(content, out)

content

Unnamed: 0,document_url,task_data
0,https://www.wholefoodsmarket.com/recipes/cream...,"{'title': 'Cream Cheese and Cashew Dip', 'desc..."
3,https://www.wholefoodsmarket.com/recipes/citru...,"{'title': 'Citrus Bow Tie Pasta', 'description..."
4,https://www.wholefoodsmarket.com/recipes/grill...,{'title': 'Grilled Vegetables with Cheese Tort...
8,https://www.wholefoodsmarket.com/recipes/pasta...,"{'title': 'Pasta Alla Norma', 'description': '..."
11,https://www.wholefoodsmarket.com/recipes/sweet...,"{'title': 'Sweet and Spicy Tuna Rigatoni', 'de..."
...,...,...
814,https://www.wholefoodsmarket.com/recipes/grass...,{'title': 'Grass-Fed Steak with Spring Greens'...
815,https://www.wholefoodsmarket.com/recipes/summe...,{'title': 'Summer Squash and Goat Cheese Pizza...
816,https://www.wholefoodsmarket.com/recipes/cream...,"{'title': 'Creamy Tahini and Broccoli Pasta', ..."
819,https://www.wholefoodsmarket.com/recipes/popco...,{'title': 'Popcorn Shrimp with Clementine-Chip...


In [85]:
print(len(internal_questions))

content_df = internal_questions.merge(content.rename(columns={"document_url": "document_url_question"}))
print(f'Managed to extract context for {len(content_df)} questions')
# convert task_data to string
content_df.to_json('./data/raw_internal_data.json', orient='records')
content_df

1337
Managed to extract context for 1109 questions


Unnamed: 0,question,intent_question,history,conversation_id,document_url_question,domain_question,text_answer,intent_answer,domain_answer,question_id,task_data
0,How much cream cheese and other ingredients wi...,ask_question_ingredients_tools,,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,You will need 1 package of cream cheese and a ...,answer_question_recipe_steps,food,food-2-0,"{'title': 'Cream Cheese and Cashew Dip', 'desc..."
1,What are the few other ingredients?,ask_question_recipe_steps,student: How much cream cheese and other ingre...,Wizard-of-Task-food-2,https://www.wholefoodsmarket.com/recipes/cream...,food,I attached them to my last message. They are ...,answer_question_recipe_steps,food,food-2-1,"{'title': 'Cream Cheese and Cashew Dip', 'desc..."
2,Will I be using premade pasta from the box or ...,ask_question_ingredients_tools,,Wizard-of-Task-food-3,https://www.wholefoodsmarket.com/recipes/citru...,food,Now transfer to a bowl and serve with the frui...,answer_question_recipe_steps,food,food-3-0,"{'title': 'Citrus Bow Tie Pasta', 'description..."
3,Okay and what other ingredients do I need?,ask_question_ingredients_tools,student: Will I be using premade pasta from th...,Wizard-of-Task-food-3,https://www.wholefoodsmarket.com/recipes/citru...,food,You will be using premade pasta. The recipe c...,answer_question_recipe_steps,food,food-3-1,"{'title': 'Citrus Bow Tie Pasta', 'description..."
4,Should the tortellini be fresh or can I use fr...,ask_question_recipe_steps,student: Will I be making the tortellini by ha...,Wizard-of-Task-food-4,https://www.wholefoodsmarket.com/recipes/grill...,food,You could make it by hand but this recipe incl...,answer_question_recipe_steps,food,food-4-1,{'title': 'Grilled Vegetables with Cheese Tort...
...,...,...,...,...,...,...,...,...,...,...,...
1104,Lower your heat to medium-low and cook until t...,return_next_step,teacher: You need to add the broth and pasta a...,Wizard-of-Task-food-264,https://www.wholefoodsmarket.com/recipes/baby-...,food,"Yes, you want to do your best to remove as muc...",answer_question_recipe_steps,diy,food-264-6,"{'title': 'Baby Risotto', 'description': 'Made..."
1105,"I have polished each panel, what should I do now?",request_next_step,"student: I'm only wanting to do the body, is t...",Wizard-of-Task-diy-75,https://www.wikihow.com/Polish-a-Car,diy,Potted means it is grown in a container and no...,answer_question_recipe_steps,diy,diy-75-9,"{'title': 'How to Polish a Car', 'description'..."
1106,How long should I monitor the cucumber vines f...,request_next_step,"student: I have wrapped the vine tendrils, wha...",Wizard-of-Task-diy-143,https://www.wikihow.com/Trellis-Cucumbers,diy,Sure. I have shared the next step with you. Ta...,answer_question_recipe_steps,diy,diy-143-5,"{'title': 'How to Trellis Cucumbers', 'descrip..."
1107,"Black mold is so scary, all the issues it can ...",chitchat,,Wizard-of-Task-diy-217,https://www.wikihow.com/Identify-Black-Mold,diy,We want to water our soil until saturated. Thi...,answer_question_recipe_steps,diy,diy-217-5,"{'title': 'How to Identify Black Mold', 'descr..."


### ANNOTATION STEP

We then annotate the remaining questions:
- adding the extractive span that answers the question
- adding a taxonmy to classify questions more granularly

We also remove pairs with inconsistent labels, that require common or external knowledge and pairs that cannot be answered extractively.
This results in 827 final questions.

In [11]:
import pandas as pd

# data to annotate - 1109 questions
raw_df = pd.read_json('./data/raw_internal_data.json')

1109

ANNOTATION PIPELINE HERE

In [54]:
# data that has already been annotated
annotated_df = pd.read_json('./data/annotated_data.json', orient='records')

In [55]:
annotated_df.head()

Unnamed: 0,conversation_id,question,domain,document_url,original_answer,history,task_data,extracted_answer,question_type
0,Wizard-of-Task-food-2,How much cream cheese and other ingredients wi...,cooking,https://www.wholefoodsmarket.com/recipes/cream...,You will need 1 package of cream cheese and a ...,,"{'title': 'Cream Cheese and Cashew Dip', 'desc...","1 (8.0-ounce) package cream cheese, softened\n...",['listing']
1,Wizard-of-Task-food-2,What are the few other ingredients?,cooking,https://www.wholefoodsmarket.com/recipes/cream...,I attached them to my last message. They are ...,student: How much cream cheese and other ingre...,"{'title': 'Cream Cheese and Cashew Dip', 'desc...",1/2 cup smooth cashew butter\n2 tablespoons ag...,['history']
2,Wizard-of-Task-food-2,"That's been blended, what's next?",cooking,https://www.wholefoodsmarket.com/recipes/cream...,Now transfer to a bowl and serve with the frui...,student: What are the few other ingredients? |...,"{'title': 'Cream Cheese and Cashew Dip', 'desc...",transfer to a bowl and serve,"['navigation', 'history']"
3,Wizard-of-Task-food-3,Will I be using premade pasta from the box or ...,cooking,https://www.wholefoodsmarket.com/recipes/citru...,You will be using premade pasta. The recipe c...,,"{'title': 'Citrus Bow Tie Pasta', 'description...",cook pasta according to package directions,['complex']
4,Wizard-of-Task-food-4,Will I be making the tortellini by hand or sho...,cooking,https://www.wholefoodsmarket.com/recipes/grill...,You could make it by hand but this recipe incl...,,{'title': 'Grilled Vegetables with Cheese Tort...,cook tortellini according to directions on pac...,['complex']


### Creating test, train, validation splits

In [3]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [56]:
from sklearn.model_selection import train_test_split

df_conversations = annotated_df.groupby('conversation_id').count().reset_index()[['conversation_id','question']].rename(columns={'question':'count'})
df_conversations.loc[32, 'count'] = 6 # for splitting to work properly
df_conversations


Unnamed: 0,conversation_id,count
0,Wizard-of-Task-diy-1,2
1,Wizard-of-Task-diy-10,1
2,Wizard-of-Task-diy-100,1
3,Wizard-of-Task-diy-102,3
4,Wizard-of-Task-diy-104,4
...,...,...
399,Wizard-of-Task-food-93,2
400,Wizard-of-Task-food-94,2
401,Wizard-of-Task-food-95,2
402,Wizard-of-Task-food-98,1


In [58]:
conversations = list(df_conversations['conversation_id'])
print(len(conversations))
count = list(df_conversations['count'])
print(len(count))

conversation_train, conversation_test, count_train, _ = train_test_split(conversations, count,
                                                               test_size=0.5, random_state = 30, stratify=count)

conversation_train, conversations_val, _, _ = train_test_split(conversation_train, count_train, 
                                                               test_size=0.4, random_state= 42, stratify=count_train)

split_data = annotated_df
split_data['data_split'] = split_data['conversation_id'].apply(lambda x: 'train' if x in conversation_train else 'test' if x in conversation_test else 'validation')
split_data.groupby('data_split').count().reset_index()


404
404


Unnamed: 0,data_split,conversation_id,question,domain,document_url,original_answer,history,task_data,extracted_answer,question_type
0,test,412,412,412,412,412,375,412,412,412
1,train,248,248,248,248,248,225,248,248,248
2,validation,167,167,167,167,167,154,167,167,167


In [62]:
split_data.to_json('./data/final_split_data.json', orient='records')

In [59]:
split_data['data_split'].value_counts()

data_split
test          412
train         248
validation    167
Name: count, dtype: int64

In [61]:
train_df = split_data[split_data['data_split'] == 'train']
print(len(train_df))
train_df.to_json('./data/train.json', orient='records')

val_df = split_data[split_data['data_split'] == 'validation']
print(len(val_df))
val_df.to_json('./data/validation.json', orient='records')

test_df = split_data[split_data['data_split'] == 'test']
print(len(test_df))
test_df.to_json('./data/test.json', orient='records')

248
167
412
