# Explore Wizard of Wikipedia Dataset

- what does it take to parse?

- this notebooks goes through downloading and parsing the Wizard of Wikipedia subset of the KILT benchmark.
    - (63,734 lines, 48.9MiB)	

- [link](https://github.com/facebookresearch/KILT) to the KILT repo

In [None]:
from pathlib import Path
import pandas as pd

wow_path = Path.cwd() / "wow-train-kilt.jsonl"
wow_url = 'http://dl.fbaipublicfiles.com/KILT/wow-train-kilt.jsonl'
wow_url_dev = "http://dl.fbaipublicfiles.com/KILT/wow-dev-kilt.jsonl"

# source: https://github.com/facebookresearch/KILT

!wget -O $wow_path $wow_url


In [51]:
import pandas as pd

df = pd.read_json(wow_path, orient="records", lines=True).convert_dtypes()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63734 entries, 0 to 63733
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      63734 non-null  string
 1   input   63734 non-null  string
 2   output  63734 non-null  object
dtypes: object(1), string(2)
memory usage: 1.5+ MB


In [52]:
print("The prompt is:\n {}\n\n".format(df.loc[0, "input"]))
df.loc[0, "output"][0]

The prompt is:
 I like to watch ice hockey on TV. My favorite team is the Chicago Blackhawks.




{'answer': "The Blackhawks are one of my favorite teams, they've won 6 Stanley Cup Championships since they started in 1926",
 'provenance': [{'wikipedia_id': '73126',
   'title': 'Chicago Blackhawks',
   'start_paragraph_id': 1,
   'start_character': 260,
   'end_paragraph_id': 1,
   'end_character': 333,
   'bleu_score': 1.0,
   'section': 'Section::::Abstract.'}]}

In [53]:
df["out_lengths"] = df["output"].apply(len)

df["out_lengths"].describe()

count    63734.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: out_lengths, dtype: float64

In [54]:
def extract_answer(outcol_obj):

    out_dict = outcol_obj[0]

    return out_dict["answer"]


df["out_answer"] = df["output"].apply(extract_answer)


df.head()

Unnamed: 0,id,input,output,out_lengths,out_answer
0,6bc20426-99d6-11ea-8a20-773209e30a7b_0,I like to watch ice hockey on TV. My favorite ...,[{'answer': 'The Blackhawks are one of my favo...,1,"The Blackhawks are one of my favorite teams, t..."
1,54ade12e-99d6-11ea-8a20-773209e30a7b_2,The Viking are sea pirates! I see! Didn't they...,[{'answer': 'They raided and trader across wid...,1,They raided and trader across wide areas of Eu...
2,5673e5da-99d6-11ea-8a20-773209e30a7b_1,I love the band The Chainsmokers made up of Al...,[{'answer': 'They're an EDM-pop duo from New Y...,1,They're an EDM-pop duo from New York. Their f...
3,5592954e-99d6-11ea-8a20-773209e30a7b_0,I would love to be a surgeon when I grow up.,[{'answer': 'Me too. Performing surgical opera...,1,Me too. Performing surgical operations on peop...
4,536ab85a-99d6-11ea-8a20-773209e30a7b_2,what on earth is equestrianism? it refers to t...,[{'answer': 'Pretty much including competitive...,1,Pretty much including competitive riding


In [55]:
# print out a single answer to try and figure out multi-line structure
import pprint as pp
pp.pprint(df.loc[1, "out_answer"])

'They raided and trader across wide areas of Europe.'


In [56]:
input = df.loc[0, "input"]

split_input = input.split("\n")
pp.pprint(split_input)

['I like to watch ice hockey on TV. My favorite team is the Chicago '
 'Blackhawks.']


### clean text

the default arguments are:

`(text, fix_unicode=True, to_ascii=True, lower=True, normalize_whitespace=True, no_line_breaks=False, strip_lines=True, keep_two_line_breaks=False, no_urls=False, no_emails=False, no_phone_numbers=False, no_numbers=False, no_digits=False, no_currency_symbols=False, no_punct=False, no_emoji=False, replace_with_url="<URL>", replace_with_email="<EMAIL>", replace_with_phone_number="<PHONE>", replace_with_number="<NUMBER>", replace_with_digit="0", replace_with_currency_symbol="<CUR>", replace_with_punct="", lang="en") -> Any`

In [57]:
from cleantext import clean


def clean_resp(ugly_text: str, lower=False,):

    clntext = clean(
        ugly_text,
        lower=lower,
        no_line_breaks=True,
        no_urls=True,
        normalize_whitespace=True,
        no_emails=True,
        lang="en",
    )

    return clntext

do_lower = False
do_lower

False

In [58]:
from tqdm.auto import tqdm
import pprint as pp


speaker_id_a = "person alpha" if do_lower else "Person Alpha"
speaker_id_b = "person beta" if do_lower else "Person Beta"
conv_words = []

for index, row in tqdm(df.iterrows(), total=len(df), desc="parsing data"):

    # prompt

    the_prompt = row["input"]
    prompt_lines = the_prompt.split("\n")
    if len(prompt_lines) == 1:
        conv_words.append(f"{speaker_id_a}:\n")
        conv_words.append(clean_resp(str(prompt_lines[0]), lower=do_lower) + "\n")
        conv_words.append("\n")

    else:
        # multi-line answer case
        set_beta = False
        for resp in prompt_lines:
            if set_beta:
                conv_words.append(f"{speaker_id_b}:\n")
                conv_words.append(clean_resp(str(resp), lower=do_lower) + "\n")
                conv_words.append("\n")
                set_beta = False
            else:
                conv_words.append(f"{speaker_id_a}:\n")
                conv_words.append(clean_resp(str(resp), lower=do_lower) + "\n")
                conv_words.append("\n")
                set_beta = True

    # response

    # split into lines:
    the_answer = row["out_answer"]
    answer_lines = the_answer.split("\n")
    if len(answer_lines) == 1:
        conv_words.append(f"{speaker_id_b}:\n")
        conv_words.append(clean_resp(str(answer_lines[0]), lower=do_lower) + "\n")
        conv_words.append("\n")

    else:
        # multi-line answer case
        set_beta = True
        for resp in answer_lines:
            if set_beta:
                conv_words.append(f"{speaker_id_b}:\n")
                conv_words.append(clean_resp(str(resp), lower=do_lower) + "\n")
                conv_words.append("\n")
                set_beta = False
            else:
                conv_words.append(f"{speaker_id_a}:\n")
                conv_words.append(clean_resp(str(resp), lower=do_lower) + "\n")
                conv_words.append("\n")
                set_beta = True


pp.pprint(conv_words[:25])

parsing data: 100%|██████████| 63734/63734 [03:10<00:00, 334.52it/s]

['Person Alpha:\n',
 'I like to watch ice hockey on TV. My favorite team is the Chicago '
 'Blackhawks.\n',
 '\n',
 'Person Beta:\n',
 "The Blackhawks are one of my favorite teams, they've won 6 Stanley Cup "
 'Championships since they started in 1926\n',
 '\n',
 'Person Alpha:\n',
 'The Viking are sea pirates!\n',
 '\n',
 'Person Beta:\n',
 "I see! Didn't they speak the Norse language?\n",
 '\n',
 'Person Alpha:\n',
 "What's the Norse language? What country speaks such?\n",
 '\n',
 'Person Beta:\n',
 'The North Germans!\n',
 '\n',
 'Person Alpha:\n',
 'So what do the Vikings do ?are they a cult group?\n',
 '\n',
 'Person Beta:\n',
 'They raided and trader across wide areas of Europe.\n',
 '\n',
 'Person Alpha:\n']





## save & export 

In [60]:
print(f"parsed into {len(conv_words)} lines for a dialogue script")

parsed into 1050198 lines for a dialogue script


In [59]:
from pathlib import Path

wow_path = Path(wow_path) 
outname = f"ScriptParsed_lower={do_lower}_{wow_path.stem}.txt"

script_path = wow_path.parent / outname

with open(script_path, "w", encoding="utf-8", errors="ignore") as fo:
    fo.writelines(conv_words)


print(f"finished saving parsed text file to: \n {script_path} \n")

finished saving parsed text file to: 
 c:\Users\peter\source\ai-msgbot\notebooks\ScriptParsed_lower=False_wow-train-kilt.txt 

