# Load explore and PreProcess ComplexWebQuestions

In this notebook we load and explore the ComplexWebQuestions dataset.
We also preprocess this dataset based on this exploration towards feeding it into a question answering model. Our model is a MAC attention network and thus it's highly dependant on our questions and KB (snippets) respresentations.

In [1]:
import os
import gc

import pandas as pd
import numpy as np

#!pip3 install --user tqdm

In [2]:
DATA_PATH = '/home/u14303/NLPproject/Data/'
FNAME_TRAIN_QUESTION = 'ComplexWebQuestions_train.json'
FNAME_DEV_QUESTION = 'ComplexWebQuestions_dev.json'
FNAME_TEST_QUESTION = 'ComplexWebQuestions_test.json'

FNAME_TRAIN_SNIPPETS = 'web_snippets_train.json'
FNAME_DEV_SNIPPETS = 'web_snippets_dev.json'
FNAME_TEST_SNIPPETS = 'web_snippets_test.json'

FNAME_ANSWERES = 'possible_answers_web_snippets_dev.json'
EMBEDDING_DIM = 300
FNAME_EMBEDDINGS = 'glove.6B.{}d.txt'.format(EMBEDDING_DIM)
FNAME_TOKEN_EMBEDDING_MAT = 'embedding_matrix.dat'

FNAME_TOKENIZER = 'tokenizer.pickle'
FNAME_INVERSE_MAP = 'inverse_word_token_map.pickle'

### Explore Questions and Answers

In [3]:
print("Loading training questions...")
with open(os.path.join(DATA_PATH, FNAME_TRAIN_QUESTION), "r") as f:
    train_questions = pd.read_json(f)
print("Done. Read {} questions.".format(train_questions.shape[0]))

Loading training questions...
Done. Read 27639 questions.


In [4]:
print("Loading dev data...")
with open(os.path.join(DATA_PATH, FNAME_DEV_QUESTION), "r") as f:
    dev_questions = pd.read_json(f)
print("Done. Read {} questions.".format(dev_questions.shape[0]))

Loading dev data...
Done. Read 3519 questions.


In [5]:
print("Loading test questions...")
with open(os.path.join(DATA_PATH, FNAME_TEST_QUESTION), "r") as f:
    test_questions = pd.read_json(f)
print("Done. Read {} questions.".format(test_questions.shape[0]))

Loading test questions...
Done. Read 3531 questions.


In [6]:
train_cols = set(train_questions.columns.values)
dev_cols = set(dev_questions.columns.values)
test_cols = set(test_questions.columns.values)
print("Symmetric diff between train_cols and dev_cols is:", train_cols.symmetric_difference(dev_cols))
print("Symmetric diff between dev_cols and test_cols is:", dev_cols.symmetric_difference(test_cols))

Symmetric diff between train_cols and dev_cols is: set()
Symmetric diff between dev_cols and test_cols is: {'composition_answer', 'answers'}


Let's explore the questions a little

In [7]:
train_questions.head(3)

Unnamed: 0,ID,answers,composition_answer,compositionality_type,created,machine_question,question,sparql,webqsp_ID,webqsp_question
0,WebQTrn-3513_7c4117891abf63781b892537979054c6,"[{'aliases': ['Washington D.C.', 'Washington',...",george washington university,composition,2018-02-13T02:07:47,what state is the the education institution ha...,What state is home to the university that is r...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-3513,what state is the george washington university in
1,WebQTrn-2136_d95da5fb8a16d81fe56cd4ce00843254,"[{'aliases': ['Super Bowl 2013', 'Super Bowl 4...",baltimore ravens,composition,2018-02-12T23:27:26,what year did the sports team with the fight s...,What year did the team with Baltimore Fight So...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2136,what year did baltimore ravens win the superbowl
2,WebQTrn-2360_a40a0d50b9a1006e2d254705d46345ea,[{'aliases': ['University of Florida Gators Me...,,conjunction,2018-02-12T23:49:23,what football teams did emmitt smith play for ...,"Which school with the fight song ""The Orange a...",PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2360,what football teams did emmitt smith play for


In [8]:
print("There are {} records in train_questions, but only {} unique train questions and {} unique train question IDs.".format(train_questions.shape[0],
                                                                                                                      train_questions["question"].nunique(),
                                                                                                                      train_questions["ID"].nunique()))
print("There are {} records in dev_questions, but only {} unique dev questions and {} unique dev question IDs.".format(dev_questions.shape[0],
                                                                                                                      dev_questions["question"].nunique(),
                                                                                                                      dev_questions["ID"].nunique()))
print("There are {} records in test_questions, but only {} unique test questions and {} unique test question IDs.".format(test_questions.shape[0],
                                                                                                                         test_questions["question"].nunique(),
                                                                                                                         test_questions["ID"].nunique()))

There are 27639 records in train_questions, but only 27628 unique train questions and 27639 unique train question IDs.
There are 3519 records in dev_questions, but only 3519 unique dev questions and 3519 unique dev question IDs.
There are 3531 records in test_questions, but only 3531 unique test questions and 3531 unique test question IDs.


In [10]:
all_questions = pd.concat([train_questions, dev_questions, test_questions], axis = 0, sort=True, ignore_index=True)
all_questions["question_len"] = all_questions.apply(lambda r : len(r["question"].split(' ')), axis=1)

display(all_questions[["question_len"]].describe())
print("95% precent of the questions are shorter than {} words and 99% are shorter than {} words.".format(
    all_questions["question_len"].quantile(0.95),
    all_questions["question_len"].quantile(0.99)))

del all_questions
gc.collect()

Unnamed: 0,question_len
count,34689.0
mean,13.388999
std,3.170039
min,3.0
25%,11.0
50%,13.0
75%,15.0
max,30.0


95% precent of the questions are shorter than 19.0 words and 99% are shorter than 21.0 words.


35

Let's take a look at the answers (relevant for tain and dev only).

Extract answers and aliases to seperate column.

In [11]:
train_questions["answer"] = train_questions.apply(lambda r:r["answers"][0]["answer"], axis = 1)
display(train_questions["answer"][:5])
train_questions["merged_answers"] = train_questions.apply(lambda r:[r["answers"][0]["answer"]] + r["answers"][0]["aliases"], axis = 1)
display(train_questions["merged_answers"][:5])

dev_questions["answer"] = dev_questions.apply(lambda r:r["answers"][0]["answer"], axis = 1)
display(dev_questions["answer"][:5])
dev_questions["merged_answers"] = dev_questions.apply(lambda r:[r["answers"][0]["answer"]] + r["answers"][0]["aliases"], axis = 1)
display(dev_questions["merged_answers"][:5])

0           Washington, D.C.
1           Super Bowl XLVII
2    Florida Gators football
3                 Gridlock'd
4                   Portugal
Name: answer, dtype: object

0    [Washington, D.C., Washington D.C., Washington...
1    [Super Bowl XLVII, Super Bowl 2013, Super Bowl...
2    [Florida Gators football, University of Florid...
3                                         [Gridlock'd]
4                      [Portugal, Portuguese Republic]
Name: merged_answers, dtype: object

0                    Muhammad Zia-ul-Haq
1    Vanderbilt University Mr. Commodore
2                                 Brazil
3                          John Harbaugh
4                             Jeff Faine
Name: answer, dtype: object

0                                [Muhammad Zia-ul-Haq]
1                [Vanderbilt University Mr. Commodore]
2    [Brazil, Brazilian , República Federativa do B...
3                                      [John Harbaugh]
4                  [Jeff Faine, Braylon Jamel Edwards]
Name: merged_answers, dtype: object

Delete all questions with no answer (surprisingly, there are some!).

In [12]:
for df in [train_questions, dev_questions]:
    df.dropna(subset=['answer'], inplace=True)

In [13]:
train_dev_questions = pd.concat([train_questions, dev_questions], axis = 0, sort=True, ignore_index=True)

In [14]:
train_dev_questions["num_answers"] = train_dev_questions.apply(lambda r:len(r["merged_answers"]), axis = 1)
display(train_dev_questions[["merged_answers", "num_answers"]].head(5))
display(train_dev_questions[["num_answers"]].describe())
print("95% precent of the questions have less than {} answers 99% have less than {}.".format(
    train_dev_questions["num_answers"].quantile(0.95) + 1,
    train_dev_questions["num_answers"].quantile(0.99) + 1))

Unnamed: 0,merged_answers,num_answers
0,"[Washington, D.C., Washington D.C., Washington...",9
1,"[Super Bowl XLVII, Super Bowl 2013, Super Bowl...",3
2,"[Florida Gators football, University of Florid...",22
3,[Gridlock'd],1
4,"[Portugal, Portuguese Republic]",2


Unnamed: 0,num_answers
count,31144.0
mean,3.943809
std,4.282474
min,1.0
25%,1.0
50%,3.0
75%,5.0
max,61.0


95% precent of the questions have less than 12.0 answers 99% have less than 22.0.


Unique answers?

In [15]:
print("There are {} train+dev question records, {} unique questions, but only {} unique explicit answers.".format(
    train_dev_questions.shape[0],
    train_dev_questions["ID"].nunique(),
    train_dev_questions["answer"].nunique()))

There are 31144 train+dev question records, 31144 unique questions, but only 3805 unique explicit answers.


In [16]:
import itertools

train_unique_ans = set(itertools.chain.from_iterable(train_questions["merged_answers"].tolist()))
dev_unique_ans = set(itertools.chain.from_iterable(dev_questions["merged_answers"].tolist()))
mutual_answers = train_unique_ans.intersection(dev_unique_ans)

In [17]:
print("There are {} train questions and {} dev questions.\n{} unique train answers (incl. aliases).\n{} unique dev answers.\n{} mutual unique answers.".format(
    train_questions.shape[0],
    dev_questions.shape[0],
    len(train_unique_ans),
    len(dev_unique_ans),
    len(mutual_answers)))

There are 27625 train questions and 3519 dev questions.
10952 unique train answers (incl. aliases).
2380 unique dev answers.
1288 mutual unique answers.


Ceck what are the lengths of the answers.

In [18]:
def calc_ans_lens(answers):
    res = []
    for ans in answers:
        res.append(len(ans.split()))
    return res

In [79]:
n = 4

train_dev_questions["max_ans_len"] = train_dev_questions.apply(lambda r : max(calc_ans_lens(r["merged_answers"])), axis=1)
train_dev_questions["avg_ans_len"] = train_dev_questions.apply(lambda r : np.average(calc_ans_lens(r["merged_answers"])), axis=1)
train_dev_questions["has_ans_with_less_than_n"] = train_dev_questions.apply(
    lambda r : np.any(n > np.array(calc_ans_lens(r["merged_answers"]))), axis=1)
display(train_dev_questions[["max_ans_len", "avg_ans_len"]].describe())

print("95% precent of the questions are shorter than {} words and 99% are shorter than {} words.".format(
    train_dev_questions["max_ans_len"].quantile(0.95),
    train_dev_questions["max_ans_len"].quantile(0.99)))
print("{}% of the questions have an answer shorter than {} words.".format(
    100 * train_dev_questions[train_dev_questions["has_ans_with_less_than_n"] == True].shape[0] /
    train_dev_questions["has_ans_with_less_than_n"].shape[0], n)

NameError: name 'train_dev_questions' is not defined

In [20]:
del train_dev_questions
gc.collect()

89

We continue with questions for which we have an answer longer than n=4

In [21]:
def filter_answers_longer_than_n(answers):
    res = []
    for ans in answers:
        if len(ans.split()) < n:
            res.append(ans)
    if len(res) == 0:
        return np.nan
    return res

In [22]:
train_questions["answers_shorter_than_n"] = train_questions.apply(lambda r : filter_answers_longer_than_n(r["merged_answers"]), axis=1)
dev_questions["answers_shorter_than_n"] = dev_questions.apply(lambda r : filter_answers_longer_than_n(r["merged_answers"]), axis=1)

Also delete all questions with no short answers.

In [23]:
for df in [train_questions, dev_questions]:
    df.dropna(subset=['answers_shorter_than_n'], inplace=True)

### Explore Snippets

In [24]:
print("Loading train snippets...")
with open(os.path.join(DATA_PATH, FNAME_TRAIN_SNIPPETS), "r") as f:
    train_snippets = pd.read_json(f)
print("Done. Read snippets for {} records".format(len(train_snippets)))

Loading train snippets...
Done. Read snippets for 107743 records


In [25]:
print("Loading dev snippets...")
with open(os.path.join(DATA_PATH, FNAME_DEV_SNIPPETS), "r") as f:
    dev_snippets = pd.read_json(f)
print("Done. Read snippets for {} records".format(len(dev_snippets)))

Loading dev snippets...
Done. Read snippets for 14446 records


In [26]:
print("Loading test snippets...")
with open(os.path.join(DATA_PATH, FNAME_TEST_SNIPPETS), "r") as f:
    test_snippets = pd.read_json(f)
print("Done. Read snippets for {} records".format(len(test_snippets)))

Loading test snippets...
Done. Read snippets for 14338 records


In [27]:
train_snippets.head(5)

Unnamed: 0,question,question_ID,split_source,web_query,web_snippets
0,"""Billie Jean""'s composer was born where?",WebQTest-1796_293ff6fdbda2c0c1c40a4c6ac6cef62c,"[noisy supervision split, ptrnet split]","""Billie Jean""'s composer was born where","[{'snippet': '""Billie Jean"" is a 1982 song by ..."
1,"""Just Like Starting Over""'s composer plays wha...",WebQTest-84_355ab1c8f8cb13542c4f8e137e429342,"[noisy supervision split, ptrnet split]","""Just Like Starting Over""'s composer plays wha...","[{'snippet': '""(Just Like) Starting Over"" is a..."
2,"""The Fourth Hand"" author also wrote what other...",WebQTest-449_d12a8796bf4c6c36c9d2d9ee1186ba20,"[noisy supervision split, ptrnet split]","""The Fourth Hand"" author also wrote what other...",[{'snippet': 'The Fourth Hand is a 2001 novel ...
3,"""The salvation of the world is in man's suffer...",WebQTrn-1971_f1bb78925b7b947a056544194a413bb5,"[noisy supervision split, ptrnet split]","""The salvation of the world is in man's suffer...",[{'snippet': 'In the light of what He has done...
4,"""What is the educational background of the Whi...",WebQTrn-2805_8c4cd2a8dd5064dcd1e88389796138c7,"[noisy supervision split, ptrnet split]","""What is the educational background of the Whi...",[{'snippet': 'الانتقال إلى History‏ - Under th...


In [28]:
print("There are {} records in train_snippets, but only {} unique train questions and {} unique train question IDs.".format(train_snippets.shape[0],
                                                                                                                      train_snippets["question"].nunique(),
                                                                                                                      train_snippets["question_ID"].nunique()))
print("There are {} records in dev_snippets, but only {} unique dev questions and {} unique dev question IDs.".format(dev_snippets.shape[0],
                                                                                                                      dev_snippets["question"].nunique(),
                                                                                                                      dev_snippets["question_ID"].nunique()))
print("There are {} records in test_snippets, but only {} unique test questions and {} unique test question IDs.".format(test_snippets.shape[0],
                                                                                                                         test_snippets["question"].nunique(),
                                                                                                                         test_snippets["question_ID"].nunique()))

There are 107743 records in train_snippets, but only 27627 unique train questions and 27633 unique train question IDs.
There are 14446 records in dev_snippets, but only 3519 unique dev questions and 3519 unique dev question IDs.
There are 14338 records in test_snippets, but only 3531 unique test questions and 3531 unique test question IDs.


We can see that there are ~4 records for each unique question ID in each snippet dataset. For example:

In [29]:
train_snippets["question_ID"] = train_snippets["question_ID"].astype('str')
idd = "WebQTest-1796_293ff6fdbda2c0c1c40a4c6ac6cef62c"
train_snippets[train_snippets["question_ID"] == idd]

Unnamed: 0,question,question_ID,split_source,web_query,web_snippets
0,"""Billie Jean""'s composer was born where?",WebQTest-1796_293ff6fdbda2c0c1c40a4c6ac6cef62c,"[noisy supervision split, ptrnet split]","""Billie Jean""'s composer was born where","[{'snippet': '""Billie Jean"" is a 1982 song by ..."
69529,"""Billie Jean""'s composer was born where?",WebQTest-1796_293ff6fdbda2c0c1c40a4c6ac6cef62c,[noisy supervision split],`` Billie Jean '' 's composer,"[{'snippet': 'According to Inside the Hits, th..."
69530,"""Billie Jean""'s composer was born where?",WebQTest-1796_293ff6fdbda2c0c1c40a4c6ac6cef62c,[ptrnet split],`` Billie Jean '''s composer,"[{'snippet': '""Billie Jean"" is a song by Ameri..."
82532,"""Billie Jean""'s composer was born where?",WebQTest-1796_293ff6fdbda2c0c1c40a4c6ac6cef62c,"[noisy supervision split, ptrnet split]",michael jackson was born where,[{'snippet': 'Singer-songwriter Michael Jackso...


This apparently happens because for each question several different web queries were made as above.

Explore snippet's lengthes for padding purposes.

In [30]:
all_snippets = pd.concat([test_snippets, dev_snippets, train_snippets], axis = 0, sort=True, ignore_index=True)

In [31]:
all_snippets["web_snippets_avg_len"] = all_snippets.apply(
    lambda r: np.average([0]+[len(snippet['snippet'].split(' ')) for snippet in r["web_snippets"]]), axis=1)

In [32]:
all_snippets["web_snippets_max_len"] = all_snippets.apply(
    lambda r: np.max([0]+[len(snippet['snippet'].split(' ')) for snippet in r["web_snippets"]]), axis=1)

In [33]:
all_snippets["web_snippets_095_precentile_len"] = all_snippets.apply(
    lambda r: np.percentile([0]+[len(snippet['snippet'].split(' ')) for snippet in r["web_snippets"]], 95), axis=1)

In [34]:
display(all_snippets[["web_snippets_avg_len", "web_snippets_max_len", "web_snippets_095_precentile_len"]].describe())

Unnamed: 0,web_snippets_avg_len,web_snippets_max_len,web_snippets_095_precentile_len
count,136527.0,136527.0,136527.0
mean,50.457214,68.734485,62.915131
std,10.439846,14.102876,12.09219
min,0.0,0.0,0.0
25%,46.98,63.0,59.0
50%,49.910891,65.0,61.0
75%,52.287129,69.0,63.0
max,98.5,389.0,222.9


It's safe to conclude that we can pad/clamp each snippet to the size of 100 words.

In [35]:
del all_snippets
gc.collect()

35

Merge all titles and snippets for each record into one text string, lower case it and save it.

We'll also try keeping the top k snippets of each record in the snippets dataframe.

In [36]:
delim = '. '
k=10
for df in [test_snippets, dev_snippets, train_snippets]:
#     df["merged_titles_and_snippets"] = df.apply(
#         lambda r: [snippet['title'] + delim + snippet['snippet'] for snippet in r["web_snippets"]],
#         axis=1)
#     df["merged_titles_and_snippets_len"] = df.apply(lambda r: len(r["merged_titles_and_snippets"].split(' ')), axis = 1)
    df["merged_k_reduced_titles_and_snippets"] = df.apply(
        lambda r: [snippet['title'] + delim + snippet['snippet'] for snippet in r["web_snippets"][:k]],
        axis=1)
    df["merged_k_reduced_titles_and_snippets_len"] = df.apply(
        lambda r: np.sum([len(snip.split(' ')) for snip in r["merged_k_reduced_titles_and_snippets"]]), axis = 1)

In [37]:
all_snippets = pd.concat([test_snippets, dev_snippets, train_snippets], axis = 0, sort=True, ignore_index=True)
# display(all_snippets[["merged_titles_and_snippets_len"]].describe())
# print("95% precent of the snippets are shorter than {} and 99% are shorter than {}".format(
#     all_snippets["merged_titles_and_snippets_len"].quantile(0.95),
#     all_snippets["merged_titles_and_snippets_len"].quantile(0.99)))

display(all_snippets[["merged_k_reduced_titles_and_snippets_len"]].describe())
print("95% precent of the snippets are shorter than {} and 99% are shorter than {}".format(
    all_snippets["merged_k_reduced_titles_and_snippets_len"].quantile(0.95),
    all_snippets["merged_k_reduced_titles_and_snippets_len"].quantile(0.99)))
del all_snippets
gc.collect()

Unnamed: 0,merged_k_reduced_titles_and_snippets_len
count,136527.0
mean,578.672131
std,121.94342
min,0.0
25%,531.0
50%,579.0
75%,620.0
max,1086.0


95% precent of the snippets are shorter than 824.0 and 99% are shorter than 894.0


49

We also want to reduce (merge) all snippets by ID.

The resulting dataframe will have a unique question_ID per record, and all relevant snippets in the same record (and also the top k * num_of_web_queries_per_question snippets per each question).

In [38]:
train_snippets_merged_by_ID = pd.DataFrame(train_questions[["ID", "question"]])
dev_snippets_merged_by_ID = pd.DataFrame(dev_questions[["ID", "question"]])
test_snippets_merged_by_ID = pd.DataFrame(test_questions[["ID", "question"]])

In [39]:
def merge_snippets(all_rows_per_id):
    return list(itertools.chain.from_iterable(all_rows_per_id["merged_k_reduced_titles_and_snippets"].tolist()))

for merged, unmerged in [(train_snippets_merged_by_ID, train_snippets), (dev_snippets_merged_by_ID, dev_snippets), (test_snippets_merged_by_ID, test_snippets)]:
    merged["merged_top_k_snippets"] = merged.apply(
             lambda r: merge_snippets(unmerged[unmerged["question_ID"] == r["ID"]]), axis = 1)

In [40]:
for df in [train_snippets_merged_by_ID, dev_snippets_merged_by_ID, test_snippets_merged_by_ID]:
    df["num_of_merged_snippets"] =  train_snippets_merged_by_ID.apply(lambda r: len(r["merged_top_k_snippets"]), axis = 1)
    print(df["num_of_merged_snippets"].describe())

count    26115.000000
mean        38.395443
std         11.037102
min          0.000000
25%         30.000000
50%         40.000000
75%         50.000000
max         50.000000
Name: num_of_merged_snippets, dtype: float64
count    3149.000000
mean       39.150206
std        10.194674
min         0.000000
25%        30.000000
50%        40.000000
75%        50.000000
max        50.000000
Name: num_of_merged_snippets, dtype: float64
count    3401.000000
mean       39.207880
std        10.198703
min         0.000000
25%        30.000000
50%        40.000000
75%        50.000000
max        50.000000
Name: num_of_merged_snippets, dtype: float64


We need to pad with empty snippets that all question will have 50 anippets.

In [41]:
# pad_to = 150
pad_to = 50
def pad_with_empty(snippets):
    pad = ["aldays ultra" for i in range(pad_to - len(snippets))]
    return snippets + pad

for df in [train_snippets_merged_by_ID, dev_snippets_merged_by_ID, test_snippets_merged_by_ID]:
    df["merged_top_k_snippets"] = df.apply(
             lambda r: pad_with_empty(r["merged_top_k_snippets"]), axis = 1)

In [42]:
for df in [train_snippets_merged_by_ID, dev_snippets_merged_by_ID, test_snippets_merged_by_ID]:
    df["num_of_merged_snippets"] =  train_snippets_merged_by_ID.apply(lambda r: len(r["merged_top_k_snippets"]), axis = 1)
    print(df["num_of_merged_snippets"].describe())

count    26115.0
mean        50.0
std          0.0
min         50.0
25%         50.0
50%         50.0
75%         50.0
max         50.0
Name: num_of_merged_snippets, dtype: float64
count    3149.0
mean       50.0
std         0.0
min        50.0
25%        50.0
50%        50.0
75%        50.0
max        50.0
Name: num_of_merged_snippets, dtype: float64
count    3401.0
mean       50.0
std         0.0
min        50.0
25%        50.0
50%        50.0
75%        50.0
max        50.0
Name: num_of_merged_snippets, dtype: float64


Padded successfully.

#### Merge questions_answers and snippets dataframes into one

Questions for which at least one of their snippets contains its explicit answer.

In [44]:
train_snippets.rename(index=str, columns={"question_ID": "ID"}, inplace=True)
train_questions_snippets = train_questions.merge(train_snippets_merged_by_ID, on='ID', how='left')

dev_snippets.rename(index=str, columns={"question_ID": "ID"}, inplace=True)
dev_questions_snippets = dev_questions.merge(dev_snippets_merged_by_ID, on='ID', how='left')

test_snippets.rename(index=str, columns={"question_ID": "ID"}, inplace=True)
test_questions_snippets = test_questions.merge(test_snippets_merged_by_ID, on='ID', how='left')

In [45]:
train_questions_snippets.rename(index=str, columns={"question_x": "question"}, inplace=True)
dev_questions_snippets.rename(index=str, columns={"question_x": "question"}, inplace=True)
test_questions_snippets.rename(index=str, columns={"question_x": "question"}, inplace=True)

Lowercase all text (questions, answers and snippets) towards following checks and preprocessing.

In [46]:
for df in [train_questions_snippets, dev_questions_snippets, test_questions_snippets]:
    df["question"] = df.apply(lambda r: str(r["question"]).lower(), axis=1)
    df["merged_top_k_snippets"] = df.apply(lambda r: [str(snip).lower() for snip in r["merged_top_k_snippets"]], axis=1)
    
# Lowercase answers.
for df in [train_questions_snippets, dev_questions_snippets]:
    df["answer"] = df.apply(lambda r: str(r["answer"]).lower(), axis=1)
    df["answers_shorter_than_n"] = df.apply(lambda r: [str(ans).lower() for ans in r["answers_shorter_than_n"]], axis=1)

In [47]:
train_questions_snippets.head(10)

Unnamed: 0,ID,answers,composition_answer,compositionality_type,created,machine_question,question,sparql,webqsp_ID,webqsp_question,answer,merged_answers,answers_shorter_than_n,question_y,merged_top_k_snippets,num_of_merged_snippets
0,WebQTrn-3513_7c4117891abf63781b892537979054c6,"[{'aliases': ['Washington D.C.', 'Washington',...",george washington university,composition,2018-02-13T02:07:47,what state is the the education institution ha...,what state is home to the university that is r...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-3513,what state is the george washington university in,"washington, d.c.","[Washington, D.C., Washington D.C., Washington...","[washington, d.c., washington d.c., washington...",What state is home to the university that is r...,[gwsports.com mike lonergan bio :: george wash...,50
1,WebQTrn-2136_d95da5fb8a16d81fe56cd4ce00843254,"[{'aliases': ['Super Bowl 2013', 'Super Bowl 4...",baltimore ravens,composition,2018-02-12T23:27:26,what year did the sports team with the fight s...,what year did the team with baltimore fight so...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2136,what year did baltimore ravens win the superbowl,super bowl xlvii,"[Super Bowl XLVII, Super Bowl 2013, Super Bowl...","[super bowl xlvii, super bowl 2013, super bowl...",What year did the team with Baltimore Fight So...,"[baltimore ravens fight song - ""the baltimore ...",50
2,WebQTrn-2360_a40a0d50b9a1006e2d254705d46345ea,[{'aliases': ['University of Florida Gators Me...,,conjunction,2018-02-12T23:49:23,what football teams did emmitt smith play for ...,"which school with the fight song ""the orange a...",PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2360,what football teams did emmitt smith play for,florida gators football,"[Florida Gators football, University of Florid...","[florida gators football, gators mens football...","Which school with the fight song ""The Orange a...",[florida gators football - the orange and the ...,50
3,WebQTest-415_b6ad66a3f1f515d0688c346e16d202e6,"[{'aliases': [], 'answer': 'Gridlock'd', 'answ...",,conjunction,2018-02-13T03:27:47,what movies did tupac star in and is the film ...,what movie with film character named mr. woods...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTest-415,what movies did tupac star in,gridlock'd,[Gridlock'd],[gridlock'd],What movie with film character named Mr. Woods...,[the bermuda depths - wikipedia. the bermuda d...,50
4,WebQTest-341_0f5ccea2d11b712eda64ebf2f6aeb1ee,"[{'aliases': ['Portuguese Republic'], 'answer'...",,conjunction,2018-02-13T03:20:47,what countries share borders with spain and is...,what country sharing borders with spain does t...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTest-341,what countries share borders with spain,portugal,"[Portugal, Portuguese Republic]","[portugal, portuguese republic]",What country sharing borders with Spain does t...,[which countries border spain? - quora. 07‏/03...,50
5,WebQTrn-3239_5e43a21c08076aabc07f1a6fd6ae6bb9,"[{'aliases': ['The King'], 'answer': 'Barry Sw...",dallas cowboys,composition,2018-02-13T01:30:16,who coached the the sports team owner is Jerry...,who was the 1996 coach of the team owned by je...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-3239,who coached the dallas cowboys in 1996,barry switzer,"[Barry Switzer, The King]","[barry switzer, the king]",Who was the 1996 coach of the team owned by Je...,[1996 dallas cowboys season - wikipedia. the 1...,50
6,WebQTest-1085_1cffc5c553afc1802970e8a6064cac32,"[{'aliases': ['Nick', 'Nicholas Joseph ""Nick"" ...",demi lovato,composition,2018-02-13T04:40:27,who was the artist had a concert tour named De...,who dated the performer who headlined the conc...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTest-1085,who was demi lovato dating,nicholas braun,"[Nicholas Braun, Nick, Nicholas Joseph ""Nick"" ...","[nicholas braun, nick, nicholas joseph braun, ...",Who dated the performer who headlined the conc...,[who has demi lovato dated? | popsugar latina....,50
7,WebQTrn-2581_321be15dae483ed949b74cf01e259708,"[{'aliases': ['Manchester United', 'Manchester...",,conjunction,2018-02-13T00:14:23,who has tim howard played for and the sports t...,which team owned by malcolm glazer has tim how...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2581,who has tim howard played for,manchester united f.c.,"[Manchester United F.C., Manchester United, Ma...","[manchester united f.c., manchester united, ma...",Which team owned by Malcolm Glazer has Tim How...,[malcolm glazer - wikipedia. malcolm irving gl...,50
8,WebQTrn-2773_7f835727d3cecbcb6c6b48db1fe147b9,"[{'aliases': ['Businesswoman', 'Business perso...",henry ford,composition,2018-02-13T00:40:08,what was the person education institution is D...,what business titles was the most famous alumn...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2773,what was henry ford best known for,businessperson,"[Businessperson, Businesswoman, Business perso...","[businessperson, businesswoman, business perso...",What business titles was the most famous alumn...,[these are the most famous university of michi...,50
9,WebQTrn-2518_1ef15e22372df70baf01b72850deb14d,"[{'aliases': ['Korkuluk', 'Yero', 'Scarecrow',...",michael jackson,composition,2018-02-13T00:05:04,who did the artist had a concert tour named HI...,the artist from the history world tour concert...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2518,who did michael jackson play in the wiz,scarecrow,"[Scarecrow, Korkuluk, Yero, Scarecrow, Fiyero]","[scarecrow, korkuluk, yero, scarecrow, fiyero]",The artist from the HIStory World Tour concert...,[history world tour - wikipedia. the history w...,50


Look for answers in the snippets.

In [48]:
for df in [train_questions_snippets, dev_questions_snippets]:
    df["have_explicit_ans_top_k"] = df.apply(lambda r: r["answer"] in ' '.join(r["merged_top_k_snippets"]),
        axis=1)
    df["have_any_ans_top_k"] = df.apply(
        lambda r: np.any([(answer in ' '.join(r["merged_top_k_snippets"])) for answer in r["answers_shorter_than_n"]]),
        axis=1)

snippets_questions_answers = pd.concat([train_questions_snippets, dev_questions_snippets], axis = 0, sort=True, ignore_index=True)

In [49]:
print("For k={} top snippets and answers shorter than n={}".format(k,n))
display(snippets_questions_answers["have_explicit_ans_top_k"].describe())
have_explicit_ans_top_k = np.count_nonzero(snippets_questions_answers["have_explicit_ans_top_k"])
print("{}% of top {} merged snippets ({} snippets) have the explicit answer as a substring in it.".format(have_explicit_ans_top_k*100/snippets_questions_answers.shape[0],
                                                                                                          k,
                                                                                                          have_explicit_ans_top_k))

display(snippets_questions_answers["have_any_ans_top_k"].describe())
have_any_ans_top_k = np.count_nonzero(snippets_questions_answers["have_any_ans_top_k"])
print("{}% of top {} merged snippets ({} snippets) have some answer alias as a substring in it.".format(have_any_ans_top_k*100/snippets_questions_answers.shape[0],
                                                                                                        k,
                                                                                                        have_any_ans_top_k))

For k=10 top snippets and answers shorter than n=4


count     29384
unique        2
top        True
freq      18235
Name: have_explicit_ans_top_k, dtype: object

62.05758235774571% of top 10 merged snippets (18235 snippets) have the explicit answer as a substring in it.


count     29384
unique        2
top        True
freq      20652
Name: have_any_ans_top_k, dtype: object

70.28314729104274% of top 10 merged snippets (20652 snippets) have some answer alias as a substring in it.


In [52]:
del snippets_questions_answers
gc.collect()

98

For training the network we've decided to concentrate only on questions that have some of their answer's aliases as an explicit substring in their top-k snippets.

In [53]:
existing_ans_train_df = train_questions_snippets.loc[train_questions_snippets["have_any_ans_top_k"] == True]
existing_ans_dev_df = dev_questions_snippets.loc[dev_questions_snippets["have_any_ans_top_k"] == True]

In [54]:
print("{}% of test merged snippets ({} snippets) have some answer alias as a substring in it.".format(existing_ans_train_df.shape[0]*100/train_questions_snippets.shape[0], 
                                                                                                     existing_ans_train_df.shape[0]))
print("{}% of dev merged snippets ({} snippets) have some answer alias as a substring in it.".format(existing_ans_dev_df.shape[0]*100/dev_questions_snippets.shape[0], 
                                                                                                     existing_ans_dev_df.shape[0]))

69.96745165613632% of test merged snippets (18272 snippets) have some answer alias as a substring in it.
72.8051391862955% of dev merged snippets (2380 snippets) have some answer alias as a substring in it.


In [55]:
existing_ans_dev_df.head(10)

Unnamed: 0,ID,answers,composition_answer,compositionality_type,created,machine_question,question,sparql,webqsp_ID,webqsp_question,answer,merged_answers,answers_shorter_than_n,question_y,merged_top_k_snippets,num_of_merged_snippets,have_explicit_ans_top_k,have_any_ans_top_k
1,WebQTest-823_ed31f9dd431831dbd32a06b958c7c97c,"[{'aliases': ['Brazilian ', 'República Federat...",,conjunction,2018-02-13T04:12:26,what does bolivia border and is the country th...,what country borders bolivia and contains goiã¡s?,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTest-823,what does bolivia border,brazil,"[Brazil, Brazilian , República Federativa do B...","[brazil, brazilian , brasil]",What country borders Bolivia and contains GoiÃ¡s?,[category:borders of bolivia - wikipedia. page...,50.0,True,True
2,WebQTrn-2181_8d86dc5e03446f0e50fd69bc06ae0658,"[{'aliases': [], 'answer': 'John Harbaugh', 'a...",ravens,composition,2018-02-12T23:32:26,who is the the sports team owner is Steve Bisc...,who is the coach of the team owned by steve bi...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2181,who is the ravens coach,john harbaugh,[John Harbaugh],[john harbaugh],Who is the coach of the team owned by Steve Bi...,[list of baltimore ravens head coaches - wikip...,50.0,True,True
3,WebQTrn-1447_f1ea2e60c0bd4311ef47cc0d7f6c0dd8,"[{'aliases': ['Braylon Jamel Edwards'], 'answe...",,comparative,2018-02-12T22:18:56,who did the cleveland browns draft and is the ...,which professional athletes who began their ca...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-1447,who did the cleveland browns draft,jeff faine,"[Jeff Faine, Braylon Jamel Edwards]","[jeff faine, braylon jamel edwards]",Which professional athletes who began their ca...,[what athletes started their career at a relat...,50.0,True,True
4,WebQTrn-453_2326103f221042f024262b19814ee9d3,"[{'aliases': ['Kenya Shilling'], 'answer': 'Ke...",kenya,composition,2018-02-12T20:29:53,what currency do they accept in the country th...,the country that has the national anthem ee mu...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-453,what currency do they accept in kenya,kenyan shilling,"[Kenyan shilling, Kenya Shilling]","[kenyan shilling, kenya shilling]",The country that has the national anthem Ee Mu...,[ee mungu nguvu yetu - wikipedia. ee mungu ngu...,50.0,True,True
5,WebQTrn-453_8c50e30ac5163e6dabfc999a7129a4ea,"[{'aliases': ['Kenya Shilling'], 'answer': 'Ke...",kenya,composition,2018-02-12T20:29:53,what currency do they accept in the country th...,rift valley province is located in a nation th...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-453,what currency do they accept in kenya,kenyan shilling,"[Kenyan shilling, Kenya Shilling]","[kenyan shilling, kenya shilling]",Rift Valley Province is located in a nation th...,[rift valley province - wikipedia. rift valley...,50.0,True,True
6,WebQTrn-1929_de8581ad379fdf8fb0e03c89e19ead1a,"[{'aliases': ['1923 World'], 'answer': '1923 W...",yankees,composition,2018-02-12T23:08:17,what year did the team won the 1999 World Seri...,when did the champion of the 1999 world series...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-1929,what year did yankees win their first world se...,1923 world series,"[1923 World Series, 1923 World]","[1923 world series, 1923 world]",When did the champion of the 1999 World Series...,[1923 world series - wikipedia. in the 1923 wo...,50.0,True,True
10,WebQTrn-3287_ebfe3c418f7914f9babf21caade27b05,"[{'aliases': [], 'answer': 'Swedish krona', 'a...",sweden,composition,2018-02-13T01:34:14,what is the currency of the country that conta...,kronoberg county is part of the country using ...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-3287,what is the currency of sweden called,swedish krona,[Swedish krona],[swedish krona],Kronoberg County is part of the country using ...,[kronoberg county - wikipedia. kronoberg count...,50.0,True,True
12,WebQTest-1508_872253e47dd6ddaa213ff31eeda8783b,"[{'aliases': ['Georgetown', 'Georgetown Univer...",,conjunction,2018-02-13T05:30:06,when did bill clinton go to college and the lo...,what college did bill clinton attend that is i...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTest-1508,when did bill clinton go to college,georgetown university,"[Georgetown University, Georgetown, Georgetown...","[georgetown university, georgetown]",What college did Bill Clinton attend that is i...,[bill clinton - wikipedia. clinton was born an...,50.0,True,True
13,WebQTrn-2674_831fb3325644a924d433e2b267f6d238,"[{'aliases': ['SC', 'Palmetto State'], 'answer...",,conjunction,2018-02-13T00:25:17,where is usc from and is the us state that has...,which state includes a university that sometim...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTrn-2674,where is usc from,south carolina,"[South Carolina, SC, Palmetto State]","[south carolina, sc, palmetto state]",Which state includes a university that sometim...,[university of southern california - wikipedia...,50.0,True,True
15,WebQTest-811_22a085e4a873315a9c2e63361cbd9248,"[{'aliases': [], 'answer': 'Bright House Field...",phillies,composition,2018-02-13T04:11:56,where is the the team has a team moscot named ...,where is mascot phillie phanatic's team's spri...,PREFIX ns: <http://rdf.freebase.com/ns/>\nSELE...,WebQTest-811,where is the phillies spring training stadium,bright house field,[Bright House Field],[bright house field],Where is mascot Phillie Phanatic's team's spri...,[phillie phanatic - wikipedia. the phillie pha...,50.0,True,True


In [56]:
existing_ans_train_df_reduced = pd.DataFrame(existing_ans_train_df[["ID", "compositionality_type", "question", "answers_shorter_than_n", "merged_top_k_snippets"]])
existing_ans_dev_df_reduced = pd.DataFrame(existing_ans_dev_df[["ID", "compositionality_type", "question", "answers_shorter_than_n", "merged_top_k_snippets"]])
test_df_reduced = pd.DataFrame(test_questions_snippets[["ID", "compositionality_type", "question", "merged_top_k_snippets"]])

In [57]:
existing_ans_dev_df_reduced.head(10)

Unnamed: 0,ID,compositionality_type,question,answers_shorter_than_n,merged_top_k_snippets
1,WebQTest-823_ed31f9dd431831dbd32a06b958c7c97c,conjunction,what country borders bolivia and contains goiã¡s?,"[brazil, brazilian , brasil]",[category:borders of bolivia - wikipedia. page...
2,WebQTrn-2181_8d86dc5e03446f0e50fd69bc06ae0658,composition,who is the coach of the team owned by steve bi...,[john harbaugh],[list of baltimore ravens head coaches - wikip...
3,WebQTrn-1447_f1ea2e60c0bd4311ef47cc0d7f6c0dd8,comparative,which professional athletes who began their ca...,"[jeff faine, braylon jamel edwards]",[what athletes started their career at a relat...
4,WebQTrn-453_2326103f221042f024262b19814ee9d3,composition,the country that has the national anthem ee mu...,"[kenyan shilling, kenya shilling]",[ee mungu nguvu yetu - wikipedia. ee mungu ngu...
5,WebQTrn-453_8c50e30ac5163e6dabfc999a7129a4ea,composition,rift valley province is located in a nation th...,"[kenyan shilling, kenya shilling]",[rift valley province - wikipedia. rift valley...
6,WebQTrn-1929_de8581ad379fdf8fb0e03c89e19ead1a,composition,when did the champion of the 1999 world series...,"[1923 world series, 1923 world]",[1923 world series - wikipedia. in the 1923 wo...
10,WebQTrn-3287_ebfe3c418f7914f9babf21caade27b05,composition,kronoberg county is part of the country using ...,[swedish krona],[kronoberg county - wikipedia. kronoberg count...
12,WebQTest-1508_872253e47dd6ddaa213ff31eeda8783b,conjunction,what college did bill clinton attend that is i...,"[georgetown university, georgetown]",[bill clinton - wikipedia. clinton was born an...
13,WebQTrn-2674_831fb3325644a924d433e2b267f6d238,conjunction,which state includes a university that sometim...,"[south carolina, sc, palmetto state]",[university of southern california - wikipedia...
15,WebQTest-811_22a085e4a873315a9c2e63361cbd9248,composition,where is mascot phillie phanatic's team's spri...,[bright house field],[phillie phanatic - wikipedia. the phillie pha...


### Pre-Process

Tokenize words in questions, answers and snippets.

Tokenize and Vectorize (keras encoded one-hot representation (each onehot vec represented as an int number)) text feature.

See: https://keras.io/preprocessing/text/#one_hot

First, add <EOS> between each sentence endings. Sencond, translate: "-", "_" and "\xa0" to " ".

In [58]:
import nltk
#nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize

translation = {"\xa0":" "}
def preprocess_text(text):
    text = text.translate(translation)
    text = ' '.join(word_tokenize(text))
    return text

In [59]:
for df in [existing_ans_train_df_reduced, existing_ans_dev_df_reduced, test_df_reduced]:
    df["question"] = df.apply(lambda r: preprocess_text(r["question"]), axis=1)
    df["merged_top_k_snippets"] = df.apply(lambda r: [preprocess_text(snip) for snip in r["merged_top_k_snippets"]], axis=1)

for df in [existing_ans_train_df_reduced, existing_ans_dev_df_reduced]:
    df["answers_shorter_than_n"] = df.apply(lambda r: [preprocess_text(ans) for ans in r["answers_shorter_than_n"]], axis=1)

We're ready for tokenization.

In [60]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [61]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

oov_token = "<UNK>"
max_num_of_words = 300000

max_num_of_words = 300000
# Kept punctuation (will be tokenized as seperate words): .,!?:;-_
filtered_punctuation = '"#$%&\'()*+/<=>@[\\]^`{|}~'
tokenizer = Tokenizer(filters=filtered_punctuation, 
                      lower=False, split=' ', char_level=False, 
                      num_words = max_num_of_words)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Merge all text to fit tokenizer.

In [62]:
all_text_train = pd.DataFrame()
all_text_train["all_text"] = existing_ans_train_df_reduced.apply(
        lambda r: [r["question"]] + [snip for snip in r["merged_top_k_snippets"]] + [ans for ans in r["answers_shorter_than_n"]],
        axis=1)

all_text_dev = pd.DataFrame()
all_text_dev["all_text"] = existing_ans_dev_df_reduced.apply(
        lambda r: [r["question"]] + [snip for snip in r["merged_top_k_snippets"]] + [ans for ans in r["answers_shorter_than_n"]],
        axis=1)

all_text_test = pd.DataFrame()
all_text_test["all_text"] = test_df_reduced.apply(
        lambda r: [r["question"]] + [snip for snip in r["merged_top_k_snippets"]],
        axis=1)

all_text = np.hstack([all_text_train["all_text"], all_text_dev["all_text"], all_text_test["all_text"]])

In [63]:
print('Tokenizing text...')
tokenizer.fit_on_texts(list(itertools.chain.from_iterable(all_text)))
print('Done tokenizing.')

del all_text
gc.collect()

Tokenizing text...
Done tokenizing.


28

Save tokenizer and reversed word_index map to file (for further prediction use).

In [64]:
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

In [65]:
import pickle

with open(os.path.join(DATA_PATH, FNAME_TOKENIZER), 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open(os.path.join(DATA_PATH, FNAME_INVERSE_MAP), 'wb') as handle:
    pickle.dump(reverse_word_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

Tokenize text fields.

In [66]:
maxlen_questions = 20
maxlen_snippets = 100

print('Applying tokenizer on train and dev texts and pad it to maxlen...')

for df in [existing_ans_train_df_reduced, existing_ans_dev_df_reduced]:
    df["tokenized_question"] = tokenizer.texts_to_sequences(df["question"])
    df["tokenized_snippets"] = df.apply(
        lambda r : tokenizer.texts_to_sequences(r["merged_top_k_snippets"]), axis=1)
    df["tokenized_answers"] = df.apply(
        lambda r : tokenizer.texts_to_sequences(r["answers_shorter_than_n"]), axis=1)

    df["tokenized_question"] = df.apply(
        lambda r: pad_sequences([r["tokenized_question"]], maxlen=maxlen_questions, padding='post', truncating='post')[0], axis=1)
    df["tokenized_snippets"] = df.apply(
        lambda r: pad_sequences(r["tokenized_snippets"], maxlen=maxlen_snippets, padding='post', truncating='post'), axis=1)

print('Done.')

Applying tokenizer on train and dev texts and pad it to maxlen...
Done.


Filter all answers with more than (n+1)=4 tokens.

In [67]:
def filter_long_answers(answers):
    res = []
    for ans in answers:
        if 0 < len(ans) <= 4:
            res += [ans]
    if len(res) == 0:
        return np.nan
    return res

In [68]:
for df in [existing_ans_train_df_reduced, existing_ans_dev_df_reduced]:
    df["tokenized_answers"] = df.apply(lambda r : filter_long_answers(r["tokenized_answers"]), axis=1)
    df.dropna(subset=['tokenized_answers'], inplace=True)

Tokenize test texts.

In [69]:
print('Applying tokenizer on test texts and pad it to maxlen...')

test_df_reduced["tokenized_question"] = tokenizer.texts_to_sequences(test_df_reduced["question"])
test_df_reduced["tokenized_snippets"] = test_df_reduced.apply(
    lambda r : tokenizer.texts_to_sequences(r["merged_top_k_snippets"]), axis=1)

test_df_reduced["tokenized_question"] = test_df_reduced.apply(
    lambda r: pad_sequences([r["tokenized_question"]], maxlen=maxlen_questions, padding='post', truncating='post')[0], axis=1)
test_df_reduced["tokenized_snippets"] = test_df_reduced.apply(
        lambda r: pad_sequences(r["tokenized_snippets"], maxlen=maxlen_snippets, padding='post', truncating='post'), axis=1)

print('Done.')

Applying tokenizer on test texts and pad it to maxlen...
Done.


Load word embeddings and arrange them in a dictionary

In [70]:
print("Loading GloVe embeddings.\n")
embeddings_index = {}
with open(os.path.join(DATA_PATH, FNAME_EMBEDDINGS), "r") as glove_ds_sample:
    for line in glove_ds_sample.readlines():
        line = line.strip().split()
        word = line[0].lower()
        embeddings_index[word] = np.array([float(x) for x in line[1:]])
print("Done loading GloVe embeddings!")

Loading GloVe embeddings.

Done loading GloVe embeddings!


Build Glove embedding matrix that matches the tokenizer's word indexing.

In [71]:
vocab_size = len(tokenizer.word_index) + 2
embedding_matrix = np.random.rand(vocab_size, EMBEDDING_DIM)

print('Creating embedding matrix...')
embedding_exists = 0
no_embeddings = 0
for word, i in tokenizer.word_index.items():
    if word in embeddings_index:
        embedding_matrix[i] = embeddings_index[word]
        embedding_exists += 1
    else:
        no_embeddings += 1

print ("There are total of {} words in our corpus.".format(embedding_exists+no_embeddings))
print ("There are {} embeddings in Glove.".format(len(embeddings_index)))
print ("We have embeddings for {} words ({}% existing embeddings).".format(embedding_exists, \
                                                                           (100*embedding_exists/(embedding_exists+no_embeddings))))
print ("Embedding is missing for {} words.".format(no_embeddings))

del embeddings_index
gc.collect()

print('\n\nDone loading embeddings.')

Creating embedding matrix...
There are total of 683359 words in our corpus.
There are 400000 embeddings in Glove.
We have embeddings for 213056 words (31.177755762344535% existing embeddings).
Embedding is missing for 470303 words.


Done loading embeddings.


The zero (0) index of the embedding_matrix will be a zero (non trainable) vector that is reserved for the padding token.
We concat this zero vector in the TF code (not here).
It's Important that it will be a zero vector that it does not effect the RNN until it reaches the real words.

In [72]:
embedding_matrix = np.delete(embedding_matrix, 0, 0)

In [73]:
embedding_matrix.dump(os.path.join(DATA_PATH, FNAME_TOKEN_EMBEDDING_MAT))

### Generate answer vectors for snippets

Generate possible answeres based on the dataset snippets. Each question will have two answer vectors: "answer_start" and "answer_end". Those are binary vectors. For each index of an answer's start in a snippet we will have "1" in "answer_start" vector and respectively "1" for every index where we have an answer's end.

In [76]:
def create_label_gram(label, answer, texts, gram):
    answer = np.array(answer)
    text_len = len(texts[0]) - gram + 1
    for text_i, text in enumerate(texts):
        text1 = np.array(text)
        for i in range(len(text) - gram + 1):
            if np.array_equal(text[i:i + gram], answer):
                label[text_i * text_len + i] = 1
    return label

def get_label(texts, answers):
    texts = np.asarray(texts)
    label_unigram = np.zeros(len(texts[0]) * len(texts))
    label_bigram = np.zeros((len(texts[0]) - 1) * len(texts))
    label_trigram = np.zeros((len(texts[0]) - 2) * len(texts))
    label_fourgram = np.zeros((len(texts[0]) - 3) * len(texts))
    for answer in answers:
        answer_len = len(answer)
        if answer_len == 1:
            label_unigram = create_label_gram(label_unigram, answer, texts, 1)
        elif answer_len == 2:
            label_bigram = create_label_gram(label_bigram, answer, texts, 2)
        elif answer_len == 3:
            label_trigram = create_label_gram(label_trigram, answer, texts, 3)
        elif answer_len == 4:
            label_fourgram = create_label_gram(label_fourgram, answer, texts, 4)
        else:
            print("ERROR: {}".format(answer))

    return np.concatenate((label_unigram, label_bigram, label_trigram))

In [77]:
from tqdm import tqdm

tqdm.pandas()
print('Generating answer vectors for train and dev.')

for df in [existing_ans_train_df_reduced, existing_ans_dev_df_reduced]:
    df["answers"] = df.progress_apply(
        lambda r : get_label(r["tokenized_snippets"], r["tokenized_answers"]), axis=1)
print('Done.')

  0%|          | 2/18268 [00:00<25:50, 11.78it/s]

Generating answer vectors for train and dev.


100%|██████████| 18268/18268 [22:12<00:00, 13.71it/s]
100%|██████████| 2374/2374 [03:10<00:00,  8.75it/s]

Done.





### Rename and save train, dev, test dataframes

In [78]:
final_train_questions = pd.DataFrame(existing_ans_train_df_reduced[["ID", "compositionality_type", "tokenized_question","answers"]])
final_dev_questions = pd.DataFrame(existing_ans_dev_df_reduced[["ID", "compositionality_type", "tokenized_question","answers"]])
final_test_questions = pd.DataFrame(test_df_reduced[["ID", "compositionality_type", "tokenized_question"]])

final_train_snippets = pd.DataFrame(existing_ans_train_df_reduced[["ID", "tokenized_snippets"]])
final_dev_snippets = pd.DataFrame(existing_ans_dev_df_reduced[["ID", "tokenized_snippets"]])
final_test_snippets = pd.DataFrame(test_df_reduced[["ID", "tokenized_snippets"]])

final_train_questions.rename(index=str, columns={"tokenized_question": "question"}, inplace=True)
final_dev_questions.rename(index=str, columns={"tokenized_question": "question"}, inplace=True)
final_test_questions.rename(index=str, columns={"tokenized_question": "question"}, inplace=True)

for df in [final_train_snippets, final_dev_snippets, final_test_snippets]:
    df.rename(index=str, columns={"tokenized_snippets": "snippets"}, inplace=True)

In [79]:
final_train_questions.to_json(os.path.join(DATA_PATH, "final_train_questions.json.gz"), orient='records', compression='gzip')
final_dev_questions.to_json(os.path.join(DATA_PATH, "final_dev_questions.json.gz"), orient='records', compression='gzip')
final_test_questions.to_json(os.path.join(DATA_PATH, "final_test_questions.json.gz"), orient='records', compression='gzip')
final_train_snippets.to_json(os.path.join(DATA_PATH, "final_train_snippets.json.gz"), orient='records', compression='gzip')
final_dev_snippets.to_json(os.path.join(DATA_PATH, "final_dev_snippets.json.gz"), orient='records', compression='gzip')
final_test_snippets.to_json(os.path.join(DATA_PATH, "final_test_snippets.json.gz"), orient='records', compression='gzip')

In [80]:
final_train_questions[:100].to_json(os.path.join(DATA_PATH, "final_train_questions_100.json.gz"), orient='records', compression='gzip')

In [81]:
existing_ans_train_df_reduced.head(1)

Unnamed: 0,ID,compositionality_type,question,answers_shorter_than_n,merged_top_k_snippets,tokenized_question,tokenized_snippets,tokenized_answers,answers
0,WebQTrn-3513_7c4117891abf63781b892537979054c6,composition,what state is home to the university that is r...,"[washington , d.c ., washington d.c ., washing...",[gwsports.com mike lonergan bio : : george was...,"[64, 51, 11, 148, 10, 1, 87, 20, 11, 2070, 7, ...","[[45179, 1500, 28639, 2680, 12, 12, 411, 308, ...","[[308, 2, 13212, 3], [308, 13212, 3], [308], [...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
