# BERT (from HuggingFace Transformers) for Text Extraction

**Author:** [Apoorv Nandan](https://twitter.com/NandanApoorv)<br>
**Date created:** 2020/05/23<br>
**Last modified:** 2020/07/30<br>
**Modified by:** Prawploy Techadanai<br>
**Description:** Fine tune pretrained BERT from HuggingFace Transformers on SQuAD.
https://keras.io/examples/nlp/text_extraction_with_bert/ 

## Introduction

This demonstration uses NECTEC Question-Answering Dataset.
The dataset has been preprocessed and the input consists of a question, and a paragraph for context.
The goal is to find the span of text in the paragraph that answers the question.
We evaluate our performance on this data with the "Exact Match" metric,
which measures the percentage of predictions that exactly match any one of the
ground-truth answers.

We fine-tune a BERT model to perform this task as follows:

1. Feed the context and the question as inputs to BERT.
2. Take two vectors S and T with dimensions equal to that of
   hidden states in BERT.
3. Compute the probability of each token being the start and end of
   the answer span. The probability of a token being the start of
   the answer is given by a dot product between S and the representation
   of the token in the last layer of BERT, followed by a softmax over all tokens.
   The probability of a token being the end of the answer is computed
   similarly with the vector T.
4. Fine-tune BERT and learn S and T along the way.

**References:**

- [BERT](https://arxiv.org/pdf/1810.04805.pdf)
- [SQuAD](https://arxiv.org/abs/1606.05250)


In [None]:
##Low accuracy due to skipped examples?? seen from TPU (337 only, due to max_len = 384 and code does not truncate)
##Input data include html tags
##More epochs reach around 40%
##Use F1 evaluation??

##TODO:
##save model on tpu/jupyter lab
##make model able to answer input question

In [None]:
## Setup

In [142]:
print(re.__version__)
print(json.__version__)
print(np.version.version)
print(tf.__version__)
print(tf.keras.__version__)
print(tokenizers.__version__)
print(transformers.__version__)

2.2.1
2.0.9
1.18.5
2.3.0
2.4.0


NameError: name 'tokenizers' is not defined

In [1]:
import os
import re
import json
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, TFBertModel, BertConfig

max_len = 384
configuration = BertConfig()  # default parameters and configuration for BERT


## Set-up BERT tokenizer


In [2]:
# Save the slow pretrained tokenizer
slow_tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
#saved in inner folder since it will cause an error without this
#error when loading TFBertModel.from_pretrained("bert-base-multilingual-cased")
save_path = "bert-pretrained/bert-base-multilingual-cased/"
if not os.path.exists(save_path):
    os.makedirs(save_path)
slow_tokenizer.save_pretrained(save_path)

# Load the fast tokenizer from saved file
tokenizer = BertWordPieceTokenizer("bert-pretrained/bert-base-multilingual-cased/vocab.txt", lowercase=True)


## Load the data


In [3]:
data_path = 'qa_dataset_new.json'

## Process the data

1. Go through the JSON file and store every record as a `SquadExample` object.
2. Go through each `SquadExample` and create `x_train, y_train, x_eval, y_eval`.


In [51]:

class SquadExample:
    def __init__(self, question, context, start_char_idx, answer_text, all_answers, is_question):
        self.question = question
        self.context = context
        self.start_char_idx = start_char_idx
        self.answer_text = answer_text
        self.all_answers = all_answers
        self.skip = False
        self.is_question = is_question

    def preprocess(self):
        context = self.context
        question = self.question
        answer_text = self.answer_text
        start_char_idx = self.start_char_idx

        # Clean context, answer and question
        context = " ".join(str(context).split())
        question = " ".join(str(question).split())
        answer = " ".join(str(answer_text).split())

        # Find end character index of answer in context
        end_char_idx = start_char_idx + len(answer)
        if end_char_idx >= len(context) and self.is_question == False:
            self.skip = True
            return

        # Mark the character indexes in context that are in answer
        is_char_in_ans = [0] * len(context)
        for idx in range(start_char_idx, end_char_idx):
            is_char_in_ans[idx] = 1

        # Tokenize context
        tokenized_context = tokenizer.encode(context)

        # Find tokens that were created from answer characters
        ans_token_idx = []
        for idx, (start, end) in enumerate(tokenized_context.offsets):
            if sum(is_char_in_ans[start:end]) > 0:
                ans_token_idx.append(idx)

        if len(ans_token_idx) == 0 and self.is_question == False:
            self.skip = True
            return

        # Find start and end token index for tokens from answer
        start_token_idx = ans_token_idx[0]
        end_token_idx = ans_token_idx[-1]

        # Tokenize question
        tokenized_question = tokenizer.encode(question)

        # Create inputs
        input_ids = tokenized_context.ids + tokenized_question.ids[1:]
        token_type_ids = [0] * len(tokenized_context.ids) + [1] * len(
            tokenized_question.ids[1:]
        )
        attention_mask = [1] * len(input_ids)

        # Pad and create attention masks.
        # Skip if truncation is needed
        padding_length = max_len - len(input_ids)
        if padding_length > 0:  # pad
            input_ids = input_ids + ([0] * padding_length)
            attention_mask = attention_mask + ([0] * padding_length)
            token_type_ids = token_type_ids + ([0] * padding_length)
        elif padding_length < 0 and self.is_question == False:  # skip
            self.skip = True
            return

        self.input_ids = input_ids
        self.token_type_ids = token_type_ids
        self.attention_mask = attention_mask
        self.start_token_idx = start_token_idx
        self.end_token_idx = end_token_idx
        self.context_token_to_char = tokenized_context.offsets

#########replaced by me##############

with open(data_path) as f:
    raw_data = json.load(f)

#split data
train_size = int(4000*0.8)  #must be int for array slicing
#remove version and dataset name as well
#raw_train_data is now an array of qas dictionaries

#raw_train_data = raw_data['data'][0:train_size]
#raw_eval_data = raw_data['data'][train_size:]

#for testing
raw_train_data = raw_data['data'][0:train_size]
raw_eval_data = raw_data['data'][train_size:]

#input is array of dict
def create_squad_examples(dataset):
    squad_examples = []
    for i in range(len(dataset)):
        question = dataset[i]['question']
        context = dataset[i]['context']
        start_char_idx = dataset[i]['new_beg_idx']
        answer_text = dataset[i]['answer']
        all_answers = [dataset[i]['answer']]
        squad_eg = SquadExample(
            question, context, start_char_idx, answer_text, all_answers, False
        )
        squad_eg.preprocess()
        squad_examples.append(squad_eg)
    return squad_examples

####################################
#input must be array
def create_inputs_targets(squad_examples):
    dataset_dict = {
        "input_ids": [],
        "token_type_ids": [],
        "attention_mask": [],
        "start_token_idx": [],
        "end_token_idx": [],
    }
    for item in squad_examples:
        if item.skip == False:
            for key in dataset_dict:
                dataset_dict[key].append(getattr(item, key))
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])

    x = [
        dataset_dict["input_ids"],
        dataset_dict["token_type_ids"],
        dataset_dict["attention_mask"],
    ]
    y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]
    return x, y


train_squad_examples = create_squad_examples(raw_train_data) 
# returns array of squad example object
print('Training Values')
x_train, y_train = create_inputs_targets(train_squad_examples)
print('x_train input_ids', x_train[0])
print('x_train token_type_ids', x_train[1])
print('x_train attention_mask', x_train[2])
print('y_train start_token_idx', y_train[0])
print('y_train end_token_idx', y_train[1])
print(f"{len(train_squad_examples)} training points created.")

eval_squad_examples = create_squad_examples(raw_eval_data) 
# returns array of squad example object
print('Evaluation Values')
x_eval, y_eval = create_inputs_targets(eval_squad_examples)
print('x_eval input_ids', x_train[0])
print('x_eval token_type_ids', x_train[1])
print('x_eval attention_mask', x_train[2])
print('y_eval start_token_idx', y_train[0])
print('y_eval end_token_idx', y_train[1])
print(f"{len(eval_squad_examples)} evaluation points created.")


Training Values
x_train input_ids [[  101   133 67779 ...     0     0     0]
 [  101   133 67779 ...     0     0     0]
 [  101   133 67779 ...     0     0     0]
 ...
 [  101   133 67779 ...     0     0     0]
 [  101   133 67779 ...     0     0     0]
 [  101   133 67779 ...     0     0     0]]
x_train token_type_ids [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
x_train attention_mask [[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]]
y_train start_token_idx [129  94 111 149 123  92 136 142 150 136  75 186 118 220  65  53 212  99
 205  63  79 186  53 114 196  61 126 150  62 132  75 160 234  89 303  80
 110 135 116 134  80 283  88  47 136  72 250 137 147  94 106  96 166  43
 110  89 210  45 108  75 156  64 289 118 117  81 158  69 294 121 235 100
 201 264  98 228  94 226  93  80 170 231  67 155  47 148  92 105  84 116
 121 178  87 135  93 150  8

# For Observing processed data

In [66]:
x_eval[1][0]  ##padding

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [113]:
count = 0
for i in range(len(train_squad_examples)):
    if train_squad_examples[i].skip == False:
        count += 1
print(count)
##there are only 337 that can be used for training

337


In [115]:
example_idx =[]
count = 0
for i in range(len(eval_squad_examples)):
    if eval_squad_examples[i].skip == False:
        count += 1
        example_idx.append(i)
print(count)
##there are only 97 that can be used for evaluation
print(example_idx[:2])
##[9,10] 

97
[9, 10]


In [117]:
eval_squad_examples[9].context

'<doc id="685203" url="https://th.wikipedia.org/wiki?curid=685203" title="วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008">วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008 วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008 เป็นครั้งที่ 14 ของการแข่งขัน จัดขึ้นที่ไทเป, ประเทศไต้หวัน ระหว่างวันที่ 20 - 28 กันยายน พ.ศ. 2551อันดับการแข่งขัน</doc>\n'

In [118]:
eval_squad_examples[9].question

'การแข่งขันวอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชียครั้งที่ 14 ในปีค.ศ. 2008 จัดขึ้นที่ประเทศใด'

In [120]:
eval_squad_examples[9].answer_text

'ไต้หวัน'

Create the Question-Answering Model using BERT and Functional API


In [5]:

def create_model():
    ## BERT encoder
    encoder = TFBertModel.from_pretrained("bert-base-multilingual-cased")

    ## QA Model
    input_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    token_type_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    attention_mask = layers.Input(shape=(max_len,), dtype=tf.int32)
    
    ## The transformer
    embedding = encoder(
        input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask
    )[0]

    start_logits = layers.Dense(1, name="start_logit", use_bias=False)(embedding)
    start_logits = layers.Flatten()(start_logits)

    end_logits = layers.Dense(1, name="end_logit", use_bias=False)(embedding)
    end_logits = layers.Flatten()(end_logits)

    start_probs = layers.Activation(keras.activations.softmax)(start_logits)
    end_probs = layers.Activation(keras.activations.softmax)(end_logits)

    model = keras.Model(
        inputs=[input_ids, token_type_ids, attention_mask],
        outputs=[start_probs, end_probs],
    )
    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    optimizer = keras.optimizers.Adam(lr=5e-5)
    model.compile(optimizer=optimizer, loss=[loss, loss])
    return model



This code should preferably be run on Google Colab TPU runtime.
With Colab TPUs, each epoch will take 5-6 minutes.


In [6]:
use_tpu = False
if use_tpu:
    # Create distribution strategy
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)

    # Create model
    with strategy.scope():
        model = create_model()
else:
    model = create_model()

model.summary()


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 384, 768), ( 177853440   input_1[0][0]                    
                                                                 input_3[0][0]         

## Create evaluation Callback

This callback will compute the exact match score using the validation data
after every epoch.


In [7]:

def normalize_text(text):
    text = text.lower()

    # Remove punctuations
    exclude = set(string.punctuation)
    text = "".join(ch for ch in text if ch not in exclude)

    # Remove articles
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    text = re.sub(regex, " ", text)

    # Remove extra white space
    text = " ".join(text.split())
    return text


class ExactMatch(keras.callbacks.Callback):
    """
    Each `SquadExample` object contains the character level offsets for each token
    in its input paragraph. We use them to get back the span of text corresponding
    to the tokens between our predicted start and end tokens.
    All the ground-truth answers are also present in each `SquadExample` object.
    We calculate the percentage of data points where the span of text obtained
    from model predictions matches one of the ground-truth answers.
    """

    def __init__(self, x_eval, y_eval):
        #to store x_eval, y_eval
        self.x_eval = x_eval
        self.y_eval = y_eval

    #function called after epoch
    #logs store losses
    def on_epoch_end(self, epoch, logs=None):
        
        #predict start and end scores for each word input arrays (input ids, token ids,attention masks)
        pred_start, pred_end = self.model.predict(self.x_eval)
        count = 0
        
        #use eval_examples
        eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]
        for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
            #a squad example object
            squad_eg = eval_examples_no_skip[idx]
            
            #offsets of context
            offsets = squad_eg.context_token_to_char
            
            #use the start and end idxs with highest score
            #these are positions in token array
            #start, end are arrays??
            start = np.argmax(start)
            end = np.argmax(end)
            
            #if start is predicted to be larger than the number of tokens, continue to next squad example
            if start >= len(offsets):
                continue
            
            pred_char_start = offsets[start][0]
            if end < len(offsets):
                pred_char_end = offsets[end][1]
                #the actual word answer
                pred_ans = squad_eg.context[pred_char_start:pred_char_end]
            else:
                #the actual word answer (in case the end is larger)
                pred_ans = squad_eg.context[pred_char_start:]

            #predicted answer
            normalized_pred_ans = normalize_text(pred_ans)
            #array of all actual stored answers
            normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]
            
            if normalized_pred_ans in normalized_true_ans:
                count += 1
                
        #accuracy = number of all correct answers/number of start indices in y_eval
        acc = count / len(self.y_eval[0])
        
        #storing results
        result = f"\nepoch={epoch+1}, exact match score={acc:.2f}"
        f = open("my_thai_result.txt","a")
        f.write(result)
        f.close()
        print(result)



## Train and Evaluate


In [8]:
exact_match_callback = ExactMatch(x_eval, y_eval)
model.fit(
    x_train,
    y_train,
    epochs=5,  # For demonstration, 3 epochs are recommended
    verbose=2,
    batch_size=64,
    callbacks=[exact_match_callback], #other ways to evaluate??
)


Epoch 1/5

epoch=1, exact match score=0.25
6/6 - 232s - loss: 8.8153 - activation_10_loss: 4.6087 - activation_11_loss: 4.2065
Epoch 2/5

epoch=2, exact match score=0.34
6/6 - 228s - loss: 4.6369 - activation_10_loss: 2.4926 - activation_11_loss: 2.1444
Epoch 3/5

epoch=3, exact match score=0.38
6/6 - 228s - loss: 2.9020 - activation_10_loss: 1.5840 - activation_11_loss: 1.3179
Epoch 4/5

epoch=4, exact match score=0.38
6/6 - 228s - loss: 1.9216 - activation_10_loss: 1.0950 - activation_11_loss: 0.8267
Epoch 5/5

epoch=5, exact match score=0.32
6/6 - 229s - loss: 1.3524 - activation_10_loss: 0.7913 - activation_11_loss: 0.5612


<tensorflow.python.keras.callbacks.History at 0x7fa06464d550>

# Saving and Loading back model methods

In [11]:
model.save('thai_qa_model')

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: thai_qa_model/assets


In [90]:
#has some problems
#The two structures don't have the same nested structure.
new_model = keras.models.load_model("thai_qa_model")

ValueError: The two structures don't have the same nested structure.

First structure: type=TensorSpec str=TensorSpec(shape=(None, 384), dtype=tf.int32, name='inputs')

Second structure: type=dict str={'input_ids': TensorSpec(shape=(None, 5), dtype=tf.int32, name='input_ids')}

More specifically: Substructure "type=dict str={'input_ids': TensorSpec(shape=(None, 5), dtype=tf.int32, name='input_ids')}" is a sequence, while substructure "type=TensorSpec str=TensorSpec(shape=(None, 384), dtype=tf.int32, name='inputs')" is not
Entire first structure:
.
Entire second structure:
{'input_ids': .}

## Save with h5 format

In [100]:
model.save('thai_qa_model.h5')

NotImplementedError: 

In [101]:
new_model = keras.models.load_model("thai_qa_model.h5")

ValueError: No model found in config file.

## Save just architecture

In [96]:
json_string = model.to_json()

NotImplementedError: 

## Save just weights

In [98]:
model.save_weights("ckpt")
#ok

In [99]:
model.save_weights("weights.h5")
#ok

In [102]:
model.get_config()

NotImplementedError: 

# Load back model with duplicate architecture

In [103]:
model2 = create_model()

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.


In [104]:
model2.summary()

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            [(None, 384)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   ((None, 384, 768), ( 177853440   input_4[0][0]                    
                                                                 input_6[0][0]         

In [105]:
model2.load_weights("weights.h5")

In [109]:
result = np.testing.assert_allclose(
    model.predict(x_eval), model2.predict(x_eval)
)

In [110]:
result

# Convert Token Index to Answer Words

In [121]:
def create_data(c ,q ):
    dataset_dict = dict()
    dataset_dict['question'] = q
    dataset_dict['context'] = c
    dataset_dict['new_beg_idx'] = 0 #for safety
    dataset_dict['answer'] = c[0] #for safety
    return dataset_dict

In [122]:
#only short context only
context = input('Please enter paragraph/context: ')
question = input('Please enter your question related to the above context: ')

my_data = create_data(context,question) #a dictionary of necessary attributes
print('Entered data')
print(my_data)

Please enter paragraph/context:  <doc id="685203" url="https://th.wikipedia.org/wiki?curid=685203" title="วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008">วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008 วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008 เป็นครั้งที่ 14 ของการแข่งขัน จัดขึ้นที่ไทเป, ประเทศไต้หวัน ระหว่างวันที่ 20 - 28 กันยายน พ.ศ. 2551อันดับการแข่งขัน</doc>\n
Please enter your question related to the above context:  การแข่งขันวอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชียครั้งที่ 14 ในปีค.ศ. 2008 จัดขึ้นที่ประเทศใด


Entered data
{'question': 'การแข่งขันวอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชียครั้งที่ 14 ในปีค.ศ. 2008 จัดขึ้นที่ประเทศใด', 'context': '<doc id="685203" url="https://th.wikipedia.org/wiki?curid=685203" title="วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008">วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008 วอลเลย์บอลเยาวชนหญิงชิงแชมป์เอเชีย 2008 เป็นครั้งที่ 14 ของการแข่งขัน จัดขึ้นที่ไทเป, ประเทศไต้หวัน ระหว่างวันที่ 20 - 28 กันยายน พ.ศ. 2551อันดับการแข่งขัน</doc>\\n', 'new_beg_idx': 0, 'answer': '<'}


In [123]:
#create new function, since old function will flag .skip == True, inputs_targets cannot be created
def create_example(dataset):
    squad_examples = []
    for i in range(len(dataset)):
        question = dataset[i]['question']
        context = dataset[i]['context']
        start_char_idx = dataset[i]['new_beg_idx']
        answer_text = dataset[i]['answer']
        all_answers = [dataset[i]['answer']]
        squad_eg = SquadExample(
            question, context, start_char_idx, answer_text, all_answers, True
        )
        squad_eg.preprocess()
        squad_examples.append(squad_eg)
    return squad_examples

In [124]:
#raw_train_data is now an array of qas dictionaries
print('Preprocessing data')
to_predict = create_example([my_data])
to_predict[0].skip = False #just incase
print('Creating inputs')
x = create_inputs_targets(to_predict) #
print('Created success')

Preprocessing data
Creating inputs
Created success


In [125]:
#dir(to_predict[0])
to_predict[0].context_token_to_char

[(0, 0),
 (0, 1),
 (1, 4),
 (5, 7),
 (7, 8),
 (8, 9),
 (9, 12),
 (12, 14),
 (14, 15),
 (15, 16),
 (17, 20),
 (20, 21),
 (21, 22),
 (22, 27),
 (27, 28),
 (28, 29),
 (29, 30),
 (30, 32),
 (32, 33),
 (33, 42),
 (42, 43),
 (43, 46),
 (46, 47),
 (47, 51),
 (51, 52),
 (52, 54),
 (54, 57),
 (57, 58),
 (58, 61),
 (61, 63),
 (63, 64),
 (64, 65),
 (66, 71),
 (71, 72),
 (72, 73),
 (73, 74),
 (74, 76),
 (76, 77),
 (77, 78),
 (78, 79),
 (80, 81),
 (81, 83),
 (83, 84),
 (84, 86),
 (86, 87),
 (87, 88),
 (88, 89),
 (89, 90),
 (90, 91),
 (92, 93),
 (93, 94),
 (95, 96),
 (96, 97),
 (97, 98),
 (98, 99),
 (99, 100),
 (101, 102),
 (102, 103),
 (103, 104),
 (104, 105),
 (106, 107),
 (108, 112),
 (112, 113),
 (113, 114),
 (114, 115),
 (115, 117),
 (117, 118),
 (118, 119),
 (119, 120),
 (121, 122),
 (122, 124),
 (124, 125),
 (125, 127),
 (127, 128),
 (128, 129),
 (129, 130),
 (130, 131),
 (131, 132),
 (133, 134),
 (134, 135),
 (136, 137),
 (137, 138),
 (138, 139),
 (139, 140),
 (140, 141),
 (142, 143),
 (143,

In [126]:
x

([array([[   101,    133,  67779,  12604,    134,    107,  58986,  22650,
           10884,    107,  88767,    134,    107,  14120,    131,    120,
             120,  77586,    119,  34724,    119,  10733,    120,  30351,
             136,  10854,  33597,    134,  58986,  22650,  10884,    107,
           12887,    134,    107,   1430,  85915, 111431,  20507,  20503,
           49097,  85915, 111431,  54633,  31287,  42407,  16000, 111424,
           55749,  19197,  42407,  19197, 111432,  42407,  17405,  49292,
          111431,  33178, 111431,  42407,  20503,  10203,    107,    135,
            1430,  85915, 111431,  20507,  20503,  49097,  85915, 111431,
           54633,  31287,  42407,  16000, 111424,  55749,  19197,  42407,
           19197, 111432,  42407,  17405,  49292, 111431,  33178, 111431,
           42407,  20503,  10203,   1430,  85915, 111431,  20507,  20503,
           49097,  85915, 111431,  54633,  31287,  42407,  16000, 111424,
           55749,  19197,  42407,  191

In [127]:
pred_start, pred_end = model.predict(x,batch_size=1)

In [128]:
for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
    if idx<10:
        print(idx)
        start = np.argmax(start)
        end = np.argmax(end)
        print(start,end)
        offsets = to_predict[0].context_token_to_char
        pred_char_start = offsets[start][0]
        if end < len(offsets):
            pred_char_end = offsets[end][1]
            #the actual word answer
            pred_ans = to_predict[0].context[pred_char_start:pred_char_end]
        else:
            #the actual word answer (in case the end is larger)
            pred_ans = to_predict[0].context[pred_char_start:]

0
132 151


In [129]:
pred_ans

'จัดขึ้นที่ไทเป, ประเทศไต้หวัน'

In [130]:
pred_start2, pred_end2 = model2.predict(x,batch_size=1)

for idx, (start, end) in enumerate(zip(pred_start2, pred_end2)):
    if idx<10:
        print(idx)
        start = np.argmax(start)
        end = np.argmax(end)
        print(start,end)
        offsets = to_predict[0].context_token_to_char
        pred_char_start = offsets[start][0]
        if end < len(offsets):
            pred_char_end = offsets[end][1]
            #the actual word answer
            pred_ans2 = to_predict[0].context[pred_char_start:pred_char_end]
        else:
            #the actual word answer (in case the end is larger)
            pred_ans2 = to_predict[0].context[pred_char_start:]

0
132 151


In [131]:
pred_ans2

'จัดขึ้นที่ไทเป, ประเทศไต้หวัน'