<a href="https://colab.research.google.com/github/google-research/tapas/blob/master/notebooks/tabfact_predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2020 The Google AI Language Team Authors

Licensed under the Apache License, Version 2.0 (the "License");

In [None]:
# Copyright 2019 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Running a Tapas fine-tuned checkpoint
---
This notebook shows how to load and make predictions with TAPAS model, which was introduced in the paper: [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349)

# Clone and install the repository


First, let's fetch the code from the github repository and install it

In [1]:
! git clone https://github.com/google-research/tapas.git

Cloning into 'tapas'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects: 100% (80/80), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 316 (delta 33), reused 41 (delta 17), pack-reused 236[K
Receiving objects: 100% (316/316), 303.89 KiB | 8.94 MiB/s, done.
Resolving deltas: 100% (156/156), done.


In [2]:
! pip install ./tapas

Processing ./tapas
Collecting apache-beam[gcp]==2.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/4b/0d/0979ad626578a52887f7df60492ac6759089a9da261ac4c88b112b3f6a5a/apache_beam-2.20.0-cp36-cp36m-manylinux1_x86_64.whl (3.5MB)
[K     |████████████████████████████████| 3.5MB 2.8MB/s 
[?25hCollecting frozendict==1.2
  Downloading https://files.pythonhosted.org/packages/4e/55/a12ded2c426a4d2bee73f88304c9c08ebbdbadb82569ebdd6a0c007cfd08/frozendict-1.2.tar.gz
Collecting pandas~=1.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/c0/95/cb9820560a2713384ef49060b0087dfa2591c6db6f240215c2bce1f4211c/pandas-1.0.5-cp36-cp36m-manylinux1_x86_64.whl (10.1MB)
[K     |████████████████████████████████| 10.1MB 24.8MB/s 
Collecting tensorflow~=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/70/e3/663eac537202dee730ad6e61769fc3ebce92a6085dbfd13ca902df5f1477/tensorflow-2.2.1-cp36-cp36m-manylinux2010_x86_64.whl (516.2MB)
[K     |█████████████████████████████

# Fetch models fom Google Storage

Next we can get pretrained checkpoint from Google Storage. For the sake of speed, this is a medium sized model trained on [TABFACT](https://tabfact.github.io/). Note that best results in the paper were obtained with with a large model.

In [3]:
! gsutil cp "gs://tapas_models/2020_10_07/tapas_tabfact_inter_masklm_large_reset.zip" "tapas_model.zip" && unzip tapas_model.zip
! mv tapas_tabfact_inter_masklm_large_reset tapas_model

Copying gs://tapas_models/2020_10_07/tapas_tabfact_inter_masklm_large_reset.zip...
| [1 files][  3.4 GiB/  3.4 GiB]   58.7 MiB/s                                   
Operation completed over 1 objects/3.4 GiB.                                      
Archive:  tapas_model.zip
   creating: tapas_tabfact_inter_masklm_large_reset/
  inflating: tapas_tabfact_inter_masklm_large_reset/bert_config.json  
  inflating: tapas_tabfact_inter_masklm_large_reset/README.txt  
  inflating: tapas_tabfact_inter_masklm_large_reset/model.ckpt.index  
  inflating: tapas_tabfact_inter_masklm_large_reset/model.ckpt.data-00000-of-00001  
  inflating: tapas_tabfact_inter_masklm_large_reset/vocab.txt  
  inflating: tapas_tabfact_inter_masklm_large_reset/model.ckpt.meta  


# Imports

In [4]:
import tensorflow.compat.v1 as tf
import os 
import shutil
import csv
import pandas as pd
import IPython

tf.get_logger().setLevel('ERROR')

In [5]:
from tapas.utils import tf_example_utils
from tapas.protos import interaction_pb2
from tapas.utils import number_annotation_utils
import math


# Load checkpoint for prediction

Here's the prediction code, which will create and `interaction_pb2.Interaction` protobuf object, which is the datastructure we use to store examples, and then call the prediction script.

In [6]:
os.makedirs('results/tabfact/tf_examples', exist_ok=True)
os.makedirs('results/tabfact/model', exist_ok=True)
with open('results/tabfact/model/checkpoint', 'w') as f:
  f.write('model_checkpoint_path: "model.ckpt-0"')
for suffix in ['.data-00000-of-00001', '.index', '.meta']:
  shutil.copyfile(f'tapas_model/model.ckpt{suffix}', f'results/tabfact/model/model.ckpt-0{suffix}')

In [21]:
max_seq_length = 512
vocab_file = "tapas_model/vocab.txt"
config = tf_example_utils.ClassifierConversionConfig(
    vocab_file=vocab_file,
    max_seq_length=max_seq_length,
    max_column_id=max_seq_length,
    max_row_id=max_seq_length,
    strip_column_names=False,
    add_aggregation_candidates=False,
)
converter = tf_example_utils.ToClassifierTensorflowExample(config)

def convert_interactions_to_examples(tables_and_queries):
  """Calls Tapas converter to convert interaction to example."""
  for idx, (table, queries) in enumerate(tables_and_queries):
    interaction = interaction_pb2.Interaction()
    for position, query in enumerate(queries):
      question = interaction.questions.add()
      question.original_text = query
      question.id = f"{idx}-0_{position}"
    for header in table[0]:
      interaction.table.columns.add().text = header
    for line in table[1:]:
      row = interaction.table.rows.add()
      for cell in line:
        row.cells.add().text = cell
    number_annotation_utils.add_numeric_values(interaction)
    for i in range(len(interaction.questions)):
      try:
        yield converter.convert(interaction, i)
      except ValueError as e:
        print(f"Can't convert interaction: {interaction.id} error: {e}")
        
def write_tf_example(filename, examples):
  with tf.io.TFRecordWriter(filename) as writer:
    for example in examples:
      writer.write(example.SerializeToString())

def predict(table_data, queries):
  table = table_data
  examples = convert_interactions_to_examples([(table, queries)])
  write_tf_example("results/tabfact/tf_examples/test.tfrecord", examples)
  write_tf_example("results/tabfact/tf_examples/dev.tfrecord", [])
  
  ! python tapas/tapas/run_task_main.py \
    --task="TABFACT" \
    --output_dir="results" \
    --noloop_predict \
    --test_batch_size={len(queries)} \
    --tapas_verbosity="ERROR" \
    --compression_type= \
    --reset_position_index_per_cell \
    --init_checkpoint="tapas_model/model.ckpt" \
    --bert_config_file="tapas_model/bert_config.json" \
    --mode="predict" 2> error


  results_path = "results/tabfact/model/test.tsv"
  all_results = []
  df = pd.DataFrame(table[1:], columns=table[0])
  display(IPython.display.HTML(df.to_html(index=False)))
  print()
  with open(results_path) as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t')
    for row in reader:
      supported = int(row["pred_cls"])
      all_results.append(supported)
      score = float(row["logits_cls"])
      prob = 1 / (1 + math.exp(-score))
      position = int(row['position'])
      print(">", queries[position])
      if supported:
        print(f"SUPPORTS with probability {prob:.2%}")
      else:
        print(f"REFUTES with probability {1.0 - prob:.2%}")
  return all_results

In [8]:
# extracting tables from html file for
from bs4 import BeautifulSoup
import requests
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

In [9]:
from bs4 import BeautifulSoup as bs
content = []
with open("table.xml", "r") as file:
    content = file.readlines()
    content = "".join(content)
    soup = bs(content, "lxml")

statements = soup.find_all('statement')
output = []
for s in statements:
    output.append((s['text']))

# Predict

In [22]:
# Based on TabFact table 2-15654040-4.html.csv
result = predict(output_rows, output)

is_built_with_cuda: True
is_gpu_available: False
GPUs: []
Training or predicting ...
Evaluation finished after training step 0.


Bodily sensations,Agoraphobic situation,Unpleasant,Pleasant
Numbed,Museum,Dangerous,Lovely
Dizzy,Restaurant,Fear,Pleasure
Tremble,Railway,Panic,Pleasant
Nervousness,Bus,Threat,Delight
Breathlessness,Theatre,Anxiety,Beautiful
Sweat,Lift,Shock,Happy
Sickness,Aeroplane,Horrify,Glad
Palpitation,Shop,Anxious,Happiness
Confusion,Tunnel,Frightened,Fun
Heartbeat,Boat,Harm,Joyous



> Palpitation is a bodily sensation
SUPPORTS with probability 100.00%
> Harm is an unpleasant
SUPPORTS with probability 99.97%
> Lovely is an agoraphobic situation
REFUTES with probability 100.00%
