# Hands on Tabular LLMs

Goals:
- Input formats and preprocessing (serialization, tokenization, numeric)
- Understanding relation to model architecture (pre-trained and fine-tuned)
- Using fine-tuned models to answer questions / verify statements


Models:
- TaPas:
  - paper: https://aclanthology.org/2020.acl-main.398.pdf
  - code: https://huggingface.co/docs/transformers/model_doc/tapas
- TaPEx:
  - paper: https://openreview.net/pdf?id=O50443AsCP,
  - code: https://huggingface.co/docs/transformers/model_doc/tapex

# SETUP

In [1]:
!pip install transformers



In [2]:
import pandas as pd
import transformers

 Loading the models

In [3]:
class TapasModel:
    def __init__(self, model_name="google/tapas-base"):
        """Choice of:
              - google/tapas-base,
              - google/tapas-base-finetuned-wtq,
           Alternatively, load a larger model by replacing 'base' with 'large'
        """
        print("loading tapas model")
        config = transformers.TapasConfig()
        config.select_one_column = False
        self.tokenizer = transformers.TapasTokenizer.from_pretrained(model_name)
        if not "finetuned" in model_name:
          self.model = transformers.TapasModel.from_pretrained(model_name, config=config)
        else:
          self.model = transformers.TapasForQuestionAnswering.from_pretrained(model_name)
        self.model_name = model_name

![](https://drive.google.com/uc?export=view&id=1vB_1tPTInX5pzcAnrgwYGybi1EIwevGp)



In [4]:
class TapexModel:
    def __init__(self, model_name="microsoft/tapex-base"):
        """We will work with:
            - microsoft/tapex-base,
            - microsoft/microsoft/tapex-large-finetuned-tabfact,
           Alternatively, use a larger model by replacing 'base' with 'large'
        """
        print("loading tapex model")
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) # => instantiates the "TapexTokenizer" under the hood
        if "tabfact" in model_name:
          self.model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)
        else: self.model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.model_name = model_name

![](https://drive.google.com/uc?export=view&id=15FUqZi4pCSbl6Pb1SKQkprxpH3Ba3ntW)


In [5]:
# Instantiate base models (pre-trained, not fine-tuned), takes ~40s
tapas_base_model = TapasModel()
tapex_base_model = TapexModel()

loading tapas model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/490 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

loading tapex model


tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/988 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

In [6]:
# Instantiate fine-tuned models, takes ~40s
tapas_tuned_model = TapasModel(model_name="google/tapas-base-finetuned-wtq")
tapex_tuned_model = TapexModel(model_name="microsoft/tapex-base-finetuned-tabfact")

loading tapas model


tokenizer_config.json:   0%|          | 0.00/490 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

loading tapex model


tokenizer_config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'Refused', '1': 'Entailed'}. The number of labels wil be overwritten to 2.


pytorch_model.bin:   0%|          | 0.00/560M [00:00<?, ?B/s]

# INPUTS

Preprocessing the inputs from raw table data to vectorized input that the models can work with.

We focus on TaPEx here.

### Raw inputs: table + questions

In [8]:
table = pd.DataFrame.from_dict(
    data = {
      "Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
      "Age": ["56", "45", "59"],
      "Number of movies": ["87", "53", "69"],
    }
)
queries = [
    # First question in two different formulations
    "How old is Leonardo Di Caprio?",
    "What is Leonardo Di Caprio age?",
    # Second question in two different formulations
    "How many movies are there in total?",
    "What is the sum of the number of movies?",
    # Third question in two different formulations
    "How many movies has George Clooney played in?",
    "What is the number of movies that George Clooney played in?"
]

In [9]:
table

Unnamed: 0,Actors,Age,Number of movies
0,Brad Pitt,56,87
1,Leonardo Di Caprio,45,53
2,George Clooney,59,69


## Serialized input

In [10]:
tapex_serialized = tapex_base_model.tokenizer.prepare_table_query(table, queries[0])
tapex_serialized

'How old is Leonardo Di Caprio? col : Actors | Age | Number of movies row 1 : Brad Pitt | 56 | 87 row 2 : Leonardo Di Caprio | 45 | 53 row 3 : George Clooney | 59 | 69'

## Tokenized input

In [11]:
tapex_tokenized = tapex_base_model.tokenizer.tokenize(tapex_serialized)
tapex_tokenized

['how',
 'Ġold',
 'Ġis',
 'Ġle',
 'on',
 'ardo',
 'Ġdi',
 'Ġcap',
 'rio',
 '?',
 'Ġcol',
 'Ġ:',
 'Ġactors',
 'Ġ|',
 'Ġage',
 'Ġ|',
 'Ġnumber',
 'Ġof',
 'Ġmovies',
 'Ġrow',
 'Ġ1',
 'Ġ:',
 'Ġbr',
 'ad',
 'Ġp',
 'itt',
 'Ġ|',
 'Ġ56',
 'Ġ|',
 'Ġ87',
 'Ġrow',
 'Ġ2',
 'Ġ:',
 'Ġle',
 'on',
 'ardo',
 'Ġdi',
 'Ġcap',
 'rio',
 'Ġ|',
 'Ġ45',
 'Ġ|',
 'Ġ53',
 'Ġrow',
 'Ġ3',
 'Ġ:',
 'Ġge',
 'orge',
 'Ġclo',
 'oney',
 'Ġ|',
 'Ġ59',
 'Ġ|',
 'Ġ69']

Why the Ġ symbol?


Ġ denotes that a space precedes the token.

When are full words converted to a single token, and when is a word split? -> Byte-Pair Encoding (BPE): "Encodes frequent words as a single token, while less frequent words are represented by multiple tokens, each of them representing a word part." This is determined depending on vocabulary size.

In [12]:
# We have 54 tokens.
len(tapex_tokenized)

54

## From tokens to numeric token IDs

In [13]:
tapex_vocab = tapex_base_model.tokenizer.get_vocab()

In [14]:
list(tapex_vocab.items())[:20]
# See <s> (start token), </s> (end token) and <pad> token.

[('<s>', 0),
 ('<pad>', 1),
 ('</s>', 2),
 ('<unk>', 3),
 ('.', 4),
 ('Ġthe', 5),
 (',', 6),
 ('Ġto', 7),
 ('Ġand', 8),
 ('Ġof', 9),
 ('Ġa', 10),
 ('Ġin', 11),
 ('-', 12),
 ('Ġfor', 13),
 ('Ġthat', 14),
 ('Ġon', 15),
 ('Ġis', 16),
 ('âĢ', 17),
 ("'s", 18),
 ('Ġwith', 19)]

In [15]:
len(tapex_vocab), tapex_base_model.tokenizer.vocab_size

(50265, 50265)

## Complete input

In [16]:
tapex_base_model.tokenizer(table, queries, padding="longest", return_attention_mastk=True, return_special_tokens_mask=True)
# Notice the IDs of input and special_tokens at start and end reflecting scope of table+query token IDs.
# This is needed to distinguish input tokens from, for example, padding.

{'input_ids': [[0, 9178, 793, 16, 2084, 261, 6782, 2269, 2927, 12834, 116, 11311, 4832, 5552, 1721, 1046, 1721, 346, 9, 4133, 3236, 112, 4832, 5378, 625, 181, 2582, 1721, 4772, 1721, 8176, 3236, 132, 4832, 2084, 261, 6782, 2269, 2927, 12834, 1721, 2248, 1721, 4268, 3236, 155, 4832, 5473, 26875, 42771, 6071, 1721, 5169, 1721, 5913, 2, 1, 1, 1, 1], [0, 12196, 16, 2084, 261, 6782, 2269, 2927, 12834, 1046, 116, 11311, 4832, 5552, 1721, 1046, 1721, 346, 9, 4133, 3236, 112, 4832, 5378, 625, 181, 2582, 1721, 4772, 1721, 8176, 3236, 132, 4832, 2084, 261, 6782, 2269, 2927, 12834, 1721, 2248, 1721, 4268, 3236, 155, 4832, 5473, 26875, 42771, 6071, 1721, 5169, 1721, 5913, 2, 1, 1, 1, 1], [0, 9178, 171, 4133, 32, 89, 11, 746, 116, 11311, 4832, 5552, 1721, 1046, 1721, 346, 9, 4133, 3236, 112, 4832, 5378, 625, 181, 2582, 1721, 4772, 1721, 8176, 3236, 132, 4832, 2084, 261, 6782, 2269, 2927, 12834, 1721, 2248, 1721, 4268, 3236, 155, 4832, 5473, 26875, 42771, 6071, 1721, 5169, 1721, 5913, 2, 1, 1, 1, 1,

In [17]:
# Understand token IDs, e.g. get the IDs by indexing the vocabulary.
tapex_vocab["how"]
# [0, 9178, ..., 2, 1, 1, 1, ..., 1]
# our tokenized input starts at <s> (0), folloewed by <Ghow> (9178), ends with </s> 2, followed by many <pad>s 1.

9178

## Exercise 1: find where is the serialized table represention in the inputs

In [24]:
# TODO

# OUTPUTS

Given input+model, we inspect the outputs of the pre-trained model (embeddings) and the fine-tuned model for QA and fact verification.

We focus on TaPas here.

## Pre-trained embeddings

In [26]:
tapas_base_inputs = tapas_base_model.tokenizer(
    table,
    queries,
    padding="max_length",
    return_tensors="pt"
)
tapas_base_outputs = tapas_base_model.model(
    **tapas_base_inputs
)

  text = normalize_for_match(row[col_index].text)
  cell = row[col_index]


In [27]:
tapas_base_inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [28]:
# The last_hidden_state represents the contextual embeddings.
tapas_base_outputs["last_hidden_state"].shape, tapas_base_outputs["last_hidden_state"][0].shape
# Notice the embedding dimensions:
# - 6   = the number of queries
# - 512 = number of tokens for the table, 0-padded to max input length.
# - 768 = embedding dimension

(torch.Size([6, 512, 768]), torch.Size([512, 768]))

In [29]:
tapas_base_outputs["last_hidden_state"][0].shape
# Notice:
# - This would be the table embedding aligned with first question/statement.
# - Requires tracking tokens and indexing + agg. (row/col) embeddings to row, column.

torch.Size([512, 768])

## Change padding strategy to "longest"

In [30]:
tapas_base_inputs = tapas_base_model.tokenizer(table, queries, padding="longest", return_tensors="pt")
tapas_base_outputs = tapas_base_model.model(**tapas_base_inputs)

tapas_base_outputs["last_hidden_state"].shape, tapas_base_outputs["last_hidden_state"][0].shape

# We changed padding to 'longest', here reflecting the number of tokens
# Dimension reduces to 6,37,768. Shorter inputs are still padded.

# Notice that with tapas our max number of tokens is 37 instead of 54.
# TaPas has a different tokenizer with a smaller vocabulary.

(torch.Size([6, 37, 768]), torch.Size([37, 768]))

Note that obtaining column-level or row-level embeddings needs aggregation from embeddings!

In [31]:
# Remove objects as large tensors consume a lot of ram
del tapas_base_model, tapas_base_inputs, tapas_base_outputs, tapex_base_model

## Fine-tuned outputs

1. Question Answering with TaPas
2. Fact-verification with TaPEx

## Question answering over tables with fine-tuned TaPas

In [32]:
tapas_tuned_inputs = tapas_tuned_model.tokenizer(table, queries, padding="longest", return_tensors="pt")
tapas_tuned_outputs = tapas_tuned_model.model(**tapas_tuned_inputs)

In [33]:
tapas_tuned_inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [34]:
tapas_tuned_outputs.keys()
# Notice, e.g. by reflecting on TaPas' pre-training tasks as in figs above:
# - logits = cell selection probabilities
# - logits_aggregation = aggregation probabilities
# So what dimensions would these have?

odict_keys(['logits', 'logits_aggregation'])

In [35]:
tapas_tuned_outputs["logits"].shape, tapas_tuned_outputs["logits_aggregation"].shape

(torch.Size([6, 37]), torch.Size([6, 4]))

In [36]:
tapas_tuned_outputs["logits_aggregation"]

tensor([[ -1.1244, -16.3785,  15.8306, -19.0858],
        [ -1.6238, -16.6762,  15.7965, -19.0060],
        [ -5.8669, -12.5763, -13.3268,  12.8516],
        [ -4.0559,  40.0272,  -5.3592,  23.4002],
        [ -8.4079,  -9.4974, -10.9121,  12.9523],
        [ -1.0346, -16.1450,  16.5865, -19.6338]], grad_fn=<AddmmBackward0>)

In [37]:
predicted_answer_coordinates, predicted_aggregation_indices = tapas_tuned_model.tokenizer.convert_logits_to_predictions(
    tapas_tuned_inputs, tapas_tuned_outputs.logits.detach(), tapas_tuned_outputs.logits_aggregation.detach()
)

# Notice the following things:
# - We use the tokenizer again to convert logits -> probs_per_token, high probs = relevant cell span for answer.
# - and aggregation_logits -> probs_per_aggregation_operator with high probs = aggregation

In [38]:
print(predicted_answer_coordinates, "\n", predicted_aggregation_indices)
# Notice:
# - Different lengths of answer coordinates (row,column), which relates to the number of cells 'selected' for the answer.
# - Inconsistent aggregation indices (just one number) over second "semantically similar question pair".

[[(1, 1)], [(1, 1)], [(0, 2), (1, 2), (2, 2)], [(0, 2), (1, 2), (2, 2)], [(2, 2)], [(2, 2)]] 
 [2, 2, 3, 1, 3, 2]


In [39]:
# The aggregation operation indices are as follows (used upon training TaPas)
id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
aggregation_predictions_string
# aggregation should for 0,1 should be same, 2,3 should be same, 4,5 should be same.

['AVERAGE', 'AVERAGE', 'COUNT', 'SUM', 'COUNT', 'AVERAGE']

In [40]:
queries

['How old is Leonardo Di Caprio?',
 'What is Leonardo Di Caprio age?',
 'How many movies are there in total?',
 'What is the sum of the number of movies?',
 'How many movies has George Clooney played in?',
 'What is the number of movies that George Clooney played in?']

In [41]:
# Extracting the relevant cell values and aggregator
answers = []
for coordinates in predicted_answer_coordinates:
    if len(coordinates) == 1:
        # only a single cell selected:
        answers.append(table.iat[coordinates[0]])
    else:
        # multiple cells selected
        cell_values = []
        for coordinate in coordinates:
            cell_values.append(table.iat[coordinate])
        answers.append(", ".join(cell_values))

display(table, '')
for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
    print(query)
    if predicted_agg == "NONE":
        print("Predicted answer: " + answer)
    else:
        print("Predicted answer: " + predicted_agg + " = " + answer)

Unnamed: 0,Actors,Age,Number of movies
0,Brad Pitt,56,87
1,Leonardo Di Caprio,45,53
2,George Clooney,59,69


''

How old is Leonardo Di Caprio?
Predicted answer: AVERAGE = 45
What is Leonardo Di Caprio age?
Predicted answer: AVERAGE = 45
How many movies are there in total?
Predicted answer: COUNT = 87, 53, 69
What is the sum of the number of movies?
Predicted answer: SUM = 87, 53, 69
How many movies has George Clooney played in?
Predicted answer: COUNT = 69
What is the number of movies that George Clooney played in?
Predicted answer: AVERAGE = 69


For the above answers:
- "How many {X}?" questions seem to resort to the "count" operator which counts the number of rows in which relevant values occur (i.e. total movies, or movie of specific actor).


## Fact-verification with TaPEx

In [42]:
table

Unnamed: 0,Actors,Age,Number of movies
0,Brad Pitt,56,87
1,Leonardo Di Caprio,45,53
2,George Clooney,59,69


In [43]:
statement_false = "George Clooney plays in 30 movies"
statement_true = "George Clooney plays in 69 movies"

In [51]:
inputs = tapex_tuned_model.tokenizer(table, [statement_false, statement_true], return_tensors="pt")
outputs = tapex_tuned_model.model(**inputs)

In [52]:
inputs.keys(), outputs.keys()

(dict_keys(['input_ids', 'attention_mask']),
 odict_keys(['logits', 'past_key_values', 'encoder_last_hidden_state']))

In [53]:
inputs["input_ids"].shape, outputs["logits"].shape, outputs["encoder_last_hidden_state"].shape

(torch.Size([2, 54]), torch.Size([2, 2]), torch.Size([2, 54, 768]))

In [54]:
outputs["logits"][0].shape

torch.Size([2])

In [57]:
# Extract the predictions
predicted_class_idxs = outputs.logits.argmax(dim=0)
for predicted_class_idx in predicted_class_idxs:
  print(tapex_tuned_model.model.config.id2label[predicted_class_idx.item()])

Refused
Entailed
