# TAPAS表格问答模型应用开发

## 模型介绍

TAPAS（Tabular Parsing for Question Answering from Structured Data）是一种用于从结构化数据（如表格）中进行问答的模型。它是由Google Research开发的，基于BERT架构。TAPAS模型可以直接处理表格数据，并回答与表格内容相关的问题。

论文链接：https://arxiv.org/abs/2004.02349

AI Gallery项目地址：https://pangu.huaweicloud.com/gallery/asset-detail.html?id=69dbd529-93e4-4a06-ba4f-242c5e82b56c

## 环境配置
1. python=3.9
2. mindnlp=0.4.0
3. pandas=2.2.3

## 导入相关的库
Pandas 是一个开源的 Python 数据处理库，提供了高效、便捷的数据结构和数据分析工具。它广泛应用于数据清洗、数据处理、数据分析和数据可视化等领域。其特有的数据结构dataframe可以作为TAPAS模型接收表格参数的形式。mindnlp 库则是基于 MindSpore 框架构建的，专注于自然语言处理任务的工具和模型。提供了u丰富的nlp预训练模型和数据处理工具。
版本依赖：mindnlp=0.4.0,pandas=2.2.3

In [1]:
from mindnlp.transformers import TapasTokenizer, TapasForQuestionAnswering
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.379 seconds.
Prefix dict has been built successfully.


## 加载预训练模型
加载预训练的TAPAS模型和分词器,这里可以根据需求选择用不同大小的数据集微调的预训练模型，如：
tapas-large-finetuned-wtq,tapas-base-finetuned-sqa等，这里选用基于wtq数据集微调的标准大小的预训练模型。根据模型大小的不同，可能需要不同的下载时间，请耐心等待。

In [None]:
model_name = "google/tapas-base-finetuned-wtq"
model = TapasForQuestionAnswering.from_pretrained(model_name)
tokenizer = TapasTokenizer.from_pretrained(model_name,clean_up_tokenization_spaces=True)

[MS_ALLOC_CONF]Runtime config:  enable_vmm:True  vmm_align_size:2MB




## 准备表格数据
表格用json格式读取，请务必将表格的数据转为字符串类型。并用pandas转为分词器可接收的table，这里选用了一个关于书籍信息的表格。同时定义关于表格的提问，TAPAS模型支持两类型的任务，一种是表格选择，问题的答案是一个单元格的内容，一种是数据聚合，问题的结果是一个数字。在例程中提出的三个问题中，前两个问题属于第一种任务，第三个问题属于第二种任务。

In [24]:
data = {
    'Title': [
        'The Great Gatsby', '1984', 'To Kill a Mockingbird', 'Pride and Prejudice', 'The Catcher in the Rye',
         'War and Peace', 'The Odyssey', 'Crime and Punishment', 'The Brothers Karamazov',
        'One Hundred Years of Solitude', 'Brave New World', 'The Lord of the Rings', 'Animal Farm', 'Fahrenheit 451',
        'The Grapes of Wrath', 'Catch-22', 'The Hobbit', 'Jane Eyre', 'Wuthering Heights',
        'Gone with the Wind', 'The Scarlet Letter', 'The Adventures of Huckleberry Finn', 'Dracula', 'Frankenstein',
        'The Picture of Dorian Gray', 'Anna Karenina', 'Les Misérables', 'Great Expectations', 'A Tale of Two Cities',
        'The Count of Monte Cristo', 'Don Quixote', 'Middlemarch', 'The Iliad', 'The Sound and the Fury',
        'The Sun Also Rises', 'Slaughterhouse-Five', 'Beloved', 'The Color Purple', 'The Handmaid\'s Tale',
        'The Road', 'The Alchemist', 'Life of Pi', 'The Kite Runner', 'A Thousand Splendid Suns'
    ],
    'Author': [
        'F. Scott Fitzgerald', 'George Orwell', 'Harper Lee', 'Jane Austen', 'J.D. Salinger',
        'Leo Tolstoy', 'Homer', 'Fyodor Dostoevsky', 'Fyodor Dostoevsky',
        'Gabriel García Márquez', 'Aldous Huxley', 'J.R.R. Tolkien', 'George Orwell', 'Ray Bradbury',
        'John Steinbeck', 'Joseph Heller', 'J.R.R. Tolkien', 'Charlotte Brontë', 'Emily Brontë',
        'Margaret Mitchell', 'Nathaniel Hawthorne', 'Mark Twain', 'Bram Stoker', 'Mary Shelley',
        'Oscar Wilde', 'Leo Tolstoy', 'Victor Hugo', 'Charles Dickens', 'Charles Dickens',
        'Alexandre Dumas', 'Miguel de Cervantes', 'George Eliot', 'Homer', 'William Faulkner',
        'Ernest Hemingway', 'Kurt Vonnegut', 'Toni Morrison', 'Alice Walker', 'Margaret Atwood',
        'Cormac McCarthy', 'Paulo Coelho', 'Yann Martel', 'Khaled Hosseini', 'Khaled Hosseini'
    ],
    'Year': [
        '1925', '1949', '1960', '1813', '1951',
        '1869', '-800', '1866', '1880',
        '1967', '1932', '1954', '1945', '1953',
        '1939', '1961', '1937', '1847', '1847',
        '1936', '1850', '1884', '1897', '1818',
        '1890', '1877', '1862', '1861', '1859',
        '1844', '1605', '1871', '-750', '1929',
        '1926', '1969', '1987', '1982', '1985',
        '2006', '1988', '2001', '2003', '2007'
    ],
    'Category': [
        'Fiction', 'Dystopian', 'Fiction', 'Classic', 'Fiction',
        'Historical', 'Epic', 'Philosophical', 'Philosophical',
        'Magical Realism', 'Dystopian', 'Fantasy', 'Satire', 'Dystopian',
        'Historical', 'Satire', 'Fantasy', 'Gothic', 'Gothic',
        'Historical', 'Classic', 'Adventure', 'Horror', 'Gothic',
        'Gothic', 'Historical', 'Historical', 'Classic', 'Classic',
        'Adventure', 'Classic', 'Classic', 'Epic', 'Fiction',
        'Fiction', 'Satire', 'Fiction', 'Fiction', 'Dystopian',
        'Fiction', 'Fiction', 'Adventure', 'Fiction', 'Fiction'
    ]
}
table = pd.DataFrame.from_dict(data)

questions = ["Who is the author of The Lord of the Rings?","which book published on 1987?","How many books belonging to Adventure in sum?"]
table


Unnamed: 0,Title,Author,Year,Category
0,The Great Gatsby,F. Scott Fitzgerald,1925,Fiction
1,1984,George Orwell,1949,Dystopian
2,To Kill a Mockingbird,Harper Lee,1960,Fiction
3,Pride and Prejudice,Jane Austen,1813,Classic
4,The Catcher in the Rye,J.D. Salinger,1951,Fiction
5,War and Peace,Leo Tolstoy,1869,Historical
6,The Odyssey,Homer,-800,Epic
7,Crime and Punishment,Fyodor Dostoevsky,1866,Philosophical
8,The Brothers Karamazov,Fyodor Dostoevsky,1880,Philosophical
9,One Hundred Years of Solitude,Gabriel García Márquez,1967,Magical Realism


## 模型推理
使用编码器对表格和问题进行编码，并使用模型进行推理。

In [25]:
inputs = tokenizer(table=table, queries=question, return_tensors="ms",padding="max_length")

input_ids=inputs["input_ids"]
attention_mask=inputs["attention_mask"]
token_type_ids=inputs["token_type_ids"]

outputs = model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)


  text = normalize_for_match(row[col_index].text)
  cell = row[col_index]


## 结果解析
对预测结果进行解析，并对问题做出回答。

In [26]:
# 获取预测结果
predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
    inputs,
    outputs.logits,
    outputs.logits_aggregation
)

# 解析预测结果
answers = []
numbers=[]
for coordinates in predicted_answer_coordinates:
    if len(coordinates) == 1:
        # 单个单元格答案
        answers.append(table.iat[coordinates[0]])
    else:
        # 多个单元格答案
        cell_values = []
        for coordinate in coordinates:
            cell_values.append(table.iat[coordinate])
        answers.append(cell_values)

for num in predicted_aggregation_indices:
    numbers.append(num)

# 打印答案
for i in range(len(questions)):
    print("Question:", questions[i])
    #n判断是否为聚合型问题
    if numbers[i]!=0:
        print("Answer:",numbers[i],answers[i])
    else:
        print("Answer:", answers[i])

Question: Who is the author of The Lord of the Rings?
Answer: J.R.R. Tolkien
Question: which book published on 1987?
Answer: Beloved
Question: How many books belonging to Adventure in sum?
Answer: 3 ['The Adventures of Huckleberry Finn', 'The Count of Monte Cristo', 'Life of Pi']
