Skip to content

Commit

Permalink
Add Multimodal heading and Document question answering in task_summar…
Browse files Browse the repository at this point in the history
…y.mdx (#23318)

* add multimodal heading and docqa

* fix sentence

* task_summary data type = modality clarification

* change the multimodal example to a smaller model
  • Loading branch information
y3sar committed Jul 17, 2023
1 parent 38dfb86 commit 4f08887
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions docs/source/en/task_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,4 +311,31 @@ There are two types of language modeling:
'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}]
```

## Multimodal

Multimodal tasks require a model to process multiple data modalities (text, image, audio, video) to solve a particular problem. Image captioning is an example of a multimodal task where the model takes an image as input and outputs a sequence of text describing the image or some properties of the image.

Although multimodal models work with different data types or modalities, internally, the preprocessing steps help the model convert all the data types into embeddings (vectors or list of numbers that holds meaningful information about the data). For a task like image captioning, the model learns relationships between image embeddings and text embeddings.

### Document question answering

Document question answering is a task that answers natural language questions from a document. Unlike a token-level question answering task which takes text as input, document question answering takes an image of a document as input along with a question about the document and returns an answer. Document question answering can be used to parse structured documents and extract key information from it. In the example below, the total amount and change due can be extracted from a receipt.

```py
>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> url = "https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/2/image/image.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> doc_question_answerer = pipeline("document-question-answering", model="magorshunov/layoutlm-invoices")
>>> preds = doc_question_answerer(
... question="What is the total amount?",
... image=image,
... )
>>> preds
[{'score': 0.8531239628791809, 'answer': '17,000', 'start': 4, 'end': 4}]
```

Hopefully, this page has given you some more background information about all the types of tasks in each modality and the practical importance of each one. In the next [section](tasks_explained), you'll learn **how** 🤗 Transformers work to solve these tasks.

0 comments on commit 4f08887

Please sign in to comment.