Skip to content

Commit

Permalink
docs: add what is cross multi modal
Browse files Browse the repository at this point in the history
  • Loading branch information
hanxiao committed Aug 12, 2022
1 parent 731d492 commit 9615e5b
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 1 deletion.
100 changes: 100 additions & 0 deletions docs/get-started/what-is.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# What is Cross-modal and Multi-modal?

The term "Modal" is the shorthand for "Data Modality". Data modality can be thought of as the "type" of data. For example, a tweet is a modal of type "text"; a photo is a modal of type "image"; a video is a modal of type "video"; etc.

Classical machine learning applications are usually focusing on a single modality at a time. For example, a spam filter is focused on text modality. A photo classifier is focused on image modality. A music recommender is focused on audio modality.

However, in the real world, data is often multimodal, meaning that it consists of multiple modalities. For example, a tweet often contains not only text, but also images, videos, and links. A video often contains not only video frames, but also audio and text (e.g., subtitles).

**Multi-modal** machine learning is a relatively new field that is concerned with the development of algorithms that can learn from multiple modalities of data.

**Cross-modal** machine learning is a subfield of multi-modal machine learning that is concerned with the development of algorithms that can learn from multiple modalities of data that are not necessarily aligned. For example, learning from images and text where the images and text are not necessarily about the same thing.

Thanks to recent advances in deep neural networks, a cross-modal or multi-modal system can go way beyond single modality. It enables advanced intelligence on all kinds of unstructured data, such as images, audio, video, PDF, 3D mesh, you name it.

## Applications

There are many potential applications of cross-modal machine learning. For example, a cross-modal machine learning algorithm could be used to automatically generate descriptions of images (e.g., for blind people). A search system could use a cross-modal machine learning algorithm to search for images by text queries (e.g., "find me a picture of a dog"). A text-to-image generation system could use a cross-modal machine learning algorithm to generate images from text descriptions (e.g., "generate an image of a dog").

In particular, there are two families of applications: neural search and creative AI.

### Neural Search

One of the most promising applications of cross-modal machine learning is neural search. The core idea of neural search is to leverage state-of-the-art deep neural networks to build every component of a search system. In short, **neural search is deep neural network-powered information retrieval**. In academia, it’s often called neural IR.


Below is an example of image embedding space generated by [DocArray](https://github.com/jina-ai/docarray)(the data structure behind Jina) and used for content-based image retrieval. Notice how similar images are mapped together in the embedding space.

```{figure} https://github.com/jina-ai/docarray/raw/main/.github/README-img/tsne.gif?raw=true
```

Searching is as simple as:

```python
db = ... # a DocumentArray of indexed images
queries = ... # a DocumentArray of query images

db.find(queries, limit=9)
for d in db:
for m in d.matches:
print(d.uri, m.uri, m.scores['cosine'].value)
```

```console
left/02262.jpg right/03459.jpg 0.21102
left/02262.jpg right/02964.jpg 0.13871843
left/02262.jpg right/02103.jpg 0.18265384
left/02262.jpg right/04520.jpg 0.16477376
...
```

Neural search is particularly well suited to cross-modal search tasks, because it can learn to map the features of one modality (e.g., text) to the features of another modality (e.g., images). This enables neural search engines to search for documents and images by text queries, and to search for text documents by image queries.


#### Think outside the (search) box

Many neural search-powered applications do not have a search box:

- A question-answering chatbot can be powered by neural search: by first indexing all hard-coded QA pairs and then semantically mapping user dialog to those pairs.

- A smart speaker can be powered by neural search: by applying STT (speech-to-text) and semantically mapping text to internal commands.

- A recommendation system can be powered by neural search: by embedding user-item information into vectors and finding top-K nearest neighbours of a user/item.

Neural search creates a new way to comprehend the world. It is creating new doors that lead to new businesses.

### Creative AI

Another potential application of cross-modal machine learning is creative AI. Creative AI systems use artificial intelligence to generate new content, such as images, videos, or text. For example, Open AI GPT3 is a machine learning platform that can generate text. The system is trained on a large corpus of text, such as books, articles, and websites. Once trained, the system can generate new text that is similar to the training data. This can be used to generate new articles, stories, or even poems.

Open AI DALLE is another example of a creative AI system. This system generates images from textual descriptions. For example, given the text "a black cat with green eyes", the system will generate an image of a black cat with green eyes. Below is an example of generating images from a text prompt using [DALL·E Flow](https://github.com/jina-ai/dalle-flow)(a text-to-image system built on top of Jina).


```python
server_url = 'grpc://dalle-flow.jina.ai:51005'
prompt = 'an oil painting of a humanoid robot playing chess in the style of Matisse'

from docarray import Document

doc = Document(text=prompt).post(server_url, parameters={'num_images': 8})
da = doc.matches

da.plot_image_sprites(fig_size=(10, 10), show_index=True)
```

```{figure} https://github.com/jina-ai/dalle-flow/raw/main/.github/client-dalle.png?raw=true
```

Creative AI holds great potential for the future. It has the potential to revolutionize how we interact with machines, helping us create more personalized experiences:

- It can be used to create realistic 3D images and videos of people and objects, which can be used in movies, video games, and other visual media.

- It can be used to generate realistic and natural-sounding dialogue, which can be used in movies, video games, and other forms of entertainment.

- It can be used to create new and innovative designs for products, which can be used in manufacturing and other industries.

- It can be used to create new and innovative marketing campaigns, which can be used in advertising and other industries.

## What's next?

In the next chapter, we will explain why Jina is the perfect framework for building neural search, creative AI and any cross-modal or multi-modal applications.
4 changes: 4 additions & 0 deletions docs/get-started/why-jina.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Why Jina and Cloud-Native?



3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,10 @@ If you'd like to opt out of usage statistics, make sure to add the `--optout-tel
:caption: Get Started
:hidden:
get-started/what-is
fundamentals/architecture-overview
get-started/install/index
get-started/create-app
fundamentals/architecture-overview
```

```{toctree}
Expand Down

0 comments on commit 9615e5b

Please sign in to comment.