-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
106 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# What is Cross-modal and Multi-modal? | ||
|
||
The term "Modal" is the shorthand for "Data Modality". Data modality can be thought of as the "type" of data. For example, a tweet is a modal of type "text"; a photo is a modal of type "image"; a video is a modal of type "video"; etc. | ||
|
||
Classical machine learning applications are usually focusing on a single modality at a time. For example, a spam filter is focused on text modality. A photo classifier is focused on image modality. A music recommender is focused on audio modality. | ||
|
||
However, in the real world, data is often multimodal, meaning that it consists of multiple modalities. For example, a tweet often contains not only text, but also images, videos, and links. A video often contains not only video frames, but also audio and text (e.g., subtitles). | ||
|
||
**Multi-modal** machine learning is a relatively new field that is concerned with the development of algorithms that can learn from multiple modalities of data. | ||
|
||
**Cross-modal** machine learning is a subfield of multi-modal machine learning that is concerned with the development of algorithms that can learn from multiple modalities of data that are not necessarily aligned. For example, learning from images and text where the images and text are not necessarily about the same thing. | ||
|
||
Thanks to recent advances in deep neural networks, a cross-modal or multi-modal system can go way beyond single modality. It enables advanced intelligence on all kinds of unstructured data, such as images, audio, video, PDF, 3D mesh, you name it. | ||
|
||
## Applications | ||
|
||
There are many potential applications of cross-modal machine learning. For example, a cross-modal machine learning algorithm could be used to automatically generate descriptions of images (e.g., for blind people). A search system could use a cross-modal machine learning algorithm to search for images by text queries (e.g., "find me a picture of a dog"). A text-to-image generation system could use a cross-modal machine learning algorithm to generate images from text descriptions (e.g., "generate an image of a dog"). | ||
|
||
In particular, there are two families of applications: neural search and creative AI. | ||
|
||
### Neural Search | ||
|
||
One of the most promising applications of cross-modal machine learning is neural search. The core idea of neural search is to leverage state-of-the-art deep neural networks to build every component of a search system. In short, **neural search is deep neural network-powered information retrieval**. In academia, it’s often called neural IR. | ||
|
||
|
||
Below is an example of image embedding space generated by [DocArray](https://github.com/jina-ai/docarray)(the data structure behind Jina) and used for content-based image retrieval. Notice how similar images are mapped together in the embedding space. | ||
|
||
```{figure} https://github.com/jina-ai/docarray/raw/main/.github/README-img/tsne.gif?raw=true | ||
``` | ||
|
||
Searching is as simple as: | ||
|
||
```python | ||
db = ... # a DocumentArray of indexed images | ||
queries = ... # a DocumentArray of query images | ||
|
||
db.find(queries, limit=9) | ||
for d in db: | ||
for m in d.matches: | ||
print(d.uri, m.uri, m.scores['cosine'].value) | ||
``` | ||
|
||
```console | ||
left/02262.jpg right/03459.jpg 0.21102 | ||
left/02262.jpg right/02964.jpg 0.13871843 | ||
left/02262.jpg right/02103.jpg 0.18265384 | ||
left/02262.jpg right/04520.jpg 0.16477376 | ||
... | ||
``` | ||
|
||
Neural search is particularly well suited to cross-modal search tasks, because it can learn to map the features of one modality (e.g., text) to the features of another modality (e.g., images). This enables neural search engines to search for documents and images by text queries, and to search for text documents by image queries. | ||
|
||
|
||
#### Think outside the (search) box | ||
|
||
Many neural search-powered applications do not have a search box: | ||
|
||
- A question-answering chatbot can be powered by neural search: by first indexing all hard-coded QA pairs and then semantically mapping user dialog to those pairs. | ||
|
||
- A smart speaker can be powered by neural search: by applying STT (speech-to-text) and semantically mapping text to internal commands. | ||
|
||
- A recommendation system can be powered by neural search: by embedding user-item information into vectors and finding top-K nearest neighbours of a user/item. | ||
|
||
Neural search creates a new way to comprehend the world. It is creating new doors that lead to new businesses. | ||
|
||
### Creative AI | ||
|
||
Another potential application of cross-modal machine learning is creative AI. Creative AI systems use artificial intelligence to generate new content, such as images, videos, or text. For example, Open AI GPT3 is a machine learning platform that can generate text. The system is trained on a large corpus of text, such as books, articles, and websites. Once trained, the system can generate new text that is similar to the training data. This can be used to generate new articles, stories, or even poems. | ||
|
||
Open AI DALLE is another example of a creative AI system. This system generates images from textual descriptions. For example, given the text "a black cat with green eyes", the system will generate an image of a black cat with green eyes. Below is an example of generating images from a text prompt using [DALL·E Flow](https://github.com/jina-ai/dalle-flow)(a text-to-image system built on top of Jina). | ||
|
||
|
||
```python | ||
server_url = 'grpc://dalle-flow.jina.ai:51005' | ||
prompt = 'an oil painting of a humanoid robot playing chess in the style of Matisse' | ||
|
||
from docarray import Document | ||
|
||
doc = Document(text=prompt).post(server_url, parameters={'num_images': 8}) | ||
da = doc.matches | ||
|
||
da.plot_image_sprites(fig_size=(10, 10), show_index=True) | ||
``` | ||
|
||
```{figure} https://github.com/jina-ai/dalle-flow/raw/main/.github/client-dalle.png?raw=true | ||
``` | ||
|
||
Creative AI holds great potential for the future. It has the potential to revolutionize how we interact with machines, helping us create more personalized experiences: | ||
|
||
- It can be used to create realistic 3D images and videos of people and objects, which can be used in movies, video games, and other visual media. | ||
|
||
- It can be used to generate realistic and natural-sounding dialogue, which can be used in movies, video games, and other forms of entertainment. | ||
|
||
- It can be used to create new and innovative designs for products, which can be used in manufacturing and other industries. | ||
|
||
- It can be used to create new and innovative marketing campaigns, which can be used in advertising and other industries. | ||
|
||
## What's next? | ||
|
||
In the next chapter, we will explain why Jina is the perfect framework for building neural search, creative AI and any cross-modal or multi-modal applications. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Why Jina and Cloud-Native? | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters