Skip to content

Commit bbc95bb

Browse files
mhbuehlerokhleif-10dmsuehirpre-commit-ci[bot]ashahba
authored
MultimodalQnA Image and Audio Support Phase 1 (#1071)
Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com> Signed-off-by: okhleif-IL <omar.khleif@intel.com> Signed-off-by: dmsuehir <dina.s.jones@intel.com> Co-authored-by: Omar Khleif <omar.khleif@intel.com> Co-authored-by: dmsuehir <dina.s.jones@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
1 parent dd9623d commit bbc95bb

File tree

15 files changed

+471
-155
lines changed

15 files changed

+471
-155
lines changed

MultimodalQnA/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
44

5-
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
5+
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
66

77
The MultimodalQnA architecture shows below:
88

@@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component
100100

101101
By default, the embedding and LVM models are set to a default value as listed below:
102102

103-
| Service | Model |
104-
| -------------------- | ------------------------------------------- |
105-
| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
106-
| LVM | llava-hf/llava-v1.6-vicuna-13b-hf |
103+
| Service | HW | Model |
104+
| -------------------- | ----- | ----------------------------------------- |
105+
| embedding-multimodal | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
106+
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
107+
| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
108+
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
107109

108110
You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
109111

MultimodalQnA/docker_compose/intel/cpu/xeon/README.md

Lines changed: 37 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis"
8484
export LLAVA_SERVER_PORT=8399
8585
export LVM_ENDPOINT="http://${host_ip}:8399"
8686
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
87+
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
8788
export WHISPER_MODEL="base"
8889
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
8990
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
9091
export LVM_SERVICE_HOST_IP=${host_ip}
9192
export MEGA_SERVICE_HOST_IP=${host_ip}
9293
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
94+
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
9395
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
9496
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
95-
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
96-
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
97+
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
98+
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
9799
```
98100

99101
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \
274276

275277
6. dataprep-multimodal-redis
276278

277-
Download a sample video
279+
Download a sample video, image, and audio file and create a caption
278280

279281
```bash
280282
export video_fn="WeAreGoingOnBullrun.mp4"
281283
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
284+
285+
export image_fn="apple.png"
286+
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
287+
288+
export caption_fn="apple.txt"
289+
echo "This is an apple." > ${caption_fn}
290+
291+
export audio_fn="AudioSample.wav"
292+
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
282293
```
283294

284-
Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
295+
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
285296

286297
```bash
287298
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
288299
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
289300
-H 'Content-Type: multipart/form-data' \
290-
-X POST -F "files=@./${video_fn}"
301+
-X POST \
302+
-F "files=@./${video_fn}" \
303+
-F "files=@./${audio_fn}"
291304
```
292305

293-
Also, test dataprep microservice with generating caption using lvm microservice
306+
Also, test dataprep microservice with generating an image caption using lvm microservice
294307

295308
```bash
296309
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
297310
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
298311
-H 'Content-Type: multipart/form-data' \
299-
-X POST -F "files=@./${video_fn}"
312+
-X POST -F "files=@./${image_fn}"
313+
```
314+
315+
Now, test the microservice with posting a custom caption along with an image
316+
317+
```bash
318+
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
319+
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
320+
-H 'Content-Type: multipart/form-data' \
321+
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
300322
```
301323

302-
Also, you are able to get the list of all videos that you uploaded:
324+
Also, you are able to get the list of all files that you uploaded:
303325

304326
```bash
305327
curl -X POST \
306328
-H "Content-Type: application/json" \
307-
${DATAPREP_GET_VIDEO_ENDPOINT}
329+
${DATAPREP_GET_FILE_ENDPOINT}
308330
```
309331

310-
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
332+
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
311333

312334
```bash
313335
[
314336
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
315-
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
337+
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
338+
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
339+
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
316340
]
317341
```
318342
319-
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
343+
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
320344
321345
```bash
322346
curl -X POST \
323347
-H "Content-Type: application/json" \
324-
${DATAPREP_DELETE_VIDEO_ENDPOINT}
348+
${DATAPREP_DELETE_FILE_ENDPOINT}
325349
```
326350
327351
7. MegaService

MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ services:
3636
http_proxy: ${http_proxy}
3737
https_proxy: ${https_proxy}
3838
PORT: ${EMBEDDER_PORT}
39+
entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
3940
restart: unless-stopped
4041
embedding-multimodal:
4142
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -76,6 +77,7 @@ services:
7677
no_proxy: ${no_proxy}
7778
http_proxy: ${http_proxy}
7879
https_proxy: ${https_proxy}
80+
entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID]
7981
restart: unless-stopped
8082
lvm-llava-svc:
8183
image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest}
@@ -125,6 +127,7 @@ services:
125127
- https_proxy=${https_proxy}
126128
- http_proxy=${http_proxy}
127129
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
130+
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
128131
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
129132
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
130133
ipc: host

MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis"
1515
export LLAVA_SERVER_PORT=8399
1616
export LVM_ENDPOINT="http://${host_ip}:8399"
1717
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
18+
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
1819
export WHISPER_MODEL="base"
1920
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
2021
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
2122
export LVM_SERVICE_HOST_IP=${host_ip}
2223
export MEGA_SERVICE_HOST_IP=${host_ip}
2324
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
25+
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
2426
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
2527
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
26-
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
27-
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
28+
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
29+
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"

MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 36 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
4040
export LVM_SERVICE_HOST_IP=${host_ip}
4141
export MEGA_SERVICE_HOST_IP=${host_ip}
4242
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
43+
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
4344
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
4445
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
45-
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
46-
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
46+
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
47+
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
4748
```
4849

4950
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \
224225

225226
6. Multimodal Dataprep Microservice
226227

227-
Download a sample video
228+
Download a sample video, image, and audio file and create a caption
228229

229230
```bash
230231
export video_fn="WeAreGoingOnBullrun.mp4"
231232
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
232-
```
233233

234-
Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
234+
export image_fn="apple.png"
235+
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
236+
237+
export caption_fn="apple.txt"
238+
echo "This is an apple." > ${caption_fn}
239+
240+
export audio_fn="AudioSample.wav"
241+
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
242+
```
235243

236-
Test dataprep microservice with generating transcript using whisper model
244+
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
237245

238246
```bash
239247
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
240248
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
241249
-H 'Content-Type: multipart/form-data' \
242-
-X POST -F "files=@./${video_fn}"
250+
-X POST \
251+
-F "files=@./${video_fn}" \
252+
-F "files=@./${audio_fn}"
243253
```
244254

245-
Also, test dataprep microservice with generating caption using lvm-tgi
255+
Also, test dataprep microservice with generating an image caption using lvm-tgi
246256

247257
```bash
248258
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
249259
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
250260
-H 'Content-Type: multipart/form-data' \
251-
-X POST -F "files=@./${video_fn}"
261+
-X POST -F "files=@./${image_fn}"
262+
```
263+
264+
Now, test the microservice with posting a custom caption along with an image
265+
266+
```bash
267+
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
268+
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
269+
-H 'Content-Type: multipart/form-data' \
270+
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
252271
```
253272

254-
Also, you are able to get the list of all videos that you uploaded:
273+
Also, you are able to get the list of all files that you uploaded:
255274

256275
```bash
257276
curl -X POST \
258277
-H "Content-Type: application/json" \
259-
${DATAPREP_GET_VIDEO_ENDPOINT}
278+
${DATAPREP_GET_FILE_ENDPOINT}
260279
```
261280

262-
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
281+
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
263282

264283
```bash
265284
[
266285
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
267-
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
286+
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
287+
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
288+
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
268289
]
269290
```
270291
271-
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
292+
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
272293
273294
```bash
274295
curl -X POST \
275296
-H "Content-Type: application/json" \
276-
${DATAPREP_DELETE_VIDEO_ENDPOINT}
297+
${DATAPREP_DELETE_FILE_ENDPOINT}
277298
```
278299
279300
7. MegaService

MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ services:
3636
http_proxy: ${http_proxy}
3737
https_proxy: ${https_proxy}
3838
PORT: ${EMBEDDER_PORT}
39+
entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
3940
restart: unless-stopped
4041
embedding-multimodal:
4142
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -139,6 +140,7 @@ services:
139140
- https_proxy=${https_proxy}
140141
- http_proxy=${http_proxy}
141142
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
143+
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
142144
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
143145
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
144146
ipc: host

MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
2222
export LVM_SERVICE_HOST_IP=${host_ip}
2323
export MEGA_SERVICE_HOST_IP=${host_ip}
2424
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
25+
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
2526
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
2627
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
27-
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
28-
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
28+
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
29+
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"

0 commit comments

Comments
 (0)