# Multi-Modal AI Application: Video Analysis and Visual Question Answering

## Objective
This notebook demonstrates how to develop **Multi-Modal AI Applications** using **decoder models** combined with **visual data**.

Applications covered include:
- **Image Captioning:** Generating descriptive captions for images.
- **Video Analysis:** Producing textual summaries of videos.

In [49]:
# Install necessary libraries
!pip install transformers pillow opencv-python openai
!pip install streamlit
!npm install localtunnel
!pip install streamlit-webrtc

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K
up to date, audited 23 packages in 1s
[1G[0K⠏[1G[0K
[1G[0K⠏[1G[0K3 packages are looking for funding
[1G[0K⠏[1G[0K  run `npm fund` for details
[1G[0K⠏[1G[0K
2 [33m[1mmoderate[22m[39m severity vulnerabilities

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.


## Image Captioning and Video Analysis


In [50]:
%%writefile Multimodal_AI_app.py

# Upload model and define function to process image with prompt
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
import os

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

def run_prompt(prompt, image):

    inputs = processor(text=prompt, images=image, return_tensors="pt")
    generated_ids = model.generate(
      pixel_values=inputs["pixel_values"],
      input_ids=inputs["input_ids"],
      attention_mask=inputs["attention_mask"],
      image_embeds=None,
      image_embeds_position_mask=inputs["image_embeds_position_mask"],
      use_cache=True,
      max_new_tokens=128,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    processed_text, entities = processor.post_process_generation(generated_text)
    return processed_text


import streamlit as st
from PIL import Image
import cv2
import numpy as np

#Image Captioning
st.title("Multimodal Image Question and Answer")
uploaded_file = st.file_uploader("Upload an image...", type=["jpg", "jpeg", "png"])

if uploaded_file:
    # Open the uploaded image
    image = Image.open(uploaded_file).convert("RGB")
    width, height = image.size
    image2 = image.resize((int(width/2), int(height/2)), resample=Image.Resampling.NEAREST)
    st.image(image, caption="Uploaded Image", use_container_width=True)

    prompt = "<grounding> Question: What teams are playing in the image?\nAnswer:"
    st.write(run_prompt(prompt, image))

#Video Analysis
prompt_list = [
    "<grounding> Question: What is happening in the image? Answer:",
    "<grounding> Question: How many referees are in the image? Answer:",
    "<grounding> Question: How many wide receivers are on the field? Answer:",
    "<grounding> Question: How close is the offense to the endzone? Answer:"
    ]

st.title("Multimodal AI Video Question and Answer")

import tempfile
uploaded_video = st.file_uploader("Upload a video...", type=["mp4"])
if uploaded_video:
  temp_dir = tempfile.mkdtemp()
  path = os.path.join(temp_dir, uploaded_video.name)
  with open(path, "wb") as f:
    f.write(uploaded_video.getvalue())
  cap = cv2.VideoCapture(path)
  frame_count = 0
  while cap.isOpened() and frame_count < 41:
      ret, frame = cap.read()
      if not ret:
          break
      frame_count += 1
      frame_image = Image.fromarray(frame).convert("RGB")
      if frame_count % 10 == 0:
        width, height = frame_image.size
        frame_image2 = frame_image.resize((int(width/2), int(height/2)), resample=Image.Resampling.NEAREST)
        st.image(cv2.cvtColor(np.array(frame_image2), cv2.COLOR_BGR2RGB), caption=f"Video Frame {frame_count}", use_container_width=True)
        st.write(run_prompt(prompt_list[int((frame_count/10)-1)], frame_image2))
  cap.release()


Overwriting Multimodal_AI_app.py


In [51]:
!curl ipv4.icanhazip.com

34.16.234.178


In [None]:
# LocalTunnel asks for the password which is the IP of this machine.
!streamlit run Multimodal_AI_app.py &>./logs.txt & npx localtunnel --port 8501

[1G[0K⠙[1G[0Kyour url is: https://young-fans-sell.loca.lt


In [None]:
# Log file for troubleshooting
!cat logs.txt

## 3. Visual Question Answering Example


In [None]:
# print("Visual Question Answering Example")
# question = "How many players are in the image?"
# answer = image_captioner(image)
# print("Answer:", answer[0]['generated_text'])