# Multi-Modal AI Application: Image Captioning, Video Analysis, and Visual Question Answering

## Objective
This notebook demonstrates how to develop **Multi-Modal AI Applications** using **decoder models** combined with **visual data**.

Applications covered include:
- **Image Captioning:** Generating descriptive captions for images.
- **Video Analysis:** Producing textual summaries of videos.
- **Visual Question Answering:** Providing answers to questions based on image content.

In [1]:
# Install necessary libraries
!pip install transformers pillow opencv-python openai




## 1. Image Captioning Example


In [2]:
from transformers import pipeline
from PIL import Image

print("Image Captioning Example")
image_captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")

# Load an example image
image = Image.open("NFL_image.jpeg")
caption = image_captioner(image)
print("Generated Caption:", caption[0]['generated_text'])

Image Captioning Example


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cpu


Generated Caption: a football player diving over a goal post


## 2. Video Analysis Example


In [3]:
import cv2

print("Video Analysis Example")
video_path = "NFL_video.mp4"
cap = cv2.VideoCapture(video_path)
frame_count = 0
while cap.isOpened() and frame_count < 5:
    ret, frame = cap.read()
    if not ret:
        break
    frame_count += 1
    frame_image = Image.fromarray(frame)
    frame_caption = image_captioner(frame_image)
    print(f"Frame {frame_count} Caption:", frame_caption[0]['generated_text'])
cap.release()

Video Analysis Example
Frame 1 Caption: a football player running with the ball
Frame 2 Caption: a football player running with the ball
Frame 3 Caption: a football player running with the ball
Frame 4 Caption: a football player running with the ball
Frame 5 Caption: a football player running with the ball


## 3. Visual Question Answering Example


In [4]:
print("Visual Question Answering Example")
question = "What is in the image?"
answer = image_captioner(image)
print("Answer:", answer[0]['generated_text'])

Visual Question Answering Example
Answer: a football player diving over a goal post


In [5]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.42.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.42.2-py2.py3-none-any.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[

In [6]:
%%writefile app.py

import streamlit as st
from transformers import pipeline
from PIL import Image

# Initialize Image Captioning Model
image_captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")

# Streamlit UI
st.title("🏈 NFL Image Captioning AI")
st.write("Upload an NFL image, and the AI will generate a caption for it!")

# File Upload
uploaded_file = st.file_uploader("Upload an NFL Image...", type=["jpg", "jpeg", "png"])

if uploaded_file is not None:
    # Open and display the image
    image = Image.open(uploaded_file)
    st.image(image, caption="Uploaded Image", use_column_width=True)

    # Generate Caption
    with st.spinner("Generating caption..."):
        caption = image_captioner(image)
        generated_caption = caption[0]['generated_text']

    # Display Caption
    st.subheader("Generated Caption:")
    st.write(generated_caption)

# Footer
st.markdown("---")
st.markdown("⚡ Built with [Streamlit](https://streamlit.io/) and Hugging Face's BLIP model.")



Writing app.py


In [7]:
!curl ipv4.icanhazip.com

35.201.218.127


In [8]:
!streamlit run app.py &>./logs.txt & npx localtunnel --port 8501

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K[1G[0JNeed to install the following packages:
localtunnel@2.0.2
Ok to proceed? (y) [20Gy

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0Ky
your url is: https://tender-lies-heal.loca.lt
^C
