# L2: Image captioning app 🖼️📝

UPDATED: Jon Chun, 4 Oct 2024
* Run locally rather than call remote HF API endpoints
* Must store HF_TOKEN in Colab secrets before running

Load your HF API key and relevant Python libraries

In [4]:
from google.colab import userdata
HF_API_KEY = userdata.get('HF_TOKEN')

In [5]:
import os

os.environ["HF_API_KEY"] = HF_API_KEY

In [6]:
import os
import io
import IPython.display
from PIL import Image
import base64

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file
# hf_api_key = os.environ['HF_API_KEY']

In [8]:
from transformers import pipeline

# Load the image-to-text model locally using Hugging Face pipeline
get_completion = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")  # Example image-to-text model

# Image-to-text function
def image_to_text(image):
    # Call the local Hugging Face model pipeline directly
    output = get_completion(image)

    # Return the generated text from the model
    return output[0]['generated_text']

config.json:   0%|          | 0.00/4.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [9]:
# prompt: upload a file and save the filename in uploaded_filename

from google.colab import files

uploaded = files.upload()
for fn in uploaded.keys():
  uploaded_filename = fn
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


Saving beckett_friend_tennis_mvhs.jpg to beckett_friend_tennis_mvhs.jpg
User uploaded file "beckett_friend_tennis_mvhs.jpg" with length 189783 bytes


In [10]:
fn

'beckett_friend_tennis_mvhs.jpg'

In [11]:
# Example usage
# Replace 'your_image' with an actual image object (e.g., an image file or PIL Image)
# For example: image = Image.open("path_to_image.jpg")
image = fn # "your_image_here"  # Placeholder for actual image input
generated_text = image_to_text(image)
print(generated_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


a man and a woman standing on a tennis court 


## Building an image captioning app

Here we'll be using an [Inference Endpoint](https://huggingface.co/inference-endpoints) for `Salesforce/blip-image-captioning-base` a 14M parameter captioning model.

The free images are available on: https://free-images.com/

In [12]:
image_url = "https://free-images.com/sm/9596/dog_animal_greyhound_983023.jpg"
display(IPython.display.Image(url=image_url))
get_completion(image_url)

[{'generated_text': 'a dog wearing a red hat and a red bow tie '}]

## Captioning with `gr.Interface()`

#### gr.Image()
- The `type` parameter is the format that the `fn` function expects to receive as its input.  If `type` is `numpy` or `pil`, `gr.Image()` will convert the uploaded file to this format before sending it to the `fn` function.
- If `type` is `filepath`, `gr.Image()` will temporarily store the image and provide a string path to that image location as input to the `fn` function.

In [13]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.44.1-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from g

In [14]:
import gradio as gr
import base64
import io
from PIL import Image

def image_to_base64_str(pil_image):
    byte_arr = io.BytesIO()
    pil_image.save(byte_arr, format='PNG')
    byte_arr = byte_arr.getvalue()
    return str(base64.b64encode(byte_arr).decode('utf-8'))

def captioner(image):
    base64_image = image_to_base64_str(image)
    result = get_completion(base64_image)
    return result[0]['generated_text']

# Close any running Gradio instances
gr.close_all()

# Create the Gradio interface
demo = gr.Interface(fn=captioner,
                    inputs=[gr.Image(label="Upload image", type="pil")],
                    outputs=[gr.Textbox(label="Caption")],
                    title="Image Captioning with BLIP",
                    description="Caption any image using the BLIP model",
                    allow_flagging="never",
                    examples=["christmas_dog.jpeg", "bird_flight.jpeg", "cow.jpeg"])

# Launch the Gradio app (no need to specify the port)
demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://347ad1a4c6805ccbb2.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [15]:
gr.close_all()

Closing server running on port: 7860
