# **Setup**

This first cell installs all the necessary libraries for this notebook. We are using `langchain`, `gradio`, `transformers`, `bs4`, `requests`, and `torch`.

In [30]:
# installing required libraries in my_env
!pip install langchain==0.1.11 gradio==5.23.2 transformers==4.38.2 bs4==0.0.2 requests==2.31.0 torch==2.2.1

Collecting langchain==0.1.11
  Downloading langchain-0.1.11-py3-none-any.whl.metadata (13 kB)
Collecting gradio==5.23.2
  Downloading gradio-5.23.2-py3-none-any.whl.metadata (16 kB)
Collecting transformers==4.38.2
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bs4==0.0.2
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting requests==2.31.0
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting torch==2.2.1
  Downloading torch-2.2.1-cp312-cp312-manylinux1_x86_64.whl.metadata (26 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.1.11)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain-community<0.1,>=0.0.25 (from langchain==0.1.11)
  Downloading langchain_community-0.0.38-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain-core<0

# 1- Image Captioning Model

Here we are importing the necessary libraries for the image captioning part of the notebook: `requests` to download images from the web, `PIL` (Pillow) to work with images, and `transformers` from Hugging Face to use a pre-trained image captioning model.

In [40]:
import requests
from PIL import Image
from transformers import AutoProcessor, BlipForConditionalGeneration

# Load the pretrained processor and model
processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

This cell loads an image from a specified path. The code then converts the image to the RGB format, which is required by the image captioning model.

In [41]:
img_path = "/content/sun.png"

# convert it into an RGB format
image = Image.open(img_path).convert('RGB')




Here, we prepare the image and an optional text input for the model. For image captioning, a generic text like "the image about what" is used as a prompt. The `processor` converts the image and text into the format that the model expects.

In [42]:
# You do not need a question for image captioning
text = "This image shows"
inputs = processor(images=image, text=text, return_tensors="pt")

 The `generate` method of the model produces a sequence of tokens representing the caption. We limit the caption length to 50 tokens using `max_length=50`.

In [43]:
# Generate a caption for the image
outputs = model.generate(**inputs, max_length=50)

After generating the caption tokens, this cell decodes them back into a human-readable text string. The `processor.decode` method is used for this purpose, and `skip_special_tokens=True` removes any special tokens added during the generation process.

In [44]:
# Decode the generated tokens to text
caption = processor.decode(outputs[0], skip_special_tokens=True)
# Print the caption
print(caption)

this image shows the sun with a smiley face


#2- Gradio Interface

Import the necessary libraries for building the Gradio interface for our image captioning model.

In [45]:
import gradio as gr
import numpy as np
from PIL import Image
from transformers import AutoProcessor, BlipForConditionalGeneration

This function `caption_image` is the core logic for the Gradio interface. It takes a NumPy array representing an image as input. It converts the image to the correct format, uses the processor to prepare the input for the model, generates the caption using the model, and finally decodes the output into a readable string. This string is then returned as the output of the function.

In [46]:
def caption_image(input_image: np.ndarray):
    # Convert numpy array to PIL Image and convert to RGB
    raw_image = Image.fromarray(input_image).convert('RGB')

    # Process the image
    inputs = processor(raw_image, return_tensors="pt")

    # Generate a caption for the image
    outputs = model.generate(**inputs)

    # Decode the generated tokens to text and store it into `caption`
    caption = processor.decode(outputs[0], skip_special_tokens=True)

    return caption

This cell defines the Gradio interface for the image captioning task. It uses the `caption_image` function as the core logic. The interface has an image input (`gr.Image()`) and a text output (`"text"`). We also provide a title and description for the web app.

In [47]:
iface = gr.Interface(
    fn=caption_image,
    inputs=gr.Image(),
    outputs="text",
    title="Image Captioning",
    description="This is a simple web app for generating captions for images using a trained model."
)

Finally, this cell launches the Gradio interface. Once executed, it will provide a public URL that you can use to access the image captioning web app.

In [48]:
iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://33d007d4b9db0134eb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


