In [None]:
!pip install -qU langchain==0.1.16 langchain-core==0.1.42 langchain-openai==0.1.3 langgraph==0.0.37 langchainhub==0.1.15

### 1. Importing Libraries:

- torch: This is the main library for PyTorch, a framework used for deep learning.
- BlipProcessor and BlipForConditionalGeneration: These are specific components from the Transformers library by Hugging Face. BlipProcessor handles the preprocessing of inputs and BlipForConditionalGeneration is the model used for generating captions.

In [None]:
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration

### 2. Setting the Model Name and Device:

- hf_model: This variable stores the name of the pre-trained BLIP model we want to use.
- device: This determines whether the code will run on a GPU ('cuda') or a CPU ('cpu'). If a GPU is available, it will be used for faster computation.

In [None]:
hf_model = "Salesforce/blip-image-captioning-large"
device = 'cuda' if torch.cuda.is_available() else 'cpu'

### 3. Loading the Processor and Model:
    
- processor: This line loads the pre-trained BLIP processor for handling inputs.
- model: This line loads the pre-trained BLIP model and moves it to the specified device (either GPU or CPU).

In [None]:
processor = BlipProcessor.from_pretrained(hf_model)
model = BlipForConditionalGeneration.from_pretrained(hf_model).to(device)

In [None]:
from IPython.display import Image_Ipython

Image_Ipython('yghzBMOFHZRKGvRuw6AM6.png', width=500, height=750)

### 4. Opening and Converting the Image:

- Image.open("yghzBMOFHZRKGvRuw6AM6.png"): This line opens the image file named "yghzBMOFHZRKGvRuw6AM6.png". The Image class comes from the PIL (Python Imaging Library) or its more modern fork, Pillow.
- .convert('RGB'): This converts the image to RGB mode, ensuring that it has three color channels (Red, Green, Blue). This is necessary because many image models expect RGB images.

In [None]:
from PIL import Image

image = Image.open("yghzBMOFHZRKGvRuw6AM6.png").convert('RGB')


### 5. Processing the Image:

- processor(image, return_tensors="pt"): This line uses the BlipProcessor to process the image. The return_tensors="pt" argument specifies that the processed outputs should be returned as PyTorch tensors.
- .to(device): This moves the processed tensors to the specified device (either GPU or CPU), ensuring that subsequent computations are performed on the same device as the model.

In [None]:
inputs = processor(image, return_tensors="pt").to(device)

### 6. Disabling Gradient Calculation:

- `with torch.no_grad(): This line is used to disable gradient calculation. Gradients are not needed during inference (prediction) and disabling them reduces memory consumption and speeds up computation.

### 7. Generating Captions：

- model.generate(**inputs, max_new_tokens=100): This line uses the model to generate text (a caption) based on the processed image inputs. max_new_tokens=100 specifies the maximum number of tokens (words or subwords) that the model should generate.

### 8. Decoding the Generated Tokens:

- processor.decode(output_ids[0], skip_special_tokens=True): This line decodes the generated token IDs back into a human-readable string (caption). skip_special_tokens=True ensures that special tokens used by the model (like padding or end-of-sequence tokens) are not included in the final caption.

In [None]:
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=100)
    # get the caption
    caption = processor.decode(output_ids[0], skip_special_tokens=True)