In [1]:
pip install Pillow datasets transformers llm-lens torch

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m76.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llm-lens
  Downloading llm_lens-0.0.0.3-py3-none-any.whl (6.4 kB)
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from dataset

## Visual descriptions

To simply get visual descriptions from LENS to pass on to an LLM, follow:

In [None]:
import requests
from lens import Lens, LensProcessor
from PIL import Image
import torch
img_url = 'https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
question = "What is the image about?"

lens = Lens()
processor = LensProcessor()
with torch.no_grad():
    samples = processor([raw_image],[question])
    output = lens(samples)


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so...


  warn("The installed version of bitsandbytes was compiled without GPU support. "


Downloading (…)ip_pytorch_model.bin:   0%|          | 0.00/3.94G [00:00<?, ?B/s]

Let's print out the image to see what it was.

In [None]:
raw_image

Now, let's look at what the LENS vision modules output. Here, the output is wrapped in a VQA prompt that can be passed to a LLM.

In [None]:
print(output["prompts"])

## Passing the visual descriptions to a LLM
We can use the prompts and visual descriptions generated by LENS to get an LLM to solve problems about an image. Here, you see an LLM that uses the visual information to answer the VQA question posed in the prompt above.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small",truncation_side = 'left',padding = True)
LLM_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

input_ids = tokenizer(samples["prompts"], return_tensors="pt").input_ids
outputs = LLM_model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

## Advanced Usage
We might want to run a large dataset through LENS. Maybe it doesn't fit in memory. Not to worry! You can pass a huggingface dataset instead of a list into LENS as well. This function gives us the visual features and prompts too.

In [None]:
from datasets import load_dataset

lens = Lens()
processor = LensProcessor()
ds = load_dataset("llm-lens/lens_sample_test", split="test")
output_ds = lens.hf_dataset_transform(ds, processor, return_global_caption = False)
print(output_ds)