<a href="https://colab.research.google.com/github/mestvnvo/Vision-Language-Models/blob/main/PL_LLaVA_NeXT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git accelerate bitsandbytes
!pip install --upgrade gspread -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m250.7 kB/s[0m eta [36m0:00:00[0m
[?25h

# Import Google Sheets w/ API


In [None]:
# Authenticate Google Account
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# Specify Google Sheet & Page
spreadsheet = gc.open_by_key('17WsSVU40pRwsrhHW8_Sep1oGSPQUIX67TdqsDG9de40')
# worksheet = spreadsheet.get_worksheet_by_id(int('2044010462')) # Many2Many (~500 selected)
worksheet = spreadsheet.get_worksheet_by_id(int('1859123998')) # Whole Dataset (~3300)

# Instantiate Model and Dataset

In [None]:
from transformers import BitsAndBytesConfig
from transformers import pipeline
import torch

# Specify output format
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

# Instantiate pipeline & batch size
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"

llava_next = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config}, batch_size=32)

config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/70.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/380M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

In [None]:
import pandas as pd
import requests
from PIL import Image

# Get all the values from the worksheet as a list of lists
data = worksheet.get_all_values()

# Create DataFrame
df = pd.DataFrame(data)
df.columns = df.iloc[0]  # Set the first row as the header
df = df.drop(0)  # Drop the header row from the data
df.reset_index(drop=True, inplace=True)

# Obtain features and targets
df = pd.DataFrame(df, columns=["Name","Image_URL","L_PT","L_Color"])
# df = pd.DataFrame(df, columns=["Name","Image_URL","L_PT","L_Color","P_PT"]) # for running Product Type in one run, then Color during another time

# Predict Product_Type

In [None]:
p_pt = []

# Iterate over DataFrame
for idx, row in df.iterrows():
  # Prompt takes Name and Image as inputs
  image = Image.open(requests.get(row["Image_URL"], stream=True).raw)
  pt_q = "Using only available product types and in one word, what product type is the "+row["Name"]+"? The available product types are: Tops, Outerwear, Bottoms, Undergarments, Footwear, Headwear, Dresses, One_piece, Accessories, Other. If the product is a full-body clothing that isn’t a dress, then the product_type is one_piece. If the product isn’t headwear or jewelry, but the product is some sort of bag, then the product_type is accessories. Ensure that the answer is one of the 10 options and use the more specific answer."
  prompt = "[INST] <image>\n"+pt_q+"[/INST]"

  # Predict Product_Type & Format output for Google Sheets
  pred = llava_next(image, prompt=prompt, generate_kwargs={"max_new_tokens": 5})[0]["generated_text"].split("[/INST] ")[1]
  cleaned = pred.strip().lower()
  cell = [[cleaned]]

  # Send to Google Sheets
  cell_address = f'F{idx + 2}'
  worksheet.update(range_name=cell_address, values=cell)

  p_pt.append(cleaned)

df["P_PT"] = p_pt

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 # Predict Color

In [None]:
p_color = []

for idx, row in df.iterrows():
  # Prompt takes Name, Image, and Product_Type as inputs
  image = Image.open(requests.get(row["Image_URL"], stream=True).raw)
  color_q = "The product name is: "+row["Name"]+". The available product colors are: Red, Orange, Beige, Yellow, Green, Blue, Purple, Brown, Black, White, Gray, Pink, Graphic, Multi. If there is a graphic, text, or a floral pattern that takes up more than a quarter of the product, then the product color is Graphic. Otherwise, if there are 2 or more distinct colors, then the product color is Multi. From the available product colors and in one word, what color is the "+row["P_PT"]+"? Ensure that the answer is one of the 14 options and if there are two close colors, choose the closer one."
  prompt = "[INST] <image>\n"+color_q+"[/INST]"

  # Predict Color & Format output for Google Sheets
  pred = llava_next(image, prompt=prompt, generate_kwargs={"max_new_tokens": 5})[0]["generated_text"].split("[/INST] ")[1]
  cleaned = pred.strip().lower()
  cell = [[cleaned]]

  # Send to Google Sheets
  cell_address = f'E{idx + 2}'
  worksheet.update(range_name=cell_address, values=cell)

  p_color.append(cleaned)

df["P_Color"] = p_color

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
df