# Interacting with CLIP

This is a self-contained notebook that shows how to download and run CLIP models, calculate the similarity between arbitrary image and text inputs, and perform zero-shot image classifications.

# Preparation for Colab

Make sure you're running a GPU runtime; if not, select "GPU" as the hardware accelerator in Runtime > Change Runtime Type in the menu. The next cells will install the `clip` package and its dependencies, and check if PyTorch 1.7.1 or later is installed.

In [1]:
import numpy as np
import torch
from pkg_resources import packaging

print("Torch version:", torch.__version__)

Torch version: 2.2.0+cu121


In [2]:
!pip install clip torch

Defaulting to user installation because normal site-packages is not writeable


In [3]:
!pip install tqdm

Defaulting to user installation because normal site-packages is not writeable


In [4]:
import torch
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from PIL import Image

In [5]:
from transformers import CLIPProcessor, CLIPModel
import os
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

  torch.utils._pytree._register_pytree_node(


# Loading the model

`clip.available_models()` will list the names of available CLIP models.

In [6]:
os.environ["HUGGINGFACE_TOKEN"] = ""

In [7]:
# Load CLIP model and processor
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.


According to the paper, ViT-L/14 performs the best

# Image Preprocessing

We resize the input images and center-crop them to conform with the image resolution that the model expects. Before doing so, we will normalize the pixel intensity using the dataset mean and standard deviation.

The second return value from `clip.load()` contains a torchvision `Transform` that performs this preprocessing.



In [8]:
# Define transformation for images
image_transform = transforms.Compose(
    [
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ]
)

In [9]:
import os
from PIL import Image
import matplotlib.pyplot as plt
import skimage

In [10]:
from torch.utils.data import Dataset, DataLoader
import os
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [11]:
# Custom dataset class
original_images = []
images = []
texts = []


class CustomDataset(Dataset):
    def __init__(
        self, csv_file, image_folder, clip_model, clip_processor, transform=None
    ):
        self.data = pd.read_csv(csv_file)
        self.image_folder = image_folder
        self.transform = transform
        self.clip_model = clip_model
        self.clip_processor = clip_processor

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_name = self.data.iloc[idx, 0]
        img_path = os.path.join(self.image_folder, f"{img_name}.png")
        image = Image.open(img_path).convert("RGB")
        description = self.data.iloc[idx, 2]
        price = self.data.iloc[idx, 3]

        # Preprocess image
        if self.transform:
            image = self.transform(image)

        # Extract image features using CLIP model
        inputs = self.clip_processor(images=image, return_tensors="pt", padding=True)
        with torch.no_grad():
            image_features = self.clip_model.get_image_features(**inputs)

        # Extract text features using CLIP model
        inputs = self.clip_processor(
            text=[description], return_tensors="pt", padding=True
        )
        with torch.no_grad():
            text_features = self.clip_model.get_text_features(**inputs)

        sample = {
            "image": image,
            "description": description,
            "price": price,
            "image_features": image_features,
            "text_features": text_features,
        }
        return sample

In [33]:
# Paths and file names

csv_file = "/wholeData.csv"
image_folder = "stylish product image/Flipkart"

# Create dataset instance
dataset = CustomDataset(
    csv_file=csv_file,
    image_folder=image_folder,
    clip_model=clip_model,
    clip_processor=clip_processor,
    transform=image_transform,
)

In [34]:
sample = dataset[0]
sample

{'image': tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.]],
 
         [[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.]],
 
         [[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.]]]),
 'description': 'Men & Women, Men, Boys, Women, Baby Boys, Baby Boys & B...',
 'price': 72,
 'image_features': tensor([[ 6.9869e-02,  3.9249e-01,  4.1330e-01, -8.2407e-01, -1.7080e-

In [35]:
# Create DataLoader
dataloader = DataLoader(dataset, batch_size=1024, shuffle=True)

In [36]:
import time

start_time = time.time()
print(start_time)

1708175112.1832147


In [37]:
# Extract features and prices 11:20AM 02/15/2024
image_features = []
text_features = []
prices = []

for batch in dataloader:
    images = batch["image"]
    descriptions = batch["description"]
    batch_text_features = batch["text_features"]
    batch_image_features = batch["image_features"]
    batch_prices = batch["price"]

    # Convert text features to numpy array
    batch_text_features_np = batch_text_features.cpu().numpy()

    # Flatten image features
    batch_image_features_np = batch_image_features.cpu().numpy()

    image_features.append(batch_image_features_np)
    text_features.append(batch_text_features_np)
    prices.append(batch_prices.numpy())

image_features = np.concatenate(image_features)
text_features = np.concatenate(text_features)
prices = np.concatenate(prices)

In [38]:
import time

end_time = time.time()


print(end_time)

1708337492.4843135


In [39]:
# Save image features
np.save("stylish product image/image_features.npy", image_features)

# Save text features
np.save("stylish product image/text_features.npy", text_features)

# Save prices
np.save("stylish product image/prices.npy", prices)

In [40]:
# Combine text features and prices
combined_features = np.concatenate((text_features, image_features), axis=1)

In [41]:
combined_features_2d = combined_features.reshape(combined_features.shape[0], -1)

In [42]:
combined_features_2d

array([[-0.00482015, -0.2510922 ,  0.15266043, ..., -0.10114391,
         0.1144692 , -0.9207649 ],
       [ 0.02847779, -0.19360664,  0.65252376, ...,  0.16380619,
        -0.05795938, -0.5908433 ],
       [ 0.14752619,  0.12617984, -0.10658576, ...,  0.05239876,
         0.20662916, -0.49995846],
       ...,
       [-0.01769301,  0.6063677 ,  0.62809056, ..., -0.1267834 ,
         0.09187119, -0.55941767],
       [ 0.3144688 ,  0.14602579,  0.07375057, ..., -0.01219191,
         0.0234279 , -0.560724  ],
       [-0.2446897 , -0.3623677 ,  0.29124153, ..., -0.03676666,
         0.17447004, -0.7681742 ]], dtype=float32)

In [43]:
prices

array([5282,  699,  328, ...,  784,  244,  709], dtype=int64)

In [62]:
from sklearn.metrics import mean_absolute_percentage_error
import xgboost as xgb

## import xgboost as xgb

In [45]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    combined_features_2d, prices, test_size=0.2, random_state=42
)

# Initialize and train the XGBoost regressor
xgb_regressor = xgb.XGBRegressor()
xgb_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = xgb_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)
print("The R2:", r2)

Mean Squared Error: 184491.12497183404
Root Mean Squared Error: 429.52430079313797
Mean Absolute Error: 222.7257883874716
The R2: 0.6888490086580701


In [66]:
y_pred = xgb_regressor.predict(X_test)

# Evaluate the model
mapr = mean_absolute_percentage_error(y_test, y_pred)
print("mean_absolute_percentage_error:", mapr)

mean_absolute_percentage_error: 0.4052485525810215


In [46]:
from sklearn.ensemble import RandomForestRegressor

In [47]:
# Initialize and train the Random forest regressor  8 hours
regr = RandomForestRegressor(max_depth=1000, random_state=42)
regr = regr.fit(X_train, y_train)

# Predict on the test set
y_pred = regr.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)
print("The R2:", r2)

Mean Squared Error: 168764.61772230035
Root Mean Squared Error: 410.80970986857204
Mean Absolute Error: 179.65068553520115
The R2: 0.7153723350337183


In [67]:
y_pred = regr.predict(X_test)

# Evaluate the model
mapr = mean_absolute_percentage_error(y_test, y_pred)
print("mean_absolute_percentage_error:", mapr)

mean_absolute_percentage_error: 0.3268546333782239


In [48]:
import lightgbm as lgb
from lightgbm import LGBMRegressor

In [65]:
# Initialize and train the lightgbm regressor
lgb = LGBMRegressor(num_leaves=1000, learning_rate=0.1)
lgbr = lgb.fit(X_train, y_train)

# Predict on the test set
y_pred = lgb.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)
print("The R2:", r2)

print(mean_absolute_percentage_error(y_test, y_pred))

#

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.528124 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 391679
[LightGBM] [Info] Number of data points in the train set: 49716, number of used features: 1536
[LightGBM] [Info] Start training from score 696.999095
Mean Squared Error: 155185.66007654744
Root Mean Squared Error: 393.93611166856414
Mean Absolute Error: 158.2496079798505
The R2: 0.7382737409062834
0.27232149824055774


In [96]:
lower = LGBMRegressor(
    num_leaves=500, learning_rate=0.1, objective="quantile", alpha=1 - 0.95
)
lower.fit(X_train, y_train)
lower_pred = lower.predict(X_test)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.621379 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 391679
[LightGBM] [Info] Number of data points in the train set: 49716, number of used features: 1536
[LightGBM] [Info] Start training from score 199.000000


In [107]:
upper = LGBMRegressor(
    num_leaves=100, learning_rate=0.1, objective="quantile", alpha=0.95
)
upper.fit(X_train, y_train)
upper_pred = upper.predict(X_test)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.580149 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 391679
[LightGBM] [Info] Number of data points in the train set: 49716, number of used features: 1536
[LightGBM] [Info] Start training from score 1975.000000
