# Create a Hushh Vibe Catalog

Download an example product catalog from this [Kaggle dataset](https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small).


In [1]:
import glob
from hushh.catalog import Catalog, Product
from PIL import Image
import json
from tqdm import tqdm
import pandas as pd
import os

## Download Data
Uncomment and execute the next two cells to download/unzip the image dataset

In [2]:
# !kaggle datasets download paramaggarwal/fashion-product-images-small --force 

In [3]:
# ! unzip -o fashion-product-images-small.zip > /dev/null

## Data Details
The dataset contains around ~40K fashion related images

In [4]:
len(glob.glob("images/*"))

44441

Taking a quick look at the images, they're jpeg files, with an id as a file name.

In [5]:
print(glob.glob("images/*")[:10])

['images/9733.jpg', 'images/14147.jpg', 'images/52112.jpg', 'images/6400.jpg', 'images/34297.jpg', 'images/24084.jpg', 'images/12536.jpg', 'images/54563.jpg', 'images/15259.jpg', 'images/35189.jpg']


The "styles" csv file gives some of the metadata for a given id.

In [6]:
styles = pd.read_csv("styles.csv",usecols=range(10), index_col=0)
styles.head()

Unnamed: 0_level_0,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


## Creating a Hushh Catalog
We can create a catalog using the hushh catalog api.  We follow these steps for each image file:

1. Extract the id from the filename.
2. Lookup the metadata for the id.
3. Create a product for the id and metadata (using a dummy url, since we won't be linking to a product url).

The Catalog comes with its own method for writing catalog files.


In [7]:

cat = Catalog("demo_catalog")

for filename in tqdm(glob.glob("images/*")):
    id, ext = os.path.splitext(os.path.basename(filename))
    id = int(id)
    style = styles.loc[id]
    if pd.isna(style.productDisplayName):
        pass
        # print(f"Skipping: {id}, product had no description")
    else:
        prod = Product(description=style.productDisplayName, url="dummy_url", image=filename)
        cat.addProduct(prod)

print("Writing Catalog")
cat.to_hcf("catalog.hcf")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44441/44441 [00:07<00:00, 5756.93it/s]


Writing Catalog
0
Collected images and text for batch 0
Collected inputs for batch 0
Collected Image and text features for batch 0
Image embeddings collected for batch 0
Text embeddings collected for batch 0
1
Collected images and text for batch 1
Collected inputs for batch 1
Collected Image and text features for batch 1
Image embeddings collected for batch 1
Text embeddings collected for batch 1
2
Collected images and text for batch 2
Collected inputs for batch 2
Collected Image and text features for batch 2
Image embeddings collected for batch 2
Text embeddings collected for batch 2
3
Collected images and text for batch 3
Collected inputs for batch 3
Collected Image and text features for batch 3
Image embeddings collected for batch 3
Text embeddings collected for batch 3
4
Collected images and text for batch 4
Collected inputs for batch 4
Collected Image and text features for batch 4
Image embeddings collected for batch 4
Text embeddings collected for batch 4


## Create a comparison JSON dataset
We will set up a quick and dirty json output for comparison.  This just dumps the flatbatch content from the catalog in standard json format.

In [11]:
with open ("catalog.json", "w") as fh:
    json.dump(cat.productVibes.flatBatches, fh, default=lambda o: o.__dict__)

## JSON timings

We can see that it takes roughly 8 seconds to decode a json paylod of ~40K embeddings. 

In [12]:
%timeit -n 1 with open ("catalog.json", "r") as fh : json.load(fh)


8.29 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## HCF timings

HCF, by comparison, is loaded in 480 microseconds.

In [13]:
%timeit -n 1 catalog.read_hcf("test.hcf")

NameError: name 'catalog' is not defined

## Conclusion

HCF's ability to load data directly from a byte stream eliminates most of the overhead of dealing with large embedding formats, making it ideal for quickly loading catalogs of data for search-related functionality.

In this particular example, HCF is **20K times** faster for decoding a similar json payload.