<a href="https://colab.research.google.com/github/hariharaprabhu/hybrid-vectorizer/blob/main/Similar%20Stock%20Tickers/Examples/hv_sp500_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HybridVectorizer — S&P 500 Demo (Colab)

This notebook shows how to run **HybridVectorizer** on a real mixed-modality dataset (text + numeric + categorical) using S&P 500 company data.

### What you’ll do
- Install `hybrid-vectorizer` (1 line).
- Load a ready-to-use CSV (auto-download from GitHub or upload your own).
- Configure columns by modality (text / numeric / categorical).
- Fit the vectorizer and build a unified vector space.
- Run `similarity_search` and **tune block weights** to see how neighbors change.

### Why this is useful
Typical vector search ignores numeric/categorical columns or smashes everything into text. **HybridVectorizer** keeps modality blocks separate and combines them with **tunable weights**, so you can trade off signals (e.g., favor fundamentals vs. description text).

### Quick start
```python
# If running on Colab:
# !pip install -q hybrid-vectorizer
# Optional ANN:  # !pip install -q faiss-cpu
```

### Dataset
- **CC0 (Public Domain)**: [S&P 500 Stocks (Kaggle)](https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks)  
- For this demo, we load:  
  `https://github.com/hariharaprabhu/hybrid-vectorizer/blob/main/Examples/Similar%20Stock%20Tickers/sp500_companies.csv`

### Notes
- GPU is optional (install CUDA-enabled PyTorch if you want faster text embeddings).
- You can swap in your own CSV with the same column names.
- Results are shown as a small table (symbol, sector, industry, similarity).


## 1) Install (Colab users)

In [6]:

!pip install -q hybrid-vectorizer
# Optional ANN acceleration:
!pip install -q faiss-cpu


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m883.7 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2) Imports

In [7]:

import logging
import numpy as np
import pandas as pd

try:
    from hybrid_vectorizer import HybridVectorizer
except Exception as e:
    raise ImportError("HybridVectorizer not found. If you're on Colab, run the install cell above.") from e

pd.set_option("display.max_colwidth", 120)



## 3) Load dataset

Choose one approach:
- **Raw GitHub URL**: easiest if you add `examples/sp500_companies.csv` to your repo.
- **Manual upload**: Colab file picker.
- **Local path**: if running locally.


In [11]:

# Option A: Load from raw GitHub (recommended once CSV is in your repo)
CSV_URL = "https://raw.githubusercontent.com/hariharaprabhu/hybrid-vectorizer/main/Examples/Similar%20Stock%20Tickers/sp500_companies.csv" # update if needed

df = None
if CSV_URL:
    try:
        df = pd.read_csv(CSV_URL)
    except Exception as e:
        print("Couldn't fetch from CSV_URL; falling back to upload or local file.", e)


print("Rows:", len(df))
df.head(3), df.tail(3)


Rows: 502


(  Exchange Symbol              Shortname               Longname      Sector  \
 0      NMS   AAPL             Apple Inc.             Apple Inc.  Technology   
 1      NMS   NVDA     NVIDIA Corporation     NVIDIA Corporation  Technology   
 2      NMS   MSFT  Microsoft Corporation  Microsoft Corporation  Technology   
 
                     Industry  Currentprice      Marketcap        Ebitda  \
 0       Consumer Electronics        254.49  3846819807232  1.346610e+11   
 1             Semiconductors        134.70  3298803056640  6.118400e+10   
 2  Software - Infrastructure        436.60  3246068596736  1.365520e+11   
 
    Revenuegrowth         City State        Country  Fulltimeemployees  \
 0          0.061    Cupertino    CA  United States           164000.0   
 1          1.224  Santa Clara    CA  United States            29600.0   
 2          0.160      Redmond    WA  United States           228000.0   
 
                                                                          

## 4) Select columns used by the script

In [12]:

use_cols = [
    "Exchange","Symbol", "Sector", "Industry", "Currentprice", "Marketcap","Ebitda", "Revenuegrowth",
    "City", "State", "Country", "Fulltimeemployees", "Longbusinesssummary", "Weight"
]

missing = [c for c in use_cols if c not in df.columns]
if missing:
    raise ValueError(f"Your CSV is missing expected columns: {missing}")

df = df[use_cols].copy()
df.head(5)


Unnamed: 0,Exchange,Symbol,Sector,Industry,Currentprice,Marketcap,Ebitda,Revenuegrowth,City,State,Country,Fulltimeemployees,Longbusinesssummary,Weight
0,NMS,AAPL,Technology,Consumer Electronics,254.49,3846819807232,134661000000.0,0.061,Cupertino,CA,United States,164000.0,"Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories w...",0.069209
1,NMS,NVDA,Technology,Semiconductors,134.7,3298803056640,61184000000.0,1.224,Santa Clara,CA,United States,29600.0,"NVIDIA Corporation provides graphics and compute and networking solutions in the United States, Taiwan, China, Hong ...",0.05935
2,NMS,MSFT,Technology,Software - Infrastructure,436.6,3246068596736,136552000000.0,0.16,Redmond,WA,United States,228000.0,"Microsoft Corporation develops and supports software, services, devices and solutions worldwide. The Productivity an...",0.058401
3,NMS,AMZN,Consumer Cyclical,Internet Retail,224.92,2365033807872,111583000000.0,0.11,Seattle,WA,United States,1551000.0,"Amazon.com, Inc. engages in the retail sale of consumer products, advertising, and subscriptions service through onl...",0.04255
4,NMS,GOOGL,Communication Services,Internet Content & Information,191.41,2351625142272,123470000000.0,0.151,Mountain View,CA,United States,181269.0,"Alphabet Inc. offers various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-...",0.042309


## 5) Fit HybridVectorizer and build vectors

In [13]:

hv = HybridVectorizer(index_column="Symbol")

print("🔄 Fitting model... (this may take a moment)")
vectors = hv.fit_transform(df)

print(f"✅ Generated {vectors.shape[0]} vectors with {vectors.shape[1]} dimensions")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

🔄 Fitting model... (this may take a moment)
🔄 Processing data
📝 Encoding text: Sector...
📝 Encoding text: Industry...
📝 Encoding text: City...
📝 Encoding text: State...
📝 Encoding text: Longbusinesssummary...
✅ Generated 502 vectors with 1938 dimensions
✅ Generated 502 vectors with 1938 dimensions


## 6) Build a query from a CSV row (GOOGL by default)

In [14]:

# Default to GOOGL; fallback to a random row if GOOGL isn't present
if not (df["Symbol"] == "GOOGL").any():
    query_row = df.sample(1, random_state=42).iloc[0]
else:
    query_row = df.loc[df['Symbol']=='GOOGL'].iloc[0]

query = query_row.to_dict()
query


{'Exchange': 'NMS',
 'Symbol': 'GOOGL',
 'Sector': 'Communication Services',
 'Industry': 'Internet Content & Information',
 'Currentprice': 191.41,
 'Marketcap': 2351625142272,
 'Ebitda': 123469996032.0,
 'Revenuegrowth': 0.151,
 'City': 'Mountain View',
 'State': 'CA',
 'Country': 'United States',
 'Fulltimeemployees': 181269.0,
 'Longbusinesssummary': 'Alphabet Inc. offers various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America. It operates through Google Services, Google Cloud, and Other Bets segments. The Google Services segment provides products and services, including ads, Android, Chrome, devices, Gmail, Google Drive, Google Maps, Google Photos, Google Play, Search, and YouTube. It is also involved in the sale of apps and in-app purchases and digital content in the Google Play and YouTube; and devices, as well as in the provision of YouTube consumer subscription services. The Google Cloud segment offers i

## 7) Similarity search (baseline)

In [15]:

print("🔍 Searching for similar companies... (baseline weights)")
results = hv.similarity_search(
    query,
    ignore_exact_matches=True
)

try:
    display(results[['Symbol', 'Sector', 'Industry', 'similarity']].head(10))
except Exception:
    display(results[:10])


🔍 Searching for similar companies... (baseline weights)
🔍 Searching for similar items...


Unnamed: 0,Symbol,Sector,Industry,similarity
6,META,Communication Services,Internet Content & Information,0.886565
0,AAPL,Technology,Consumer Electronics,0.773971
2,MSFT,Technology,Software - Infrastructure,0.749128
1,NVDA,Technology,Semiconductors,0.736084
8,AVGO,Technology,Semiconductors,0.715126


## 8) Similarity search with custom block weights

In [16]:

# NOTE: ensure keys match your package's internal block names.
# If your package uses 'numeric' instead of 'numerical', adjust accordingly.
custom_weights = {'text': 0.5, 'numerical': 1.0, 'categorical': 1.0}

print("🔍 Searching with custom block weights:", custom_weights)
results_weighted = hv.similarity_search(
    query,
    ignore_exact_matches=True,
    block_weights=custom_weights
)

try:
    display(results_weighted[['Symbol', 'Sector', 'Industry', 'similarity']].head(10))
except Exception:
    display(results_weighted[:10])


🔍 Searching with custom block weights: {'text': 0.5, 'numerical': 1.0, 'categorical': 1.0}
🔍 Searching for similar items...


Unnamed: 0,Symbol,Sector,Industry,similarity
5,GOOG,Communication Services,Internet Content & Information,1.0
6,META,Communication Services,Internet Content & Information,0.92849
0,AAPL,Technology,Consumer Electronics,0.859022
2,MSFT,Technology,Software - Infrastructure,0.847837
1,NVDA,Technology,Semiconductors,0.814828


## 9) Try a few variants

In [17]:

variants = {
    "text-heavy": {'text': 1.0, 'numerical': 0.5, 'categorical': 0.5},
    "numeric-heavy": {'text': 0.3, 'numerical': 1.0, 'categorical': 0.5},
    "categorical-heavy": {'text': 0.3, 'numerical': 0.5, 'categorical': 1.0}
}

for name, w in variants.items():
    print(f"--- {name} weights: {w}")
    res = hv.similarity_search(query, ignore_exact_matches=True, block_weights=w)
    try:
        display(res[['Symbol','Sector','Industry','similarity']].head(10))
    except Exception:
        display(res[:10])


--- text-heavy weights: {'text': 1.0, 'numerical': 0.5, 'categorical': 0.5}
🔍 Searching for similar items...


Unnamed: 0,Symbol,Sector,Industry,similarity
6,META,Communication Services,Internet Content & Information,0.834159
0,AAPL,Technology,Consumer Electronics,0.667657
1,NVDA,Technology,Semiconductors,0.637654
21,NFLX,Communication Services,Entertainment,0.637065
27,TMUS,Communication Services,Telecom Services,0.625839


--- numeric-heavy weights: {'text': 0.3, 'numerical': 1.0, 'categorical': 0.5}
🔍 Searching for similar items...


Unnamed: 0,Symbol,Sector,Industry,similarity
6,META,Communication Services,Internet Content & Information,0.936577
0,AAPL,Technology,Consumer Electronics,0.876563
2,MSFT,Technology,Software - Infrastructure,0.871376
1,NVDA,Technology,Semiconductors,0.815888
3,AMZN,Consumer Cyclical,Internet Retail,0.805585


--- categorical-heavy weights: {'text': 0.3, 'numerical': 0.5, 'categorical': 1.0}
🔍 Searching for similar items...


Unnamed: 0,Symbol,Sector,Industry,similarity
6,META,Communication Services,Internet Content & Information,0.941367
0,AAPL,Technology,Consumer Electronics,0.884008
2,MSFT,Technology,Software - Infrastructure,0.873653
1,NVDA,Technology,Semiconductors,0.853141
3,AMZN,Consumer Cyclical,Internet Retail,0.842497



## 10) Notes
- For GPU acceleration of text embeddings, install a CUDA-enabled PyTorch (see pytorch.org).
- For large datasets, consider installing `faiss-cpu` for faster nearest-neighbor search.
- Change the query symbol above to explore different neighborhoods.
