<a href="https://colab.research.google.com/github/juliawol/WB_Embedder/blob/main/Brand_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install datasets sentence-transformers tqdm

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import pandas as pd
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm


# Configuration
data_files = "data_sampled_30.csv"
model_name = "DeepPavlov/rubert-base-cased"  # Pre-trained RuBERT model for embeddings
threshold = 0.7  # Similarity threshold for considering brand pairs as candidates

# Load dataset using the datasets library
dataset = load_dataset("JuliaWolken/WB_CARDS", data_files=data_files)
data = dataset["train"].to_pandas()

# Filter necessary rows and columns
data = data.dropna(subset=['brandname'])

# Brand Candidate Generation
def generate_brand_candidates(data, model_name, threshold):
    # Load pre-trained model for embeddings
    model = SentenceTransformer(model_name)

    # Get unique brand names
    brands = data['brandname'].dropna().unique()

    # Compute embeddings for all brands
    print("Generating embeddings for brands...")
    brand_embeddings = model.encode(brands, convert_to_tensor=True, show_progress_bar=True)

    # Find similar brand pairs using cosine similarity
    print("Finding similar brand pairs...")
    candidates = []
    similarities = util.pytorch_cos_sim(brand_embeddings, brand_embeddings)

    for i in tqdm(range(len(brands)), desc="Processing brands"):
        for j in range(i + 1, len(brands)):
            similarity = similarities[i, j].item()
            if similarity >= threshold:
                candidates.append((brands[i], brands[j], "Positive"))

    return candidates

# Generate brand candidates
brand_candidates = generate_brand_candidates(data, model_name, threshold)

# Save candidates for review with progress bar
print("Saving brand candidates to CSV...")
with tqdm(total=len(brand_candidates), desc="Saving to CSV") as pbar:
    brand_candidates_df = pd.DataFrame(brand_candidates, columns=['Brand1', 'Brand2', 'Label'])
    brand_candidates_df.to_csv("brand_candidates.csv", index=False)
    pbar.update(len(brand_candidates))

print("Brand candidate generation complete. File saved as 'brand_candidates.csv'.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


data_sampled_30.csv:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]



config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating embeddings for brands...


Batches:   0%|          | 0/1116 [00:00<?, ?it/s]

Finding similar brand pairs...


Processing brands: 100%|██████████| 35681/35681 [2:48:07<00:00,  3.54it/s] 


Saving brand candidates to CSV...


Saving to CSV: 100%|██████████| 22735953/22735953 [00:24<00:00, 925745.24it/s]

Brand candidate generation complete. File saved as 'brand_candidates.csv'.



