Pre-trained models like BERT are initially trained on a large corpus of text (e.g., Wikipedia, BookCorpus) to learn general language representations. **Transfer learning** involves fine-tuning these pre-trained models on specific task, leveraging pre-exisitng knowledge to achieve better performance with less data and training time. 


## Understanding BERT's Architecture
- Bidirectional: considers context from both left and right of a token
- Transformer Encoder: Utilizes self-attention mechanism to build representations of words based on their context.

## Task-Specific Fine-Tuning For Similarity Detection
Each sample in the dataset consists of a pair of business names and a label indicating their similarity (0 for dissimilar,1 for similar). 
**Tokenization:** Tokenize the pairs of names using BERT's tokenizer. This includes adding special token ([CLS], [SEP]) to distingguish separate names

Example:

In [None]:
inputs = tokenizer(name1, name2, return_tensor = 'pt', padding = 'max_length', truncation = True,  max_length = 128)

- We use the pre-trained BERT model to encode the input pairs. 
- Aggregate the token embeddings (e.g., mean pooling) to obtain a fixed-size representation. 
- Add a linear layer to map the pooled representation to similarity score.

Example:

In [None]:
class Bus

**Loss function and Optimization**: Use a suitable loss function for binary classification (e.g., Binary Cross-Entropy Loss).

**Training Loop**:
- Forward pass: compute the similarity score for input pairs
- Loss computation: Calculate the loss between predicted similarity scores and actual labels. 
- Backward Pass and Optimization: Update model parameters based on the gradient.

### Why Tokenize Pairs Together?

**Contextual Relationship**:
By tokenizing pairs together and feeding them into the model, BERT can consider the interation between the two names in a single forward pass. This allows the model to learn more about their relationship, which is crucial for taks like similarity detection. 
When embeddings are generated separately, the model cannot leverage the full context of both names together. Comparing separate embeddings via cosine similarity captures some relational informatio but lacks the deeper interaction m

## Further Reading
- BERT Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Transfer Learning in NLP: Transfer Learning in NLP
- Attention Is All You Need: Attention Is All You Need
- Sentence-BERT: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- The Illustrated Transformer: The Illustrated Transformer

In [10]:
import Levenshtein
from nltk.util import ngrams
from sklearn.metrics.pairwise import cosine_similarity
import torch
from transformers import AutoTokenizer, AutoModel, AdamW
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader, random_split

### Business Names Dataset Class
The `dataset` class loads the business name pairs from CSV file, tokenizes the text, prepares the input for the model:

Questions: 
- what is the purpose of max length?
- what are input_ids (inputs to the BERT)
- what are attention masks? (Input to BERT)

In [11]:
class BusinessNamesDataset(Dataset):
    def __init__(self, file_path, tokenizer, max_length=128):
        self.data = pd.read_csv(file_path)  # Load data from CSV
        self.tokenizer = tokenizer  # BERT tokenizer
        self.max_length = max_length  # Maximum sequence length

    def __len__(self):
        return len(self.data)  # Number of samples in the dataset

    def __getitem__(self, idx):
        name1 = self.data.iloc[idx]['name1']  # First business name
        name2 = self.data.iloc[idx]['name2']  # Second business name
        label = self.data.iloc[idx]['label']  # Similarity label (0 or 1)
        inputs = self.tokenizer(name1, name2, return_tensors='pt', padding='max_length', truncation=True, max_length=self.max_length)
        input_ids = inputs['input_ids'].squeeze(0)  # Token IDs
        attention_mask = inputs['attention_mask'].squeeze(0)  # Attention mask
        return input_ids, attention_mask, torch.tensor(label, dtype=torch.float)


### Model Architecture
The model class defines the architecture for fine-tuning BERT to compute similarity scores.
We add a **linear layer** to map the pooled BERT embeddings to a similarity score.
- **BERT Forward:** Passes the input through BERT to obtain the hidden states.
- **Pooling:** Averages the hidden states to get a single vector representing the input pair
- **Similarity score:** Computes the similarity score using the linear layer

Questions/Concerns:
- `hidden states` of BERT?

In [12]:
class BusinessNamesModel(torch.nn.Module):
    def __init__(self, model_name):
        super(BusinessNamesModel, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name) #load pretrained BERT
        self.similarity = torch.nn.Linear(self.bert.config.hiddensize,1)
    
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask = attention_mask) # Forward pass through BERT
        pooled_out = outputs.last_hidden_state.mean(dim=1) # Mean pooling
        similarity_score = self.similarity(pooled_output)
        return similarity_score


## Training the model:

In [13]:
def train_model(model, train_dataloader, val_dataloader= None, num_epochs=3, learning_rate=2e-5):
    criterion = torch.nnBCEWithLogitsLoss() # loss funciton
    optimizer = AdamW(model.parameters(), lr=learning_rate) # Optimizer
    device = torch.device('cuda' if torch.cuda.is_available() else "cpu") # device  configuration
    model.to(device) # Move model to device

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for batch in train_dataloader:
            input_ids, attention_mask, labels = batch
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)

            optimizer.zero_grad() # Reset gradients
            outputs = model(input_ids, attention_mask) # Forward pass
            loss = criterion(outputs.squeeze(-1), labels.float()) # Compute loss
            loss.backward() # Backpropagation
            optimizer.step()
            total_loss += loss.item()

        print(f'Epoch {epoch+1}, Loss: {tota_loss/len(train_dataloader)}')

        # Validation
        if val_dataloader:
            model.eval()
            val_loss = 0 
            with torch.no_grad():
                for batch in val_dataloader:
                    input_ids, attention_mask, labels = batch
                    input_ids = input_ids.to(device)
                    attention_mask = attention_mask.to(device)
                    labels = labels.to(devince)


                    outputs = model(input_ids, attention_mask)
                    loss = criterion(outputs.squeeze(-1), labels.float())
                    val_loss += loss.item()

            print(f'Epoch {epoch+1}, Validation Loss: {val_loss/len(val_dataloader)}')



## Using the Fine-Tuned Model

This class uses the fine-tuned model to obtain embeddings for new business names.

In [15]:
class FineTunedEmbedding:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenzier
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def get_embedding(self, text):
        inputs = self.tokenizer(text, return_tensor = 'pt', padding = True, truncation = True)
        inputs = {k: v.to(self.devince) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = self.model.bert(**inputs)
        return outputs.last_hidden_state.mean(dim=1).cpu().numpy()
        