In [None]:
!pip install -q condacolab
import condacolab
condacolab.install() # expect a kernel restart

In [None]:
!mamba install sentence-transformers faiss wikipedia pandas -yq

## Training Your Own Embedding Model

Up to this point, you've been using pre-trained models, which are a bit like using someone else's map to navigate a city. Now, we're going to draw our own map by training an embedding model on a dataset of our choosing. This is an exciting opportunity to tailor the model to better understand the specific topics and language that you're interested in.

### Customizing Your Model

When training your own model, there are several levers you can pull to potentially enhance its performance:

1. **Increase Training Data**: The more examples your model sees, the better it can learn. You can use the `wikipedia` package to fetch articles on a range of topics. More diverse and extensive data can lead to a more robust model.

2. **Model Architecture**: You can change the underlying architecture of the model by specifying a different model string when initializing `SentenceTransformer`. Experiment with different architectures like `bert-base-nli-mean-tokens`, `roberta-base-nli-stsb-mean-tokens`, or even larger models if you have the computational resources.

3. **Training Duration**: The amount of time you train your model (number of epochs) also impacts performance. More training can result in a better understanding of the text, but also watch out for overfitting—where the model learns the training data too well and doesn't generalize to new data.

4. **Loss Function**: The loss function you choose tells the model how to measure its mistakes during training. Different tasks might benefit from different loss functions, so feel free to experiment with options like `ContrastiveLoss`, `MultipleNegativesRankingLoss`, or `TripletLoss`.

5. **Evaluation**: Remember to evaluate your model regularly during training. This helps you understand whether the changes you're making are improving performance.

### Your Challenge

Train your own model using the provided code snippet as a starting point. Fetch more articles from Wikipedia on topics you're interested in, configure the model and training parameters, and let the training begin! Keep an eye on how changes in these configurations affect your model's understanding of language.

Happy modeling, and may the best embeddings win!


In [None]:
# imports from the first lab
import wikipedia
import sentence_transformers, sentence_transformers.losses
import faiss
import numpy

In [None]:
from torch.utils.data import DataLoader

# Prepare the dataset
train_examples = [sentence_transformers.InputExample(texts=[
    'First sentence.',
    'Second sentence.',
], label=0.8)]

# Define the model
model = sentence_transformers.SentenceTransformer('distilbert-base-nli-mean-tokens')

# Define a dataloader and loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = sentence_transformers.losses.CosineSimilarityLoss(model)

# Training
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

In [None]:
# Show the model architecture
model

In [None]:
# Show the example output
model.encode(['First sentence.', 'Second sentence.'])

In [None]:
# Functions from the first notebook
def get_articles_by_topic(topics):
    # Step 1: Fetch articles
    articles = {topic: wikipedia.page(topic).content for topic in topics}

    # Step 2: Preprocess text
    # (assuming simple preprocessing for demonstration)
    processed_articles = {
        title: content.replace("\n", " ") for title, content in articles.items()
    }
    return processed_articles

# Prepare a function to create a new index
def create_index(passages, model, instruction="passage"):
    if instruction:
        passages = [
            f"{instruction}: {passage}" for passage in passages
        ]
    # Step 3: Generate embeddings
    embeddings = [
        model.encode(content, normalize_embeddings=True)
        for content in passages
    ]

    # Step 4: Indexing with FAISS
    # Get the size of the embeddings
    dimension = (
        embeddings[0].shape[0]
    )
    # Use the "distance" for the index
    index = faiss.IndexFlatIP(dimension)

    # You need to convert the embeddings dictionary to a list of embeddings
    embeddings_matrix = numpy.array(embeddings)
    index.add(embeddings_matrix)  # Add embeddings to the index

    # return the results
    return index

def search(query, model, index, k=3, instruction="query"):
    """
    Search for relevant articles given a query.
    Some models need a special instruction (e.g. "query: ")
    """
    # Need to embed the query
    if instruction:
        query = f"{instruction}: {query}"
    query_embedding = model.encode(query)
    # k=3 finds the 3 closest article
    distances, indices = index.search(numpy.array([query_embedding]), k=k)
    return distances, indices

In [None]:
# Save the model if you like it
# model.save('path-to-save-model/')