# Semantic Search with `google/embeddinggemma-300m`

This notebook demonstrates how to perform a semantic search using the `txtai` library and the Embedding Gemma model.

In [2]:
!pip install -q git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
!pip install -q sentence-transformers
!pip install -q txtai
!pip install -q wikipedia

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.2/288.2 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


Import the `Embeddings` class from the `txtai` library.

In [3]:
from txtai import Embeddings

Initialize the `Embeddings` model. We specify the model path as "google/embeddinggemma-300m" and the method as "sentence-transformers". We also provide instructions for query and data formatting to ensure compatibility with the model.

In [4]:
embeddings = Embeddings(
    path="google/embeddinggemma-300m",
    method="sentence-transformers",
    instructions={
        "query": "task: search result | query: ",
        "data": "title: none | text: ",
    })

modules.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/997 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/312 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

3_Dense/model.safetensors:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

Fetch the content of the Wikipedia page for "Python (programming language)" and split it into a list of strings, where each string is a paragraph or section.

In [5]:
import wikipedia
data = wikipedia.page("Python (programming language)").content.split("\n")

Create a vector store by indexing the loaded data using the initialized embeddings model. This process converts the text data into numerical vectors that can be efficiently searched.

In [6]:
# Create a sample vector store
embeddings.index(data)

Define a query string and use the `embeddings.search()` method to find the top k most similar documents (paragraphs in this case) to the query within the indexed data.

In [11]:
# Search for top k similar documents
query = "Who is the creating of Python?"
results = embeddings.search(query, 3)

Print the search results. Iterate through the search results and print the text of each matched document along with its similarity score.

In [12]:
# Print results
for idx, score in results:
    print(f"Text: {data[int(idx)]} (score: {score:.4f})")

Text: Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands (he first released it in 1991 as Python 0.9.0.); it was conceived as a successor to the ABC programming language, which was inspired by SETL, capable of exception handling and interfacing with the Amoeba operating system. Python implementation began in December, 1989. Van Rossum assumed sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation" from responsibilities as Python's "benevolent dictator for life" (BDFL); this title was bestowed on him by the Python community to reflect his long-term commitment as the project's chief decision-maker. (He has since come out of retirement and is self-titled "BDFL-emeritus".) In January 2019, active Python core developers elected a five-member Steering Council to lead the project. (score: 0.5758)
Text: Python's name is inspired by the British comedy group Monty