<a href="https://colab.research.google.com/github/mpilomthiyane97/ai-engineering/blob/main/Semantic_Search_Explorer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
''''You type a query (e.g. “tropical fruit” or “vehicle with wheels”),
and it returns the most semantically similar sentences from a dataset using embeddings.'''


# Step 1: Install required packages
!pip install sentence-transformers gradio scikit-learn pandas

# Step 2: Import libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import gradio as gr
import pandas as pd

# Step 3: Load sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 4: Define sample dataset (can be replaced with a real one later)
sentences = [
    "Apples and oranges are tasty fruits",
    "Tesla makes electric cars",
    "Bananas grow in tropical regions",
    "The lion is known as the king of the jungle",
    "Planes can fly across continents",
    "Python is a popular programming language",
    "Monkeys are intelligent animals",
    "Ferrari is a luxury sports car",
    "Coffee is a common morning drink",
    "Elephants are the largest land animals",
    "MacBooks are made by Apple",
    "Trains can carry a lot of people",
    "Pizza is a popular Italian dish",
    "Helicopters can hover in place",
    "Cats are independent pets",
    "JavaScript is used in web development",
    "Oranges are rich in vitamin C",
    "Bicycles are eco-friendly vehicles",
    "Chocolate is made from cocoa",
    "Dogs are loyal companions",
    "Submarines travel under water",
    "Laptops are portable computers",
    "Eagles have excellent eyesight",
    "Mangoes are sweet and juicy",
    "The internet connects the world",
    "Camels can survive in the desert",
    "Android is an open-source mobile OS",
    "Giraffes have long necks",
    "Trucks carry heavy loads",
    "Penguins live in Antarctica"
]

# Step 5: Precompute embeddings for the dataset
embeddings = model.encode(sentences)

# Step 6: Define semantic search function
def semantic_search(query, top_k=5):
    # Convert user query into vector
    query_vec = model.encode([query])

    # Compute cosine similarity between query and all sentences
    sims = cosine_similarity(query_vec, embeddings)[0]

    # Get top-k most similar sentences
    top_indices = sims.argsort()[::-1][:top_k]

    # Prepare results as a DataFrame
    results = pd.DataFrame({
        "Sentence": [sentences[i] for i in top_indices],
        "Similarity Score": [round(sims[i], 3) for i in top_indices]
    })

    return results

# Step 7: Create Gradio UI
gr.Interface(
    fn=semantic_search,
    inputs=[
        gr.Textbox(label="Enter your search query"),
        gr.Slider(1, 10, value=5, step=1, label="Top K Results")
    ],
    outputs=gr.Dataframe(label="Most Similar Sentences"),
    title="🔍 Semantic Search Explorer",
    description="Type any phrase, and see which sentences the model thinks are most similar in meaning.",
    theme="default"
).launch()


Collecting gradio
  Downloading gradio-5.29.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.0 (from gradio)
  Downloading gradio_client-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.8-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://39b5b4c450be91e51a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


