In [1]:
#  Install dependencies
!pip install sentence-transformers faiss-cpu scikit-learn


Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [2]:
# 🧠 Mini Semantic Search Demo (Day 8 – LLM Journey)

#
# -----------------------------
# 📚 Imports
# -----------------------------
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import faiss
import numpy as np

# -----------------------------
# 🔨 Load Embedding Model
# -----------------------------
model = SentenceTransformer("all-MiniLM-L6-v2")

# -----------------------------
# 📝 Load Real Dataset (20 Newsgroups)
# -----------------------------
categories = ['sci.med', 'comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes')
)

documents = newsgroups.data[:200]  # Take first 200 docs for demo
print(f"Loaded {len(documents)} documents")
print("\nSample Document:\n", documents[0][:300])  # preview first 300 chars

# -----------------------------
# 🔢 Encode Documents into Embeddings
# -----------------------------
doc_embeddings = model.encode(documents, show_progress_bar=True)

# -----------------------------
# 🗄️ Build FAISS Index
# -----------------------------
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(doc_embeddings))

# -----------------------------
# 🔍 Semantic Search Function
# -----------------------------
def semantic_search(query, top_k=5):
    query_vec = model.encode([query])
    distances, indices = index.search(query_vec, top_k)
    results = [(documents[i], distances[0][j]) for j, i in enumerate(indices[0])]
    return results

# -----------------------------
# 🚀 Run a Test Query
# -----------------------------
query = "doctor treating patients"
results = semantic_search(query, top_k=5)

print(f"Query: {query}\n")
for res, dist in results:
    print(f"- {res[:200]}... (score: {dist:.4f})\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded 200 documents

Sample Document:
 
The FDA, I believe.  Rules say no blood or blood products donations
from anyone who has been in a malarial area for 3 years.  I was a platelet
donor until my Thailand trip and my blood bank was very disappointed
to find out they couldn't use me for 3 years.

Not necessarily.  The same rules may not


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Query: doctor treating patients

- 
Well, let me put it this way, based on my own experience.  A
general practitioner with no training in infectious diseases,
by establishing links to the "Lyme community", treating patients
who come to... (score: 1.0058)

- 























Dyer, you're rude. Medicine is not a totallly scientific endevour. It's
often practiced in a disorganized manner. Most early treatment of
non-life threatening illness is done o... (score: 1.1607)

- - Am I justified in being pissed off at this doctor?
- 
- Last Saturday evening my 6 year old son cut his finger badly with a knife.
- I took him to a local "Urgent and General Care" clinic at 5:50 pm... (score: 1.2194)

- [reply to geb@cs.pitt.edu (Gordon Banks)]
 
 
 
I made a decision a while back that I will not be bullied into getting
studies like a CT or MRI when I don't think they are indicated.  If the
patient w... (score: 1.2205)

- It would be nice to think that individuals can somehow 'beat the system'


In [3]:
# 🔎 Interactive Query
while True:
    query = input("Enter your search query (or 'exit' to quit): ")
    if query.lower() == "exit":
        break

    results = semantic_search(query, top_k=5)
    print(f"\nQuery: {query}\n")
    for res, dist in results:
        print(f"- {res[:200]}... (score: {dist:.4f})\n")


Enter your search query (or 'exit' to quit): doctor treating patients

Query: doctor treating patients

- 
Well, let me put it this way, based on my own experience.  A
general practitioner with no training in infectious diseases,
by establishing links to the "Lyme community", treating patients
who come to... (score: 1.0058)

- 























Dyer, you're rude. Medicine is not a totallly scientific endevour. It's
often practiced in a disorganized manner. Most early treatment of
non-life threatening illness is done o... (score: 1.1607)

- - Am I justified in being pissed off at this doctor?
- 
- Last Saturday evening my 6 year old son cut his finger badly with a knife.
- I took him to a local "Urgent and General Care" clinic at 5:50 pm... (score: 1.2194)

- [reply to geb@cs.pitt.edu (Gordon Banks)]
 
 
 
I made a decision a while back that I will not be bullied into getting
studies like a CT or MRI when I don't think they are indicated.  If the
patient w... (score: 1.2205)

- It

In [4]:
import ipywidgets as widgets
from IPython.display import display

query_box = widgets.Text(
    description="Query:",
    placeholder="Type your search here..."
)

output = widgets.Output()

def on_submit(change):
    with output:
        output.clear_output()
        query = change['new']
        results = semantic_search(query, top_k=5)
        print(f"\nQuery: {query}\n")
        for res, dist in results:
            print(f"- {res[:200]}... (score: {dist:.4f})\n")

query_box.observe(on_submit, names='value')
display(query_box, output)


Text(value='', description='Query:', placeholder='Type your search here...')

Output()

In [5]:
!pip install sentence-transformers faiss-cpu scikit-learn



In [6]:
# 🧠 Full Semantic Search Demo with 20 Newsgroups (Day 8 – LLM Journey)

# Install dependencies


# -----------------------------
# 📚 Imports
# -----------------------------
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import faiss
import numpy as np

# -----------------------------
# 🔨 Load Embedding Model
# -----------------------------
model = SentenceTransformer("all-MiniLM-L6-v2")

# -----------------------------
# 📝 Load the Full Dataset (All 20 Categories)
# -----------------------------
newsgroups = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers', 'quotes')
)

documents = newsgroups.data  # ~11,000 training docs
print(f"Loaded {len(documents)} documents across {len(newsgroups.target_names)} categories")
print("Sample categories:", newsgroups.target_names[:5])
print("\nSample Document:\n", documents[0][:300])  # preview first 300 chars

# -----------------------------
# 🔢 Encode Documents into Embeddings
# -----------------------------
doc_embeddings = model.encode(documents, show_progress_bar=True, batch_size=64)

# -----------------------------
# 🗄️ Build FAISS Index
# -----------------------------
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(doc_embeddings))

# -----------------------------
# 🔍 Semantic Search Function
# -----------------------------
def semantic_search(query, top_k=5):
    query_vec = model.encode([query])
    distances, indices = index.search(query_vec, top_k)
    results = [(documents[i], distances[0][j]) for j, i in enumerate(indices[0])]
    return results

# -----------------------------
# 🚀 Run a Test Query
# -----------------------------
query = "space exploration and NASA projects"
results = semantic_search(query, top_k=5)

print(f"Query: {query}\n")
for res, dist in results:
    print(f"- {res[:200]}... (score: {dist:.4f})\n")


Loaded 11314 documents across 20 categories
Sample categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware']

Sample Document:
 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I k


Batches:   0%|          | 0/177 [00:00<?, ?it/s]

Query: space exploration and NASA projects

- 
Lets hear it for Dan Goldin...now if he can only convince the rest of
our federal government that the space program is a worth while
investment!

I hope that I will live to see the day we walk on Mar... (score: 0.8291)

- We are not at the end of the Space Age, but only at the end of Its
beginning.

That space exploration is no longer a driver for technical innovation,
or a focus of American cultural attention is certa... (score: 0.8486)

- With the continuin talk about the "End of the Space Age" and complaints 
by government over the large cost, why not try something I read about 
that might just work.

Announce that a reward of $1 bill... (score: 0.9167)

- Reply address: mark.prado@permanet.org

 > From: higgins@fnalf.fnal.gov (Bill Higgins-- Beam Jockey)
 >
 > In article <1993Apr19.230236.18227@aio.jsc.nasa.gov>,
 > > |> AW&ST  had a brief blurb on a M... (score: 0.9200)

- I am looking for any information about the space program.
This