# Step 2: Create Embeddings for Semantic Search

**What are embeddings?**
- Embeddings convert text into numbers (vectors) that capture meaning
- Similar meanings = similar vectors
- This lets us search by *meaning*, not just keywords

**Example:**
- Query: "help for homeless veteran"
- Finds: "Transitional Housing for Veterans" (even though words don't match exactly)

## Install Required Packages

Run this once to install what we need:

In [1]:
# Run this cell once to install packages
!pip install sentence-transformers chromadb -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Load the Data

In [2]:
import json
import warnings
warnings.filterwarnings("ignore")

with open('../data/homeless_services_hackathon.json', 'r') as f:
    services = json.load(f)

print(f"Loaded {len(services)} services")

Loaded 1719 services


## Prepare Text for Embeddings

We need to combine the important fields into a single text string for each service.
This is what the embedding model will "read" to understand each service.

In [3]:
def create_service_text(service):
    """
    Combine relevant fields into a single searchable text.
    This is what gets converted to an embedding.
    """
    parts = []
    
    # Core info
    if service.get('service_name'):
        parts.append(f"Service: {service['service_name']}")
    if service.get('organization'):
        parts.append(f"Organization: {service['organization']}")
    if service.get('description'):
        parts.append(f"Description: {service['description']}")
    
    # Who it's for
    if service.get('eligibility'):
        parts.append(f"Eligibility: {service['eligibility']}")
    if service.get('target_populations'):
        parts.append(f"Target Population: {service['target_populations']}")
    
    # What type of service
    if service.get('types'):
        parts.append(f"Service Types: {', '.join(service['types'])}")
    if service.get('areas_of_focus'):
        parts.append(f"Areas of Focus: {', '.join(service['areas_of_focus'])}")
    
    # Location
    if service.get('area_served'):
        parts.append(f"Area Served: {service['area_served']}")
    
    return "\n".join(parts)

# Test it on one service
sample_text = create_service_text(services[0])
print("Sample service text:")
print("-" * 50)
print(sample_text[:1000])  # First 1000 chars

Sample service text:
--------------------------------------------------
Service: Interim Shelter Bed Program (ISB)
Organization: Urban Street Angels
Description: Offers an interim shelter program that provides shelter for homeless youth who need a bed and food in a trauma-informed environment. 

Offers the following once enlisted: 
•	Food
•	Hygiene supplies
•	Clothing 
•	Transitional housing
•	Job search assistance
•	Education opportunities
•	Case management
•	Mental health therapy
•	Substance abuse treatment
•	Transportation support
Eligibility: Must be 18-24 years old, currently homeless, living at a temporary shelter or institution, or at risk of experiencing homelessness
Target Population: Homeless Youth/Runaway/Youth Shelter Residents/
Service Types: Case Management & Coordination, TAY Services, Mental Health Services, Transitional & Supportive Housing, Emergency Shelter & Crisis Intervention, Homelessness Prevention & Diversion, Food & Basic Needs Assistance, Substance Abuse Diso

## Create the Embedding Model

We're using `sentence-transformers` - it's free and runs locally (no API key needed).

In [4]:
from sentence_transformers import SentenceTransformer

# This model is good for semantic search
# First run will download the model (~90MB)
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Model loaded!")

Model loaded!


## Set Up ChromaDB (Vector Database)

ChromaDB stores our embeddings and lets us search them quickly.

In [5]:
import chromadb

# Create a persistent database (saves to disk)
chroma_client = chromadb.PersistentClient(path="../data/chroma_db")

# Delete existing collection if it exists (for clean reruns)
try:
    chroma_client.delete_collection("services")
except:
    pass

# Create a new collection
collection = chroma_client.create_collection(
    name="services",
    metadata={"description": "Homeless services from 211 San Diego"}
)

print("ChromaDB collection created!")

ChromaDB collection created!


## Generate Embeddings and Store in Database

This will take a minute or two to process all services.

In [6]:
from tqdm import tqdm  # Progress bar

# Process in batches for efficiency
batch_size = 100

for i in tqdm(range(0, len(services), batch_size), desc="Processing services"):
    batch = services[i:i + batch_size]
    
    # Create text for each service
    texts = [create_service_text(s) for s in batch]
    
    # Generate embeddings
    embeddings = model.encode(texts).tolist()
    
    # Create IDs and metadata
    ids = [str(i + j) for j in range(len(batch))]
    
    # Store metadata we'll want when retrieving results
    metadatas = [{
        "service_name": s.get('service_name', ''),
        "organization": s.get('organization', ''),
        "phone": s.get('main_phone', ''),
        "address": s.get('address', ''),
        "url": s.get('url', ''),
        "types": ', '.join(s.get('types', [])),
        "area_served": s.get('area_served', '')
    } for s in batch]
    
    # Add to ChromaDB
    collection.add(
        embeddings=embeddings,
        documents=texts,
        metadatas=metadatas,
        ids=ids
    )

print(f"\nDone! Added {collection.count()} services to the database.")

Processing services: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.01it/s]


Done! Added 1719 services to the database.





## Test the Search!

Let's try some queries a case manager might ask.

In [7]:
def search_services(query, n_results=5):
    """
    Search for services matching the query.
    Returns the top n_results matches.
    """
    # Convert query to embedding
    query_embedding = model.encode(query).tolist()
    
    # Search ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    return results

# Test query
query = "I need shelter for a homeless veteran"
results = search_services(query)

print(f"Query: {query}")
print("=" * 60)
for i, (meta, doc) in enumerate(zip(results['metadatas'][0], results['documents'][0])):
    print(f"\n{i+1}. {meta['service_name']}")
    print(f"   Organization: {meta['organization']}")
    print(f"   Phone: {meta['phone']}")
    print(f"   Types: {meta['types'][:80]}...")

Query: I need shelter for a homeless veteran

1. National Call Center for Homeless Veterans
   Organization: United States Department of Veterans Affairs (VA)
   Phone: (877) 424-3838
   Types: Mental Health Services, Veteran Services...

2. Homeless Veterans' Reintegration Program
   Organization: Able-Disabled Advocacy
   Phone: (619) 266-4247
   Types: Mental Health Services, Veteran Services...

3. Housing Stability Case Management
   Organization: Interfaith Community Services
   Phone: (760) 529-9979
   Types: Disability Services, Case Management & Coordination, Housing Search & Navigation...

4. Coordinated Entry Access Site (CES), VA Healthcare Systems, Oceanside
   Organization: United States Department of Veterans Affairs (VA)
   Phone: (619) 497-8989
   Types: Homelessness Prevention & Diversion, Disability Services, Case Management & Coor...

5. Harm Reduction Shelter
   Organization: Alpha Project for the Homeless
   Phone: (619) 860-2800
   Types: Emergency Shelter & Cris

In [8]:
# Try more queries
test_queries = [
    "food assistance for seniors",
    "mental health services for youth",
    "help paying rent to avoid eviction",
    "domestic violence shelter",
    "job training for homeless adults"
]

for query in test_queries:
    results = search_services(query, n_results=3)
    print(f"\nQuery: {query}")
    print("-" * 40)
    for meta in results['metadatas'][0]:
        print(f"  - {meta['service_name']}")


Query: food assistance for seniors
----------------------------------------
  - Senior Nutrition Program
  - Senior Farmers Market Nutrition Program
  - Senior Food Program, Back Country Support, Boulevard

Query: mental health services for youth
----------------------------------------
  - Adolescent Habilitative Learning Program
  - Adolescent Intensive Outpatient - Healthy yoUth
  - Children and Adolescent Mental Health Services

Query: help paying rent to avoid eviction
----------------------------------------
  - City of San Diego Eviction Prevention Program
  - Rent and Utility Payment Assistance
  - Housing and Foreclosure Counseling

Query: domestic violence shelter
----------------------------------------
  - National Domestic Violence Hotline
  - Domestic Violence Shelters
  - Domestic Violence Shelters

Query: job training for homeless adults
----------------------------------------
  - STEPS Program, San Diego Centre City Corps
  - Long Term Transitional Housing
  - Missio

## Summary

We now have:
1. All services converted to embeddings
2. Stored in a searchable vector database
3. A `search_services()` function that finds relevant services by meaning

**Next step (notebook 03):** Connect this to an LLM to generate helpful, conversational responses instead of just returning raw results.