# Word Embedding Search

This notebook loads 7-letter words from our word list and creates embeddings to find semantically similar words.

## Setup
First, we'll install required packages if they're not already installed.


In [7]:
import sentence_transformers
import sklearn
import numpy
import json


## 1. Load Word List
Load the words from our common-7-letter-words.txt file

In [27]:
with open('wordLists/6letters.json', 'r') as f:
    words = json.load(f)

print(f"Loaded {len(words)} words.")

Loaded 29874 words.


## 2. Generate Word Embeddings
Use sentence-transformers to generate embeddings for all words


In [28]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all words
print("Generating embeddings...")
word_embeddings = model.encode(words, show_progress_bar=True)
print("Done!")


Generating embeddings...


Batches: 100%|██████████| 934/934 [00:15<00:00, 61.79it/s]


Done!


## 3. Query Function
Create a function to find the most similar words


In [18]:

from sklearn.metrics.pairwise import cosine_similarity

def query_top_k(query, k=5):
    """Find the k most similar words to the query.
    
    Args:
        query (str): The word or phrase to find similar words to
        k (int): Number of similar words to return
        
    Returns:
        list: Top k words and their similarity scores
    """
    query_emb = model.encode([query])
    similarities = cosine_similarity(query_emb, word_embeddings)[0]
    top_k_idx = np.argsort(similarities)[::-1][:k]
    return [(words[i], similarities[i]) for i in top_k_idx]

def query_by_threshold(query, threshold=0.5):
    """Find all words with similarity score greater than threshold.
    
    Args:
        query (str): The word or phrase to find similar words to
        threshold (float): Minimum similarity score to return
        
    Returns:
        list: All words with similarity score greater than threshold
    """
    query_emb = model.encode([query])
    similarities = cosine_similarity(query_emb, word_embeddings)[0]
    # filter out words with similarity score less than threshold
    filtered_words = [word for word, similarity in zip(words, similarities) if similarity > threshold]
    # sort by similarity score
    sorted_words = sorted(filtered_words, key=lambda x: x[1], reverse=True)
    return sorted_words

## 4. Performing queries
Try finding similar words with different queries


In [25]:
# Example queries
query = "city"
threshold = 0.4

results = query_by_threshold(query, threshold=threshold)

# output as a json list
print(json.dumps(results, indent=4))


[
    "buffalo",
    "outdoor",
    "outside",
    "sunland",
    "stadium",
    "station",
    "utopian",
    "brewery",
    "freeway",
    "upstate",
    "company",
    "concord",
    "cottage",
    "council",
    "country",
    "holland",
    "tourist",
    "indoors",
    "america",
    "airport",
    "highway",
    "kitchen",
    "lincoln",
    "village",
    "bedroom",
    "central",
    "nearest",
    "seaside",
    "address",
    "academy",
    "badland",
    "bangkok",
    "capital",
    "capitol",
    "eastern",
    "garages",
    "harbour",
    "madison",
    "manager"
]
