# Shakespeare Smart Search: Exploring Love, Betrayal, and Blood with BigQuery AI

In this notebook, we explore Shakespeare’s works using **BigQuery AI’s vector search**.  
Instead of simple keyword search (which can miss nuance), we generate **semantic embeddings** with Gemini and use them to uncover hidden connections between plays, sonnets, and poems.  

This project is part of the [**BigQuery AI Hackathon**](https://www.kaggle.com/competitions/bigquery-ai-hackathon/data) (Semantic Detective 🕵️‍♀️ track).  

**Goals**:
- Demonstrate semantic search across Shakespeare’s works.  
- Compare thematic elements (e.g., love, betrayal, honor, blood).  
- Show how embeddings can connect plays and poems by meaning, not just words.  

**Dataset**:  
- Based on [nrennie/shakespeare](https://github.com/nrennie/shakespeare) (CSV of plays/poems/sonnets).  
- Uploaded into BigQuery as `shakespeare-smart-search.shakespeare_title_clean.works_master`.  

**AI Tools Used**:
- `ML.GENERATE_EMBEDDING` with **Gemini (gemini-embedding-001)**.  
- Vector similarity search in SQL.  


## 1. Setup

We start by connecting Colab to Google Cloud and BigQuery.  
Make sure you have:
- A Google Cloud project with BigQuery enabled.  
- Your dataset uploaded (mine is `shakespeare-smart-search.shakespeare_title_clean`).  
- Authentication via Colab (OAuth popup will appear).  


In [10]:
# Install required package
!pip install --quiet google-cloud-bigquery

# Import libraries
from google.cloud import bigquery
from google.colab import auth

# Authenticate
auth.authenticate_user()

# Initialize BigQuery client
client = bigquery.Client(project="shakespeare-smart-search")
print("✅ BigQuery client initialized")


✅ BigQuery client initialized


## 2. Preview the Shakespeare Dataset

We’ll take a look at the `works_master` table, which contains metadata and text lines.  
This will help us confirm the schema before moving into embeddings.  


In [11]:
query = """
SELECT File, line_number, dialogue AS text_line, Title, act, scene, character
FROM `shakespeare-smart-search.shakespeare_title_clean.works_master`
WHERE dialogue IS NOT NULL
LIMIT 5
"""
df = client.query(query).to_dataframe()
df


Unnamed: 0,File,line_number,text_line,Title,act,scene,character
0,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ...",1henryiv,Act I,Scene I,[stage direction]
1,,1.0,"So shaken as we are, so wan with care,",1henryiv,Act I,Scene I,King Henry Iv
2,,2.0,"Find we a time for frighted peace to pant,",1henryiv,Act I,Scene I,King Henry Iv
3,,3.0,And breathe short-winded accents of new broils,1henryiv,Act I,Scene I,King Henry Iv
4,,4.0,To be commenced in strands afar remote.,1henryiv,Act I,Scene I,King Henry Iv


## 3. Creating Semantic Embeddings in BigQuery

Shakespeare’s text lines are stored in `works_master`.  
To enable semantic search, we use **Gemini embeddings**:

```sql
ML.GENERATE_EMBEDDING(
  MODEL `projects/bigquery-public-data/models/gemini-embedding-001`,
  (SELECT dialogue AS content)
)
```
Normally you’d run ML.GENERATE_EMBEDDING here to compute embeddings from scratch. In this project, we pre-computed embeddings into chunks_with_snippets in BigQuery and now query them directly.

In [14]:
# Check if the embeddings table exists and has data
query = """
SELECT File, stanza, snippet, ARRAY_LENGTH(embedding) AS embed_dims
FROM `shakespeare-smart-search.shakespeare_title_clean.chunks_with_snippets`
LIMIT 5
"""
df_preview = client.query(query).to_dataframe()
df_preview

Unnamed: 0,File,stanza,snippet,embed_dims
0,,260.0,"Brutus, who pluck'd the knife from Lucrece' si...",3072
1,,73.0,"'What! canst thou talk?' quoth she, 'hast thou...",3072
2,,113.0,"'What should I do, seeing thee so indeed, That...",3072
3,,24.0,"'Were I hard-favour'd, foul, or wrinkled-old, ...",3072
4,,44.0,Now which way shall she turn? what shall she s...,3072


**NOTE**: Due to time constraints, we focused on embedding text snippets themselves. If we continued, the next step would be to join back to the works_master_cleaned table to enrich results with play/character metadata

## 4. Query Function: Semantic Search

We’ll define a helper function `semantic_search(query_text)` that:

1. Generates an embedding for the query string.  
2. Computes similarity with Shakespeare’s embeddings.  
3. Returns the top-N most semantically similar snippets.  

In [12]:
def semantic_search(query_text, top_k=5):
    sql = f"""
    DECLARE query_embedding ARRAY<FLOAT64>;

    -- Generate embedding for the query
    SET query_embedding = (
      SELECT ml_generate_embedding_result
      FROM ML.GENERATE_EMBEDDING(
        MODEL `shakespeare-smart-search.shakespeare_title_clean.embed_model`,
        (SELECT "{query_text}" AS content),
        STRUCT(TRUE AS flatten_json_output)
      )
    );

    -- Find top matches
    SELECT
      File,
      stanza,
      Genre,
      snippet,
      (
        SELECT SUM(x*y)
        FROM UNNEST(embedding) x WITH OFFSET
        JOIN UNNEST(query_embedding) y WITH OFFSET
        USING (offset)
      ) AS similarity
    FROM `shakespeare-smart-search.shakespeare_title_clean.chunks_with_snippets`
    ORDER BY similarity DESC
    LIMIT {top_k};
    """
    return client.query(sql).to_dataframe()

## 5. Thematic Search Example

Let’s search for the theme of **love and betrayal**.  
We expect results from tragedies like *Othello* as well as narrative poems like *Lucrece*.  

Notice: even if the word “betrayal” isn’t present, semantically related passages (infidelity, deception, false vows) can still surface.  


In [13]:
df_results = semantic_search("Find lines about love and betrayal", top_k=5)
df_results


Unnamed: 0,File,stanza,Genre,snippet,similarity
0,,53.0,,He looks upon his love and neighs unto her; Sh...,0.642223
1,,,,"In loving thee thou know'st I am forsworn, My ...",0.632225
2,,135.0,,"'Love comforteth like sunshine after rain, But...",0.628241
3,,70.0,,"I know not love,' quoth he, 'nor will not know...",0.615321
4,,131.0,,'If love have lent you twenty thousand tongues...,0.613862


## 6. Character Voice Comparison

Prompt: *Find lines similar to “honor”*.

Falstaff ridicules honor in *Henry IV*; King Henry and Hotspur glorify it.  
Even though their tone is opposite, embeddings place them close — because they all orbit around the same concept.  

In [15]:
semantic_search("honor", top_k=5)

Unnamed: 0,File,stanza,Genre,snippet,similarity
0,,171.0,,"'O Jove,' quoth she, 'how much a fool was I To...",0.599833
1,,231.0,,Three times with sighs she gives her sorrow fi...,0.587989
2,,149.0,,"For now she knows it is no gentle chase, But t...",0.569207
3,,122.0,,"'But if thou fall, O, then imagine this, The e...",0.568832
4,,173.0,,"As falcon to the lure, away she flies; The gra...",0.568513


## 7. Genre Crossover

Prompt: *Search for 'blood'*.

We see results from tragedies like *Macbeth* and from poems like *Lucrece*.  
This demonstrates how embeddings bridge genres by focusing on imagery, not literal overlap.  

In [16]:
semantic_search("blood", top_k=5)

Unnamed: 0,File,stanza,Genre,snippet,similarity
0,,251.0,,About the mourning and congealed face Of that ...,0.631692
1,,250.0,,"And bubbling from her breast, it doth divide I...",0.614293
2,,207.0,,"And from the strand of Dardan, where they foug...",0.582654
3,,264.0,,"'Now, by the Capitol that we adore, And by thi...",0.581832
4,,237.0,,"'Mine enemy was strong, my poor self weak, And...",0.576578


## 8. Conclusion

In this notebook we demonstrated how **BigQuery AI + Gemini embeddings** enable semantic search over Shakespeare.

- Instead of keyword search, we used vector similarity to uncover passages related by *meaning* (love, betrayal, honor, blood).  
- We showed how poems and plays can be linked across genres, highlighting Shakespeare’s recurring imagery and themes.  
- The approach scales: the same method could be applied to support tickets, medical notes, or legal archives.  

**Limitations & Next Steps:**
- Add metadata joins (character, title, genre) to make results richer.
- Build a simple UI to let users type a query instead of editing SQL.
- Explore clustering/visualization to group plays by theme.

Even with these limitations, the project illustrates the power of vector search in BigQuery for exploring unstructured text collections.