# üìö The Great Gatsby - Vector Database Project

This notebook demonstrates how to convert F. Scott Fitzgerald's "The Great Gatsby" into a searchable vector database using ChromaDB and perform semantic search queries.

- Loading and preprocessing text data
- Creating embeddings with SentenceTransformers
- Storing vectors in ChromaDB
- Performing semantic search queries

## 1. Install and Import Required Libraries

In [1]:
import chromadb
from chromadb.utils import embedding_functions
from pathlib import Path
import requests
import re
from pprint import pprint

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## 2. Download The Great Gatsby Text

We'll download the text from Project Gutenberg (public domain).

In [2]:
# Download The Great Gatsby from Project Gutenberg
url = "https://www.gutenberg.org/files/64317/64317-0.txt"
data_path = Path("data/great_gatsby.txt")

# Create data directory if it doesn't exist
data_path.parent.mkdir(exist_ok=True)

# Download the file if it doesn't exist
if not data_path.exists():
    print("üì• Downloading The Great Gatsby...")
    response = requests.get(url)
    response.raise_for_status()
    
    with open(data_path, 'w', encoding='utf-8') as f:
        f.write(response.text)
    print("‚úÖ Downloaded successfully!")
else:
    print("‚úÖ File already exists!")

# Read the file
with open(data_path, 'r', encoding='utf-8') as f:
    raw_text = f.read()

print(f"üìä Total characters: {len(raw_text)}")
print(f"üìä First 500 characters:\n{raw_text[:500]}")

üì• Downloading The Great Gatsby...
‚úÖ Downloaded successfully!
üìä Total characters: 270822
üìä First 500 characters:
*** START OF THE PROJECT GUTENBERG EBOOK 64317 ***




                           The Great Gatsby
                                  by
                          F. Scott Fitzgerald


                           Table of Contents

I
II
III
IV
V
VI
VII
VIII
IX


                              Once again
                                  to
                                 Zelda


  Then wear the gold hat, if that will move her;
  If you can bounce high, bounce for her too,
  Till she cry ‚ÄúLover, go


## 3. Clean and Preprocess the Text

Remove Project Gutenberg headers/footers and clean the text.

In [3]:
# Clean the text - remove Project Gutenberg header and footer
def clean_text(text):
    # Find the start of the actual book (after the Gutenberg header)
    start_markers = ["CHAPTER I", "Chapter I", "CHAPTER 1"]
    start_idx = 0
    for marker in start_markers:
        idx = text.find(marker)
        if idx != -1:
            start_idx = idx
            break
    
    # Find the end (before Gutenberg footer)
    end_markers = ["End of Project Gutenberg", "*** END OF THE PROJECT GUTENBERG"]
    end_idx = len(text)
    for marker in end_markers:
        idx = text.find(marker)
        if idx != -1:
            end_idx = idx
            break
    
    # Extract the main text
    clean = text[start_idx:end_idx]
    
    # Clean up extra whitespace
    clean = re.sub(r'\n{3,}', '\n\n', clean)
    clean = re.sub(r' {2,}', ' ', clean)
    
    return clean.strip()

cleaned_text = clean_text(raw_text)
print(f"üìä Cleaned text length: {len(cleaned_text)} characters")
print(f"üìä First 500 characters:\n{cleaned_text[:500]}")

üìä Cleaned text length: 269986 characters
üìä First 500 characters:
*** START OF THE PROJECT GUTENBERG EBOOK 64317 ***

 The Great Gatsby
 by
 F. Scott Fitzgerald

 Table of Contents

I
II
III
IV
V
VI
VII
VIII
IX

 Once again
 to
 Zelda

 Then wear the gold hat, if that will move her;
 If you can bounce high, bounce for her too,
 Till she cry ‚ÄúLover, gold-hatted, high-bouncing lover,
 I must have you!‚Äù

 Thomas Parke d‚ÄôInvilliers

 I

In my younger and more vulnerable years my father gave me some advice
that I‚Äôve been turning over in my mind ever since.

‚ÄúWhenev


## 4. Split Text into Chunks

We'll split the text into paragraphs for better semantic search results.

In [4]:
# Split into paragraphs (chunks)
def split_into_chunks(text, min_length=100):
    """Split text into paragraphs, filtering out very short ones."""
    # Split by double newlines (paragraphs)
    paragraphs = text.split('\n\n')
    
    # Filter out very short paragraphs
    chunks = [p.strip() for p in paragraphs if len(p.strip()) >= min_length]
    
    return chunks

chunks = split_into_chunks(cleaned_text)
print(f"üìä Number of chunks: {len(chunks)}")
print(f"üìä Average chunk length: {sum(len(c) for c in chunks) // len(chunks)} characters")
print(f"\nüìù Sample chunk (first one):\n{chunks[0][:300]}...")
print(f"\nüìù Sample chunk (middle one):\n{chunks[len(chunks)//2][:300]}...")

üìä Number of chunks: 769
üìä Average chunk length: 289 characters

üìù Sample chunk (first one):
Then wear the gold hat, if that will move her;
 If you can bounce high, bounce for her too,
 Till she cry ‚ÄúLover, gold-hatted, high-bouncing lover,
 I must have you!‚Äù...

üìù Sample chunk (middle one):
‚ÄúThey‚Äôre such beautiful shirts,‚Äù she sobbed, her voice muffled in the
thick folds. ‚ÄúIt makes me sad because I‚Äôve never seen such‚Äîsuch
beautiful shirts before.‚Äù...


## 5. Initialize ChromaDB Client

Create a persistent client to store our vector database locally.

In [6]:
# Initialize ChromaDB with persistent storage
chroma_client = chromadb.PersistentClient(path="./vector_db")
print("‚úÖ ChromaDB client initialized!")

‚úÖ ChromaDB client initialized!


## 6. Create Sentence Transformer Embedding Function

Initialize the embedding model using SentenceTransformers.

In [7]:
# Create embedding function using SentenceTransformer
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

print("‚úÖ Sentence Transformer embedding function created!")
print("üìä This model creates 384-dimensional embeddings")


‚úÖ Sentence Transformer embedding function created!
üìä This model creates 384-dimensional embeddings


## 7. Create Vector Embeddings

Generate embeddings for all text chunks. This may take a minute...

In [8]:
# Generate embeddings for all chunks
print(f"üîÑ Generating embeddings for {len(chunks)} chunks...")
vectors = sentence_transformer_ef(chunks)

print(f"‚úÖ Generated {len(vectors)} embeddings!")
print(f"üìä Each embedding has {len(vectors[0])} dimensions")

üîÑ Generating embeddings for 769 chunks...
‚úÖ Generated 769 embeddings!
üìä Each embedding has 384 dimensions


## 8. Store Embeddings in ChromaDB Collection

Create a collection and add all chunks with their embeddings.

In [9]:
# Create unique IDs for each chunk
ids = [f"chunk_{i}" for i in range(len(chunks))]

# Uncomment to delete existing collection if needed
# chroma_client.delete_collection(name="great_gatsby")

# Create or get collection
collection = chroma_client.get_or_create_collection(name="great_gatsby")

print(f"üì¶ Adding {len(chunks)} chunks to ChromaDB...")

# Add documents to collection
collection.add(
    documents=chunks,
    ids=ids,
    embeddings=vectors,
)

print(f"‚úÖ Successfully added documents to collection!")
print(f"üìä Collection count: {collection.count()}")

üì¶ Adding 769 chunks to ChromaDB...
‚úÖ Successfully added documents to collection!
üìä Collection count: 769


## 9. Query the Vector Database

Now let's search for relevant passages using semantic search!

In [10]:
# Query 1: Find passages about Gatsby's parties
query = "Gatsby's extravagant parties with music and dancing"
query_embedding = sentence_transformer_ef([query])

results = collection.query(
    query_embeddings=query_embedding,
    n_results=5, # how many results to return
)

print("üîç Query:", query)
print("\n" + "="*80)
for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"\nüìÑ Result {i+1} (Similarity: {1-distance:.4f})")
    print("-" * 80)
    print(doc[:300] + "..." if len(doc) > 300 else doc)
    print()

üîç Query: Gatsby's extravagant parties with music and dancing


üìÑ Result 1 (Similarity: 0.3714)
--------------------------------------------------------------------------------
I believe that on the first night I went to Gatsby‚Äôs house I was one
of the few guests who had actually been invited. People were not
invited‚Äîthey went there. They got into automobiles which bore them out
to Long Island, and somehow they ended up at Gatsby‚Äôs door. Once there
they were introduced by ...


üìÑ Result 2 (Similarity: 0.2974)
--------------------------------------------------------------------------------
Daisy and Gatsby danced. I remember being surprised by his graceful,
conservative foxtrot‚ÄîI had never seen him dance before. Then they
sauntered over to my house and sat on the steps for half an hour,
while at her request I remained watchfully in the garden. ‚ÄúIn case
there‚Äôs a fire or a flood,‚Äù she ...


üìÑ Result 3 (Similarity: 0.1896)
-----------------------------------------

In [11]:
# Query 2: Find passages about the green light
query = "the green light at the end of the dock"
query_embedding = sentence_transformer_ef([query])

results = collection.query(
    query_embeddings=query_embedding,
    n_results=3,
)

print("üîç Query:", query)
print("\n" + "="*80)
for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
    print(f"\nüìÑ Result {i+1} (Similarity: {1-distance:.4f})")
    print("-" * 80)
    print(doc[:400] + "..." if len(doc) > 400 else doc)
    print()

üîç Query: the green light at the end of the dock


üìÑ Result 1 (Similarity: 0.2204)
--------------------------------------------------------------------------------
‚ÄúIf it wasn‚Äôt for the mist we could see your home across the bay,‚Äù
said Gatsby. ‚ÄúYou always have a green light that burns all night at
the end of your dock.‚Äù


üìÑ Result 2 (Similarity: 0.0462)
--------------------------------------------------------------------------------
With an effort Wilson left the shade and support of the doorway and,
breathing hard, unscrewed the cap of the tank. In the sunlight his
face was green.


üìÑ Result 3 (Similarity: -0.2387)
--------------------------------------------------------------------------------
Daisy put her arm through his abruptly, but he seemed absorbed in what
he had just said. Possibly it had occurred to him that the colossal
significance of that light had now vanished forever. Compared to the
great distance that had separated him from Daisy it had seemed ver

## 11. Alternative Query Method


In [12]:
# Alternative method - let ChromaDB handle the embedding
results = collection.query(
    query_texts=["Gatsby's mysterious wealth and background"],
    n_results=3,
)

pprint(results)

{'data': None,
 'distances': [[0.6270289421081543, 0.6323080062866211, 0.6470474004745483]],
 'documents': [['Something in her tone reminded me of the other girl‚Äôs ‚ÄúI '
                'think he\n'
                'killed a man,‚Äù and had the effect of stimulating my '
                'curiosity. I would\n'
                'have accepted without question the information that Gatsby '
                'sprang from\n'
                'the swamps of Louisiana or from the lower East Side of New '
                'York. That\n'
                'was comprehensible. But young men didn‚Äôt‚Äîat least in my '
                'provincial\n'
                'inexperience I believed they didn‚Äôt‚Äîdrift coolly out of '
                'nowhere and\n'
                'buy a palace on Long Island Sound.',
                'There was a small picture of Gatsby, also in yachting '
                'costume, on the\n'
                'bureau‚ÄîGatsby with his head thrown back defiantly‚Äîtaken '
    

## 12. Database Statistics

Let's check the statistics of our vector database.

In [13]:
print("üìä Vector Database Statistics")
print("="*80)
print(f"Collection Name: {collection.name}")
print(f"Total Documents: {collection.count()}")
print(f"Embedding Dimensions: 384")
print(f"Embedding Model: all-MiniLM-L6-v2")
print(f"Storage Path: ./gatsby_vector_db")
print("\n‚úÖ The Great Gatsby is now fully searchable as a vector database!")

üìä Vector Database Statistics
Collection Name: great_gatsby
Total Documents: 769
Embedding Dimensions: 384
Embedding Model: all-MiniLM-L6-v2
Storage Path: ./gatsby_vector_db

‚úÖ The Great Gatsby is now fully searchable as a vector database!
