# Semantic Search Description

Semantic search understands the **meaning** behind searches rather than exact word matches. If you search "car", it can find "vehicle", "automobile", "sedan" because they're related concepts. This means sales teams can ask natural questions like:
- "Show me technology accounts with high revenue"
- "Find medical sector companies in the United States"

Vector Embeddings: Converting text to numbers
- Computers can't understand text, but they understand numbers
- Similar meanings = similar numbers

ChromaDB: A smart database for these vectors

## Step 1: Import Packages

In [23]:
import os
import pandas as pd
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

## Step 2: Load and Prepare Data

In [24]:
base_path = os.path.abspath("../data_directory/clean_data")

# Read the data
pipeline = pd.read_csv(os.path.join(base_path, "Pipeline.csv"))
accounts = pd.read_csv(os.path.join(base_path, "Accounts.csv"))
teams = pd.read_csv(os.path.join(base_path, "Teams.csv"))
products = pd.read_csv(os.path.join(base_path, "Products.csv"))

In [25]:
print(f"Loaded: {len(accounts)} accounts, {len(pipeline)} opportunities")

Loaded: 85 accounts, 8800 opportunities


## Step 3: Create Searchable Text Descriptions

In [None]:
# This helps the AI understand what each account is about by making it human-readable text
def create_account_description(row):
    """Convert account data into searchable text"""
    description = f"""
    Sector: {row['sector']} - {row['sector']} industry
    Account: {row['account']}
    This is a {row['sector']} company with:
    Revenue: ${row['revenue']} million
    Employees: {row['employees']}
    Location: {row['office_location']}
    Established: {row['year_established']}
    """
    if pd.notna(row['subsidiary_of']):
        description += f"Subsidiary of: {row['subsidiary_of']}\n"
    return description.strip()

def create_opportunity_description(row):
    """Convert opportunity data into searchable text"""
    description = f"""
    Opportunity ID: {row['opportunity_id']}
    Sales Agent: {row['sales_agent']}
    Product: {row['product']}
    Account: {row['account']}
    Deal Stage: {row['deal_stage']}
    Close Value: ${row['close_value']}
    """
    if pd.notna(row['engage_date']):
        description += f"Engaged: {row['engage_date']}\n"
    if pd.notna(row['close_date']):
        description += f"Closed: {row['close_date']}\n"
    
    return description.strip()

In [27]:
# Apply descriptions to dataframes
accounts['search_text'] = accounts.apply(create_account_description, axis=1)
pipeline['search_text'] = pipeline.apply(create_opportunity_description, axis=1)

## Step 4: Initialize the Embedding Model

In [28]:
model = SentenceTransformer('all-MiniLM-L6-v2')
# This model converts text into numbers (vectors) that capture meaning

## Step 5: Set Up ChromaDB

Made two separate ChromaDB collections: one for accounts, one for opportunities

In [29]:
try:
    chroma_client = chromadb.Client(Settings(
        persist_directory="./chroma_db",
        anonymized_telemetry=False
    ))
    
    # Delete old collections if they exist
    try:
        chroma_client.delete_collection("crm_accounts")
        chroma_client.delete_collection("crm_opportunities")
    except:
        pass  # Collections might not exist yet
    
    # Create new collections
    # Collection 1: Accounts
    accounts_collection = chroma_client.create_collection(
        name="crm_accounts",
        metadata={"description": "CRM account data"}
    )
    
    # Collection 1: Opportunities
    opportunities_collection = chroma_client.create_collection(
        name="crm_opportunities",
        metadata={"description": "Sales opportunities data"}
    )
    
    print("ChromaDB collections created")
    
except Exception as e:
    print(f"Error setting up ChromaDB: {e}")
    raise

ChromaDB collections created


## Step 6: Add Accounts to ChromaDB

In [30]:
#Adding accounts to vector database
for idx, row in accounts.iterrows():
    # Get the text description
    text = row['search_text']
    
    # Convert to embedding (vector)
    embedding = model.encode(text).tolist()
    
    # Store in ChromaDB with metadata
    accounts_collection.add(
        embeddings=[embedding],
        documents=[text],
        metadatas=[{
            'account': str(row['account']),
            'sector': str(row['sector']),
            'revenue': str(row['revenue']),
            'employees': str(row['employees']),
            'location': str(row['office_location'])
        }],
        ids=[f"account_{idx}"]
    )
    
    if (idx + 1) % 20 == 0:
        print(f"  Processed {idx + 1}/{len(accounts)} accounts")

print(f"Added {len(accounts)} accounts!")

  Processed 20/85 accounts
  Processed 40/85 accounts
  Processed 60/85 accounts
  Processed 80/85 accounts
Added 85 accounts!


## Step 7: Add Opportunities to ChromaDB

In [31]:
# Adding opportunities to vector database

for idx, row in pipeline.iterrows():
    text = row['search_text']
    embedding = model.encode(text).tolist()
    
    opportunities_collection.add(
        embeddings=[embedding],
        documents=[text],
        metadatas=[{
            'opportunity_id': str(row['opportunity_id']),
            'sales_agent': str(row['sales_agent']),
            'product': str(row['product']),
            'account': str(row['account']),
            'deal_stage': str(row['deal_stage']),
            'close_value': str(row['close_value'])
        }],
        ids=[f"opp_{idx}"]
    )
    
    if (idx + 1) % 500 == 0:
        print(f"  Processed {idx + 1}/{len(pipeline)} opportunities")

print(f"Added {len(pipeline)} opportunities!")

  Processed 500/8800 opportunities
  Processed 1000/8800 opportunities
  Processed 1500/8800 opportunities
  Processed 2000/8800 opportunities
  Processed 2500/8800 opportunities
  Processed 3000/8800 opportunities
  Processed 3500/8800 opportunities
  Processed 4000/8800 opportunities
  Processed 4500/8800 opportunities
  Processed 5000/8800 opportunities
  Processed 5500/8800 opportunities
  Processed 6000/8800 opportunities
  Processed 6500/8800 opportunities
  Processed 7000/8800 opportunities
  Processed 7500/8800 opportunities
  Processed 8000/8800 opportunities
  Processed 8500/8800 opportunities
Added 8800 opportunities!


## Step 8: Making Search Functions

In [32]:
def search_accounts(query, n_results=5, filter_sector=None):
    """
    Search for accounts using natural language.
    
    Args:
        query: What you're looking for (e.g., "technology companies")
        n_results: How many results to return
        filter_sector: Optional filter by industry sector
    
    Returns:
        Dictionary with search results
    """
    try:
        # Convert query to embedding
        query_embedding = model.encode(query).tolist()
        
        # Search with optional filtering
        if filter_sector:
            results = accounts_collection.query(
                query_embeddings=[query_embedding],
                n_results=n_results,
                where={"sector": filter_sector}
            )
        else:
            results = accounts_collection.query(
                query_embeddings=[query_embedding],
                n_results=n_results
            )
        
        return results
    
    except Exception as e:
        print(f"Search error: {e}")
        return None

In [33]:
def search_opportunities(query, n_results=5, filter_stage=None):
    """
    Search for opportunities using natural language.
    
    Args:
        query: What you're looking for (e.g., "high value deals")
        n_results: How many results to return
        filter_stage: Optional filter by deal stage
    
    Returns:
        Dictionary with search results
    """
    try:
        query_embedding = model.encode(query).tolist()
        
        if filter_stage:
            results = opportunities_collection.query(
                query_embeddings=[query_embedding],
                n_results=n_results,
                where={"deal_stage": filter_stage}
            )
        else:
            results = opportunities_collection.query(
                query_embeddings=[query_embedding],
                n_results=n_results
            )
        
        return results
    
    except Exception as e:
        print(f"Search error: {e}")

In [34]:
def pretty_print_account_results(results):
    """Helper function to display account results nicely."""
    if not results or not results['metadatas'][0]:
        print("No results found.")
        return
    
    for i, metadata in enumerate(results['metadatas'][0]):
        print(f"\n{i+1}. {metadata['account']}")
        print(f"   Sector: {metadata['sector']}")
        print(f"   Revenue: ${metadata['revenue']} million")
        print(f"   Location: {metadata['location']}")
        print(f"   Employees: {metadata['employees']}")

def pretty_print_opportunity_results(results):
    """Helper function to display opportunity results nicely."""
    if not results or not results['metadatas'][0]:
        print("No results found.")
        return
    
    for i, metadata in enumerate(results['metadatas'][0]):
        print(f"\n{i+1}. {metadata['opportunity_id']}")
        print(f"   Agent: {metadata['sales_agent']}")
        print(f"   Product: {metadata['product']}")
        print(f"   Account: {metadata['account']}")
        print(f"   Stage: {metadata['deal_stage']}")
        print(f"   Value: ${metadata['close_value']}")

## Testing the Semantic Search Model

In [35]:
# Test 1: Search for accounts
print("\nTEST 1: Technology companies with high revenue")
results = search_accounts("technology companies high revenue", n_results=3)
pretty_print_account_results(results)

# Test 2: Search for opportunities
print("\n\nTEST 2: Won deals by Darcel Schlecht") # Chose name randomly lol
results = search_opportunities("won deals Darcel Schlecht", n_results=3)
pretty_print_opportunity_results(results)

# Test 3: Search for medical sector accounts
print("\n\nTEST 3: Medical sector accounts")
results = search_accounts("medical healthcare companies", n_results=3)
pretty_print_account_results(results)


TEST 1: Technology companies with high revenue

1. hottechi
   Sector: technology
   Revenue: $8170.38 million
   Location: korea
   Employees: 16499.0

2. goodsilron
   Sector: marketing
   Revenue: $2952.73 million
   Location: united_states
   Employees: 5107.0

3. bluth_company
   Sector: technology
   Revenue: $1242.32 million
   Location: united_states
   Employees: 3027.0


TEST 2: Won deals by Darcel Schlecht

1. tpnu79c1
   Agent: darcel_schlecht
   Product: mg_special
   Account: toughzap
   Stage: won
   Value: $54.0

2. ujsfg18f
   Agent: darcel_schlecht
   Product: mg_advanced
   Account: dontechi
   Stage: won
   Value: $3290.0

3. snfw1fd7
   Agent: darcel_schlecht
   Product: mg_advanced
   Account: streethex
   Stage: won
   Value: $3369.0


TEST 3: Medical sector accounts

1. bioplex
   Sector: medical
   Revenue: $326.82 million
   Location: united_states
   Employees: 1016.0

2. bioholding
   Sector: medical
   Revenue: $587.34 million
   Location: philippines
   E