<a href="https://colab.research.google.com/github/gomezphd/CAI2300C_NLP/blob/main/projects/semantic_search/notebooks/CCG_Project_2_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 2:** Semantic Search System for Customer Complaints and Legal Documents

- **`Student`**: Carlos C Gomez
- **`Instructor`**: Dr. Ernesto Lee
- **`Date`**: January 29, 2025  
---

## 🔎 **Introduction to Semantic Search Systems**

Semantic search represents an evolution in information retrieval, moving beyond simple keyword matching to understand the contextual meaning of search queries. This project implements a semantic search system with dual applications in customer service and legal document retrieval.

### *Core Functionality* 🎯
The system demonstrates two primary applications:
1. **Customer Complaint Analysis**: Identification of similar customer issues for standardized response handling
2. **Legal Document Retrieval**: Topic-based search of legal cases with semantic understanding

### *Technical Architecture* 🛠️
The implementation utilizes:
- **Embedding Models**: OpenAI's semantic text representation
- **Similarity Metrics**: Cosine-based relationship analysis
- **Data Filtering**: Metadata-based refinement
- **Interface**: Gradio framework implementation

This implementation extends research by [Dr. Lee](https://drlee.io/building-a-semantic-search-engine-for-customer-complaints-with-openai-a61e9f4f2ba7).   on semantic search applications in customer service contexts.

---
---

## 📌 **Step 1: Dependencies**

#### ***Sets up required Python packages***

- **`OpenAI`** → Text embedding generation  
- **`Gradio`** → Interface development  
- **`SciPy`** → Vector similarity calculations  
- **`NumPy`** → Numerical operations  



In [1]:
# Install Required Libraries
!pip install openai gradio scipy numpy -q


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.8/321.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m86.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

---
## 📌 **Step 2: Library Integration**

#### ***Imports necessary components for system operation***

- **`numpy`** → Handles numerical operations and array manipulation  
- **`gradio`** → Creates the interactive web interface  
- **`openai`** → Enables access to embedding models  
- **`scipy.spatial`** → Provides distance calculations for similarity metrics  


In [2]:
# Import Libraries
import numpy as np
import gradio as gr
from openai import OpenAI
from scipy.spatial import distance


---
## 📌 **Step 3: API Configuration**  

The system requires OpenAI API authentication through **two possible implementation methods**.  

---

### ✅ **Method A: Secure Storage (Recommended)**  
Colab Secrets provides a **secure** way to store your API key. Follow these steps:  

1. **Open Colab Secrets**  
   - Click the 📁 **folder icon** in the left sidebar  
   - Select the 🔑 **key icon** (Secrets) at the bottom  
   - Click **"Add new secret"**  

2. **Enter Your API Key**  
   - **Name**: Enter exactly `OPENAI_CAI2300C`  
   - **Value**: Paste your OpenAI API key (starts with `"sk-"`)  
   - Click **"Add"**  

3. **Important**: The secret name **must** be `OPENAI_CAI2300C`, exactly as shown.  

---

### ⚠️ **Method B: Direct Implementation (Less Secure)**  
* Alternatively, you can **hardcode** the API key directly in the script (see Method 2 in the code below).  

* However, this method **exposes** your API key, making it **less secure**.  
For better protection, **use Method A (Colab Secrets)** whenever possible.  


---



In [3]:
#------------------------------------
#  ✅ **Method 1: Using Colab Secrets (Recommended)
#------------------------------------

# Import required libraries
from google.colab import userdata
import os
from openai import OpenAI

# Retrieve API key from Colab Secrets
openai_api_key = userdata.get('OPENAI_CAI2300C')  # Ensure correct secret name

# Initialize OpenAI client if API key is successfully retrieved
if openai_api_key:
    os.environ["OPENAI_API_KEY"] = openai_api_key
    client = OpenAI(api_key=openai_api_key)
    print("✅ OpenAI API Key loaded successfully from Colab Secrets!")
else:
    print("❌ Failed to load API Key. Please check:")
    print("  1. You've added your API key to Colab Secrets with name 'OPENAI_CAI2300C'")
    print("  2. The secret is saved successfully")
    print("  3. Refresh the page if issues persist")



✅ OpenAI API Key loaded successfully from Colab Secrets!


In [6]:
#------------------------------------
# Method 2: Manual API Key Entry (Less Secure and need to comment out above cell)
# ------------------------------------
## from openai import OpenAI
## client = OpenAI(api_key="your-api-key")  # Replace with your actual OpenAI API key
## print("✅ OpenAI API Key manually set!")


---
## 📌 **Step 4: Embedding Function Implementation**

The `create_embeddings` function serves as the **core transformation engine** for the semantic search system.

### *Key Components*:
- **Model Selection**: Utilizes OpenAI's `text-embedding-3-small` model
- **Input Processing**: Accepts text arrays for batch processing
- **Vector Generation**: Creates high-dimensional embeddings representing semantic meaning

### *Technical Implementation Notes*:
- Each text input is transformed into a *1,536-dimensional vector*
- The process runs asynchronously for efficiency
- Output vectors capture semantic relationships between texts
---

In [7]:
# Define the Embedding Function
def create_embeddings(texts, model="text-embedding-3-small"):
    """
    Create embeddings for a list of texts using OpenAI's API.

    Args:
        texts (list): List of text strings to embed.
        model (str): OpenAI embedding model to use.

    Returns:
        list: List of embeddings.
    """
    return [client.embeddings.create(input=text, model=model).data[0].embedding for text in texts]



---
## 📌 **Step 5: Customer Complaints Dataset**

The system utilizes a **structured dataset** of common customer service scenarios.

### *Dataset Characteristics*:
- **Size**: 30 distinct complaint scenarios
- **Coverage**: Multiple service aspects
  - 📦 *Delivery issues*
  - 💰 *Billing concerns*
  - 🔧 *Product quality*
  - 📞 *Customer service interactions*

### *Implementation Purpose*:
Provides a *robust foundation* for testing semantic matching capabilities across diverse customer service scenarios.
---


In [8]:
# Sample customer complaints dataset
customer_complaints = [
    "The delivery of my order was delayed by 3 days, and I had to constantly check the tracking system for updates. This caused inconvenience as it was a gift that I needed urgently.",
    "I received a damaged product in the package, and the box itself was torn. It seems there was no care taken during the shipping process.",
    "The refund process is incredibly slow. I submitted my request weeks ago and still haven't received any confirmation or updates on the status of my refund.",
    "The customer service representative I spoke to was extremely rude and unhelpful, refusing to listen to my concerns or provide a proper resolution.",
    "I never received the order I placed two weeks ago, even though the system marked it as delivered. I feel like my money has been wasted.",
    "The packaging was torn and damaged when my order arrived, making it look like the contents could have been tampered with or mishandled during shipping.",
    "The product I received doesn’t match the description on the website at all. It feels misleading, and I now have to go through the hassle of returning it.",
    "The website is confusing and difficult to navigate, making it hard to find what I was looking for. The search feature also doesn’t provide accurate results.",
    "I was incorrectly charged an extra amount for my order, and I can't figure out why. The customer support hasn't resolved this issue yet.",
    "The warranty claim process is unclear, and I couldn’t find any detailed instructions on the website. I’ve been stuck without a resolution for weeks.",
    "The size I ordered doesn’t fit, even though I followed the size chart on your website. It seems the chart is inaccurate and misleading.",
    "I tried canceling my order before it was shipped, but the system wouldn't let me. Now, I’m stuck with something I don’t need.",
    "The product quality is far below what I expected based on the reviews and description. It feels like I’ve been scammed.",
    "The live chat feature on the website never connects me to an agent. I’ve tried multiple times, and the automated replies aren’t helpful at all.",
    "I’ve been charged for a subscription that I never signed up for. It’s unfair, and I haven’t received any explanation or solution yet.",
    "The instructions for assembling the product were incomplete and confusing. I had to search online to figure out how to put it together.",
    "My account was locked for no reason, and I wasn’t able to make a purchase. Customer support didn’t resolve this issue quickly.",
    "The item was marked as 'in stock,' but after I placed the order, I was informed that it’s on backorder and won’t arrive for weeks.",
    "I received the wrong item in my order, and now I have to go through the hassle of returning it and waiting for a replacement.",
    "The checkout process on the website is frustratingly slow, and my payment failed multiple times before finally going through.",
    "My promotional discount code didn’t work at checkout, and I ended up paying the full price. I contacted support but haven’t heard back yet.",
    "The delivery person left my package outside in the rain, ruining the contents inside. There should be better handling of deliveries.",
    "I had to pay additional customs fees that weren’t disclosed when I placed the order. This hidden charge is unacceptable.",
    "The automated phone system doesn’t connect me to a real person, and I’ve been stuck waiting for a resolution for over a week.",
    "The color of the product I received is completely different from what was shown on the website. It’s not what I ordered at all.",
    "The mobile app keeps crashing whenever I try to add items to my cart. It’s been impossible to complete my purchase.",
    "The tracking information for my shipment hasn’t been updated in days, and I have no idea where my package is.",
    "The item I purchased doesn’t work as advertised. It’s defective and should have been properly tested before being sold.",
    "I paid for expedited shipping, but my order still arrived late. I feel like I wasted money on a service that wasn’t delivered.",
    "I’ve been trying to return a product for over two weeks, but the return label hasn’t been sent to me yet. This delay is unacceptable."
]

---
## 📌 Step 6: Legal Case Summaries

The legal dataset provides a **structured collection** of case summaries across multiple practice areas.

### *Dataset Structure*:
Each case entry contains:
- **📄 Summary**: *Detailed case description*
- **🏷️ Topic**: *Legal practice area classification*
- **🔑 Keywords**: *Relevant legal terminology*

### *Practice Areas Covered*:
1. *Environmental Law*
2. *Privacy Law*
3. *Intellectual Property*
4. *Antitrust Law*
5. *Healthcare Law*
---

In [9]:
# Sample legal case summaries dataset
legal_summaries = [
    {
        "summary": "The Supreme Court ruled in favor of the plaintiff regarding environmental regulations, emphasizing the need for stricter enforcement to protect endangered species in industrial zones.",
        "topic": "Environmental Law",
        "keywords": ["Supreme Court", "plaintiff", "environmental regulations", "endangered species", "industrial zones"]
    },
    {
        "summary": "A new policy on data privacy was enacted to protect user information, requiring companies to implement stricter encryption standards for all stored data.",
        "topic": "Privacy Law",
        "keywords": ["data privacy", "policy", "user information", "encryption standards", "data storage"]
    },
    {
        "summary": "The landmark case addressed intellectual property rights for AI-generated content, establishing that such creations could not yet qualify for traditional copyright protections.",
        "topic": "Intellectual Property",
        "keywords": ["intellectual property", "AI", "landmark case", "copyright", "AI-generated content"]
    },
    {
        "summary": "Antitrust concerns were raised in a merger between two major tech companies, with regulators questioning the potential for monopolistic practices in the digital advertising sector.",
        "topic": "Antitrust Law",
        "keywords": ["antitrust", "merger", "tech companies", "monopoly", "digital advertising"]
    },
    {
        "summary": "The court dismissed charges of negligence against the pharmaceutical company, citing insufficient evidence to prove a breach of safety protocols during the manufacturing process.",
        "topic": "Healthcare Law",
        "keywords": ["negligence", "pharmaceutical company", "court dismissal", "safety protocols", "manufacturing process"]
    },
    {
        "summary": "The court ruled that a major energy company was liable for damages caused by an oil spill that affected coastal communities and wildlife habitats.",
        "topic": "Environmental Law",
        "keywords": ["oil spill", "energy company", "liability", "coastal communities", "wildlife habitats"]
    },
    {
        "summary": "A federal ruling mandated that companies must disclose all data breaches affecting more than 10,000 users within 48 hours of detection.",
        "topic": "Privacy Law",
        "keywords": ["data breaches", "federal ruling", "disclosure", "user protection", "48-hour rule"]
    },
    {
        "summary": "The court upheld the trademark infringement claim, ruling that the defendant's branding created consumer confusion with an established company's product line.",
        "topic": "Intellectual Property",
        "keywords": ["trademark infringement", "branding", "consumer confusion", "product line", "court ruling"]
    },
    {
        "summary": "An antitrust lawsuit was filed against a leading e-commerce platform for allegedly using its market dominance to suppress smaller competitors.",
        "topic": "Antitrust Law",
        "keywords": ["antitrust lawsuit", "e-commerce", "market dominance", "smaller competitors", "suppression"]
    },
    {
        "summary": "A healthcare company was fined for failing to comply with federal health data privacy regulations, resulting in unauthorized access to patient records.",
        "topic": "Healthcare Law",
        "keywords": ["healthcare company", "fines", "data privacy", "patient records", "non-compliance"]
    },
    {
        "summary": "Environmental advocates sued a chemical manufacturer over illegal waste disposal practices that contaminated local water supplies.",
        "topic": "Environmental Law",
        "keywords": ["illegal waste disposal", "chemical manufacturer", "water contamination", "lawsuit", "environmental advocates"]
    },
    {
        "summary": "New legislation on biometric data collection requires businesses to obtain explicit consent before capturing or storing fingerprints or facial recognition data.",
        "topic": "Privacy Law",
        "keywords": ["biometric data", "explicit consent", "facial recognition", "fingerprints", "data collection"]
    },
    {
        "summary": "The court ruled that a software developer retained ownership of the source code created during a freelance project, reinforcing independent contractor rights.",
        "topic": "Intellectual Property",
        "keywords": ["source code", "software developer", "freelance project", "ownership", "contractor rights"]
    },
    {
        "summary": "Regulators imposed fines on a major telecom company for collusion in price-fixing agreements with its regional partners.",
        "topic": "Antitrust Law",
        "keywords": ["collusion", "price-fixing", "telecom company", "regional partners", "regulators"]
    },
    {
        "summary": "The court held a healthcare provider liable for medical malpractice after failing to diagnose a patient’s life-threatening condition in a timely manner.",
        "topic": "Healthcare Law",
        "keywords": ["medical malpractice", "healthcare provider", "liability", "misdiagnosis", "timely care"]
    }
]


---
## 📌 Step 7: Vector Generation Process

This step implements the **unified embedding generation** for both datasets.

### *Processing Pipeline*:
1. **Initial Processing**
   - *Text extraction*
   - *Data validation*
   - *Batch preparation*

2. **Embedding Generation**
   - *Vector creation*
   - *Dimension validation*
   - *Quality assurance*

### *Implementation Details*:
- **🔄 Batch Processing**: Optimized for large datasets
- **🎯 Accuracy**: Maintains semantic precision
- **💾 Storage**: Efficient vector management
---

In [10]:
# Generate embeddings for both datasets

def generate_embeddings(data_list, is_dict=False):
    """
    Generates embeddings for a list of items.

    Args:
        data_list (list): List of strings or dictionaries.
        is_dict (bool): True if items are dictionaries, False if strings.

    Returns:
        list: List with added embeddings.
    """
    if is_dict:
        # For legal summaries (list of dictionaries)
        texts = [item["summary"] for item in data_list]
        embeddings = create_embeddings(texts)

        # Add embeddings to existing dictionaries
        for item, embedding in zip(data_list, embeddings):
            item["embedding"] = embedding

        return data_list
    else:
        # For customer complaints (list of strings)
        embeddings = create_embeddings(data_list)

        # Convert to list of dictionaries
        return [{"text": text, "embedding": emb} for text, emb in zip(data_list, embeddings)]

# Apply embedding generation
processed_complaints = generate_embeddings(customer_complaints, is_dict=False)
processed_legal = generate_embeddings(legal_summaries, is_dict=True)




---

## 📌 Step 8: Search Functionality

The unified search function implements a **sophisticated retrieval system** combining multiple search capabilities.

### *Core Features*:
- **🎯 Precision**: Advanced similarity calculations
- **🔍 Flexibility**: Multi-dataset search support
- **📑 Filtering**: Topic-based result refinement

### *Technical Highlights*:
1. **Vector Comparison**
   - *Cosine similarity metrics*
   - *Threshold optimization*
2. **Result Ranking**
   - *Relevance scoring*
   - *Dynamic sorting*



---




In [11]:
# Define Unified Search Function

def find_similar_entries(query, dataset, n=3, topic_filter=None):
    """
    Find the n most similar entries in a dataset.

    Args:
        query (str): Search query.
        dataset (list): Dataset of dictionaries with embeddings.
        n (int): Number of similar results to return.
        topic_filter (str, optional): Topic filter for legal cases.

    Returns:
        list: List of matching entries sorted by similarity.
    """
    query_embedding = create_embeddings([query])[0]
    results = []

    for entry in dataset:
        # Apply topic filter for legal cases
        if topic_filter and topic_filter != "All":
            if "topic" in entry and entry["topic"] != topic_filter:
                continue

        # Compute similarity
        similarity = 1 - distance.cosine(query_embedding, entry["embedding"])
        results.append((entry, similarity))

    return sorted(results, key=lambda x: x[1], reverse=True)[:n]





---
## 📌 **Step 9: User Interface Development**

The system implements an **intuitive interface** using Gradio's framework.

### *Interface Components*:
1. **Input Section**
   - *Query text field*
   - *Search type selector*
   - *Topic filter dropdown*

2. **Output Display**
   - *Ranked results*
   - *Similarity scores*
   - *Topic classifications*

### *Design Philosophy*:
Focuses on **accessibility** while maintaining *powerful functionality* for both technical and non-technical users.


---





In [12]:
# Develop the UI where users can toggle between Customer Complaints and Legal Cases.

def unified_search(query, search_type, topic_filter=None):
    """
    Unified search function for both types of content.

    Args:
        query (str): The user query.
        search_type (str): Either "Customer Complaints" or "Legal Cases".
        topic_filter (str, optional): Topic filter for legal cases.

    Returns:
        str: Formatted search results.
    """
    dataset = processed_complaints if search_type == "Customer Complaints" else processed_legal
    results = find_similar_entries(query, dataset, topic_filter=topic_filter)

    if not results:
        return "No matching entries found."

    output = f"🔍 **Search Query:** {query}\n\n📂 **Results:**\n"
    for i, (entry, similarity) in enumerate(results, 1):
        output += f"\n{i}. **{entry.get('text', '') or entry.get('summary', '')}**"
        if "topic" in entry:
            output += f"\n   📑 **Topic:** {entry['topic']}"
        output += f"\n   🎯 **Similarity Score:** {similarity:.2%}\n"

    return output


# 📌 Step 11: Launch the Gradio Interface

In [13]:
# Create and Launch Gradio Interface

unified_interface = gr.Interface(
    fn=unified_search,
    inputs=[
        gr.Textbox(label="Enter Your Query"),
        gr.Radio(
            ["Customer Complaints", "Legal Cases"],
            label="Select Search Type",
            value="Customer Complaints"
        ),
        gr.Dropdown(
            ["All", "Antitrust Law", "Privacy Law", "Intellectual Property",
             "Healthcare Law", "Environmental Law"],
            label="Filter by Legal Topic (Only for Legal Cases)",
            value="All"
        ),
    ],
    outputs="text",
    title="🔎 Unified Semantic Search",
    description="Search across both Customer Complaints and Legal Cases using OpenAI embeddings.",
)

# Launch the interface
unified_interface.launch()




Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7a22d2f2221527a95b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




---
## 🚀 System Deployment

### *Operation Notes*:
- **Runtime Environment**: Google Colab integration
- **Resource Management**: Optimized processing
- **User Access**: Browser-based interface

### *References*:
📚 Based on research by Dr. Lee on semantic search applications [Read more](https://drlee.io/building-a-semantic-search-engine-for-customer-complaints-with-openai-a61e9f4f2ba7).  
---