# URL Similarity Search System using ChromaDB
-----------------------------------------

In the digital age, dealing with broken or incorrect URLs is a common challenge that can lead to poor user experience and lost traffic. This project implements a smart URL matching system that helps users find the correct webpage even when they encounter typos or slightly incorrect URLs. By converting URLs into numerical vectors and using ChromaDB for similarity search, the system can quickly identify and suggest the most similar valid URLs from a website's sitemap. This approach is particularly useful for large websites where manually redirecting or finding correct URLs would be time-consuming. The system processes XML sitemaps to build its knowledge base, and when given a potentially incorrect URL, it returns the closest matches based on semantic similarity rather than just character matching, making it more effective at understanding user intent.

### Why this process is important

**1.- User Experience & Error Recovery**

*   Users often encounter broken links or mistyped URLs
*   Instead of showing a generic 404 error, we can guide users to the content they likely meant to access
*   Real-world example: A user types "motortrend.com/news/ford-mustang-review-2024" but the actual URL is "motortrend.com/news/ford-mustang-2024-review" - our system would redirect them to the correct page
*   This reduces user frustration and maintains engagement on the site

**2.- Content Migration & Legacy Support**

*   Websites frequently reorganize their URL structure or migrate content
*   Old bookmarks, external links, and search engine results may still point to previous URL patterns
*   Our system can help maintain continuity by:
    *   Finding the new location of moved content
    *   Handling various URL formats that might have been used over time
    *   Preserving SEO value from old links by providing relevant alternatives

**3.-Analytics & Content Management**

*   Understanding URL patterns and similarities helps content teams:
    *   Identify duplicate or similar content that might need consolidation
    *   Analyze how content is organized and accessed
    *   Track content evolution over time through URL changes
    *   Make data-driven decisions about content structure
*   Example: If many users are searching for "reviews" in different URL formats, it might indicate a need to standardize the URL structure for review content


This module implements a URL similarity search system that can:
1. Process XML sitemaps to extract URLs
2. Convert URLs into vector representations
3. Store these vectors in ChromaDB for efficient similarity search
4. Find similar URLs when given a potentially incorrect URL

The system is particularly useful for:
- Finding correct URLs when users mistype or remember URLs incorrectly
- Redirecting users to the closest matching content
- Analyzing URL patterns in a website

Requirements:
- chromadb
- numpy
- tqdm
- xml.etree.ElementTree

In [1]:
# %% [markdown]
# # URL Similarity Matcher using ChromaDB
# This notebook processes XML sitemaps and creates a similarity search system for URLs.
# 
# ## Setup
# First, let's install the required packages:

# %%
# Install required packages
!pip install chromadb tqdm numpy pandas matplotlib seaborn



In [2]:
# Import required libraries
import os
import xml.etree.ElementTree as ET
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from urllib.parse import urlparse
import numpy as np
from datetime import datetime
from tqdm import tqdm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
class URLMatcher:
    """
    A class that manages URL similarity matching using vector embeddings and ChromaDB.
    
    This class provides functionality to:
    - Process XML sitemaps and extract URLs
    - Convert URLs into vector representations
    - Store URL vectors in ChromaDB
    - Find similar URLs using vector similarity search
    
    Attributes:
        sitemaps_folder (str): Path to folder containing XML sitemaps
        client (chromadb.Client): ChromaDB client instance
        collection (chromadb.Collection): ChromaDB collection for storing URLs
        debugging (bool): Flag to control debug output

    This class is using cosine similarity between vectors
    """    
    def __init__(self, collection_name: str = "url_collection1", sitemaps_folder="C:\\Users\\getapia\\ML_RAG_Project\\sitemap"):
        """
        Initialize the URL matcher with a collection name and sitemaps folder.
        
        Args:
            collection_name (str): Name of the ChromaDB collection to use
            sitemaps_folder (str): Path to folder containing XML sitemaps
        """        
        self.sitemaps_folder = sitemaps_folder
        self.client = chromadb.Client()
        self.debugging = False
        try:
            self.collection = self.client.get_collection(
                name=collection_name,
                embedding_function=embedding_functions.DefaultEmbeddingFunction(),
                metadata={"hnsw:space": "cosine"}
            )
            if self.debugging:
                print(f"Retrieved existing collection: {collection_name}")
        except:
            self.collection = self.client.create_collection(
                name=collection_name,
                embedding_function=embedding_functions.DefaultEmbeddingFunction(),
                metadata={"hnsw:space": "cosine"}
            )
            if self.debugging:
                print(f"Created new collection: {collection_name}")
            
    def normalize_path(self, url_or_path):
        """
        Normalize a URL or path to a consistent format.
        
        Args:
            url_or_path (str): Full URL or path to normalize
            
        Returns:
            str: Normalized path starting with '/'
            
        Example:
            'https://example.com/path/to/page/' -> '/path/to/page'
        """        
        if not url_or_path:
            return ""
            
        if url_or_path.startswith('http'):
            parsed = urlparse(url_or_path)
            path = parsed.path
        else:
            path = url_or_path
            
        path = path.strip('/')
        
        # Remove empty segments
        path_parts = [p for p in path.split('/') if p]
        
        # Rejoin with single slashes
        return '/' + '/'.join(path_parts)

    def url_to_vector(self, url_or_path):
        """
        Convert a URL or path to a vector representation using semantic understanding.
        
        This method:
        1. Normalizes the path
        2. Extracts the content part (usually the last segment)
        3. Splits into words
        4. Creates trigrams for each word
        5. Weights features based on position
        6. Normalizes the final vector
        
        Args:
            url_or_path (str): URL or path to vectorize
            
        Returns:
            numpy.ndarray: Normalized vector representation of the URL
        """
        # Normalize the path first
        path = self.normalize_path(url_or_path)
        
        # Split into segments
        path_parts = path.strip('/').split('/')
        
        # Get the relevant parts (usually the last part contains the actual content)
        if len(path_parts) > 1:
            content_part = path_parts[-1]  # Get the last segment
        else:
            content_part = path_parts[0]
            
        # Split by hyphens to get individual words
        words = content_part.split('-')
        
        features = []
        
        # Process each word
        for word in words:
            # Remove common suffixes like numbers and years
            word = ''.join([c for c in word if not c.isdigit()])
            
            # Get trigrams of the word to capture word structure
            trigrams = [word[i:i+3] for i in range(len(word)-2)]
            
            # Convert trigrams to numbers
            for trigram in trigrams:
                # Create a hash of the trigram
                trigram_value = sum(ord(c) * (i+1) for i, c in enumerate(trigram))
                features.append(trigram_value)
        
        # Weight the features based on word position
        # Words at the start and end are usually more important
        weighted_features = []
        for i, value in enumerate(features):
            position_weight = 1.0
            if i < len(features) // 3:  # First third
                position_weight = 1.5
            elif i > (2 * len(features)) // 3:  # Last third
                position_weight = 1.3
            weighted_features.append(value * position_weight)
        
        # Ensure fixed length
        target_length = 150
        if len(weighted_features) < target_length:
            weighted_features.extend([0] * (target_length - len(weighted_features)))
        else:
            weighted_features = weighted_features[:target_length]
        
        # Normalize vector
        features = np.array(weighted_features, dtype=np.float32)
        norm = np.linalg.norm(features)
        if norm > 0:
            features = features / norm
        
        return features

    def process_sitemap(self, file_path):
        """
        Process a single sitemap XML file to extract URLs and last modification dates.
        
        Args:
            file_path (str): Path to the sitemap XML file
            
        Returns:
            tuple: (list of URLs, list of last modification dates)
        """        
        try:
            tree = ET.parse(file_path)
            root = tree.getroot()
            ns = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
            
            urls = []
            lastmods = []
            counter = 0
            
            for url in root.findall('.//ns:url', ns):
                loc = url.find('ns:loc', ns)
                lastmod = url.find('ns:lastmod', ns)
                
                if loc is not None:
                    full_url = loc.text
                    normalized_path = self.normalize_path(full_url)
                    
                    if normalized_path:
                        urls.append(normalized_path)
                        lastmods.append(lastmod.text if lastmod is not None else None)
                        counter += 1
            
            if self.debugging:
                print(f'File-> {file_path} with {counter} paths processed')
            return urls, lastmods
        except ET.ParseError as e:
            print(f"Error parsing {file_path}: {e}")
            return [], []

    def process_all_sitemaps(self):
        """
        Process all XML sitemap files in the specified folder.
        
        Returns:
            tuple: (list of all URLs, list of all last modification dates)
        """
        all_urls = []
        all_lastmods = []
        
        xml_files = [f for f in os.listdir(self.sitemaps_folder) 
                    if f.endswith('.xml')]
        
        print(f"Processing {len(xml_files)} sitemap files...")
        for xml_file in tqdm(xml_files):
            file_path = os.path.join(self.sitemaps_folder, xml_file)
            urls, lastmods = self.process_sitemap(file_path)
            all_urls.extend(urls)
            all_lastmods.extend(lastmods)
        
        return all_urls, all_lastmods

    def store_urls(self, urls, lastmods):
        """
        Store URLs and their vectors in ChromaDB.
        
        Args:
            urls (list): List of URLs to store
            lastmods (list): List of last modification dates
        """        
        if self.debugging:
            print("Converting URLs to vectors and storing in ChromaDB...")
        
        batch_size = 1000
        for i in tqdm(range(0, len(urls), batch_size)):
            batch_urls = urls[i:i + batch_size]
            batch_lastmods = lastmods[i:i + batch_size]
            
            vectors = [self.url_to_vector(url).tolist() for url in batch_urls]
            metadatas = [{"lastmod": lm} for lm in batch_lastmods]
            
            self.collection.add(
                embeddings=vectors,
                documents=batch_urls,
                metadatas=metadatas,
                ids=[f"url_{j}" for j in range(i, i + len(batch_urls))]
            )

    def process_and_store(self):
        """Process all sitemaps and store URLs in ChromaDB."""
        urls, lastmods = self.process_all_sitemaps()
        if self.debugging:
            print(f"Found {len(urls)} URLs")
        self.store_urls(urls, lastmods)
        if self.debugging:
            print("URLs stored successfully in ChromaDB")

    def find_similar_url_by_path(self, query_url, n_results=2):
        """
        Find similar URLs to the query URL.
        
        Args:
            query_url (str): URL to find matches for
            n_results (int): Number of similar URLs to return
            
        Returns:
            dict: ChromaDB query results containing similar URLs and their distances
            
        Example:
            matcher.find_similar_url_by_path("https://example.com/incorrect-path")
        """
        # Normalize the query path
        query_path = self.normalize_path(query_url)
        
        # Convert to vector
        query_vector = self.url_to_vector(query_path)
        
        # Search in ChromaDB
        results = self.collection.query(
            query_embeddings=[query_vector.tolist()],
            n_results=n_results
        )
        
        # Check if exact path exists
        exists = query_path in results['documents'][0]
        if exists:
            print("URL Exists in database")
        else:
            print("URL doesn't exist in database, here closer results\n")
            for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
                similarity = 1 - distance  # Convert distance to similarity score
                print(f"{i+1}. Path: https://www.motortrend.com{doc}/")
                print(f"   Similarity: {similarity:.4f}")
        
        return results

    def get_collection_stats(self):
        """
        Get statistics about the URL collection.
        
        Returns:
            int: Number of URLs in the collection
        """
        try:
            count = self.collection.count()
            
            sample = self.collection.get(
                limit=5,
                include=['documents', 'metadatas']
            )
            
            print(f"\nCollection Statistics:")
            print(f"Total number of stored URLs: {count}")
            
            if count > 0:
                print("\nSample of stored URLs:")
                for i, doc in enumerate(sample['documents']):
                    print(f"{i+1}. {doc}")
                    print(f"   Last modified: {sample['metadatas'][i]['lastmod']}")
            
            return count
                
        except Exception as e:
            print(f"Error getting collection stats: {e}")
            return 0

In [4]:
# Initialize the matcher
matcher = URLMatcher(collection_name="url_collection1")

In [5]:
# Verify the correct initialization/creation of the DB
print("\n=== Checking Collection Stats Before Processing ===")
count = matcher.get_collection_stats()


=== Checking Collection Stats Before Processing ===

Collection Statistics:
Total number of stored URLs: 0


In [6]:
# Process and store URLs
if count == 0:
    print("\n=== Processing and Storing URLs ===")
    matcher.process_and_store()



=== Processing and Storing URLs ===
Processing 26 sitemap files...


100%|██████████| 26/26 [00:00<00:00, 38.12it/s]
100%|██████████| 54/54 [00:21<00:00,  2.49it/s]


In [7]:
# Verify we load the URLs in the DB
print("\n=== Checking Collection Stats After Processing ===")
count = matcher.get_collection_stats()


=== Checking Collection Stats After Processing ===

Collection Statistics:
Total number of stored URLs: 53277

Sample of stored URLs:
1. /news/mustang-1964
   Last modified: 2000-11-01T07:00:00.000Z
2. /news/porsche-911-turbo
   Last modified: 2000-09-02T06:42:00.000Z
3. /news/inside-cadillacs-project-blackfin
   Last modified: 2000-07-01T07:31:00.000Z
4. /news/0004-turp-abiogenic-petroleum-theory
   Last modified: 2000-04-01T05:52:00.000Z
5. /news/82799-ford-focus-rally-car
   Last modified: 2000-07-01T08:00:00.000Z


In [24]:
test_url = "https://www.motortrend.com/news/honda-pricing-cr-2008/"

In [25]:
# Find similar URLs
closer_results = matcher.find_similar_url_by_path(test_url)

URL doesn't exist in database, here closer results

1. Path: https://www.motortrend.com/news/2008-honda-s2000-cr-pricing/
   Similarity: 1.0000
2. Path: https://www.motortrend.com/news/2022-mazda-cx-30-pricing/
   Similarity: 1.0000
