<a href="https://colab.research.google.com/github/ishancoderr/LLMs/blob/main/matching_algorithms_for_text_similarity_and_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install rapidfuzz

Collecting rapidfuzz
  Downloading rapidfuzz-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading rapidfuzz-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.1/3.1 MB[0m [31m109.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz
Successfully installed rapidfuzz-3.11.0


In [None]:
!pip install textdistance

Collecting textdistance
  Downloading textdistance-4.6.3-py3-none-any.whl.metadata (18 kB)
Downloading textdistance-4.6.3-py3-none-any.whl (31 kB)
Installing collected packages: textdistance
Successfully installed textdistance-4.6.3


In [None]:
import time
import numpy as np
from rapidfuzz import fuzz
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import textdistance
from sentence_transformers import SentenceTransformer, util

# Set random seed for reproducibility
np.random.seed(42)

# Example strings for comparison (20 different pairs)
strings = [
    ("United State of America", "USA"),
    ("Ishan Weerakoon", "Ishan Thathsara Weerakoon"),
    ("New York", "NY"),
    ("California", "CA"),
    ("Machine Learning", "AI"),
    ("Artificial Intelligence", "Machine Learning"),
    ("Python Programming", "Java Programming"),
    ("Deep Learning", "Neural Networks"),
    ("Geospatial Data", "GIS"),
    ("Data Science", "Big Data"),
    ("Data Visualization", "Data Analytics"),
    ("Natural Language Processing", "Speech Recognition"),
    ("Computer Vision", "Image Processing"),
    ("Blockchain", "Cryptocurrency"),
    ("Cloud Computing", "Distributed Systems"),
    ("Internet of Things", "IoT"),
    ("Cybersecurity", "Information Security"),
    ("Software Engineering", "System Development"),
    ("Quantum Computing", "Quantum Algorithms"),
    ("Robotics", "Automation")
]

# Load SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Measure time and compare models for each string pair
for text1, text2 in strings:
    print(f"\nComparing: {text1} vs {text2}")

    # **RapidFuzz**
    start = time.time()
    fuzzy_score = fuzz.ratio(text1, text2)
    rapidfuzz_time = time.time() - start
    print(f"RapidFuzz Score: {fuzzy_score} | Time: {rapidfuzz_time:.6f} sec")

    # **Cosine Similarity with TF-IDF**
    start = time.time()
    documents = [text1, text2]
    tfidf = TfidfVectorizer()
    vectors = tfidf.fit_transform(documents)
    cosine_sim = cosine_similarity(vectors[0], vectors[1])
    cosine_time = time.time() - start
    print(f"Cosine Similarity (TF-IDF): {cosine_sim[0][0]} | Time: {cosine_time:.6f} sec")

    # **TextDistance (Damerau-Levenshtein)**
    start = time.time()
    distance = textdistance.damerau_levenshtein.normalized_similarity(text1, text2)
    textdistance_time = time.time() - start
    print(f"TextDistance Similarity: {distance} | Time: {textdistance_time:.6f} sec")

    # **SentenceTransformers (Semantic Similarity)**
    start = time.time()
    emb1 = model.encode(text1)
    emb2 = model.encode(text2)
    similarity = util.cos_sim(emb1, emb2)
    sentence_transformers_time = time.time() - start
    print(f"Semantic Similarity: {similarity.item()} | Time: {sentence_transformers_time:.6f} sec")


Comparing: United State of America vs USA
RapidFuzz Score: 23.076923076923073 | Time: 0.000018 sec
Cosine Similarity (TF-IDF): 0.0 | Time: 0.006275 sec
TextDistance Similarity: 0.13043478260869568 | Time: 0.000094 sec
Semantic Similarity: 0.6735268831253052 | Time: 0.056481 sec

Comparing: Ishan Weerakoon vs Ishan Thathsara Weerakoon
RapidFuzz Score: 75.0 | Time: 0.000012 sec
Cosine Similarity (TF-IDF): 0.7092972666062739 | Time: 0.006552 sec
TextDistance Similarity: 0.6 | Time: 0.000074 sec
Semantic Similarity: 0.9155293703079224 | Time: 0.098614 sec

Comparing: New York vs NY
RapidFuzz Score: 40.0 | Time: 0.000012 sec
Cosine Similarity (TF-IDF): 0.0 | Time: 0.011745 sec
TextDistance Similarity: 0.25 | Time: 0.000077 sec
Semantic Similarity: 0.8770714402198792 | Time: 0.092146 sec

Comparing: California vs CA
RapidFuzz Score: 16.666666666666664 | Time: 0.000014 sec
Cosine Similarity (TF-IDF): 0.0 | Time: 0.004665 sec
TextDistance Similarity: 0.09999999999999998 | Time: 0.000070 sec
S

### Model Comparison Summary

#### **RapidFuzz**:
- **Best for speed with short, direct strings**.
- This model is ideal when **speed** is the top priority, especially when dealing with **short, direct strings**. It provides fast comparisons and reasonable similarity scores for simpler text matches.

#### **Cosine Similarity (TF-IDF)**:
- **Best for comparing documents and word significance**.
- This model is suitable when comparing **larger, document-like content**. It focuses on capturing the **occurrence and significance** of words and is more effective in scenarios where document-level comparisons are necessary.

#### **TextDistance (Edit Distance)**:
- **Best for handling minor text differences or typos**.
- TextDistance is effective in scenarios where there are **minor variations or typos** between two strings. It measures the similarity based on how many edits (insertions, deletions, substitutions) are needed to convert one string into another.

#### **Semantic Similarity (Sentence Transformers)**:
- **Best for capturing semantic meaning**.
- This method excels at comparing strings with different wordings but **similar meanings**. It's the most **robust model for semantic similarity**, providing meaningful similarity scores even when phrases differ significantly in structure or vocabulary. However, it is **slower** than the other models.

# **Elastic search**

In [None]:
!pip install elasticsearch

Collecting elasticsearch
  Downloading elasticsearch-8.17.0-py3-none-any.whl.metadata (8.8 kB)
Collecting elastic-transport<9,>=8.15.1 (from elasticsearch)
  Downloading elastic_transport-8.17.0-py3-none-any.whl.metadata (3.6 kB)
Downloading elasticsearch-8.17.0-py3-none-any.whl (571 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m571.2/571.2 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading elastic_transport-8.17.0-py3-none-any.whl (64 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.5/64.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.17.0 elasticsearch-8.17.0


In [None]:
from elasticsearch import Elasticsearch
import random

client = Elasticsearch(
    "https://my-elasticsearch-project-dd725e.es.eu-west-1.aws.elastic.cloud:443",
    api_key=""
)

index_name = "search-yn0c"

mappings = {
    "mappings": {
        "properties": {
            "text": {
                "type": "text"  # This is for full-text search
            }
        }
    }
}

# Create the index
if not client.indices.exists(index=index_name):
    client.indices.create(index=index_name, body=mappings)
    print(f"Index '{index_name}' created.")
else:
    print(f"Index '{index_name}' already exists.")


def generate_random_text(length=10):
    words = ["sample", "text", "testing", "example", "Elasticsearch", "match", "full-text", "search", "ishan","focus"]
    return " ".join(random.choices(words, k=length))

documents = [{"text": generate_random_text()} for _ in range(200)]

# Index the documents
for i, doc in enumerate(documents):
    client.index(index=index_name, id=i, body=doc)
    print(f"Document {i} indexed.")

# Perform a text match query
query = {
    "query": {
        "match": {
            "text": "This is a ft sample text for testing"  # The text to search for
        }
    }
}

# Execute the search query
response = client.search(index=index_name, body=query)

# Print the results
print("\nSearch Results:")
for hit in response["hits"]["hits"]:
    print(f"Text: {hit['_source']['text']} | Score: {hit['_score']}")


Index 'search-yn0c' already exists.
Document 0 indexed.
Document 1 indexed.
Document 2 indexed.
Document 3 indexed.
Document 4 indexed.
Document 5 indexed.
Document 6 indexed.
Document 7 indexed.
Document 8 indexed.
Document 9 indexed.
Document 10 indexed.
Document 11 indexed.
Document 12 indexed.
Document 13 indexed.
Document 14 indexed.
Document 15 indexed.
Document 16 indexed.
Document 17 indexed.
Document 18 indexed.
Document 19 indexed.
Document 20 indexed.
Document 21 indexed.
Document 22 indexed.
Document 23 indexed.
Document 24 indexed.
Document 25 indexed.
Document 26 indexed.
Document 27 indexed.
Document 28 indexed.
Document 29 indexed.
Document 30 indexed.
Document 31 indexed.
Document 32 indexed.
Document 33 indexed.
Document 34 indexed.
Document 35 indexed.
Document 36 indexed.
Document 37 indexed.
Document 38 indexed.
Document 39 indexed.
Document 40 indexed.
Document 41 indexed.
Document 42 indexed.
Document 43 indexed.
Document 44 indexed.
Document 45 indexed.
Document