<a href="https://colab.research.google.com/github/oasquared/DDDS-Cohort-16-Projects/blob/main/Cosine_distance_and_Euclidean_Distance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Expected Result from the Analysis
- Cosine distance between doc1 and doc3 would be very small (high similarity) since they have the same term distribution.

- Cosine between doc1 and doc2 would be larger due to different word usage.

- Euclidean distance between doc1 and doc3 will be large because doc3 has twice the magnitude (even though it's a repetition).

- May incorrectly judge doc1 and doc3 as dissimilar due to magnitude difference.

## Outcome  after coding

- Doc1 and Doc3 are similar for both and eqaul distance.

## Reading
| Metric    | Sensitive to Length? | Common for Text? | Better for TF-IDF? |
| --------- | -------------------- | ---------------- | ------------------ |
| Cosine    | No                 | Yes            | Yes              |
| Euclidean | Yes                | No             | No               |

## Code

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances, cosine_distances

# Define documents
doc1 = "Mary had a little lamb"
doc2 = "Say hello to my little friend"
doc3 = f"{doc1} " * 2  # Repetition of doc1 to increase length

docs = [doc1, doc2, doc3]

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Compute distance matrices
cosine_dist = cosine_distances(tfidf_matrix)
euclid_dist = euclidean_distances(tfidf_matrix)

# Display results
import pandas as pd

print("Cosine Distance Matrix:")
print(pd.DataFrame(cosine_dist, index=["doc1", "doc2", "doc3"], columns=["doc1", "doc2", "doc3"]))

print("\nEuclidean Distance Matrix:")
print(pd.DataFrame(euclid_dist, index=["doc1", "doc2", "doc3"], columns=["doc1", "doc2", "doc3"]))



Cosine Distance Matrix:
          doc1      doc2      doc3
doc1  0.000000  0.895521  0.000000
doc2  0.895521  0.000000  0.895521
doc3  0.000000  0.895521  0.000000

Euclidean Distance Matrix:
          doc1      doc2      doc3
doc1  0.000000  1.338298  0.000000
doc2  1.338298  0.000000  1.338298
doc3  0.000000  1.338298  0.000000


In [13]:
doc3

'Mary had a little lamb Mary had a little lamb '

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import pandas as pd
import numpy as np

# Define documents
doc1 = "Mary had a little lamb"
doc2 = "Say hello to my little friend"
doc3 = f"{doc1} " * 2  # Repetition to increase length

docs = [doc1, doc2, doc3]

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Compute cosine similarity
cos_sim = cosine_similarity(tfidf_matrix)

# Compute Euclidean distances and convert to similarity
euc_dist = euclidean_distances(tfidf_matrix)
euc_sim = 1 / (1 + euc_dist)  # Normalize to (0,1] range

# Format and display
labels = ["doc1", "doc2", "doc3"]
cos_sim_df = pd.DataFrame(cos_sim, index=labels, columns=labels)
euc_sim_df = pd.DataFrame(euc_sim, index=labels, columns=labels)

print("Cosine Similarity Matrix:")
print(cos_sim_df)

print("\nEuclidean Similarity Matrix (1 / (1 + dist)):")
print(euc_sim_df)


Cosine Similarity Matrix:
          doc1      doc2      doc3
doc1  1.000000  0.104479  1.000000
doc2  0.104479  1.000000  0.104479
doc3  1.000000  0.104479  1.000000

Euclidean Similarity Matrix (1 / (1 + dist)):
          doc1      doc2      doc3
doc1  1.000000  0.427661  1.000000
doc2  0.427661  1.000000  0.427661
doc3  1.000000  0.427661  1.000000


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances, euclidean_distances
import pandas as pd
import numpy as np

# Step 1: Define the documents
doc1 = "Mary had a little lamb"
doc2 = "Say hello to my little friend"
doc3 = f"{doc1} " * 2  # Longer version of doc1

docs = [doc1, doc2, doc3]
labels = ["doc1", "doc2", "doc3"]

# Step 2: TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Step 3: Compute Distances
cosine_dist = cosine_distances(tfidf_matrix)
euclidean_dist = euclidean_distances(tfidf_matrix)

# Step 4: Convert to Similarities
cosine_sim = cosine_similarity(tfidf_matrix)  # or 1 - cosine_dist
euclidean_sim = 1 / (1 + euclidean_dist)      # custom similarity from distance

# Step 5: Wrap results in DataFrames for readability
def to_df(matrix, name):
    return pd.DataFrame(matrix, index=labels, columns=labels).round(3).rename_axis(name)

# Step 6: Print everything
print("=== Cosine Similarity ===")
print(to_df(cosine_sim, "Cosine Sim"))

print("\n=== Cosine Distance ===")
print(to_df(cosine_dist, "Cosine Dist"))

print("\n=== Euclidean Distance ===")
print(to_df(euclidean_dist, "Euclidean Dist"))

print("\n=== Euclidean Similarity (1 / (1 + distance)) ===")
print(to_df(euclidean_sim, "Euclidean Sim"))


=== Cosine Similarity ===
             doc1   doc2   doc3
Cosine Sim                     
doc1        1.000  0.104  1.000
doc2        0.104  1.000  0.104
doc3        1.000  0.104  1.000

=== Cosine Distance ===
              doc1   doc2   doc3
Cosine Dist                     
doc1         0.000  0.896  0.000
doc2         0.896  0.000  0.896
doc3         0.000  0.896  0.000

=== Euclidean Distance ===
                 doc1   doc2   doc3
Euclidean Dist                     
doc1            0.000  1.338  0.000
doc2            1.338  0.000  1.338
doc3            0.000  1.338  0.000

=== Euclidean Similarity (1 / (1 + distance)) ===
                doc1   doc2   doc3
Euclidean Sim                     
doc1           1.000  0.428  1.000
doc2           0.428  1.000  0.428
doc3           1.000  0.428  1.000


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances, euclidean_distances
import pandas as pd
import numpy as np

# Step 1: Define the documents
doc1 = "Mary had a little lamb"
doc2 = "Say hello to my little friend"
doc3 = ("Mary had a little lamb " * 2).strip()  # Longer version of doc1

docs = [doc1, doc2, doc3]
labels = ["doc1", "doc2", "doc3"]

# Step 2: TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Step 3: Compute distance and similarity matrices
cos_sim = cosine_similarity(tfidf_matrix)
cos_dist = cosine_distances(tfidf_matrix)
euc_dist = euclidean_distances(tfidf_matrix)
euc_sim = 1 / (1 + euc_dist)

# Step 4: Create combined table
pairs = []
for i in range(len(docs)):
    for j in range(i, len(docs)):  # Avoid duplicates
        pairs.append({
            "Pair": f"{labels[i]}-{labels[j]}",
            "Cosine Similarity": round(cos_sim[i, j], 3),
            "Cosine Distance": round(cos_dist[i, j], 3),
            "Euclidean Distance": round(euc_dist[i, j], 3),
            "Euclidean Similarity": round(euc_sim[i, j], 3)
        })

# Convert to DataFrame and display
combined_df = pd.DataFrame(pairs)
print("\n=== Combined Distance & Similarity Table ===")
print(combined_df.to_string(index=False))



=== Combined Distance & Similarity Table ===
     Pair  Cosine Similarity  Cosine Distance  Euclidean Distance  Euclidean Similarity
doc1-doc1              1.000            0.000               0.000                 1.000
doc1-doc2              0.104            0.896               1.338                 0.428
doc1-doc3              1.000            0.000               0.000                 1.000
doc2-doc2              1.000            0.000               0.000                 1.000
doc2-doc3              0.104            0.896               1.338                 0.428
doc3-doc3              1.000            0.000               0.000                 1.000


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances, euclidean_distances
import pandas as pd

# Step 1: Define the documents
doc1 = "Mary had a little lamb"
doc2 = "Say hello to my little friend"
doc3 = ("Mary had a little lamb " * 2).strip()  # Clean repeat of doc1

docs = [doc1, doc2, doc3]
labels = ["doc1", "doc2", "doc3"]

# Step 2: TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Step 3: Compute similarity and distance matrices
cos_sim = cosine_similarity(tfidf_matrix)
cos_dist = cosine_distances(tfidf_matrix)
euc_dist = euclidean_distances(tfidf_matrix)
euc_sim = 1 / (1 + euc_dist)

# Step 4: Prepare combined table with exact pairs and rounding
pairs = []
for i in range(len(docs)):
    for j in range(i, len(docs)):  # Include self-pairs only once
        pairs.append({
            "Pair": f"{labels[i]}-{labels[j]}",
            "Cosine Similarity": round(cos_sim[i, j], 3),
            "Cosine Distance": round(cos_dist[i, j], 3),
            "Euclidean Distance": round(euc_dist[i, j], 3),
            "Euclidean Similarity": round(euc_sim[i, j], 3)
        })

# Step 5: Create and print the final DataFrame
combined_df = pd.DataFrame(pairs)
print("\n=== Combined Distance & Similarity Table ===")
print(combined_df.to_string(index=False))



=== Combined Distance & Similarity Table ===
     Pair  Cosine Similarity  Cosine Distance  Euclidean Distance  Euclidean Similarity
doc1-doc1              1.000            0.000               0.000                 1.000
doc1-doc2              0.104            0.896               1.338                 0.428
doc1-doc3              1.000            0.000               0.000                 1.000
doc2-doc2              1.000            0.000               0.000                 1.000
doc2-doc3              0.104            0.896               1.338                 0.428
doc3-doc3              1.000            0.000               0.000                 1.000


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import pandas as pd

# Define documents
doc1 = "Mary had a little lamb"
doc2 = "Say hello to my little friend"
doc3 = ("Mary had a little lamb " * 2).strip()  # Clean repeated doc1
docs = [doc1, doc2, doc3]
labels = ["doc1", "doc2", "doc3"]

# Vectorize using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Helper function to run KNN with given metric
def run_knn(metric_name, metric):
    knn = NearestNeighbors(n_neighbors=3, metric=metric)
    knn.fit(tfidf_matrix)
    distances, indices = knn.kneighbors(tfidf_matrix)

    # Collect results
    result = []
    for i, (dists, nbrs) in enumerate(zip(distances, indices)):
        for dist, nbr in zip(dists, nbrs):
            result.append({
                "Doc": labels[i],
                "Neighbor": labels[nbr],
                f"{metric_name} Distance": round(dist, 3)
            })
    return pd.DataFrame(result)

# Run KNN using Cosine and Euclidean
cosine_results = run_knn("Cosine", "cosine")
euclidean_results = run_knn("Euclidean", "euclidean")

# Display results
print("=== KNN with Cosine Distance ===")
print(cosine_results.to_string(index=False))

print("\n=== KNN with Euclidean Distance ===")
print(euclidean_results.to_string(index=False))


=== KNN with Cosine Distance ===
 Doc Neighbor  Cosine Distance
doc1     doc1            0.000
doc1     doc3            0.000
doc1     doc2            0.896
doc2     doc2            0.000
doc2     doc1            0.896
doc2     doc3            0.896
doc3     doc1            0.000
doc3     doc3            0.000
doc3     doc2            0.896

=== KNN with Euclidean Distance ===
 Doc Neighbor  Euclidean Distance
doc1     doc1               0.000
doc1     doc3               0.000
doc1     doc2               1.338
doc2     doc2               0.000
doc2     doc1               1.338
doc2     doc3               1.338
doc3     doc1               0.000
doc3     doc3               0.000
doc3     doc2               1.338
