### **Cosine Similarity with TF - IDF** 

### Term Frequency (TF)
Term Frequency (TF) measures how often a term appears in a document. It is calculated as:

$$
TF(t,d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}
$$

where:
- $f_{t,d}$ is the number of times term $t$ appears in document $d$,
- $\sum_{t' \in d} f_{t',d}$ is the total number of terms in document $d$.

**Example:**  
If a document contains **100 words**, and the word "data" appears **5 times**, then:

$$
TF(\text{"data"}) = \frac{5}{100} = 0.05
$$

---

### Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) measures how important a term is by considering how many documents contain it. It is given by:

$$
IDF(t) = \log \frac{N}{DF(t)}
$$

where:
- \( N \) is the total number of documents,
- \( DF(t) \) is the number of documents that contain term \( t \).

**Example:**  
If there are **10,000 documents** and "data" appears in **1,000** of them, then:

$$
IDF(\text{"data"}) = \log \frac{10000}{1000} = \log 10 = 1
$$

---

### TF-IDF Formula
TF-IDF is computed as:

$$
TF-IDF(t, d) = TF(t,d) \times IDF(t)
$$

Using our previous examples:

$$
TF-IDF(\text{"data"}) = 0.05 \times 1 = 0.05
$$

Thus, the **TF-IDF score for "data" in this document is 0.05**, indicating its importance relative to other terms.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample texts
text1 = "Machine learning is a field of artificial intelligence."
text2 = "Deep learning is a branch of artificial intelligence and machine learning."
# Convert texts to TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([text1, text2])
tfidf_array = tfidf_matrix.toarray()

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Create dictionaries with TF-IDF scores for each document
tfidf_scores_doc1 = dict(zip(feature_names, tfidf_array[0]))
tfidf_scores_doc2 = dict(zip(feature_names, tfidf_array[1]))

# Print TF-IDF scores for each word in each document
print("TF-IDF scores for Document 1:")
for word, score in sorted(tfidf_scores_doc1.items()):
    print(f"  {word}: {score:.4f}")

print("\nTF-IDF scores for Document 2:")
for word, score in sorted(tfidf_scores_doc2.items()):
    print(f"  {word}: {score:.4f}")

# Print vector representation of each document
print("\nVector representation of Document 1:")
print(tfidf_array[0])

print("\nVector representation of Document 2:")
print(tfidf_array[1])


TF-IDF scores for Document 1:
  and: 0.0000
  artificial: 0.3541
  branch: 0.0000
  deep: 0.0000
  field: 0.4977
  intelligence: 0.3541
  is: 0.3541
  learning: 0.3541
  machine: 0.3541
  of: 0.3541

TF-IDF scores for Document 2:
  and: 0.3638
  artificial: 0.2588
  branch: 0.3638
  deep: 0.3638
  field: 0.0000
  intelligence: 0.2588
  is: 0.2588
  learning: 0.5177
  machine: 0.2588
  of: 0.2588

Vector representation of Document 1:
[0.         0.35409974 0.         0.         0.49767483 0.35409974
 0.35409974 0.35409974 0.35409974 0.35409974]

Vector representation of Document 2:
[0.36378803 0.25883818 0.36378803 0.36378803 0.         0.25883818
 0.25883818 0.51767635 0.25883818 0.25883818]


In [2]:
# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
print(f"Cosine Similarity (TF-IDF): {cosine_sim[0][0]:.4f}")

Cosine Similarity (TF-IDF): 0.6416


### **Cosine Similarity with Word2Vec - GloVe**

In [4]:
import numpy as np
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
# Load pretrained GloVe model
glove_model = api.load("glove-wiki-gigaword-50")
def get_embedding(text, model):
    words = text.lower().split()
    word_vectors = [model[word] for word in words if word in model]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)
# Compute embeddings
embedding1 = get_embedding(text1, glove_model)
embedding2 = get_embedding(text2, glove_model)
# Compute cosine similarity
cosine_sim = cosine_similarity([embedding1], [embedding2])
print(f"Cosine Similarity (Word Embeddings - GloVe): {cosine_sim[0][0]:.4f}")

Cosine Similarity (Word Embeddings - GloVe): 0.9705


In [8]:
import spacy
from sklearn.metrics.pairwise import cosine_similarity

# Load spaCy's pre-trained model with word vectors (small model for speed)
nlp = spacy.load("en_core_web_md")  # Use "en_core_web_lg" for larger vectors

def get_embedding(text):
    """Compute the document embedding by averaging word vectors."""
    doc = nlp(text)
    return doc.vector  # SpaCy provides a mean-pooled vector representation

# Example texts
text1 = "Machine learning is fascinating."
text2 = "Deep learning is a subset of machine learning."

# Compute embeddings
embedding1 = get_embedding(text1)
embedding2 = get_embedding(text2)

# Compute cosine similarity
cosine_sim = cosine_similarity([embedding1], [embedding2])[0][0]

print(f"Cosine Similarity (Word Embeddings - spaCy): {cosine_sim:.4f}")


Cosine Similarity (Word Embeddings - spaCy): 0.9144


### **1. Cosine Similarity**
Cosine Similarity measures the cosine of the angle between two vectors, indicating how similar they are in direction.

$$
\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}
$$

where:
- $A \cdot B$ is the dot product of vectors $A$ and $B$,
- $\|A\|$ and $\|B\|$ are the magnitudes (norms) of the vectors.

---

### **2. Euclidean Distance**
Euclidean Distance measures the straight-line distance between two points in space.

$$
d(A, B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}
$$

where:
- $A_i$ and $B_i$ are the components of vectors $A$ and $B$.

---

### **3. Manhattan Distance**
Manhattan Distance (L1 norm) measures the absolute sum of differences between two vectors.

$$
d(A, B) = \sum_{i=1}^{n} |A_i - B_i|
$$

where:
- The difference is measured along each dimension separately, like city blocks.

---

### **4. Jaccard Similarity**
Jaccard Similarity measures the ratio of the intersection over the union of two sets.

$$
J(A, B) = \frac{|A \cap B|}{|A \cup B|}
$$

where:
- $|A \cap B|$ is the number of common elements,
- $|A \cup B|$ is the total number of unique elements.

For continuous vectors, Jaccard similarity can be computed by treating nonzero elements as presence indicators.

---

### **5. Pearson Correlation Coefficient**
Pearson Correlation measures the linear relationship between two vectors.

$$
r = \frac{\sum_{i=1}^{n} (A_i - \bar{A})(B_i - \bar{B})}{\sqrt{\sum_{i=1}^{n} (A_i - \bar{A})^2} \sqrt{\sum_{i=1}^{n} (B_i - \bar{B})^2}}
$$

where:
- $\bar{A}$ and $\bar{B}$ are the means of vectors $A$ and $B$.

---

These formulas provide different ways to measure similarity or distance between two vectors. The choice of measure depends on the specific application.


In [None]:
import numpy as np
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean, cityblock, jaccard
from scipy.stats import pearsonr

# Load pretrained GloVe model
glove_model = api.load("glove-wiki-gigaword-50")

def get_embedding(text, model):
    """Compute the document embedding by averaging word vectors."""
    words = text.lower().split()
    word_vectors = [model[word] for word in words if word in model]
    
    if not word_vectors:
        return np.zeros(model.vector_size)  # Return zero vector if no words found
    
    return np.mean(word_vectors, axis=0)

# Example texts
text1 = "Machine learning is fascinating."
text2 = "Deep learning is a subset of machine learning."

# Compute embeddings
embedding1 = get_embedding(text1, glove_model)
embedding2 = get_embedding(text2, glove_model)

# Compute Cosine Similarity
cosine_sim = cosine_similarity([embedding1], [embedding2])[0][0]
      
# Compute Euclidean Distance
euclidean_dist = euclidean(embedding1, embedding2)

# Compute Manhattan Distance
manhattan_dist = cityblock(embedding1, embedding2)

# Compute Jaccard Similarity (using binary vectors)
jaccard_sim = jaccard(embedding1 > 0, embedding2 > 0)  # Convert to binary presence/absence

# Compute Pearson Correlation
pearson_corr, _ = pearsonr(embedding1, embedding2)

# Display Results
print(f"Cosine Similarity: {cosine_sim:.4f}")
print(f"Euclidean Distance: {euclidean_dist:.4f}")
print(f"Manhattan Distance: {manhattan_dist:.4f}")
print(f"Jaccard Similarity: {1 - jaccard_sim:.4f}")  # Convert distance to similarity
print(f"Pearson Correlation: {pearson_corr:.4f}")


Cosine Similarity: 0.9476
Euclidean Distance: 1.3273
Manhattan Distance: 7.7810
Jaccard Similarity: 0.7576
Pearson Correlation: 0.9466
