In [None]:

# semantic.py
!pip install spacy
!python -m spacy download en_core_web_md

"""
NLP Similarity Comparison
This script compares word and sentence similarities using two different SpaCy models:
- en_core_web_md (medium model with word vectors)
- en_core_web_sm (small model without word vectors)
"""

import spacy

# Load both SpaCy models
nlp_md = spacy.load("en_core_web_md")
nlp_sm = spacy.load("en_core_web_sm")

# Define a list of words to compare
words = ["cat", "monkey", "banana", "apple"]

# -----------------------------
# WORD SIMILARITY COMPARISON
# -----------------------------
print("\n=== WORD SIMILARITIES (en_core_web_md) ===")
for word1 in words:
    for word2 in words:
        sim = nlp_md(word1).similarity(nlp_md(word2))
        print(f"{word1:10s} {word2:10s} {sim:.4f}")

print("\n=== WORD SIMILARITIES (en_core_web_sm) ===")
for word1 in words:
    for word2 in words:
        sim = nlp_sm(word1).similarity(nlp_sm(word2))
        print(f"{word1:10s} {word2:10s} {sim:.4f}")

# -----------------------------
# SENTENCE SIMILARITY COMPARISON
# -----------------------------
sentences = [
    "Where did my dog go?",
    "Hello, there is my car.",
    "I've lost my car in my car.",
    "I'd like my boat back.",
    "I will name my dog Diana."
]

print("\n=== SENTENCE SIMILARITIES (en_core_web_md) ===")
model_sentence_md = nlp_md("Where did my dog go?")
for sentence in sentences:
    similarity = model_sentence_md.similarity(nlp_md(sentence))
    print(f"{sentence:40s} - {similarity:.3f}")

print("\n=== SENTENCE SIMILARITIES (en_core_web_sm) ===")
model_sentence_sm = nlp_sm("Where did my dog go?")
for sentence in sentences:
    similarity = model_sentence_sm.similarity(nlp_sm(sentence))
    print(f"{sentence:40s} - {similarity:.3f}")




Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.

=== WORD SIMILARITIES (en_core_web_md) ===
cat        cat        1.0000
cat        monkey     0.3945
cat        banana     0.2334
cat        apple      0.2334
monkey     cat        0.3945
monkey     monkey     1.0000
monkey     banana     0.3741
monkey     apple      0.3741
banana     cat        0.2334
banana     monkey     0.3741
b

  sim = nlp_sm(word1).similarity(nlp_sm(word2))


cat        monkey     0.5401
cat        banana     0.5966
cat        apple      0.6260
monkey     cat        0.5401
monkey     monkey     1.0000
monkey     banana     0.6102
monkey     apple      0.6974
banana     cat        0.5966
banana     monkey     0.6102
banana     banana     1.0000
banana     apple      0.6354
apple      cat        0.6260
apple      monkey     0.6974
apple      banana     0.6354
apple      apple      1.0000

=== SENTENCE SIMILARITIES (en_core_web_md) ===
Where did my dog go?                     - 1.000
Hello, there is my car.                  - 0.931
I've lost my car in my car.              - 0.887
I'd like my boat back.                   - 0.907
I will name my dog Diana.                - 0.899

=== SENTENCE SIMILARITIES (en_core_web_sm) ===
Where did my dog go?                     - 1.000
Hello, there is my car.                  - 0.337
I've lost my car in my car.              - 0.452
I'd like my boat back.                   - 0.624
I will name my dog Diana.   

  similarity = model_sentence_sm.similarity(nlp_sm(sentence))


"\nNOTES:\n- The 'en_core_web_md' model includes pre-trained word vectors, so it gives more accurate and intuitive similarity scores.\n  For example, 'cat' and 'monkey' (both animals) will show higher similarity than 'cat' and 'banana' (animal vs fruit).\n\n- The 'en_core_web_sm' model does not include word vectors. As a result, its similarity scores are less consistent\n  and may not reflect real-world relationships between words or sentences.\n\nExample observation:\nWhen using 'en_core_web_md', the relationships make sense (animals, fruits, etc.).\nWhen using 'en_core_web_sm', similarities appear weaker or random, since it relies only on context and tags, not word embeddings.\n"

Semantic Similarity Observations

1. Similarities between “cat”, “monkey”, and “banana”:
When using the en_core_web_md model, I noticed that:

“cat” and “monkey” have a moderate similarity (≈ 0.39) because both are animals.

“monkey” and “banana” also show a moderate similarity (≈ 0.37), likely reflecting the real-world association that monkeys eat bananas.

“cat” and “banana” have low similarity (≈ 0.23), showing they are unrelated concepts.

“apple” and “banana” have a very high similarity (≈ 1.0), as they are both fruits.

Example of my own:
If I compare “car”, “bus”, and “apple”, I would expect “car” and “bus” to be more similar (both vehicles), while “apple” would have low similarity to either, since it is a fruit.

2. Comparison with en_core_web_sm:
When I ran the same example using the small model (en_core_web_sm), I noticed that:

The similarity scores were less intuitive and sometimes inconsistent. For example, “cat” and “banana” had a higher similarity than expected, and “apple” and “banana” were lower than in the medium model.

Conclusion:

The medium model (en_core_web_md) provides more meaningful and realistic similarity scores for both words and sentences, while the small model (en_core_web_sm) may produce unreliable results for semantic similarity tasks.