<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Linguistic Features: NLP with spaCy</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(POS Tagging, Dependency Parsing, Word Vectors, and Semantic Similarity)</span></div>

## Table of Contents
1. [POS Tagging](#section-1)
2. [Word-Sense Disambiguation](#section-2)
3. [Dependency Parsing](#section-3)
4. [Introduction to Word Vectors](#section-4)
5. [Word Vectors in spaCy](#section-5)
6. [Visualizing Word Vectors](#section-6)
7. [Measuring Semantic Similarity](#section-7)
8. [Conclusion](#section-8)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. POS Tagging</span><br>

### Understanding Part-of-Speech (POS)
POS tagging is the process of assigning grammatical categories (such as nouns, verbs, adjectives) to words in a text. Crucially, **POS tags depend on the context**, meaning the surrounding words and their tags influence how a specific word is categorized.

In spaCy, we can access these tags using the `.pos_` attribute of a token.

#### Code Example: Contextual Tagging
In the example below, the word "fish" appears twice. Once as a verb ("will fish") and once as a noun ("a fish").



In [28]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# A sentence where 'fish' is used as a verb and a noun
text = "My cat will fish for a fish tomorrrow in a fishy way."
doc = nlp(text)

# Iterate over tokens and print text, POS tag, and explanation
print([(token.text, token.pos_, spacy.explain(token.pos_)) for token in doc])


[('My', 'PRON', 'pronoun'), ('cat', 'NOUN', 'noun'), ('will', 'AUX', 'auxiliary'), ('fish', 'VERB', 'verb'), ('for', 'ADP', 'adposition'), ('a', 'DET', 'determiner'), ('fish', 'NOUN', 'noun'), ('tomorrrow', 'NOUN', 'noun'), ('in', 'ADP', 'adposition'), ('a', 'DET', 'determiner'), ('fishy', 'ADJ', 'adjective'), ('way', 'NOUN', 'noun'), ('.', 'PUNCT', 'punctuation')]



**Expected Output Analysis:**
- First 'fish': Tagged as **VERB**.
- Second 'fish': Tagged as **NOUN**.
- 'fishy': Tagged as **ADJ** (Adjective).

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Use <code>spacy.explain(tag)</code> to get a human-readable description of any tag (e.g., "ADP" becomes "adposition"). </div>

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Word-Sense Disambiguation</span><br>

### Importance of POS
POS tagging improves accuracy for many NLP tasks, such as translation systems. For example, translating "fish" requires knowing if it is an action (verb) or an object (noun).

*   **Verb** -> *pescarÃ©* (Spanish for "I will fish")
*   **Noun** -> *pescado* (Spanish for "fish" as food)

### Word-Sense Disambiguation (WSD)
WSD is the problem of deciding in which **sense** a word is used in a sentence.

| Word | POS tag | Description |
| :--- | :--- | :--- |
| Play | VERB | engage in activity for enjoyment and recreation |
| Play | NOUN | a dramatic work for the stage or to be broadcast |

#### Code Example: Disambiguating "Fish"
We can filter tokens based on their text and observe their assigned POS tags to understand their usage.



In [29]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Two sentences using 'fish' differently
verb_text = "I will fish tomorrow."
noun_text = "I ate fish."

# Extract 'fish' from the verb sentence
print("Verb Context:")
print([(token.text, token.pos_) for token in nlp(verb_text) if "fish" in token.text])

print("\nNoun Context:")
# Extract 'fish' from the noun sentence
print([(token.text, token.pos_) for token in nlp(noun_text) if "fish" in token.text])


Verb Context:
[('fish', 'VERB')]

Noun Context:
[('fish', 'NOUN')]



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Dependency Parsing</span><br>

### Exploring Sentence Syntax
Dependency parsing explores the syntax of a sentence by identifying links (dependencies) between tokens. The result is often represented as a tree structure.

A **Dependency Label** describes the type of syntactic relation between two tokens.

#### Common Dependency Labels

| Dependency label | Description |
| :--- | :--- |
| **nsubj** | Nominal subject |
| **root** | Root (main verb/action) |
| **det** | Determiner |
| **dobj** | Direct object |
| **aux** | Auxiliary |

### Using `displaCy` for Visualization
spaCy provides a built-in visualizer called `displaCy` to draw dependency trees.



In [30]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("We understand the differences.")

# In a Jupyter environment, use render. In a script, use serve.
# displacy.render(doc, style="dep", jupyter=True) 



### Accessing Dependency Labels
You can access the dependency label using the `.dep_` attribute of a token.



In [31]:
doc = nlp("We understand the differences.")

# Print token text, dependency label, and explanation
print([(token.text, token.dep_, spacy.explain(token.dep_)) for token in doc])


[('We', 'nsubj', 'nominal subject'), ('understand', 'ROOT', 'root'), ('the', 'det', 'determiner'), ('differences', 'dobj', 'direct object'), ('.', 'punct', 'punctuation')]



**Output Breakdown:**
1.  **We**: `nsubj` (nominal subject)
2.  **understand**: `ROOT` (root)
3.  **the**: `det` (determiner)
4.  **differences**: `dobj` (direct object)
5.  **.**: `punct` (punctuation)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Introduction to Word Vectors</span><br>

### What are Word Vectors (Embeddings)?
Word vectors are numerical representations of words. Unlike older methods, they capture semantic meaning.

#### The "Bag of Words" Problem
Older methods (like Bag of Words) count frequencies but do not understand meaning or context.

| Sentences | I | got | covid | coronavirus |
| :--- | :--- | :--- | :--- | :--- |
| I got covid | 1 | 2 | 3 | 0 |
| I got coronavirus | 1 | 2 | 0 | 4 |

In the table above, "covid" and "coronavirus" are treated as completely different entities (columns 3 and 4), even though they mean the same thing in this context.

#### The Vector Solution
Word vectors use a pre-defined number of dimensions to represent words. They consider word frequencies and the presence of other words in **similar contexts**. This allows the model to understand that "cat" and "kitten" are related because they appear in similar sentence structures.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Word Vectors in spaCy</span><br>

### Loading Vectors
To use word vectors in spaCy, you typically need a medium (`md`) or large (`lg`) model. The small model (`sm`) does not contain static word vectors.

*   **Model**: `en_core_web_md`
*   **Specs**: 300-dimensional vectors for 20,000 words.



In [32]:
import spacy

# Ensure you have downloaded the model: python -m spacy download en_core_web_md
try:
    nlp = spacy.load("en_core_web_md")
    print("Model loaded successfully.")
    print(nlp.meta["vectors"])
except OSError:
    print("Please download the model: python -m spacy download en_core_web_md")


Model loaded successfully.
{'width': 300, 'vectors': 20000, 'keys': 684830, 'name': 'en_vectors', 'mode': 'default'}



### Accessing Vector Data
1.  **`nlp.vocab.strings`**: Access word IDs.
2.  **`nlp.vocab.vectors`**: Access the actual vector arrays.



In [33]:
# Get the ID for the word "like"
like_id = nlp.vocab.strings["like"]
print(f"ID for 'like': {like_id}")

# Get the vector using the ID
vector_data = nlp.vocab.vectors[like_id]
print(f"Vector shape: {vector_data.shape}")
print(f"First 5 dimensions: {vector_data[:5]}")


ID for 'like': 18194338103975822726
Vector shape: (300,)
First 5 dimensions: [-0.61052   0.11656  -0.50648  -0.32216  -0.099742]



### Finding Similar Words
spaCy can find semantically similar terms to a given term by comparing their vectors.



In [34]:
import numpy as np

word = "covid"

# Find the 5 most similar words to "covid"
# Note: This uses spaCy's internal vector table
most_similar_words = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=5
)

# Decode the IDs back to strings
words = [nlp.vocab.strings[w] for w in most_similar_words[0][0]]
print(f"Words similar to '{word}': {words}")

KeyError: '[E058] Could not retrieve vector for key 2127825066894192516.'


***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Visualizing Word Vectors</span><br>

### Dimensionality Reduction
Since word vectors are 300-dimensional, we cannot visualize them directly. We use **Principal Component Analysis (PCA)** to project these vectors into a 2-dimensional space.

This visualization helps us see how words are grouped. For example, fruits should cluster together, and animals should cluster together.

#### Code Example: PCA Visualization



In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
import spacy

# Load model
nlp = spacy.load("en_core_web_md")

# List of words to visualize
words = ["wonderful", "horrible", "apple", "banana", "orange", "watermelon", "dog", "cat"]

# Extract vectors and stack them vertically
word_vectors = np.vstack([nlp.vocab.vectors[nlp.vocab.strings[w]] for w in words])

# Extract two principal components using PCA
pca = PCA(n_components=2)
word_vectors_transformed = pca.fit_transform(word_vectors)

# Visualize the scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(word_vectors_transformed[:, 0], word_vectors_transformed[:, 1])

# Add labels to the points
for word, coord in zip(words, word_vectors_transformed):
    x, y = coord
    plt.text(x, y, word, size=12)

plt.title("Word Vector Visualization (PCA)")
plt.show()



### Analogies
Word embeddings can capture semantic relationships, allowing for vector arithmetic (analogies).
*   **Formula**: `Queen - Woman + Man = King`
*   This implies the vector direction from Woman to Queen is similar to Man to King.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Measuring Semantic Similarity</span><br>

### The Similarity Score
Semantic similarity is the process of analyzing texts to identify how similar they are.
*   **Metric**: Cosine Similarity.
*   **Range**: 0 to 1 (0 = completely different, 1 = identical).

spaCy allows you to calculate similarity between **Tokens**, **Spans**, and **Documents**.

#### 1. Token Similarity
Comparing individual words.



In [None]:
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

token1 = doc1[2] # pizza
token2 = doc2[4] # pasta

print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))



#### 2. Span Similarity
Comparing phrases or slices of a document.



In [None]:
span1 = doc1[1:] # "eat pizza"
span2 = doc2[1:] # "like to eat pasta"

print(f"Similarity between '{span1}' and '{span2}' = ", round(span1.similarity(span2), 3))

# Comparing "eat pizza" vs "eat pasta"
print(f"Similarity between '{doc1[1:]}' and '{doc2[3:]}' = ", round(doc1[1:].similarity(doc2[3:]), 3))



#### 3. Doc Similarity
Comparing full sentences or documents. Doc vectors default to an average of the word vectors contained within them.



In [None]:
doc1 = nlp("I like to play basketball")
doc2 = nlp("I love to play basketball")

print("Similarity score :", round(doc1.similarity(doc2), 3))



#### 4. Sentence Similarity (Finding Relevant Content)
We can use similarity scores to find the most relevant sentence in a text given a keyword.



In [None]:
# A document with multiple sentences
sentences = nlp("What is the cheapest flight from Boston to Seattle? "
                "Which airline serves Denver, Pittsburgh and Atlanta? "
                "What kinds of planes are used by American Airlines?")

keyword = nlp("price")

# Iterate through sentences and check similarity to "price"
for i, sentence in enumerate(sentences.sents):
    print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))



**Analysis**: The first sentence ("cheapest flight") should have the highest similarity score to "price" because "cheapest" and "price" are semantically related in the vector space.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Conclusion</span><br>

In this notebook, we explored the powerful linguistic features provided by spaCy:

1.  **POS Tagging**: We learned how to identify grammatical roles (Noun, Verb, etc.) and how context changes these tags (e.g., "fish" as a verb vs. noun). This is crucial for Word-Sense Disambiguation.
2.  **Dependency Parsing**: We visualized the syntactic structure of sentences using `displaCy` and accessed dependency labels (`nsubj`, `dobj`) to understand relationships between words.
3.  **Word Vectors**: We moved beyond simple frequency counts (Bag of Words) to 300-dimensional embeddings that capture semantic meaning.
4.  **Semantic Similarity**: Using word vectors, we calculated how similar tokens, spans, and documents are to one another using Cosine Similarity. This allows for advanced applications like search relevance and recommendation systems.

**Next Steps:**
*   Experiment with the `en_core_web_lg` model for potentially higher accuracy.
*   Apply dependency parsing to extract specific information (e.g., "Who did what?").
*   Use semantic similarity to build a simple FAQ bot that matches user queries to existing questions.
