# SpaCy Overview & Key Concepts

## 1. Main Components of the SpaCy `Doc` Object

A `Doc` is the container for the entire processed text. Its main components include:

- **Tokens:** Individual words, punctuation, or symbols. Each token has attributes like `.text`, `.lemma_`, `.pos_`, and more.
- **Sentences:** The `Doc` object can be segmented into sentences.
- **Linguistic Annotations:** Includes part-of-speech tags, dependency parse information, named entities, etc.


---

## 2. How SpaCy Differs from NLTK

- **Performance & Production Use:**  
  SpaCy is built for speed and production-level usage with efficient, pre-trained models. NLTK, in contrast, is primarily designed for teaching, research, and prototyping.

- **Ease of Use & Consistency:**  
  SpaCy provides a consistent API for advanced NLP tasks (POS tagging, NER, dependency parsing) using state-of-the-art models, while NLTK offers a wide range of algorithms and corpora but with less consistency in API design.

- **Modern Machine Learning:**  
  SpaCy leverages modern neural network architectures for its NLP pipelines, whereas NLTK traditionally relies on rule-based and statistical methods.

---

## 3. Customizing Rules in SpaCy

SpaCy allows you to add custom rules using components like the **EntityRuler** and **Matcher**.

**EntityRuler Example:**
```python
import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
# Add EntityRuler before the built-in NER
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define custom patterns
patterns = [{"label": "ORG", "pattern": "Apple Inc."}]
ruler.add_patterns(patterns)

doc = nlp("Apple Inc. is a tech company.")
for ent in doc.ents:
    print(ent.text, ent.label_)
```
This code snippet demonstrates how to create and integrate custom entity rules.

---

## 4. How SpaCy Performs POS Tagging & Named Entity Recognition (NER)

SpaCy’s pipeline uses statistical models (often based on neural networks) that have been trained on large annotated datasets. When you call the `nlp` object on text, it automatically performs:

- **POS Tagging:** Assigning part-of-speech labels to each token.
- **NER:** Identifying and categorizing named entities in the text.

**Example:**
```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# POS Tagging
for token in doc:
    print(token.text, token.pos_, token.tag_)

# Named Entity Recognition
for ent in doc.ents:
    print(ent.text, ent.label_)
```

---

## 5. How SpaCy Measures Text Similarity

SpaCy computes similarity by comparing the vector representations (word embeddings) of texts. You can compare:

- **Doc-to-Doc:** Using `doc.similarity(other_doc)`
- **Span-to-Span:** Using `span.similarity(other_span)`
- **Token-to-Token:** Using `token.similarity(other_token)`

*Note:* To get meaningful similarity scores, use a model with word vectors (e.g., `en_core_web_md` or `en_core_web_lg`).

**Example:**
```python
import spacy

# Load a model with word vectors
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like apples")
doc2 = nlp("I enjoy oranges")

print("Similarity score:", doc1.similarity(doc2))
```

---

## 6. Saving and Loading Trained SpaCy Models

After training or customizing a model, you can save it to disk and later load it.

**Save the Model:**
```python
nlp.to_disk("my_spacy_model")
```

**Load the Model:**
```python
import spacy

nlp = spacy.load("my_spacy_model")
```

This makes it easy to persist and reuse your models without retraining.

---

## 7. Differentiating Between `Doc`, `Span`, and `Token` Objects

- **Doc:**  
  The `Doc` object represents the entire processed document. It is an ordered sequence of tokens and holds all linguistic annotations for the document.

- **Span:**  
  A `Span` is a slice of the `Doc`, representing a continuous subset of tokens (e.g., a phrase or sentence). Spans share the annotations of the parent `Doc`.

- **Token:**  
  Each `Token` represents an individual element (word, punctuation, etc.) from the text. Tokens are the smallest units in spaCy's processing pipeline and carry individual attributes like text, lemma, POS tag, etc.

**Visualization Example:**
```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("SpaCy makes NLP tasks fun and efficient.")

# Accessing Tokens
print("Tokens:")
for token in doc:
    print(token.text)

# Creating a Span (first 3 tokens)
span = doc[0:3]
print("\nSpan:", span.text)

# The entire Doc
print("\nDoc:", doc.text)
```

---



# Experiment 1
Create a basic NLP program to find words, phrases, names and concepts using "spacy.blank" to 
create the English nlp object. Process the text and instantiate a Doc object in the variable doc. Select the first token of the Doc and print its text.

In [20]:
import spacy

# Create a blank English NLP object
nlp = spacy.blank("en")

# Process the text to create a Doc object
doc = nlp("Apple is a technology company.")

# Select the first token of the Doc
first_token = doc[0]

# Print the text of the first token
print(first_token.text)

Apple


In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying a U.K. startup for $1 billion. This boosts the U.K. economy"
doc = nlp(text)

# extract words 
print("Words (Tokens):")
for token in doc:
    print(token.text)
    
sentences = [sent.text for sent in doc.sents]
print("Sentences:", sentences)

# stop words removal
print('\nStop words')
stop_word=[token.text for token in doc if not token.is_stop]
stop_words=' '.join(stop_word)
print(f"original text: {text}")
print(f"Filtered text: {stop_words}")

print("\nNoun Phrases:")
if "parser" in nlp.pipe_names: 
    for chunk in doc.noun_chunks:
        print(chunk.text)
else:
    print("No parser available in blank model. Noun chunking requires 'parser'.")

print("\nNamed Entities:")
if "ner" in nlp.pipe_names:
    for ent in doc.ents:
        print(f"{ent.text} - {ent.label_}")
else:
    print("No named entity recognizer (NER) available in blank model.")

Words (Tokens):
Apple
is
looking
at
buying
a
U.K.
startup
for
$
1
billion
.
This
boosts
the
U.K.
economy
Sentences: ['Apple is looking at buying a U.K. startup for $1 billion.', 'This boosts the U.K. economy']

Stop words
original text: Apple is looking at buying a U.K. startup for $1 billion. This boosts the U.K. economy
Filtered text: Apple looking buying U.K. startup $ 1 billion . boosts U.K. economy

Noun Phrases:
Apple
a U.K.
This
the U.K. economy

Named Entities:
Apple - ORG
U.K. - GPE
$1 billion - MONEY
U.K. - GPE
