# intro and import

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text.



In [48]:
import spacy

# Statistical models


## Download statistical models

Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.

In [49]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Check that your installed models are up to date



In [50]:
!python -m ppacy validate

/usr/bin/python3: No module named ppacy


## Loading statistical models



In [51]:
nlp = spacy.load("en_core_web_sm")

# Documents, tokens and spans


## Processing text

Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.

In [52]:
duc = nlp("this is a test")

## Accessing token attributes



In [53]:
for token in duc:
  print(token.text)

this
is
a
test


# Spans


## Accessing spans
Span indices are exclusive. So doc[2:4] is a span starting at token 2, up to – but not including! – token 4.

In [54]:
span = duc[2 : 4]
span.text

'a test'

## Creating a span manually

In [55]:
from spacy.tokens import Span
duc = nlp("i live in tehran city")

# Create a Span for "tehran city" with label GPE (geopolitical)
span = Span(duc, 3, 5, label="name of city")

print(f"text of span -> {span.text}")
print(f"label of span -> {span.label_}")

text of span -> tehran city
label of span -> name of city


# Linguistic features


Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_.



## Part-of-speech tags (predicted by statistical model)


In [56]:
duc =nlp("this is a test.")

In [57]:
# Coarse-grained part-of-speech tags
for token in duc:
  print(f"{token.text} -> {token.pos_}")

this -> PRON
is -> AUX
a -> DET
test -> NOUN
. -> PUNCT


In [58]:
#  Fine-grained part-of-speech tags
for token in duc:
  print(f"{token.text} -> {token.tag_}")

this -> DT
is -> VBZ
a -> DT
test -> NN
. -> .


In spaCy, both `pos_` and `tag_` are attributes of a `Token` object, which represents an individual word in a document. However, they provide different levels of detail regarding the part-of-speech (POS) tagging.

1. `pos_` (Coarse-grained part-of-speech tags):
   - The `pos_` attribute provides coarse-grained part-of-speech tags, which classify words into broad categories based on their grammatical roles within a sentence. These categories typically include tags like nouns, verbs, adjectives, adverbs, pronouns, etc. For example:
     - `'DET'`: Determiner
     - `'VERB'`: Verb
     - `'NOUN'`: Noun
     - `'PUNCT'`: Punctuation

2. `tag_` (Fine-grained part-of-speech tags):
   - The `tag_` attribute provides fine-grained part-of-speech tags, which offer more detailed information about the specific grammatical properties of each word. These tags are more granular and can differentiate between different types of nouns, verbs, adjectives, etc. They often include additional information such as verb tense, noun type, singular/plural forms, etc. For example:
     - `'DT'`: Determiner
     - `'VBZ'`: Verb, 3rd person singular present
     - `'NN'`: Noun, singular or mass
     - `'.'`: Punctuation (period)

In summary, while `pos_` provides a general categorization of words into broad grammatical classes, `tag_` offers a more detailed classification that includes specific grammatical features and properties of each word.

## Syntactic dependencies (predicted by statistical model)



In [59]:
doc = nlp("This is a text.")
for token in duc:
  print(f"{token.text} -> {token.dep_}")

this -> nsubj
is -> ROOT
a -> det
test -> attr
. -> punct


In spaCy, the `dep_` attribute of a `Token` object represents the syntactic dependency label assigned to that token within the dependency parse tree of the sentence. Each token in a sentence is linked to one or more other tokens through directed dependency relations, indicating the grammatical relationships between words.

Here's a breakdown:

1. **Dependency labels (`dep_`)**:
   - The `dep_` attribute provides information about the grammatical relationships between tokens in a sentence. Each token is linked to a head token (usually a word that governs the dependent token) through a specific dependency label.
   - Dependency labels describe the syntactic role that a token plays in the structure of the sentence, such as subject, object, modifier, etc.
   - Examples of dependency labels include:
     - `'nsubj'`: Nominal subject (subject of a verb)
     - `'ROOT'`: The main/root verb of the sentence
     - `'det'`: Determiner (e.g., articles like "a", "the")
     - `'attr'`: Attribute (a word that is an attribute of the noun it modifies)
     - `'punct'`: Punctuation mark

**Difference between dependency labels and part-of-speech (POS) tags (`pos_` and `tag_`)**:
- Dependency labels describe the syntactic relationships between words in a sentence, while POS tags categorize words based on their grammatical properties.
- Dependency labels provide information about how words are connected in the sentence's syntactic structure, whereas POS tags classify individual words into broad grammatical categories (e.g., noun, verb, adjective).
- Dependency labels are used to construct the dependency parse tree of the sentence, which illustrates the hierarchical relationships between words, while POS tags are used to annotate individual words with their grammatical roles.

In summary, while POS tags classify individual words, dependency labels describe the syntactic relationships between words in a sentence. They serve different purposes but are both crucial for understanding the structure and meaning of natural language text.

In [60]:
duc = nlp("this is a text.")

for token in duc:
  print(f"{token.text} -> {token.dep_}")

this -> nsubj
is -> ROOT
a -> det
text -> attr
. -> punct


## Named Entities (predicted by statistical model)



In [61]:
duc = nlp("i love Google")

for ent in duc.ents:
  print(f"{ent.text} -> {ent.label_}")

Google -> ORG


## Sentences (usually needs the dependency parser)



In [62]:
duc = nlp("this is a sentence. this is a another one.")

for sent in duc.sents:
  print(sent.text)

this is a sentence.
this is a another one.


## Base noun phrases (needs the tagger and parser)

In spaCy, a "noun chunk" refers to a contiguous sequence of tokens in a sentence that forms a noun phrase. Noun chunks typically consist of a noun and the words that modify it, such as determiners (like "a" or "the") and adjectives. They represent the basic building blocks of sentence structure and are often useful for tasks like information extraction, parsing, and understanding the meaning of sentences.

The `noun_chunks` property of a `Doc` object in spaCy is a generator that yields spans representing these noun chunks in the document. When you iterate over this generator, you get access to each noun chunk as a `Span` object.

Let's break down your code:

```python
doc = nlp("I have a red car")
[chunk.text for chunk in doc.noun_chunks]
# Output: ['I', 'a red car']
```

- `doc.noun_chunks`: This property yields spans representing the noun chunks in the document.
- `chunk.text`: For each noun chunk span (`chunk`), it retrieves the text of the noun chunk.
- The list comprehension iterates over all noun chunks in the document (`doc`) and collects their text representations into a list.

In the example sentence "I have a red car", there are two noun chunks:
1. "I"
2. "a red car"

These noun chunks represent the subject ("I") and the object ("a red car") of the sentence, respectively.


In [63]:
duc = nlp("i love tajrish place.")
for chunk in duc.noun_chunks:
  print(chunk.text)

i
tajrish place


## Label explanations



In [64]:
spacy.explain("RB")

'adverb'

# Visualizing


In [65]:
from spacy import displacy

## Visualize dependencies



In [66]:
duc = nlp("this is a test.")
displacy.render(duc, style="dep")

# Visualize named entities



In [67]:
duc = nlp("mike love tehran city.")
displacy.render(duc, style="ent")

# Word vectors and similarity


In [68]:
duc1 = nlp("i like cats.")
duc2 = nlp("i like dogs.")

duc1.similarity(duc2)

  duc1.similarity(duc2)


0.954132408648854