# Spacy

spaCy is an open-source software library for advanced natural language processing. [Click here](https://spacy.io/api) for API docs. 
Guides and usage [here](https://spacy.io/usage/linguistic-features)

Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage

#### Spacy steps 
- Loading language library
- Creating pipeline
- Tokenization
- Tagging parts of speech

In [4]:
import spacy

!python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
text = u"Trump tariff VAT threat raises fears of hit to UK" # u - denotes unicode
doc =  nlp(text)
print(doc)

Trump tariff VAT threat raises fears of hit to UK


### spaCy Token Properties

| Property          | Description                                                                 | Example                                   |
|-------------------|-----------------------------------------------------------------------------|-------------------------------------------|
| `token.text`      | The raw text of the token.                                                 | `"Hello"`                                 |
| `token.lemma_`    | The base form of the token (lemmatized form).                              | `"running"` → `"run"`                     |
| `token.pos_`      | The coarse-grained part-of-speech tag.                                     | `"NOUN"`, `"VERB"`, `"ADJ"`               |
| `token.tag_`      | The fine-grained part-of-speech tag.                                       | `"NN"`, `"VBZ"`, `"JJ"`                   |
| `token.dep_`      | The syntactic dependency label.                                            | `"nsubj"`, `"dobj"`, `"punct"`            |
| `token.shape_`    | The shape of the token (e.g., capitalization, punctuation, digits).        | `"Xxxx"`, `"dd"`, `"___"`                 |
| `token.is_alpha`  | Whether the token consists of alphabetic characters.                       | `True` for `"Hello"`, `False` for `"123"` |
| `token.is_stop`   | Whether the token is a stop word.                                          | `True` for `"the"`, `False` for `"cat"`   |
| `token.is_punct`  | Whether the token is punctuation.                                          | `True` for `"."`, `False` for `"cat"`     |
| `token.is_digit`  | Whether the token consists of digits.                                      | `True` for `"123"`, `False` for `"cat"`   |
| `token.like_num`  | Whether the token resembles a number (e.g., "10", "ten").                  | `True` for `"10"`, `True` for `"ten"`     |
| `token.ent_type_`| The named entity type (if the token is part of an entity).                 | `"PERSON"`, `"GPE"`, `"DATE"`             |
| `token.ent_iob_`  | The IOB tag of the named entity (Inside, Outside, Beginning).             | `"B"`, `"I"`, `"O"`                       |
| `token.sentiment` | The sentiment score of the token (if available).                          | `0.5`, `-0.2`                             |
| `token.lang_`     | The language of the token (if available).                                 | `"en"`, `"fr"`                            |

---

In [None]:
for token in doc:
    print(token.text, token.pos, token.pos_) # pos - parts of speech

Trump 100 VERB
tariff 92 NOUN
VAT 92 NOUN
threat 92 NOUN
raises 100 VERB
fears 92 NOUN
of 85 ADP
hit 92 NOUN
to 85 ADP
UK 96 PROPN


In [14]:
nlp.pipeline # list the current pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1394c34d0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1394c3b90>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1394b5e00>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x13961be10>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x139619810>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1394b4820>)]

In [15]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

#### Span

Truncated version of whole doc. it maintains the relation to to the parent document.

In [None]:
print(doc[6:8])
print(type(doc[6:8])) # here this is not just a string, it is span

of hit
<class 'spacy.tokens.span.Span'>


In [29]:
multi_sentence_text = u"This is line one. THis is line two. This is line 3"

doc2 = nlp(multi_sentence_text)

for sentence in doc2.sents:
    print(sentence, end="\n")


print("--------start------------")
print("word on doc2[5] : {}".format(doc2[5]))
print("word on doc2[4] : {}".format(doc2[4]))

print(doc2[5].is_sent_start)
print(doc2[4].is_sent_start)

print("----------End-----------")
print(doc2[5].is_sent_end)
print(doc2[4].is_sent_end)


This is line one.
THis is line two.
This is line 3
--------start------------
word on doc2[5] : THis
word on doc2[4] : .
True
False
----------End-----------
False
True
