# spaCy NER

https://spacy.io/

## Install spaCy

* Install spacy package:
```bash
pip install -U spacy
```

* Install spacy package for Apple M1 chip:
```bash
pip install -U 'spacy[apple]'
```

* Install french model:
```bash
python -m spacy download fr_core_news_sm
```

## Import packages

In [None]:
import os
import spacy
from spacy import displacy
from spacy_conll import init_parser

## Parse text

In [None]:
# Load French model tokenizer, tagger, parser and NER
nlp = spacy.load("fr_core_news_sm")
nlp.add_pipe("conll_formatter", last=True)

In [None]:
# Process text
text = "ABYDE ou ABYDOS, sub. Ville maritime de Phrygie vis-à-vis de Sestos."
text += "Xercès joignit ces deux endroits éloignés l'un de l'autre de sept stades, par le pont qu'il jetta sur l'Hellespont."
doc = nlp(text)

The Doc object is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text

* Display tokens and their following attributes:
    * **Text**: The original word text.
    * **Lemma**: The base form of the word.
    * **POS**: The simple UPOS part-of-speech tag.
    * **Tag**: The detailed part-of-speech tag.
    * **Dep**: Syntactic dependency, i.e. the relation between tokens.
    * **Shape**: The word shape – capitalization, punctuation, digits.
    * **is alpha**: Is the token an alpha character?
    * **is stop**: Is the token part of a stop list, i.e. the most common words of the language?

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

https://spacy.io/usage/visualizers

* Display syntax dependency tree: 

In [None]:
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", jupyter=True)

* Print the named entities:

In [None]:
print(doc.ents)

# Printing labels of entities.
for entity in doc.ents:
    print(entity.text,'->',entity.label_)

* Display NER annotations:

In [None]:
displacy.render(doc, style="ent", jupyter=True)

* Print annotations in CONLL format:

In [None]:
print(doc._.conll_str)

* Write the doc to a file (conll format):

In [None]:
with open(os.path.join('output', 'sample_spacy.conllu'), 'w', encoding='utf-8') as file:
    file.write(doc._.conll_str)