# spaCy NER

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/ludovicmoncla/nlp-tools/blob/main/spacy.ipynb)

https://spacy.io/

## Install spaCy

* Install spacy package:
```bash
pip install -U spacy
```

* Install spacy package for Apple M1 chip:
```bash
pip install -U 'spacy[apple]'
```

* Install french model:
```bash
python -m spacy download fr_core_news_sm
```

## Import packages

In [1]:
import os
import spacy
from spacy import displacy
from spacy_conll import init_parser

## Parse text

In [2]:
# Load French model tokenizer, tagger, parser and NER
nlp = spacy.load("fr_core_news_sm")
nlp.add_pipe("conll_formatter", last=True)

ConllFormatter(conversion_maps=None, ext_names={'conll_str': 'conll_str', 'conll': 'conll', 'conll_pd': 'conll_pd'}, field_names={'ID': 'ID', 'FORM': 'FORM', 'LEMMA': 'LEMMA', 'UPOS': 'UPOS', 'XPOS': 'XPOS', 'FEATS': 'FEATS', 'HEAD': 'HEAD', 'DEPREL': 'DEPREL', 'DEPS': 'DEPS', 'MISC': 'MISC'}, include_headers=False, disable_pandas=False)

In [3]:
# Process text
text = "ABYDE ou ABYDOS, sub. Ville maritime de Phrygie vis-à-vis de Sestos."
text += "Xercès joignit ces deux endroits éloignés l'un de l'autre de sept stades, par le pont qu'il jetta sur l'Hellespont."
doc = nlp(text)

The Doc object is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text

* Display tokens and their following attributes:
    * **Text**: The original word text.
    * **Lemma**: The base form of the word.
    * **POS**: The simple UPOS part-of-speech tag.
    * **Tag**: The detailed part-of-speech tag.
    * **Dep**: Syntactic dependency, i.e. the relation between tokens.
    * **Shape**: The word shape – capitalization, punctuation, digits.
    * **is alpha**: Is the token an alpha character?
    * **is stop**: Is the token part of a stop list, i.e. the most common words of the language?

In [4]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

ABYDE abyde NOUN NOUN ROOT XXXX True False
ou ou CCONJ CCONJ cc xx True True
ABYDOS abydo NOUN NOUN conj XXXX True False
, , PUNCT PUNCT punct , False False
sub sub PROPN PROPN conj xxx True False
. . PUNCT PUNCT punct . False False
Ville ville NOUN NOUN ROOT Xxxxx True False
maritime maritime ADJ ADJ nmod xxxx True False
de de ADP ADP case xx True True
Phrygie phrygie NOUN NOUN nmod Xxxxx True False
vis-à-vis vis-à-vis ADV ADV advmod xxx-x-xxx False False
de de ADP ADP case xx True True
Sestos Sestos PROPN PROPN nmod Xxxxx True False
. . PUNCT PUNCT punct . False False
Xercès xercè NOUN NOUN ROOT Xxxxx True False
joignit joignit ADJ ADJ amod xxxx True False
ces ce DET DET det xxx True True
deux deux NUM NUM nummod xxxx True True
endroits endroit NOUN NOUN nmod xxxx True False
éloignés éloigné ADJ ADJ amod xxxx True False
l' le DET DET det x' False True
un un PRON PRON nmod xx True True
de de ADP ADP case xx True True
l' le DET DET det x' False True
autre autre NOUN NOUN nmod xxxx True

https://spacy.io/usage/visualizers

* Display syntax dependency tree: 

In [5]:
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", jupyter=True)

* Print the named entities:

In [6]:
print(doc.ents)

# Printing labels of entities.
for entity in doc.ents:
    print(entity.text,'->',entity.label_)

(ABYDE, ABYDOS, Phrygie, Sestos, Xercès, Hellespont)
ABYDE -> ORG
ABYDOS -> MISC
Phrygie -> LOC
Sestos -> LOC
Xercès -> PER
Hellespont -> LOC


* Display NER annotations:

In [7]:
displacy.render(doc, style="ent", jupyter=True)

* Print annotations in CONLL format:

In [8]:
print(doc._.conll_str)

1	ABYDE	abyde	NOUN	NOUN	Gender=Fem|Number=Plur	0	ROOT	_	_
2	ou	ou	CCONJ	CCONJ	_	3	cc	_	_
3	ABYDOS	abydo	NOUN	NOUN	Gender=Fem|Number=Sing	1	conj	_	SpaceAfter=No
4	,	,	PUNCT	PUNCT	_	1	punct	_	_
5	sub	sub	PROPN	PROPN	Gender=Masc|Number=Sing	1	conj	_	SpaceAfter=No
6	.	.	PUNCT	PUNCT	_	1	punct	_	_

1	Ville	ville	NOUN	NOUN	Gender=Fem|Number=Sing	0	ROOT	_	_
2	maritime	maritime	ADJ	ADJ	Number=Sing	1	nmod	_	_
3	de	de	ADP	ADP	_	4	case	_	_
4	Phrygie	phrygie	NOUN	NOUN	Gender=Masc|Number=Sing	1	nmod	_	_
5	vis-à-vis	vis-à-vis	ADV	ADV	_	1	advmod	_	_
6	de	de	ADP	ADP	_	7	case	_	_
7	Sestos	Sestos	PROPN	PROPN	_	1	nmod	_	SpaceAfter=No
8	.	.	PUNCT	PUNCT	_	1	punct	_	SpaceAfter=No

1	Xercès	xercè	NOUN	NOUN	Gender=Masc|Number=Plur	0	ROOT	_	_
2	joignit	joignit	ADJ	ADJ	Gender=Masc|Number=Plur	1	amod	_	_
3	ces	ce	DET	DET	Number=Plur|PronType=Dem	5	det	_	_
4	deux	deux	NUM	NUM	NumType=Card	5	nummod	_	_
5	endroits	endroit	NOUN	NOUN	Gender=Masc|Number=Plur	1	nmod	_	_
6	éloignés	éloigné	ADJ	ADJ	Gender=Masc|Number=Plur

* Write the doc to a file (conll format):

In [9]:
with open(os.path.join('output', 'sample_spacy.conllu'), 'w', encoding='utf-8') as file:
    file.write(doc._.conll_str)