<a href="https://colab.research.google.com/github/jalva80/NLPlaying/blob/master/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing - Named Entity Recognition
Notebook de jeu avec la reconnaissance d'entités nommées.

# Librairies

Utilisation de NLTK et Spacy pour la reconnaissance d'entités nommées dans des textes bruts rédigés en français.

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [2]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:

from nltk.chunk import ne_chunk

### Extraction des informations
On commence par voir ce que l'on obtient avec un extrait. (*Romeo and Juliet*, William Shakespeare)

In [0]:
exple = """Benvolio, Old Montague's nephew, heard the fighting. He didn't really like the feud between his family and the Capulets."""

On applique la tokenisation et le part-of-speech tagging (categorisation du type d'information) à notre phrase d'exemple.

In [0]:
exTok = word_tokenize(exple)
exTag = pos_tag(exTok)

On observe ce que l'on obtient.

In [7]:
exTag

[('Benvolio', 'NNP'),
 (',', ','),
 ('Old', 'NNP'),
 ('Montague', 'NNP'),
 ("'s", 'POS'),
 ('nephew', 'NN'),
 (',', ','),
 ('heard', 'VBD'),
 ('the', 'DT'),
 ('fighting', 'NN'),
 ('.', '.'),
 ('He', 'PRP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('really', 'RB'),
 ('like', 'IN'),
 ('the', 'DT'),
 ('feud', 'NN'),
 ('between', 'IN'),
 ('his', 'PRP$'),
 ('family', 'NN'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('Capulets', 'NNPS'),
 ('.', '.')]

On obtient des paires (mot, type d'information). On va alors segmenter la phrase pour identifier les entitées nommées à partir d'une expression régulière indiquant les règles de segmentation d'une phrase.
Cette trame de segmentation consiste en une règle: un groupe nominal NP est formée lorsque l'on a un déterminant optionnel DT, suivi par n'importe quelle quantité d'adjectifs JJ, puis un nom NN.

In [0]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

**Découpage**

(Chunk)
A partir de ce pattern, je crée un partitionner et je le teste sur la phrase.

In [9]:
cp = nltk.RegexpParser(pattern)
cs = cp.parse(exTag)
print(cs)

(S
  Benvolio/NNP
  ,/,
  Old/NNP
  Montague/NNP
  's/POS
  (NP nephew/NN)
  ,/,
  heard/VBD
  (NP the/DT fighting/NN)
  ./.
  He/PRP
  did/VBD
  n't/RB
  really/RB
  like/IN
  (NP the/DT feud/NN)
  between/IN
  his/PRP$
  (NP family/NN)
  and/CC
  the/DT
  Capulets/NNPS
  ./.)


La fonction renvoie une hierarchie des éléments de la phrase.

In [10]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('Benvolio', 'NNP', 'O'),
 (',', ',', 'O'),
 ('Old', 'NNP', 'O'),
 ('Montague', 'NNP', 'O'),
 ("'s", 'POS', 'O'),
 ('nephew', 'NN', 'B-NP'),
 (',', ',', 'O'),
 ('heard', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('fighting', 'NN', 'I-NP'),
 ('.', '.', 'O'),
 ('He', 'PRP', 'O'),
 ('did', 'VBD', 'O'),
 ("n't", 'RB', 'O'),
 ('really', 'RB', 'O'),
 ('like', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('feud', 'NN', 'I-NP'),
 ('between', 'IN', 'O'),
 ('his', 'PRP$', 'O'),
 ('family', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('the', 'DT', 'O'),
 ('Capulets', 'NNPS', 'O'),
 ('.', '.', 'O')]


chaque mot est affiché avec son label PoS son repère d'entité nommée

In [11]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [12]:
ne_tree = ne_chunk(pos_tag(word_tokenize(exple)))
print(ne_tree)

(S
  (GPE Benvolio/NNP)
  ,/,
  (PERSON Old/NNP Montague/NNP)
  's/POS
  nephew/NN
  ,/,
  heard/VBD
  the/DT
  fighting/NN
  ./.
  He/PRP
  did/VBD
  n't/RB
  really/RB
  like/IN
  the/DT
  feud/NN
  between/IN
  his/PRP$
  family/NN
  and/CC
  the/DT
  (ORGANIZATION Capulets/NNPS)
  ./.)


**Entité**

In [0]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [14]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


NORP = nationalities or religious or political groups

In [15]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(European, 'B', 'NORP'),
 (authorities, 'O', ''),
 (fined, 'O', ''),
 (Google, 'B', 'ORG'),
 (a, 'O', ''),
 (record, 'O', ''),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (on, 'O', ''),
 (Wednesday, 'B', 'DATE'),
 (for, 'O', ''),
 (abusing, 'O', ''),
 (its, 'O', ''),
 (power, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (mobile, 'O', ''),
 (phone, 'O', ''),
 (market, 'O', ''),
 (and, 'O', ''),
 (ordered, 'O', ''),
 (the, 'O', ''),
 (company, 'O', ''),
 (to, 'O', ''),
 (alter, 'O', ''),
 (its, 'O', ''),
 (practices, 'O', '')]


"B"egin : la marque est au début d'une entité

"I"n : la marque est dans une entité

"L"ast : la marque est à la fin d'une entité

"O"ut : la marque est en dehors d'une entité

"U"nit: marque unique d'une entité

"" : aucune marque d'entité n'est créée.

Je vais maintenant extraire les entités nommées d'un extrait plus long.

In [0]:
extrait2 = """Friar Lawrence was preparing Romeo for the wedding. He wasn't sure if this wedding was the right tiling to do. Marrying two young people without their parents' permission was not right. He wanted this marriage to bring the two families closer together, but he wasn't sure if it would.

"Will they get angry? Will they get angry with me?" he thought.
Amen already," said Romeo. He was impatient from waiting on Friar Lawrence.

"Calm yourself, Romeo," scolded Friar Lawrence. "You need to live and love moderately. If not, neither love nor life will last long."

But he knew Romeo wouldn't be calm. Neither would Juliet. Then Juliet came running toward the church.

"I should send both of them home," thought Friar Lawrence. But he knew he couldn't do that.

"Good afternoon, Father," sang Juliet. She jumped into Romeo's arms.

"Juliet!" cried Romeo, holding her. "Please tell me you love me as much as I love you! And how happy you will be after we are married."

"I love you so much that words can't describe it," she said and kissed him."""

In [25]:
article = nlp(" ".join(re.split(r'[\n\t]+', extrait2)))
len(article.ents)

16

il y a 16 entités détectées dans l'extrait, représentées par les étiquettes:

In [26]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 2, 'ORG': 1, 'PERSON': 11, 'TIME': 1, 'WORK_OF_ART': 1})

Les trois marques les plus fréquentes sont:

In [27]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Romeo', 5), ('Friar Lawrence', 4), ('Juliet', 3)]

je sélectionne une phrase de l'extrait

In [48]:
sentences = [x for x in article.sents]
for s in sentences[10]:
  print(s)

"
Calm
yourself
,
Romeo
,
"
scolded
Friar
Lawrence
.


J'utilise displacy.render pour générer un marquage basique.

Avec la visualisation native de spaCy, on obtient la représentation des dépendances dans la phrase:

In [47]:
displacy.render(nlp(str(sentences[10])), jupyter=True, style='ent')

In [45]:
displacy.render(nlp(str(sentences[10])), style='dep', jupyter = True, options = {'distance': 120})

Ensuite, on extrait les PoS (part of Speech) et lemmatise la phrase

In [49]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[10])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]


[('Calm', 'VERB', 'calm'),
 ('Romeo', 'PROPN', 'Romeo'),
 ('scolded', 'VERB', 'scold'),
 ('Friar', 'PROPN', 'Friar'),
 ('Lawrence', 'PROPN', 'Lawrence')]

In [50]:
dict([(str(x), x.label_) for x in nlp(str(sentences[10])).ents])

{'Friar Lawrence': 'PERSON', 'Romeo': 'PERSON'}

Les entités nommés extraites sont correctes.

In [51]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[10]])

[(", 'O', ''), (Calm, 'B', 'WORK_OF_ART'), (yourself, 'I', 'WORK_OF_ART'), (,, 'I', 'WORK_OF_ART'), (Romeo, 'I', 'WORK_OF_ART'), (,, 'O', ''), (", 'O', ''), (scolded, 'O', ''), (Friar, 'B', 'PERSON'), (Lawrence, 'I', 'PERSON'), (., 'O', '')]


Pour finir on visualise l'extrait complet:

In [34]:
displacy.render(article, jupyter=True, style='ent')