Created on Monday 11 January 2021  

**Group 3 - Representation**  
**The objective of this notebook is to create NER(Named Entity Recognition) representation from the deduplicated data file** 

@authors : Arthur CARLET, Guillaume BERNARD, Neima MARCO, Nesrine AIDER, Lou-Ann CHAUSSE, Fannie MATHEY

# G3 : Named-Entity Recognition (NER)


---

## 1) Libraries and data import

In [None]:
# To be launched only once :

# Spacy's fr_core_news_lg model installation :

!pip install -U spacy
!python -m spacy download fr_core_news_lg


# Polyglot's model installation :

!pip install icu
!pip install pyicu
!pip install pycld2
!pip install morfessor
!pip install -U polyglot
!polyglot download embeddings2.fr
!polyglot download ner2.fr

# After installation, the environment must be rebooted

Collecting polyglot
[?25l  Downloading https://files.pythonhosted.org/packages/e7/98/e24e2489114c5112b083714277204d92d372f5bbe00d5507acf40370edb9/polyglot-16.7.4.tar.gz (126kB)
[K     |██▋                             | 10kB 16.1MB/s eta 0:00:01[K     |█████▏                          | 20kB 21.6MB/s eta 0:00:01[K     |███████▉                        | 30kB 11.1MB/s eta 0:00:01[K     |██████████▍                     | 40kB 9.0MB/s eta 0:00:01[K     |█████████████                   | 51kB 7.1MB/s eta 0:00:01[K     |███████████████▋                | 61kB 7.0MB/s eta 0:00:01[K     |██████████████████▏             | 71kB 7.7MB/s eta 0:00:01[K     |████████████████████▊           | 81kB 8.0MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 7.6MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 6.8MB/s eta 0:00:01[K     |████████████████████████████▌   | 112kB 6.8MB/s eta 0:00:01[K     |███████████████████████████████▏| 122kB 6.8MB/s eta 0:00:01

In [None]:
# Libraries :

import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import pickle
# Polyglot : 
from icu import Locale
import polyglot
from polyglot.text import Text, Word

# Spacy's fr_core_news_lg :
import spacy
import fr_core_news_lg

# Stanford NER :
import nltk
from nltk.tag.stanford import StanfordNERTagger

# Maxent_Ne_Chunker :
nltk.download('punkt')
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
# GoogleDrive setup :

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Data loading :

DATA_PATH = '/content/drive/MyDrive/PIP 2021/Données'
dataframe = pd.read_json(DATA_PATH + "/Deduplicated/df_scrapped_g1_5_v0.json")

## 2) NER

We selected 4 differents methods of NER :

***- maxent_ne_chunker*** : a statistical model currently recommended by NLTK for NER. (Source : https://nlpforhackers.io/named-entity-extraction/)

***- Spacy fr_core_news_lg*** : a model that features fast statistical NER as well as an open-source named-entity visualizer. (Source : https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)

***- Stanford NER*** : a Java implementation of a Named Entity Recognizer. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. (Source : https://medium.com/sicara/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486)


***- Polyglot*** : an annotators for 40 major languages using Wikipedia and Freebase. Polyglot does not require NER human annotated datasets or language specific resources. (Source : https://polyglot.readthedocs.io/en/latest/modules.html)

### 2.1) maxent_ne_chunker

In [None]:
# NER implementation with maxent_ne_chunker :

ner_nechunk = []

for sent in tqdm(dataframe['art_content']):
    ner_nechunk.append(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))))

HBox(children=(FloatProgress(value=0.0, max=7490.0), HTML(value='')))




In [None]:
# Creation of a new dataframe containing the article id and its NER tag :

df_ner_nechunk = pd.DataFrame(
    {"art_id": dataframe["art_id"], "ner": ner_nechunk})

df_ner_nechunk.head()

Unnamed: 0,art_id,ner
0,1,"[[(La, NNP)], (FNCDG, NNP), (et, FW), (l, NN),..."
1,2,"[[(Malgré, NNP)], (la, NNP), (levée, FW), (des..."
2,25,"[(Quels, NNS), (étaient, JJ), (les, NNS), (obj..."
3,27,"[[(La, NNP)], (journée, NN), (thématique, NN),..."
4,28,"[(La, NNP), (1ère, CD), (journée, NN), (thémat..."


We have seen that the results weren't good at all. We tried some preprocess techniques to improve the result, but the result were still mediocre. We won't use this model.

### 2.2) Spacy fr_core_news_lg

In [None]:
# NER implementation with the fr_core_news_lg model :

nlp = fr_core_news_lg.load()
ner_lg = []

for content in tqdm(dataframe['art_content']):
    doc = nlp(content)  # NER computing
    ner_lg.append([(X.text, X.label_) for X in doc.ents])

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




In [None]:
# Creation of a new dataframe containing the article id and its NER tag :

df_ner_lg = pd.DataFrame({"art_id": dataframe["art_id"], "ner": ner_lg})

df_ner_lg.head()

Unnamed: 0,art_id,ner
0,g1_5_0,"[(Cher, LOC), (Nous vous sollicitons pour part..."
1,g1_5_1,"[(AFIGESE, ORG), (crise du Covid-19, MISC), (U..."
2,g1_5_2,"[(Assises de l’, MISC), (AFIGESE 2019, MISC), ..."
3,g1_5_3,"[(Revue Pouvoirs Locaux, ORG), (Gilles Alfonsi..."
4,g1_5_4,"[(AFIGESE, ORG), (Linked In, MISC), (Groupe « ..."


In [None]:
# Conversion to json format :

df_ner_lg.to_json(
    '/content/drive/MyDrive/PIP 2021/Demande/Arthur/NER_spacy_lg.json', orient='records')

In [None]:
#save spacy model
with open('/content/drive/MyDrive/PIP 2021/Pos Tagging/Guillaume/spacy_lg.pickle', 'wb') as f1:
    pickle.dump(nlp, f1)

### 2.3) Stanford NER

In [None]:
# NER implementation with Stanford NER :

jar = '/content/drive/MyDrive/PIP 2021/Pos Tagging/Nesrine/stanford-ner.jar'
model = 'TO_BE_CREATED'

ner_tagger = StanfordNERTagger(model, jar, encoding='utf8')
df_ner_stanford = []

for content in tqdm(dataframe['art_content']):
    words = nltk.word_tokenize(content)  # Split text into token
    df_ner_stanford.append(ner_tagger.tag(words))

While working on Stanford's NER model, we quickly realised that the french model did not exist and that if we wanted to use this method, we would have to make our own model.
To do so, it would require us to find a french text corpus to train our model on.

Knowing this, and already having several working NER representations of our data, we decided to suspend our work and researches for the time being.

### 2.4) Polyglot

In [None]:
# Convert polyglot ner output format into spacy ner output format
# polyglot ner output format is [ <tag>([<word> ,...]) ,... ] (=entities)
# We need to change the result format in order to have an identical output
def post_process_ner_polyglot(entities: list) -> list:
    """Documentation
    Parameters:
        entities: list of entity returned by polyglot's Text function
    Out:
        result: list of entities cleaned
    """
    result = []
    for entity in entities:
        # Polyglot tags are I-PER, I-LOC, I-ORG and spacy tags are PER, LOC and ORG
        result.append(
            (' '.join(x for x in entity).replace(' - ', '-'), entity.tag[2:]))
    return result

In [None]:
ner_polyglot = []
for content in tqdm(dataframe['art_content']):
    try:
        # run polyglot model
        entities = Text(content, hint_language_code='fr').entities
        content_ner = post_process_ner_polyglot(entities)  # output cleaning
        ner_polyglot.append(content_ner)  # add result to the list
    except: 
        print("Encoding Error")
        ner_polyglot.append('[]') #if except add empty list

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




In [None]:
# Creation of a new dataframe containing the article id and its NER tag :
df_ner_polyglot = pd.DataFrame({"art_id": dataframe['art_id'], "ner": ner_polyglot})
df_ner_polyglot.head()

Unnamed: 0,art_id,ner
0,g1_5_0,"[(l’AFIGESE, PER), (DGA, ORG), (l’enquête.L’en..."
1,g1_5_1,"[(l’AFIGESE, PER), (l’UNCCAS, PER), (France, L..."
2,g1_5_2,"[(La Gazette, ORG), (Société Française, ORG), ..."
3,g1_5_3,"[(Gilles Alfonsi, PER), (l’AFIGESE, LOC), (Mic..."
4,g1_5_4,"[(L’AFIGESE, PER), (Groupe, ORG), (Communauté,..."


In [None]:
# Conversion to json format :

df_ner_polyglot.to_json(
    '/content/drive/MyDrive/PIP 2021/Demande/Arthur/NER_polyglot.json', orient='records')
# Notice that we cant save polyglot's NER model



---




