Created on Friday 08 January 2021  

**Group 3 - Representation**  
**The objective of this notebook is to create pos tagging representation from the deduplicated data file** 

@authors : Arthur CARLET, Guillaume BERNARD, Neima MARCO, Nesrine AIDER, Lou-Ann CHAUSSE, Fannie MATHEY

# G3 : Part-Of-Speech Tagging (POS_Tagging)


---

## 1) Libraries and data import

In [1]:
# To be launched only once :

# Spacy's fr_core_news_lg model installation :

!pip install -U spacy
!python -m spacy download fr_core_news_lg

# Polyglot's model installation :

!pip install icu
!pip install pyicu
!pip install pycld2
!pip install morfessor
!pip install -U polyglot
!polyglot download embeddings2.fr
!polyglot download pos2.fr

#camemBert
!pip install sentencepiece
!pip install transformers

# After installation, the environment must be rebooted

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/e5/bf/ca7bb25edd21f1cf9d498d0023808279672a664a70585e1962617ca2740c/spacy-2.3.5-cp36-cp36m-manylinux2014_x86_64.whl (10.4MB)
[K     |████████████████████████████████| 10.4MB 4.1MB/s 
Collecting thinc<7.5.0,>=7.4.1
[?25l  Downloading https://files.pythonhosted.org/packages/c0/1a/c3e4ab982214c63d743fad57c45c5e68ee49e4ea4384d27b28595a26ad26/thinc-7.4.5-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 52.4MB/s 
Installing collected packages: thinc, spacy
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed spacy-2.3.5 thinc-7.4.5
Collecting fr_core_news_lg==2.3.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-2.3.0/fr_core_

In [None]:
# Libraries :

from polyglot.text import Text, Word
import polyglot
from icu import Locale
import nltk
import string
import re
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from nltk.tag import StanfordPOSTagger
import spacy
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import pickle
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Polyglot :

nltk.download('punkt')
nlp = spacy.load("fr_core_news_lg")

# camemBert:
tokenizer = AutoTokenizer.from_pretrained("gilf/french-camembert-postag-model")
model = AutoModelForTokenClassification.from_pretrained(
    "gilf/french-camembert-postag-model")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1501.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=810912.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=210.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442628805.0, style=ProgressStyle(descri…




In [None]:
# GoogleDrive setup : to run only if you use Google Colab

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Data loading :

DATA_PATH = '/content/drive/MyDrive/PIP 2021/Données'
dataframe = pd.read_json(DATA_PATH + "/Deduplicated/df_scrapped_g1_5_v0.json")

## 2) POS_Tagging

We selected 3 different methods of POS_tagging :

***- Spacy fr_core_news_lg*** : a French multi-task CNN trained on UD French Sequoia and WikiNER (Source : https://spacy.io/models/fr#fr_core_news_lg)

***- Stanford POS_Tagger*** : a Java implementation of the log-linear part-of-speech taggers (Source : https://nlp.stanford.edu/software/tagger.shtml#About)

***- Polyglot*** : Polyglot is a natural language pipeline that supports massive multilingual applications. 15 languages are supported for POS Tagging (Source : https://polyglot.readthedocs.io/en/latest/POS.html)

***- CamemBERT*** :  a part of speech tagging model for French that was trained on the free-french-treebank dataset available on github. The base tokenizer and model used for training is 'camembert-base' (Source : https://huggingface.co/gilf/french-camembert-postag-model)



### 2.1) Spacy fr_core_news_lg model



In [None]:
# POS_Tagging implementation with the fr_core_news_lg model :

pos_tagging_lg = []

for content in tqdm(dataframe['art_content']):
    doc = nlp(content)
    pos_tagging_lg.append([(w.text, w.pos_) for w in doc])

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




In [None]:
# Creation of a new dataframe containing the article id and its POS_tag :

df_pos_tagging_lg = pd.DataFrame(
    {"art_id": dataframe["art_id"], "pos_tag": pos_tagging_lg})
df_pos_tagging_lg.head()

Unnamed: 0,art_id,pos_tag
0,g1_5_0,"[( , SPACE), (Cher, PROPN), (adhérent, NOUN), ..."
1,g1_5_1,"[( , SPACE), (Pendant, ADP), (tout, ADJ), (le,..."
2,g1_5_2,"[( , SPACE), (Le, DET), (26, NUM), (septembre,..."
3,g1_5_3,"[( , SPACE), (Nous, PRON), (conseillons, VERB)..."
4,g1_5_4,"[( , SPACE), (Face, NOUN), (à, ADP), (cette, D..."


In [None]:
# Conversion to json format :

df_pos_tagging_lg.to_json(
    '/content/drive/MyDrive/PIP 2021/Demande/Arthur/Pos_tagging_spacy_lg.json', orient='records')

# save spacy model
with open('/content/drive/MyDrive/PIP 2021/Pos Tagging/Guillaume/pos_spacy.pickle', 'wb') as f1:
    pickle.dump(nlp, f1)

### 2.2) Stanford POS_Tagger 

In [None]:
# POS_Tagging implementation with the french version of the Stanford POS_Tagger :

JAR = '/content/drive/MyDrive/PIP 2021/Pos Tagging/Nesrine/stanford-postagger-4.2.0.jar'
STANFORD_FRENCH_TAGGER = '/content/drive/MyDrive/PIP 2021/Pos Tagging/Nesrine/french-ud.tagger'

pos_tagger_sf = StanfordPOSTagger(STANFORD_FRENCH_TAGGER, JAR, encoding='utf8')
pos_tagging_sf = []

for content in tqdm(dataframe['art_content']):
    res = pos_tagger_sf.tag(content.split())
    pos_tagging_sf.append(res)

In [None]:
# Creation of a new dataframe containing the article id and its POS_tag :

df_pos_tagging_sf = pd.DataFrame(
    {"art_id": dataframe["art_id"], "pos_tag": pos_tagging_sf})

df_pos_tagging_sf.head()

In [None]:
# Conversion to json format :

df_pos_tagging_sf.to_json(
    '/content/drive/MyDrive/PIP 2021/Demande/Arthur/Pos_tagging_stanford.json', orient='records')
# Notice: stanford model is already contained in the jar file we don't need to save it again

### 2.3) Polyglot

In [None]:
# convert polyglot pos-tagging output format into spacy pos-tagging output format
#   unique difference is tag CONJ in polyglot format
#   which corresponds to tag CCONJ in spacy format
def post_process_polyglot(tags: list) -> list:
    """Documentation
    Parameters:
        tags: list of tags returned by polyglot's Text function
    Out:
        result: list of tags cleaned
    """
    result = []
    for tag in tags:
        result.append((tag[0], tag[1].replace('CONJ', 'CCONJ')))
    return result

In [None]:
pos_polyglot = []

for content in tqdm(dataframe['art_content']):
    # run polyglot model
    tags = Text(content, hint_language_code='fr').pos_tags
    pos_tag_result = post_process_polyglot(tags)
    pos_polyglot.append(pos_tag_result)  # add result to the list

In [None]:
# Creation of a new dataframe containing the article id and its POS_tag :

df_pos_poly = pd.DataFrame(
    {"art_id": dataframe["art_id"], "pos_tag": pos_polyglot})

df_pos_poly.head()

In [None]:
# Conversion to json format :

df_pos_poly.to_json(
    '/content/drive/MyDrive/PIP 2021/Demande/Arthur/Pos_tagging_polyglot.json', orient='records')

# Notice: we con't save polyglot's model

### 2.4) CamemBERT 

After some experimentations, we did not manage to make CamemBERT work with our data: Some sentences contained in 'art_content' were far too long for the model to handle.

Knowing this, we decided to suspend our work and researches on this model for the time being.

In [None]:
# Post process function that fixes the tagging of punctuation marks and numbers :
def post_process(txt:str, label:str)-> (str,str):
    """Documentation
    Parameters:
        txt: the word analysed by the model
        label: the tag output associated to the analysed word by the model 
    Out:
        txt: the cleaned text 
        label: the corrected label
    """
    if txt in string.punctuation:
        return (txt, "PUNC")
    match_num = re.match("\d+", txt)
    if match_num:
        if match_num[0] == txt:
            return(txt, "NUM")
    txt = re.sub(r'[^\w\s]', '', txt)
    return (txt, label)

# Execution of the CamemBERT POS_tagging with output post_processed :


def get_pos_camembert(txt:str)->list:
    """Documentation
    Parameters:
        txt: the sentence that we want to get the pos tagging

    Out:
        sentence: list of word with their label

    Reference:
        1. https://huggingface.co/gilf/french-camembert-postag-model
    """
    res = None
    # splitting text by sentences bcs if txt too long the model wont work
    tokenized = nltk.sent_tokenize(txt, language="french")
    for phrase in tokenized:
        tokens = nlp_token_class(phrase)  # model execution
        cleaned = [post_process(
            txt=x["word"], label=x["entity_group"]) for x in tokens]  # cleaning output
        if res is None:
            res = cleaned
        else:
            res += cleaned
    return res

In [None]:
# POS_Tagging implementation with CamemBERT :

nlp_token_class = pipeline(
    'ner', model=model, tokenizer=tokenizer, grouped_entities=True)
pos_tagging_camembert = [get_pos_camembert(
    x) for x in tqdm(dataframe['art_content'].values)]

In [None]:
# Creation of a new dataframe containing the article id and its POS_tag :

df_pos_tagging_camembert = pd.DataFrame(
    {"art_id": dataframe["art_id"], "pos_tag": pos_tagging_camembert})

df_pos_tagging_camembert.head()



---

---



