##### This module contains code for performing named entity recognition and part-of-speech tagging. For this tutorial, we'll use [spaCy](https://github.com/ian-nai/Non-English-NLP-Tutorial/blob/master/Documentation%20Resources.md#spacy) to complete these tasks.

##### Named entity recognition (NER) is the process of locating and classifying named entities in text into pre-defined categories (persons, places, and so on). Part-of-speech tagging tags the parts of speech computationally identified within a text.


##### First, let's try performing NER with spaCy. Please note that the NER results are often inaccurate because we didn't train our model. For more information about training spaCy models, consult spaCy documentation at this link: https://spacy.io/usage/training

In [None]:
import spacy
import csv

with open('cleaned_text.txt', 'r') as file:
    text_data = file.read()
    
nlp = spacy.load("fr_core_news_md")
doc = nlp(text_data)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
# Output our tagged data into a CSV
with open('pos_tags.csv', 'w') as csvfile:
    fieldnames = ['text', 'start_char', 'end_char', 'label']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    
    for ent in doc.ents:
        writer.writerow({'text': ent.text, 'start_char': ent.start_char, 'end_char': ent.end_char, 'label': ent.label})

##### We can perform part-of-speech tagging using the spacy-lefff library. From the library's [GitHub repository](https://github.com/sammous/spacy-lefff):
##### "This package allows to bring Lefff lemmatization and part-of-speech tagging to a spaCy custom pipeline. When POS tagging and Lemmatization are combined inside a pipeline, it improves your text preprocessing for French compared to the built-in spaCy French processing.""

In [None]:
import spacy
from spacy_lefff import LefffLemmatizer, POSTagger
import csv

#nlp = spacy.load('fr')
nlp = spacy.load("fr_core_news_md")
pos = POSTagger()
french_lemmatizer = LefffLemmatizer(after_melt=True, default=True)
nlp.add_pipe(pos, name='pos', after='parser')
nlp.add_pipe(french_lemmatizer, name='lefff', after='pos')

# Open our file
with open('cleaned_text.txt', 'r') as file:
    text_data = file.read()

    
# Specify the information we want
doc = nlp(text_data)
for d in doc:
    print(d.text, d.pos_, d._.melt_tagger, d._.lefff_lemma, d.tag_, d.lemma_)
    
# Output our tagged data into a CSV
with open('text_data.csv', 'w') as csvfile:
    fieldnames = ['text', 'pos', 'melt', 'lefff_lemma', 'tag', 'lemma']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    
    for d in doc:
        if d.pos_ != "SPACE":
            writer.writerow({'text': d.text, 'pos': d.pos_, 'melt': d._.melt_tagger, 'lefff_lemma': d._.lefff_lemma, 'tag': d.tag_, 'lemma': d.lemma_})

#### We can perform part-of-speech tagging using the [Stanza package](https://stanfordnlp.github.io/stanza/index.html), as well. For further information about Stanza, please consult the [documentation](https://github.com/ian-nai/Non-English-NLP-Tutorial/blob/master/Documentation%20Resources.md#stanza). This [web demo](http://stanza.run/) of the package may also be of interest. 

#### For the purposes of this tutorial, we won't save the Stanza output to a CSV. Note that Stanza uses a different pipeline than spaCy and NLTK, and that consulting these tools' documentation is the best way to determine which one would best suit your individual needs.

In [None]:
import stanza

# Open and read our file
with open('cleaned_text.txt', 'r') as file:

    text_data = file.read()
    
    # Set out Stanza pipeline, including language
    nlp = stanza.Pipeline(lang='fr', processors='tokenize, ner', tokenize_pretokenized=True)

    doc = nlp(text_data)

    print([token for sent in doc.sentences for token in sent.ents])