# NLP for other languages

Many of the methods in the `NLTK` library, such as `pos_tag`, were trained on texts in modern English. If you want to work with other languages, you need to change the model underlying these methods. 

For texts in the Dutch langauge, for instance, you can make use of the model trained on the [Alpino](https://www.let.rug.nl/~vannoord/trees/) corpus. You can change the NLTK language model as follows:


In [None]:
import nltk
nltk.download('alpino')

from nltk.corpus import alpino as alp
from nltk.tag import UnigramTagger, BigramTagger
training_corpus = alp.tagged_sents()
unitagger = UnigramTagger(training_corpus)
bitagger = BigramTagger(training_corpus, backoff=unitagger)
pos_tag = bitagger.tag

In [None]:
from tdm import word_tokenise

sentence = 'Het was nog donker, toen in de vroege morgen van de tweeëntwintigste december 1946 in onze stad, op de eerste verdieping van het huis Schilderskade 66, de held van deze geschiedenis, Frits van Egters, ontwaakte.'

words = word_tokenise(sentence)
pos_tag(words)

Another option is to make use of another NLP library named [spaCy](https://spacy.io/). This NLP library offers support for [a wide range of languages](https://spacy.io/usage/models). These langauge models can all be downloaded from the spaCy website. 

spaCy is not part of the Anaconda distribution of Python, so if you have never worked with spaCy before, the library needs to be installed first.

In [None]:
import sys
!pip install spacy


There are a number of [language models](https://spacy.io/models/nl). You can use the code below to download the model named `nl_core_news_lg`. This is a langauge model for Dutch. 


In [None]:
import sys
!python3 -m spacy download nl_core_news_lg

In [None]:
import sys
!python3 -m spacy download pt_core_news_sm

After the model has been downloaded, it needs to be loaded into your code, so that you can start to work with it. The `load()` method in `spaCy` creates a new object which can be used to add linguistic and semantic annotations. in the cell below, this object is given the name `nlp`.

In [None]:
import spacy
nlp = spacy.load("nl_core_news_lg")

This newly created `nlp` object can given a string as input. Its output will be a tagged text giving information about a number of grammatical and morphological aspects of this string, including the parts of speech, the sentence boundaries and the lemmatised form. 

In the code below, the output of the `nlp()` method is assigned to a variable named `tagged_text`. The annotations can be accessed by naviagting through the string word by word.

In [None]:
lemmatizer = nlp.get_pipe("lemmatizer")
tagged_text = nlp("'Het is gezien', mompelde hij, 'het is niet onopgemerkt gebleven.''")

for w in tagged_text:
    print( f'{w.text} (pos: {w.pos_} ; lemma: {w.lemma_})' )
    

The code below aims to use `spaCy` to produce data about the number of words, sentences, adverbs, pronouns, adjectives and conjunctions for all the texts in a folder named 'Corpus'. The process of adding linguistic annotations may demand some time, unfortunately. The code below used the `timeit` library to track how long this process actually takes. With longer texts, this process may take more than a minute. 

In [None]:
import timeit
from tdm import removeExtension
import spacy
import os
import re

dir = 'Corpus'

out = open( 'nlp.csv' , 'w' ,  encoding = 'utf-8')

# CSV header
out.write( 'title,tokens,sentences,' )
out.write(  'adverbs,verbs,pronouns,nouns,adjectives,conjunctions,aux-verbs\n')


for file in os.listdir(dir):
    if re.search( r'.txt$' , file ):
        print( f'Adding annotations for {file} ... ')
        out.write( removeExtension(file) )
        path = os.path.join(dir,text)
        with open(path) as file_handler:
            full_text = file_handler.read()
        start_time = timeit.default_timer()
        annotated_text = nlp(full_text)
        end_time = timeit.default_timer()
        print( f'Done! The annotation process took {end_time-start_time} seconds.')
        nr_words = len(annotated_text) 
        nr_sentences = len(list( annotated_text.sents ))
        out.write( f',{nr_words},{nr_sentences}')

        for w in annotated_text:
            pos[ w.pos_ ] = pos.get( w.pos_ , 0 ) + 1
            
        out.write( f",{pos.get('ADV',0)}" )
        out.write( f",{pos.get('VERB',0)}" )
        out.write( f",{pos.get('PRON',0)}" )
        out.write( f",{pos.get('NOUN',0)}" )
        out.write( f",{pos.get('ADJ',0)}" )
        out.write( f",{pos.get('SCONJ',0)+pos.get('CCONJ',0)}" )
        out.write( f",{pos.get('AUX',0)}" )
        out.write( '\n')
            
out.close()