# Purpose of this notebook

To be a brief comparison to packages that provide more generic NLP functionality, such as:
- [spacy](https://spacy.io/usage/spacy-101) - see also our own [methods_intro_nlp_spacy_basics](methods_intro_nlp_spacy_basics.ipynb)
  - ~two dozen languages, including Dutch
- [pattern](https://github.com/clips/pattern)
  - 6 languages, including Dutch
- [nltk](https://www.nltk.org/)
  - no out-of-the-box Dutch support, though [training basic support is relatively simple](https://stackoverflow.com/questions/40212895/nltk-tag-dutch-sentence)
- [textblob](https://textblob.readthedocs.io/en/dev/) 
  - no out-of-the-box Dutch support, though [there's this](https://github.com/gvisniuc/textblob-nl)
- [CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
  - 8 languages, no Dutch, though 

...in part just to mention them, in part to help you choose one.


## When you need these - and when you may _not_ need these

There are also various tasks that are automated, 
or common enough that there are plenty of interactive tools, from widgets as code, to complete no-code solutions.

Data annotation is a good example of this. You want documents going in one side,
annotation data coming in the other side, and a quick web search for purely-online annotation 
reveals tools like [label studio](https://labelstud.io/), [docanno](https://doccano.github.io/doccano/),
[ML-annotate](https://github.com/falcony-io/ml-annotate), [brat](http://brat.nlplab.org/), [annotator.js](http://annotatorjs.org/)
to more purpose-specific projectss such as [lawnotation](https://www.lawnotation.org/).

The extent of your questions come down to 
- "after I do a lot of clicks, how is the thing it spits out usable to me?",
- "is this particular tool already aware of the language and scope I'm working in, and does it try to help me along?"

and specifically _not_
- "what do I need to install"
- "what do I need to learn to even get started, in terms of programming, how your your package works, whether the output is what I need in the first place"



In [14]:
import pprint

test = "Python is a high-level, general-purpose programming language. It can be quite useful."

## spacy

In [None]:
!pip3 install -U spacy

In [15]:
import spacy
english_lg  = spacy.load('en_core_web_lg')   

ana = english_lg( test )

pprint.pprint(
    [list( (tok.text, tok.pos_)  for tok in ana ),
     list(ana.sents)]
)

[[('Python', 'PROPN'),
  ('is', 'AUX'),
  ('a', 'DET'),
  ('high', 'ADJ'),
  ('-', 'PUNCT'),
  ('level', 'NOUN'),
  (',', 'PUNCT'),
  ('general', 'ADJ'),
  ('-', 'PUNCT'),
  ('purpose', 'NOUN'),
  ('programming', 'NOUN'),
  ('language', 'NOUN'),
  ('.', 'PUNCT'),
  ('It', 'PRON'),
  ('can', 'AUX'),
  ('be', 'AUX'),
  ('quite', 'ADV'),
  ('useful', 'ADJ'),
  ('.', 'PUNCT')],
 [Python is a high-level, general-purpose programming language.,
  It can be quite useful.]]


## Textblob

See also https://textblob.readthedocs.io/en/dev/quickstart.html

In [None]:
!pip3 install -U textblob

In [16]:
from textblob import TextBlob

ana = TextBlob( test )

pprint.pprint(
    [ana.tags,
     ana.sentences]
)

[[('Python', 'NNP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('high-level', 'JJ'),
  ('general-purpose', 'JJ'),
  ('programming', 'NN'),
  ('language', 'NN'),
  ('It', 'PRP'),
  ('can', 'MD'),
  ('be', 'VB'),
  ('quite', 'RB'),
  ('useful', 'JJ')],
 [Sentence("Python is a high-level, general-purpose programming language."),
  Sentence("It can be quite useful.")]]


## NLTK

In [21]:
from nltk import word_tokenize, sent_tokenize

pprint.pprint(
    [word_tokenize(test), 
     sent_tokenize(test)])


[['Python',
  'is',
  'a',
  'high-level',
  ',',
  'general-purpose',
  'programming',
  'language',
  '.',
  'It',
  'can',
  'be',
  'quite',
  'useful',
  '.'],
 ['Python is a high-level, general-purpose programming language.',
  'It can be quite useful.']]


## pattern

In [None]:
!pip3 install pattern

In [44]:
from pattern.en import parse
ana = parse(test,
     tokenize = True,  
         tags = True,  
       chunks = True,  
    relations = True,  
      #lemmata = True,  
        light = False)


In [45]:
pprint.pprint( ana.split() )


[[['Python', 'NNP', 'B-NP', 'O', 'NP-SBJ-1'],
  ['is', 'VBZ', 'B-VP', 'O', 'VP-1'],
  ['a', 'DT', 'B-NP', 'O', 'NP-OBJ-1'],
  ['high-level', 'JJ', 'I-NP', 'O', 'NP-OBJ-1'],
  [',', ',', 'I-NP', 'O', 'NP-OBJ-1'],
  ['general-purpose', 'JJ', 'I-NP', 'O', 'NP-OBJ-1'],
  ['programming', 'NN', 'I-NP', 'O', 'NP-OBJ-1'],
  ['language', 'NN', 'I-NP', 'O', 'NP-OBJ-1'],
  ['.', '.', 'O', 'O', 'O']],
 [['It', 'PRP', 'B-NP', 'O', 'NP-SBJ-1'],
  ['can', 'MD', 'B-VP', 'O', 'VP-1'],
  ['be', 'VB', 'I-VP', 'O', 'VP-1'],
  ['quite', 'RB', 'B-ADJP', 'O', 'O'],
  ['useful', 'JJ', 'I-ADJP', 'O', 'O'],
  ['.', '.', 'O', 'O', 'O']]]


## CoreNLP

See also https://stanfordnlp.github.io/CoreNLP/