# NLP Tookits

Several Natural Language Processing Toolkits available for Python. Basic NLP functions provided (tokenization, PoS Tagging, Chunking, etc)


In [1]:
from __future__ import unicode_literals

---

## [NLTK](http://www.nltk.org/) 
    * Most known Toolkit.
    * Allows to build an NLP Pipeline and some applications
    * Developed with teaching in mind, may not be the fast toolkit available, but the easiest to understand. 
    * It has some tools and models for Portuguese - <http://www.nltk.org/howto/portuguese_en.html>
    * It is necessary to download models with nltk.download()

In [2]:
import nltk

In [3]:
nltk.word_tokenize(u'Geladeira Brastemp c/ painel branco-amarelo. Atualização em (09/12/2015).', language='portuguese')

[u'Geladeira',
 u'Brastemp',
 u'c/',
 u'painel',
 u'branco-amarelo',
 u'.',
 u'Atualiza\xe7\xe3o',
 u'em',
 u'(',
 u'09/12/2015',
 u')',
 u'.']

In [4]:
from nltk.corpus import mac_morpho

In [6]:
mac_morpho.tagged_words()

[(u'Jersei', u'N'), (u'atinge', u'V'), ...]

In [7]:
from nltk.corpus import floresta

In [8]:
floresta.parsed_sents()

[Tree('UTT+np', [Tree('>N+art', ['Um']), Tree('H+n', ['revivalismo']), Tree('N<+adj', ['refrescante'])]), Tree('STA+fcl', [Tree('SUBJ+np', [Tree('>N+art', ['O']), Tree('H+prop', ['7_e_Meio'])]), Tree('P+v-fin', ['\xe9']), Tree('SC+np', [Tree('>N+art', ['um']), Tree('H+n', ['ex-libris']), Tree('N<+pp', [Tree('H+prp', ['de']), Tree('P<+np', [Tree('>N+art', ['a']), Tree('H+n', ['noite']), Tree('N<+adj', ['algarvia'])])])]), Tree('.', ['.'])]), ...]

---

## [Pattern](http://www.clips.ua.ac.be/pattern)

* Pattern is a web mining module for the Python programming language.
* It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and visualization.

In [9]:
from pattern.en import tag

In [10]:
s = "I eat pizza with a fork."
s = tag(s)
print s
for word, tag in s:
    if tag == "NN": # Find all nouns in the input string.
        print word

[(u'I', u'PRP'), (u'eat', u'VBP'), (u'pizza', u'NN'), (u'with', u'IN'), (u'a', u'DT'), (u'fork', u'NN'), (u'.', u'.')]
pizza
fork


In [11]:
from pattern.web import Twitter

In [12]:
twitter = Twitter(language='pt')

In [13]:
for tweet in twitter.search('Dilma', cached=False):
    print tweet.text

RT @Zaga_Silos: Dilma quer cobrar do povo um ajuste fiscal, sendo que apenas o desviado no Petrolão daria pra fazer dezenas de ajustes fisc…
RT @o_antagonista: A campanha de Dilma custou um bilhão de reais, mas "só" R$ 300 milhões foram declarados https://t.co/szqs23Ov5z https://…
Dilma deverá centrar sua política econômica(?) na expansão do crédito e do consumo, enquanto o Banco Central estuda elevar os juros. PODE?
@diogomainardi e você tem dúvida? Todos recebem. Aecio, Dilma, Lula... Ou vanos ser ingênuos ?
RT @SenadoFederal: Dilma sanciona lei que regulariza repatriação de dinheiro de brasileiro no exterior https://t.co/yXVH2if6TM
RT @rodolfogbw: Dilma viola a Lei de Responsabilidade Fiscal e a punição é severa...para nós! https://t.co/mKKndWhgKy
RT @Ricamconsult: A Dilma está tornando os BRs + fortes.Há 5 anos, eram necessárias 2 pessoas p carregar R$100 em compras de supermercado; …
hhahahaahahahaha

Dilma e as Dilmarchinhas de Carnaval https://t.co/E14YOV7JFQ via @YouTube
RT @di

---

## [TextBlob](https://textblob.readthedocs.org/en/dev/)

* It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
* As the website says, TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.


In [14]:
from textblob import TextBlob
text = 'Geladeira Brastemp 2 Portas Branca com sensor de porta aberta'
blob = TextBlob(text)

In [15]:
blob.detect_language()

u'pt'

In [16]:
blob.tags

[(u'Geladeira', u'NNP'),
 (u'Brastemp', u'NNP'),
 (u'2', u'CD'),
 (u'Portas', u'NNP'),
 (u'Branca', u'NNP'),
 (u'com', u'NN'),
 (u'sensor', u'NN'),
 (u'de', u'IN'),
 (u'porta', u'FW'),
 (u'aberta', u'NN')]

In [17]:
blob.noun_phrases

WordList([u'geladeira brastemp', u'portas branca', u'com sensor', u'porta aberta'])

----

## [polyglot](http://polyglot.readthedocs.org/l)

Polyglot is a natural language pipeline that supports massive multilingual applications.

Features:

* Tokenization (165 Languages)
* Language detection (196 Languages)
* Named Entity Recognition (40 Languages)
* Part of Speech Tagging (16 Languages)
* Sentiment Analysis (136 Languages)
* Word Embeddings (137 Languages)
* Morphological analysis (135 Languages)
* Transliteration (69 Languages)


In [18]:
# path used for polyglot downloaded data
import polyglot
polyglot.data_path = '/usr/share/'

In [19]:
from polyglot.downloader import downloader
downloader.supported_tasks(lang="pt")

[u'tsne2',
 u'sentiment2',
 u'sgns2',
 u'morph2',
 u'transliteration2',
 u'counts2',
 u'pos2',
 u'ner2',
 u'embeddings2']

In [20]:
from polyglot.text import Text, Word

In [21]:
text = Text(u'Geladeira Brastemp c/ painel branco-amarelo. Atualização em (09/12/2015).')

In [22]:
text.detect_language()

'pt'

In [23]:
print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))

Word            POS Tag
------------------------------
Geladeira       NOUN
Brastemp        NUM
c               NUM
/               PUNCT
painel          NOUN
branco          ADJ
-               PUNCT
amarelo         ADJ
.               PUNCT
Atualização     NOUN
em              ADP
(               PUNCT
09              NUM
/               PUNCT
12              NUM
/               PUNCT
2015            NUM
)               PUNCT
.               PUNCT


In [24]:
word = Word(u'geladeira', language="pt")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
    print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])

Neighbors (Synonms) of geladeira
------------------------------
gaveta          
bochecha        
estante         
fornalha        
jangada         
barricada       
Disneylândia    
taberna         
taverna         
lavanderia      


The first 10 dimensions out the 256 dimensions

[-1.71507478 -0.63970172 -0.40256602  0.10979888  0.92080826 -0.13696
  0.23301034 -0.11777839 -0.1058483   0.22535691]


In [25]:
word = Word(u'infelicidade')
word.morphemes

WordList([u'in', u'felic', u'idade'])

In [26]:
text = Text('Apesar de interessante, o produto é caro e deixa a desejar. Não recomendo')
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in text.words:
    print("{:<16}{:>2}".format(w, w.polarity))

Word            Polarity
------------------------------
Apesar           0
de              -1
interessante     1
,                0
o                0
produto          0
é                0
caro            -1
e                0
deixa            0
a                0
desejar          1
.                0
Não              0
recomendo        0


---
## [NLPNet](http://nilc.icmc.usp.br/nlpnet/)

* nlpnet is a Python library for Natural Language Processing tasks based on neural networks. 
* It performs part-of-speech tagging and semantic role labeling.
* It was developed at [NILC](http://nilc.icmc.usp.br/)

In [27]:
import nlpnet
nlpnet.set_data_dir(str('/usr/share/nlpnet_data/'))
tagger = nlpnet.POSTagger()
tagger.tag(u'Geladeira Brastemp c/ painel branco-amarelo. Atualização em (09/12/2015).')

[[(u'Geladeira', u'N'),
  (u'Brastemp', u'N'),
  (u'c/', u'N'),
  (u'painel', u'N'),
  (u'branco-amarelo', u'ADJ'),
  (u'.', u'PU')],
 [(u'Atualiza\xe7\xe3o', u'N'),
  (u'em', u'PREP'),
  (u'(', u'PU'),
  (u'09/12/2015', u'N'),
  (u')', u'PU'),
  (u'.', u'PU')]]

In [28]:
tagger = nlpnet.SRLTagger()
sent = tagger.tag(u'O rato roeu a roupa do rei de Roma em abril.')[0]
sent.arg_structures

[(u'roeu',
  {u'A0': [u'O', u'rato'],
   u'A1': [u'a', u'roupa', u'do', u'rei', u'de', u'Roma'],
   u'AM-TMP': [u'em', u'abril'],
   u'V': [u'roeu']})]

---

## [spaCy](https://spacy.io/)

* spaCy is a library for industrial-strength natural language processing in Python and Cython. 
* It features state-of-the-art speed and accuracy, a concise API, and great documentation. 
* English Only

----

## [MontyLingua](http://alumni.media.mit.edu/~hugo/montylingua/)

* MontyLingua is a free, commonsense-enriched, end-to-end natural language understander for English.
* The last update on the library was in 2004



----

## Others
* [Aelius](http://sourceforge.net/projects/aelius/files/) - Aelius Brazilian Portuguese POS-Tagger and Corpus Annotation Tool. The tool it not available on pip and havely depends on third-based programs in java.

* [PyNLPI](http://pythonhosted.org/PyNLPl/) - Python Natural Language Processing Library (PyNLPl, pronounced as “pineapple”). The library offers a wide variety of modules for various NLP tasks.

* [PyPLN](http://pypln.org/) - PyPLN is a platform for processing and extracting useful information from text. It was conceived to run in the cloud, scale quickly and be easy to us. (Authors from FGV)