# Best practices in NLP

Python is one the best languages for text processing. However, some best practices makes all the diferece in the performance of the applications

### 1. Use python simple structures
    Lists, sets and dicts are very optimized for text processing and are in the core of python. Use them!


In [1]:
colors = ['branco', 'amarelo', 'azul', 'branco']

In [2]:
set(colors)

{'amarelo', 'azul', 'branco'}

---

### 2. Always use unicode (Always!!!)

In [3]:
a = 'maça'
print len(a)

5


In [4]:
b = u'maça'
print len(b)

4


In [5]:
a == b

  if __name__ == '__main__':


False

In [6]:
# Your problem may be solved adding the directive:
from __future__ import unicode_literals

In [7]:
a = 'maça'
b = u'maça'
a == b

True

--- 

### 3. Discover the text encoding is the programmmer's task, not python's

Learn how your system deals with input and output

In [8]:
a = u'maça'.encode('latin1')

In [9]:
print a

ma�a


In [10]:
unicode(a)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 2: ordinal not in range(128)

In [11]:
# Even simple concatenation raises a decode problem
a + 'outra palavra'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 2: ordinal not in range(128)

In [12]:
# some libraries can help, but remember: find the correct encoding is a heuristic science!
import chardet
chardet.detect(a)

{'confidence': 0.7916670185186749, 'encoding': 'ISO-8859-2'}

In [13]:
print a.decode(chardet.detect(a)['encoding'])

maça


---

### 4. List compreensions are very optimized. Use it for filtering

In [14]:
wordlist = ['abacate', 'kiwi', 'abacaxi', 'melancia']
[word for word in wordlist if word.startswith('a')]

[u'abacate', u'abacaxi']

--- 
### 5. Some very useful code snippets

In [15]:
wordlist = ['abacate', 'kiwi', 'abacaxi', 'melancia', 'banana', 'abacate', 'abacaxi', 'abacate']

In [16]:
# join, sorted, set
print 'Frutas: ' + ', '.join( sorted(set(wordlist)))

Frutas: abacate, abacaxi, banana, kiwi, melancia


In [17]:
# use always operator 'in', instead of a loop with ==
'abacaxi' in wordlist

True

In [18]:
# Frequency lists
freqlist = dict()
for word in wordlist:
    freqlist[word] = freqlist.get(word, 0) + 1
print freqlist

{u'abacate': 3, u'kiwi': 1, u'abacaxi': 2, u'melancia': 1, u'banana': 1}


In [19]:
# sorted by frequency
from operator import itemgetter
', '.join( [word for word,freq in sorted(freqlist.items(), key=itemgetter(1), reverse=True)] )

u'abacate, abacaxi, kiwi, melancia, banana'

In [20]:
# use and abuse of slicing
text = 'E ele disse: "texto em quotes" e continou...'
text[text.find('"')+1:text.rfind('"')]

u'texto em quotes'

----
### 6. Some more elaborated data structures. Used for special cases only

In [21]:
# Counter
from collections import Counter
Counter(wordlist)

Counter({u'abacate': 3,
         u'abacaxi': 2,
         u'banana': 1,
         u'kiwi': 1,
         u'melancia': 1})

In [22]:
# blist - The blist is a drop-in replacement for the Python list that provides 
# better performance when modifying large lists. 
from blist import blist
blist(wordlist)

blist([u'abacate', u'kiwi', u'abacaxi', u'melancia', u'banana', u'abacate', u'abacaxi', u'abacate'])

In [23]:
# String data in a MARISA-trie may take up to 50x-100x less memory than 
# in a standard Python dict; the raw lookup speed is comparable; 
# trie also provides fast advanced methods like prefix search.
import marisa_trie
trie = marisa_trie.Trie(wordlist)
trie.items()

[(u'abacate', 3),
 (u'abacaxi', 4),
 (u'banana', 0),
 (u'kiwi', 1),
 (u'melancia', 2)]

In [24]:
'abacate' in trie

True

In [25]:
trie.keys('aba')

[u'abacate', u'abacaxi']