## IMDb

At Fast.ai we have introduced a new module called fastai.text which replaces the torchtext library that was used in our 2018 dl1 course. The fastai.text module also supersedes the fastai.nlp library but retains many of the key functions.

In [1]:
from fastai.text import *
from fastai.core import num_cpus, partition_by_cores
import html
from pathlib import Path
import numpy as np
import csv
import pandas as pd
from collections import Counter, defaultdict
from itertools import chain
from nltk.corpus import brown
import os, re

from gensim.corpora import Dictionary
from gensim.models import Word2Vec
from typing import Callable, List, Collection
from concurrent.futures.process import ProcessPoolExecutor

The Fastai.text module introduces several custom tokens.

We need to download the IMDB large movie reviews from this site: http://ai.stanford.edu/~amaas/data/sentiment/
Direct link : [Link](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) and untar it into the PATH location. We use pathlib which makes directory traveral a breeze.

In [2]:
PATH=Path('/media/discoD/repositorios/1-billion-word-language-modeling-benchmark/')

## Standardize format

The imdb dataset has 3 classes. positive, negative and unsupervised(sentiment is unknown). 
There are 75k training reviews(12.5k pos, 12.5k neg, 50k unsup)
There are 25k validation reviews(12.5k pos, 12.5k neg & no unsup)

Refer to the README file in the imdb corpus for further information about the dataset.

In [3]:
class VocabularyTokenizer():
    "Put together rules, a tokenizer function and a language to tokenize text with multiprocessing."
    def __init__(self, tok_func:Callable=SpacyTokenizer, lang:str='pt', n_cpus:int=None):
        self.tok_func,self.lang = tok_func,lang
        self.n_cpus = n_cpus or num_cpus()//2

    def process_text(self, t:str, tok:BaseTokenizer) -> List[str]:
        "Processe one text `t` with tokenizer `tok`."
        return tok.tokenizer(t)

    def _process_all_1(self, texts:Collection[str]) -> List[List[str]]:
        "Process a list of `texts` in one process."
        tok = self.tok_func(self.lang)
        return [self.process_text(t, tok) for t in texts]

    def process_all(self, texts:Collection[str]) -> List[List[str]]:
        "Process a list of `texts`."
        if self.n_cpus <= 1: return self._process_all_1(texts)
        with ProcessPoolExecutor(self.n_cpus) as e:
            return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])

In [4]:
def save_texts(paths, filename, lang):
    CLASSES = ['unsup']
    file_count = 0
    filename = filename + '_' + lang + '.csv'
    if os.path.isfile(filename):
        os.remove(filename)
    with open(filename, 'a') as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_NONE, escapechar='\\')
        for idx,label in enumerate(CLASSES):
            for path in paths:
                for fname in (path).glob('*'):
                    file_count += 1
                    print('writing from %s' % fname)
                    [writer.writerow([line, idx]) for line in fname.open('r', encoding='utf-8').read().split('\n')]
    print('%d texts saved to %s' % (file_count, filename))

In [5]:
save_texts([PATH/'training-jur/'], 'train_jur', 'pt')
save_texts([PATH/'heldout-jur/'], 'test_jur', 'pt')
#save_texts(PATH/'training-monolingual.tokenized.shuffled/', 'train' + lang + '.csv')
#save_texts(PATH/'heldout-monolingual.tokenized.shuffled/', 'test' + lang + '.csv')
save_texts([PATH/'training-jur/',PATH/'heldout-jur/'], 'full_jur', 'pt')

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00001-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00002-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00003-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00004-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00005-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00006-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00007-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00008-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchma

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00073-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00074-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00075-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00076-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00077-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00078-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00079-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00080-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchma

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00043-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00044-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00045-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00046-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00047-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00048-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00049-of-00050
51 texts saved to test_jur_pt.csv
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00001-of-001

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00033-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00050-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00067-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00068-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00069-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00070-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00071-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00072-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchma

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00037-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00038-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00039-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00040-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00041-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00042-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00043-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00044-of-00050
writing from /media/disc

In [4]:
def get_tokens(filename):
    data = pd.read_csv(filename, header=None, escapechar='\\', chunksize=500000)
    for idx, df in enumerate(data):
        print(idx)
        yield VocabularyTokenizer().process_all(df[0].astype(str))

In [5]:
freq_full = Counter(p for o in chain.from_iterable(get_tokens('full_jur_pt.csv')) for p in o)
freq_full.most_common()

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


[(',', 86209975),
 ('de', 47899122),
 ('a', 46778448),
 ('.', 40302847),
 ('o', 32367373),
 ('do', 24197323),
 ('que', 23907082),
 ('da', 21574005),
 ('e', 18370734),
 (')', 17040131),
 ('(', 15903152),
 ('/', 13528846),
 ('em', 12670640),
 ('"', 10080484),
 ('no', 8940297),
 ('não', 8790572),
 ('os', 8309974),
 ('para', 8124080),
 (':', 7868768),
 (';', 7344581),
 ('à', 6933346),
 ('na', 6887398),
 ('por', 5787370),
 ('com', 5489777),
 ('reclamante', 5418756),
 ('-', 4749272),
 ('as', 4721645),
 ('se', 4296054),
 ('DE', 4288772),
 ('dos', 4245311),
 ('reclamada', 3848965),
 ('das', 3606642),
 ('R', 3320080),
 ('pelo', 3305948),
 ('A', 3303312),
 ('$', 3300015),
 ('como', 3183664),
 ('pela', 3180415),
 ('nos', 3101938),
 ('s', 3001102),
 ('art', 2626062),
 ('é', 2512428),
 ('ou', 2475883),
 ('nº', 2458896),
 ('autos', 2455167),
 ('foi', 2452390),
 ('pagamento', 2444342),
 ('trabalho', 2349480),
 ('autor', 2283061),
 ('horas', 2208676),
 ('valor', 2169408),
 ('O', 2148000),
 ('ser', 209

In [6]:
palavras = [palavra for palavra, contagem in freq_full.most_common()]
print(len(palavras))
palavras

7617126


[',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 '2015',
 'esta',
 'pedido',
 'ré',
 '2014',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda',
 '3',


In [8]:
sum(freq_full.values())

1225378678

In [15]:
def write_list(array, filename):
    array.insert(0, '<UNK>')
    array.insert(0, '<S>')
    array.insert(0, '</S>')
    with open(filename, 'w') as file:
        for item in array:
            file.write(item + '\n')
    file.close()

In [10]:
palavras_filtradas = [palavra for palavra, contagem in freq_full.most_common() if contagem > 1]
print(len(palavras_filtradas))
palavras_filtradas

2571698


[',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 '2015',
 'esta',
 'pedido',
 'ré',
 '2014',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda',
 '3',


In [12]:
palavras_filtradas = [palavra for palavra in palavras_filtradas if not re.match(pattern='\d{7}-\d{2}.\d{4}.\d{1}.\d{2}.\d{4}', string=palavra)]
print(len(palavras_filtradas))
palavras_filtradas

2070354


[',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 '2015',
 'esta',
 'pedido',
 'ré',
 '2014',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda',
 '3',


In [16]:
write_list(array=palavras_filtradas, filename='vocabulario_jur_sm.txt')

In [14]:
!tail -n 10 vocabulario_jur_sm.txt

637-85
09838-3
Cestrem
0142721
nº1383205
2.243,41
085279f
d51dc8d
4a84915
670dc59


In [14]:
singletons = [palavra for palavra, contagem in freq_full.most_common() if contagem == 1]
singletons

['35233588460701',
 '484-E',
 '812,12',
 '024881',
 '0000028-09.2014.5.15.0042',
 '232fc9d',
 'cf2bc68',
 '1aa4f4a',
 '743-030',
 'Urbanetz',
 '28b077a',
 'SCHWARZENBERG',
 'inverntario',
 'ofiado',
 'COMPELTADAS',
 '0000153-59.2014.5.08.0013',
 'aad549e',
 '58f630c',
 '0071000-84.2013.5.21.0005',
 '887f762',
 '0000670-87.2014.5.09.0643',
 '1000934-84.2016.5.02.0070',
 'be66415',
 '7fa5cf6',
 '59deea6',
 '0011116-68.2015.5.15.0055',
 '0000309-79.2015.5.17.0007',
 'e8c1d83',
 '1251085',
 '4963dc6',
 'bbd48b4',
 'Nº0000529-35.2016.5.10.0008',
 '4022439',
 'ca5c545',
 '1FR',
 '89205-120',
 'vidraceiro-montador',
 '1,0061077',
 'BURLOU',
 'Osasco-',
 '06016-902',
 '605812',
 'e008487',
 'b2b2f73',
 '0000685-29.2015.5.05.0612',
 'd18c5f1',
 'Id.d56b31a',
 '186627',
 '0011053-30.2015.5.03.0038',
 '7.4ºC.',
 'reabriu-a',
 '3e4bee1',
 '7237d80',
 'vítalícia',
 'ec67203',
 'e4e2a89',
 '28258f8',
 'e0e2d46',
 '2dd1126',
 '7ae9b13',
 'IDf381c5d',
 '9cfffa0',
 '0000833-02.2014.5.04.0451',
 'ED-DC-

In [15]:
len(singletons)

5045428

In [17]:
processos = [singleton for singleton in singletons if re.match(pattern='\d{7}-\d{2}.\d{4}.\d{1}.\d{2}.\d{4}', string=singleton)]
processos

['0000028-09.2014.5.15.0042',
 '0000153-59.2014.5.08.0013',
 '0071000-84.2013.5.21.0005',
 '0000670-87.2014.5.09.0643',
 '1000934-84.2016.5.02.0070',
 '0011116-68.2015.5.15.0055',
 '0000309-79.2015.5.17.0007',
 '0000685-29.2015.5.05.0612',
 '0011053-30.2015.5.03.0038',
 '0000833-02.2014.5.04.0451',
 '0001420-77.2013.5.03.0098',
 '0011532-13.2015.5.12.0025',
 '0011511-96.2015.5.15.0140',
 '0001598-65.2016.5.07.0015',
 '0000080-48.2014.5.04.0741',
 '0000282-95.2014.5.11.0008',
 '0001107-49.2013.5.23.0005',
 '0010231-15.2013.5.01.0055',
 '0000027-67.2014.5.20.0015',
 '0001744-35.2014.5.10.0002',
 '1000594-07.2015.5.02.0255',
 '1000196-78.2014.5.02.0422',
 '0001312-71.2014.5.08.0131',
 '0001402-56.2016.5.07.0028',
 '0000655-61.2014.5.07.00001',
 '0000310-13.2015.5.08.0008',
 '0001586-69.2013.5.09.0122',
 '0001352-88.2014.5.05.0017',
 '0000428-65.2015.5.07.0024',
 '0011026-17.2016.5.09.0015',
 '0010475-13.2014.5.01.0053',
 '0020801-40.2015.5.04.0012',
 '0010371-56.2015.5.03.0109',
 '0001426

In [18]:
len(processos)

585553

In [39]:
#currency_pattern = r"\b(?<![.,-])[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}(?![.,-])\b|\b(?<![.,-])[0-9]{1,3}(?:.?[0-9]{3})*\,[0-9]{2}(?![.,-])\b"
currency_pattern = r"(?<![.,])(?:- *)?\b[0-9]{1,3}(?:\.?[0-9]{3})*\,?[0-9]{2}(?![.,-])\b|(?<![.,])(?:- *)?\b[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}(?![.,-])\b"

In [35]:
valores = [singleton for singleton in singletons if re.match(pattern=currency_pattern, string=singleton)]
print(len(valores))
valores

345644


['812,12',
 '5.743,01',
 '1.617,42',
 '1.112,47',
 '8.060,91',
 '11.427,18',
 '19.178,93',
 '26.449,07',
 '17.582,51',
 '9.577,33',
 '28.561,12',
 '12.761,90',
 '14.924,00',
 '8.959,36',
 '1500,43',
 '189.339,06',
 '10.192,13',
 '44.227,65',
 '18.665,49',
 '12.231,53',
 '3598,04',
 '9.168,37',
 '2.997,11',
 '1,102.94',
 '156.226,06',
 '6.520,42',
 '32.424,06',
 '19.258,92',
 '5.208,28',
 '16.506,66',
 '10.299,14',
 '42.767,61',
 '43.605,64',
 '3.557,24',
 '114.800,00',
 '152.945,13',
 '251.725,45',
 '1520,41',
 '28.721,00',
 '10.916,22',
 '19.242,89',
 '20.414,67',
 '1.232,28',
 '3,226.06',
 '6.555,92',
 '53.487,98',
 '18.988,82',
 '46.56',
 '2.327,75',
 '29.026,91',
 '2.578,65',
 '9.904,32',
 '7.854,90',
 '2.138,59',
 '594.824,46',
 '35.897,24',
 '3.106,49',
 '10.473,76',
 '111.791,11',
 '14.415,39',
 '5.350,90',
 '8014,74',
 '13.900,57',
 '1.574,58',
 '274.829,84',
 '34.194,51',
 '774,77',
 '1.196,79',
 '10.387,29',
 '83.900,08',
 '766,30',
 '8.081,13',
 '12.286,40',
 '5.187,47',
 '1

In [36]:
valores_cifra = [valor for valor in singletons if (valor.startswith('R$') or valor.startswith('$'))]
print(len(valores_cifra))
valores_cifra

0


[]

In [80]:
def filter_vocabulary(vocabulary_file):
    words = [p.replace('\n', '') for p in open(vocabulary_file, mode='r', encoding='utf8').readlines()]
    print('%s words in the original vocabulary' % len(words))
    len_words = len(words)
    
    words = [word for word in words if not re.match(pattern=currency_pattern, string=word)]
    print('%s words discarded after filtering currency and numbers' % (len_words - len(words)))
    len_words = len(words)
    
    words = [word for word in words if not re.match(pattern='\d{7}-\d{2}.\d{4}.\d{1}.\d{2}.\d{4}', string=word)]
    print('%s words discarded after filtering process numbers' % (len_words - len(words)))
    len_words = len(words)
    
    words = [word for word in words if not (re.match(pattern=r"\b[0-9a-f]{7}\b", string=word) and not re.match(pattern=r"\b[a-f]{7}\b", string=word))]
    print('%s words discarded after filtering doc ids' % (len_words - len(words)))
    len_words = len(words)
    
    print('%s words in the final vocabulary' % len_words)
    return words
    
punctuations = re.escape('!"#%\'()*+,./:;<=>?@[\\]^_`{|}~').split('\\')
    
def get_words_starting_with_punct(words):
    words_starting_with_punct = [word for word in words if word[0] in punctuations]
    print(len(words_starting_with_punct))
    return words_starting_with_punct

def get_words_starting_with(words, string):
    words_starting_with = [word for word in words if word.startswith(string)]
    print(len(words_starting_with))
    return words_starting_with

def get_ascii_only(words):
    words_ascii = [word for word in words if not re.match(pattern='^[\x00-\x7F]+$', string=word)]
    print(len(words_ascii))
    return words_ascii

In [55]:
en_vocab = [word for word in [p.replace('\n', '') for p in open('/media/discoD/models/elmo/vocab-2016-09-10.txt', mode='r', encoding='utf8').readlines()] if re.match(pattern=currency_pattern, string=word)]
en_vocab

['2008',
 '2009',
 '2007',
 '2006',
 '100',
 '2010',
 '2005',
 '2004',
 '2003',
 '2001',
 '500',
 '200',
 '2002',
 '2000',
 '300',
 '2011',
 '1999',
 '150',
 '2012',
 '400',
 '1997',
 '1998',
 '1995',
 '1996',
 '1994',
 '250',
 '1990',
 '700',
 '600',
 '1992',
 '800',
 '1991',
 '1993',
 '1989',
 '120',
 '1988',
 '1980',
 '2020',
 '1979',
 '1984',
 '1986',
 '1987',
 '1982',
 '350',
 '140',
 '130',
 '1983',
 '1985',
 '1967',
 '2013',
 '1981',
 '900',
 '1968',
 '1976',
 '1972',
 '180',
 '160',
 '1974',
 '1978',
 '1970',
 '110',
 '1975',
 '125',
 '2014',
 '1969',
 '1977',
 '2015',
 '911',
 '1973',
 '1960',
 '1964',
 '1971',
 '450',
 '360',
 '170',
 '1962',
 '225',
 '1959',
 '1945',
 '1963',
 '212',
 '1965',
 '2016',
 '101',
 '750',
 '1948',
 '1966',
 '787',
 '2050',
 '1950',
 '1961',
 '240',
 '1000',
 '103',
 '105',
 '115',
 '1953',
 '1947',
 '175',
 '1957',
 '270',
 '1958',
 '220',
 '1949',
 '1955',
 '135',
 '1944',
 '1956',
 '1933',
 '650',
 '401',
 '1940',
 '2030',
 '1939',
 '230',
 '19

In [62]:
en_vocab[0]

'2008'

In [63]:
en_vocab[0][0]

'2'

In [64]:
en_vocab = [word for word in [p.replace('\n', '') for p in open('/media/discoD/models/elmo/vocab-2016-09-10.txt', mode='r', encoding='utf8').readlines()]]
en_vocab

['</S>',
 '<S>',
 '<UNK>',
 'the',
 ',',
 '.',
 'to',
 'of',
 'and',
 'a',
 'in',
 '"',
 "'s",
 'that',
 'for',
 'on',
 'is',
 'The',
 'was',
 'with',
 'said',
 'as',
 'at',
 'it',
 'by',
 'from',
 'be',
 'have',
 'he',
 'has',
 'his',
 'are',
 'an',
 ')',
 'not',
 '(',
 'will',
 'who',
 'I',
 'had',
 'their',
 '--',
 'were',
 'they',
 'but',
 'been',
 'this',
 'which',
 'more',
 'or',
 'its',
 'would',
 'about',
 ':',
 'after',
 'up',
 '$',
 'one',
 'than',
 'also',
 "'t",
 'out',
 'her',
 'you',
 'year',
 'when',
 'It',
 'two',
 'people',
 '-',
 'all',
 'can',
 'over',
 'last',
 'first',
 'But',
 'into',
 "'",
 'He',
 'A',
 'we',
 'In',
 'she',
 'other',
 'new',
 'years',
 'could',
 'there',
 '?',
 'time',
 'some',
 'them',
 'if',
 'no',
 'percent',
 'so',
 'what',
 'only',
 'government',
 'million',
 'just',
 'U.S.',
 'him',
 'before',
 'most',
 'like',
 'because',
 'now',
 'three',
 ';',
 'being',
 'against',
 'do',
 'Obama',
 'where',
 'made',
 'Mr',
 'many',
 'New',
 'back',
 'an

In [67]:
[word for word in en_vocab if word.startswith('\\')]

['\\']

In [68]:
words_starting_with_punct = [word for word in en_vocab if word[0] in punctuations.split('\\')]
print(len(words_starting_with_punct))
words_starting_with_punct

3244


['</S>',
 '<S>',
 '<UNK>',
 ',',
 '.',
 '"',
 "'s",
 ')',
 '(',
 ':',
 "'t",
 "'",
 '?',
 ';',
 '/',
 '%',
 '...',
 "'re",
 '!',
 "'ve",
 "'m",
 "'ll",
 "'d",
 '[',
 ']',
 '+',
 '*',
 '..',
 '=',
 '@',
 '....',
 '#',
 '|',
 '>',
 "'Brien",
 "'Neal",
 "'Neill",
 "'S",
 "'Connor",
 '<',
 "'ite",
 '.....',
 "'Donnell",
 "'Malley",
 "'Reilly",
 "'Connell",
 "'Aquila",
 "'Sullivan",
 "'clock",
 "'Adua",
 "'Driscoll",
 "'Leary",
 "'a",
 '......',
 "'o",
 '.500',
 "'Hara",
 "'T",
 "'Hare",
 "'Gara",
 "'Antoni",
 "'Shea",
 "'Or",
 "'r",
 "'Keefe",
 "'Neil",
 ',,',
 "'Arcy",
 "'n",
 "'an",
 '.......',
 "'A",
 '~',
 "'n'roll",
 "'Italia",
 "'Hair",
 "'Oreal",
 "'Toole",
 "'Djamena",
 "'Nique",
 "'Rourke",
 '.SPX',
 '.DJI',
 "'Callaghan",
 "'i",
 "'ites",
 "'Grady",
 "'mon",
 "'ida",
 "'Addario",
 '........',
 "'Meara",
 "'Dowd",
 "'s-eye",
 '.IXIC',
 "'all",
 "'Epargne",
 "'e",
 '.22-caliber',
 "'en",
 "'Dell",
 "'etat",
 "'Zogbia",
 "'Dea",
 "'Keeffe",
 "'Hanlon",
 '.........',
 "'vi",
 '.N225'

In [69]:
words_jur = filter_vocabulary('vocabulary_jur_pt.txt')

774340 words in the original vocabulary
0 words discarded after filtering currency and numbers
0 words discarded after filtering process numbers
0 words discarded after filtering doc ids
774340 words in the final vocabulary


In [74]:
words_jur

['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '-',
 '--',
 '---',
 '----',
 '-----',
 '------',
 '-------',
 '--------',
 '---------',
 '----------',
 '-----------',
 '------------',
 '-------------',
 '--------------',
 '---------------',
 '----------------',
 '-----------------',
 '------------------',
 '-------------------',
 '--------------------',
 '---------------------',
 '----------------------',
 '-----------------------',
 '------------------------',
 '-------------------------',
 '--------------------------',
 '---------------------------',
 '----------------------------',
 '---------------

In [71]:
get_words_starting_with_punct(words_jur)

2271


['</S>',
 '<S>',
 '<UNK>',
 '!',
 '"',
 '#',
 '%',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '.',
 '.-',
 '.--',
 '..',
 '...',
 '....',
 '.....',
 '......',
 '.......',
 '........',
 '.........',
 '..........',
 '...........',
 '............',
 '.............',
 '..............',
 '...............',
 '................',
 '.................',
 '..................',
 '...................',
 '....................',
 '.....................',
 '......................',
 '.......................',
 '........................',
 '.........................',
 '..........................',
 '...........................',
 '............................',
 '.............................',
 '..............................',
 '...............................',
 '................................',
 '.................................',
 '..................................',
 '...................................',
 '....................................

In [73]:
get_words_starting_with(words_jur, '..')

163


['..',
 '...',
 '....',
 '.....',
 '......',
 '.......',
 '........',
 '.........',
 '..........',
 '...........',
 '............',
 '.............',
 '..............',
 '...............',
 '................',
 '.................',
 '..................',
 '...................',
 '....................',
 '.....................',
 '......................',
 '.......................',
 '........................',
 '.........................',
 '..........................',
 '...........................',
 '............................',
 '.............................',
 '..............................',
 '...............................',
 '................................',
 '.................................',
 '..................................',
 '...................................',
 '....................................',
 '.....................................',
 '......................................',
 '.......................................',
 '.............................

In [81]:
get_ascii_only(words_jur)

83619


['----RXCELENTÍSSIMO',
 '--1357ª',
 '--1361ª',
 '--1362ª',
 '--1372ª',
 '--1373ª',
 '--1374ª',
 '--1424ª',
 '--1425ª',
 '--1429ª',
 '--1432ª',
 '--1433ª',
 '--1436ª',
 '--1437ª',
 '--1438ª',
 '--1440ª',
 '--1442ª',
 '--1445ª',
 '--1449ª',
 '--1452ª',
 '--1458ª',
 '--1462ª',
 '--1463ª',
 '--1464ª',
 '--1465ª',
 '--1466ª',
 '--2ª',
 '--3ª',
 '--6Œ',
 '--Argüida',
 '--BENEFÍCIO',
 '--CONCLUSÃO',
 '--CíÒ',
 '--DECISÃO',
 '--FUNDAMENTAÇÃO',
 '--Fundamentação',
 '--HONORÁRIOS',
 '--INDENIZAÇÃO',
 '--Nº',
 '--NÃO',
 '--Não',
 '--PETROBRÁS',
 '--Prescrição',
 '--RELATÓRIO',
 '--Reconheço',
 '--SALÁRIOS',
 '--Serviço',
 '--Seção',
 '--São',
 '--Súmula',
 '--apresentação',
 '--contribuição',
 '--diferenças',
 '--doença',
 '--férias',
 '--indenização',
 '--não',
 '--procuração',
 '--pág',
 '--págs',
 '--residência',
 '--sessão',
 '--usufruídas',
 '--µhhwÊ',
 '--É',
 '--Í',
 '--Ó',
 '--Ô',
 '--Ö',
 '--à',
 '--ä',
 '--æ',
 '--é',
 '--ê',
 '--ï',
 '--ônus',
 '--ö',
 '-10ª',
 '-10º',
 '-10ºC',
 '-11ª

In [87]:
get_words_starting_with(words_jur, '\')

SyntaxError: EOL while scanning string literal (<ipython-input-87-0ac111dd68fd>, line 1)

In [50]:
Counter(p.replace('\n', '') for p in open('vocabulary_original_jur_pt.txt', mode='r', encoding='utf8').readlines())

Counter({'</S>': 1,
         '<S>': 1,
         '<UNK>': 1,
         '\x01': 1,
         '\x02': 1,
         '\x03': 1,
         '\x04': 1,
         '\x05': 1,
         '\x06': 1,
         '\x07': 1,
         '\x08': 1,
         '\x0e': 1,
         '\x0f': 1,
         '\x10': 1,
         '\x11': 1,
         '\x12': 1,
         '\x13': 1,
         '\x14': 1,
         '\x15': 1,
         '\x16': 1,
         '\x17': 1,
         '\x18': 1,
         '\x19': 1,
         '\x1a': 1,
         '\x1b': 1,
         '\x1c ': 1,
         '\x1c \x1f ': 1,
         '\x1d ': 1,
         '\x1e ': 1,
         '\x1e \x1f ': 1,
         '\x1f ': 1,
         '\x1f \x1c ': 1,
         '\x1f \x1d ': 1,
         '\x1f \x1e ': 1,
         '\x1f \x1f ': 1,
         '!': 1,
         '"': 1,
         '#': 1,
         '$': 1,
         '%': 1,
         '&': 1,
         "'": 1,
         '(': 1,
         ')': 1,
         '*': 1,
         '+': 1,
         ',': 1,
         ',0': 1,
         ',00': 1,
         ',042.': 1

In [48]:
filter_vocabulary('vocabulary_original_jur_pt.txt')

7617419 words in the original vocabulary
1393447 words discarded after filtering currency and numbers
1086800 words discarded after filtering process numbers
2382043 words discarded after filtering doc ids
2755129 words in the final vocabulary


['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1c \x1f ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1c ',
 '\x1f \x1d ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',0',
 ',00',
 ',042.',
 ',05.10.0102',
 ',06',
 ',1',
 ',12',
 ',122.714.',
 ',1306.042.',
 ',15',
 ',161.670.',
 ',2,',
 ',2-',
 ',2000129592140',
 ',2006.37.02.',
 ',2013,5,11,0003',
 ',2013.5.10.0801',
 ',2014,',
 ',2015.5.16.0016',
 ',222',
 ',23',
 ',24',
 ',26',
 ',27',
 ',29X7,33',
 ',2bis-',
 ',2ºC',
 ',344.344.',
 ',35,39,41,45,53',
 ',3920',
 ',4',
 ',40',
 ',41',
 ',43',
 ',440',
 ',45',
 ',5',
 ',50',
 ',50x16',
 ',51',
 ',52',
 ',53',
 ',54',
 ',58',
 ',5h',
 ',5o',
 ',60',
 ',630.384.',
 ',67',
 ',68',
 ',7',
 ',76',
 ',85',
 ',98',
 ',99',
 ',9ª',
 '

In [51]:
filter_vocabulary('vocabulario_jur_sm.txt')

2070357 words in the original vocabulary
480215 words discarded after filtering currency and numbers
0 words discarded after filtering process numbers
432144 words discarded after filtering doc ids
1157998 words in the final vocabulary


['</S>',
 '<S>',
 '<UNK>',
 ',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 'esta',
 'pedido',
 'ré',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda

In [54]:
filter_vocabulary('vocabulary_jur_pt.txt')

1015374 words in the original vocabulary
124289 words discarded after filtering currency and numbers
0 words discarded after filtering process numbers
116813 words discarded after filtering doc ids
774272 words in the final vocabulary


['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '-',
 '--',
 '---',
 '----',
 '-----',
 '------',
 '-------',
 '--------',
 '---------',
 '----------',
 '-----------',
 '------------',
 '-------------',
 '--------------',
 '---------------',
 '----------------',
 '-----------------',
 '------------------',
 '-------------------',
 '--------------------',
 '---------------------',
 '----------------------',
 '-----------------------',
 '------------------------',
 '-------------------------',
 '--------------------------',
 '---------------------------',
 '----------------------------',
 '---------------

In [52]:
filter_vocabulary('/media/discoD/models/elmo/vocab-2016-09-10.txt')

793471 words in the original vocabulary
22540 words discarded after filtering currency and numbers
0 words discarded after filtering process numbers
4 words discarded after filtering doc ids
770927 words in the final vocabulary


['</S>',
 '<S>',
 '<UNK>',
 'the',
 ',',
 '.',
 'to',
 'of',
 'and',
 'a',
 'in',
 '"',
 "'s",
 'that',
 'for',
 'on',
 'is',
 'The',
 'was',
 'with',
 'said',
 'as',
 'at',
 'it',
 'by',
 'from',
 'be',
 'have',
 'he',
 'has',
 'his',
 'are',
 'an',
 ')',
 'not',
 '(',
 'will',
 'who',
 'I',
 'had',
 'their',
 '--',
 'were',
 'they',
 'but',
 'been',
 'this',
 'which',
 'more',
 'or',
 'its',
 'would',
 'about',
 ':',
 'after',
 'up',
 '$',
 'one',
 'than',
 'also',
 "'t",
 'out',
 'her',
 'you',
 'year',
 'when',
 'It',
 'two',
 'people',
 '-',
 'all',
 'can',
 'over',
 'last',
 'first',
 'But',
 'into',
 "'",
 'He',
 'A',
 'we',
 'In',
 'she',
 'other',
 'new',
 'years',
 'could',
 'there',
 '?',
 'time',
 'some',
 'them',
 'if',
 'no',
 'percent',
 'so',
 'what',
 'only',
 'government',
 'million',
 'just',
 'U.S.',
 'him',
 'before',
 'most',
 'like',
 'because',
 'now',
 'three',
 ';',
 'being',
 'against',
 'do',
 'Obama',
 'where',
 'made',
 'Mr',
 'many',
 'New',
 'back',
 'an

In [40]:
words = [p.replace('\n', '') for p in open('vocabulario_jur_sm.txt', mode='r', encoding='utf8').readlines()]
print(len(words))

2070357


In [41]:
doc_ids = [doc_id for doc_id in words if (re.match(pattern=r"\b[0-9a-f]{7}\b", string=doc_id) and not re.match(pattern=r"\b[a-f]{7}\b", string=doc_id))]
doc_ids

['0024319-19.2015.5.24.000',
 'cb5d56e',
 '2a15223',
 '0012008',
 '1510486-6',
 '0005974',
 '0014942',
 '1002028-0',
 'de2d62a',
 '1075700',
 'ee81eb5',
 'b39b434',
 '0002073',
 '1018459',
 '1510540-4',
 '1059787-SSP',
 '0024142-55.2015.0000',
 '0000121-69.2015.5.22.000',
 '4854dc4',
 '0000368-32.2013.5.11.451',
 '06c13da',
 '1747100-1',
 '0000479-',
 '1245260-SSP',
 '0000336.50.2012.5.22.0000',
 '0024142.55.2015.5.24.0000',
 '0a5f19f',
 '0010570-28.2013.05.0001',
 '1107515',
 'd42c653',
 '0024142-55',
 '1102433',
 '0800003',
 'e7060f3',
 '2f764e0',
 '1df9e15',
 'b231607',
 '7cadceb',
 '0012572',
 'e87f9dc',
 '0173535-8',
 '73cbb43',
 '0324693-0',
 '7e71e4e',
 '0005623',
 '1405523',
 'c651607',
 'bae7cd3',
 '0079614-2',
 '0001133',
 '0009683-6',
 '985628d',
 '0041464-3',
 '1f64aa1',
 '0010520-65.2015.503.0040',
 'c778f74',
 'd6358a7',
 '0001321',
 '1070176-SSP',
 '5dab9c7',
 '0119225-2',
 'c4e42c4',
 'e3349e6',
 'd971ac2',
 '45cfc11',
 '3cea6c6',
 'f69175f',
 '0029142',
 '38dc595',
 '1

In [42]:
len(doc_ids)

489786

In [16]:
words_jur = [p.replace('\n', '') for p in open('vocabulary_jur_pt.txt', mode='r', encoding='utf8').readlines()]
print(len(words_jur))

1015374


In [17]:
doc_ids = [doc_id for doc_id in words_jur if (re.match(pattern=r"\b[0-9a-f]{7}\b", string=doc_id) and not re.match(pattern=r"\b[a-f]{7}\b", string=doc_id))]
print(len(doc_ids))
doc_ids

133130


['0000000',
 '0000002',
 '0000005-44.2015.503.00145',
 '0000005-67.2013.520.0007',
 '0000006.50.2013.5.24.0004',
 '0000008',
 '0000009-41',
 '0000010',
 '0000010-55.2014.503.0063',
 '0000011-94.2010.05.08.0013',
 '0000011.53.2012.5.06.0022',
 '0000012-04.2013.5.',
 '0000012-74.2017.5.08.000',
 '0000012-91.2014.5.11.055',
 '0000013-36.2012.5.10.111',
 '0000014-44.2014.503.0176',
 '0000017-95.2014.05.08.0002',
 '0000018-87.2014.10.0111',
 '0000020-24',
 '0000023-19.2016.05.09.0095',
 '0000023-47.2012.5.02.000',
 '0000027',
 '0000027-09.2011.5.24.',
 '0000028.2014.5.21.0012',
 '0000029-79.2013.5.',
 '0000032-15.2013.5.23.121',
 '0000039-2015',
 '0000039-22',
 '0000040-07',
 '0000042-62.2016.5.11.000',
 '0000043-59',
 '0000044-44',
 '0000045-33.2010.5.10.00',
 '0000045.20.2015.5.07.0014',
 '0000046-14',
 '0000047-91.2015',
 '0000047-96',
 '0000049-55.2015.5.23.000',
 '0000049-66',
 '0000050-51',
 '0000051-10.5.23.0005',
 '0000051-64.2016.5.12.00',
 '0000052-46.2014.5.06.411',
 '0000052-83.

In [18]:
doc_ids = [doc_id for doc_id in doc_ids if not re.match(pattern=r"\b[0-9]{7}\b", string=doc_id)]
print(len(doc_ids))
doc_ids

114239


['00007ea',
 '000197b',
 '0001c00',
 '0001ea0',
 '00020df',
 '00021d9',
 '00026ce',
 '00027fa',
 '0002f51',
 '0004d0f',
 '0004efc',
 '0005ca0',
 '000643f',
 '000a1db',
 '000b6b5',
 '000b857',
 '000b9c2',
 '000d19d',
 '000f9f2',
 '000fb99',
 '00101f9',
 '001078a',
 '00111ea',
 '00114c9',
 '0011c20',
 '0012e98',
 '001310b',
 '001322f',
 '00146fb',
 '001494f',
 '0014e47',
 '0015e7e',
 '0016e4b',
 '001714d',
 '0017de5',
 '00182e8',
 '001a317',
 '001a4e0',
 '001ae78',
 '001b331',
 '001c04f',
 '001c0bd',
 '001e6fe',
 '001f59a',
 '001fdf9',
 '002054a',
 '002136e',
 '002171e',
 '0021bdb',
 '0021ca1',
 '0021f69',
 '00239aa',
 '0023a66',
 '00288b4',
 '002919d',
 '002a4a9',
 '002a606',
 '002c040',
 '002d3b5',
 '002d985',
 '002e2d1',
 '002e9e0',
 '002f092',
 '002f4ab',
 '00301a7',
 '0030cd9',
 '0030d56',
 '003101e',
 '003161a',
 '0033a85',
 '00340de',
 '00346a4',
 '003582b',
 '0035a60',
 '00360ed',
 '00367b4',
 '0036a76',
 '00382c2',
 '003a044',
 '003a259',
 '003a4a4',
 '003a7c0',
 '003a8d5',
 '00

In [26]:
vocabulary = [doc_id for doc_id in words_jur if not (re.match(pattern=r"\b[0-9a-f]{7}\b", string=doc_id) and not re.match(pattern=r"\b[a-f]{7}\b", string=doc_id))]
print(len(vocabulary))
vocabulary

882244


['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '-',
 '--',
 '---',
 '----',
 '-----',
 '------',
 '-------',
 '--------',
 '---------',
 '----------',
 '-----------',
 '------------',
 '-------------',
 '--------------',
 '---------------',
 '----------------',
 '-----------------',
 '------------------',
 '-------------------',
 '--------------------',
 '---------------------',
 '----------------------',
 '-----------------------',
 '------------------------',
 '-------------------------',
 '--------------------------',
 '---------------------------',
 '----------------------------',
 '---------------

In [27]:
vocabulary = [word for word in vocabulary if not word.startswith('--')]
print(len(vocabulary))
vocabulary

881700


['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '-',
 '-.-.-.-.-.-.-.-.-.-.-',
 '-.A',
 '-.O',
 '-.horas',
 '-0',
 '-0,00',
 '-0,01',
 '-0,02',
 '-0,04',
 '-0,05',
 '-0,06',
 '-0,09',
 '-0,10',
 '-0,15',
 '-0,30',
 '-0,35',
 '-0,50',
 '-0,55',
 '-0,59',
 '-0,6',
 '-0,60',
 '-0,68',
 '-0,69',
 '-0-',
 '-0.6',
 '-00',
 '-000',
 '-000,00',
 '-00030',
 '-001',
 '-002',
 '-002232',
 '-0040',
 '-006',
 '-0062',
 '-007',
 '-00870.2008.006.14.',
 '-01',
 '-01.11.2013',
 '-010',
 '-013',
 '-01667.2001.001.23.',
 '-02',
 '-020',
 '-021',
 '-02168',
 '-0288',
 '-03',
 '-03.394.296',
 '-030',
 '-0300',
 '-04',
 '-0

In [32]:
vocabulary

['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '-',
 '-.-.-.-.-.-.-.-.-.-.-',
 '-.A',
 '-.O',
 '-.horas',
 '-0',
 '-0,00',
 '-0,01',
 '-0,02',
 '-0,04',
 '-0,05',
 '-0,06',
 '-0,09',
 '-0,10',
 '-0,15',
 '-0,30',
 '-0,35',
 '-0,50',
 '-0,55',
 '-0,59',
 '-0,6',
 '-0,60',
 '-0,68',
 '-0,69',
 '-0-',
 '-0.6',
 '-00',
 '-000',
 '-000,00',
 '-00030',
 '-001',
 '-002',
 '-002232',
 '-0040',
 '-006',
 '-0062',
 '-007',
 '-00870.2008.006.14.',
 '-01',
 '-01.11.2013',
 '-010',
 '-013',
 '-01667.2001.001.23.',
 '-02',
 '-020',
 '-021',
 '-02168',
 '-0288',
 '-03',
 '-03.394.296',
 '-030',
 '-0300',
 '-04',
 '-0

In [35]:
valores = [valor for valor in vocabulary if not (valor.startswith('-') and re.match(pattern=currency_pattern, string=valor[1:]))]

In [36]:
print(len(valores))
valores

879980


['</S>',
 '<S>',
 '<UNK>',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c ',
 '\x1d ',
 '\x1e ',
 '\x1e \x1f ',
 '\x1f ',
 '\x1f \x1e ',
 '\x1f \x1f ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 ',00',
 ',042.',
 ',1',
 ',12',
 ',2,',
 ',4',
 ',5',
 '-',
 '-.-.-.-.-.-.-.-.-.-.-',
 '-.A',
 '-.O',
 '-.horas',
 '-0',
 '-0,6',
 '-0-',
 '-0.6',
 '-00',
 '-000',
 '-00030',
 '-001',
 '-002',
 '-002232',
 '-0040',
 '-006',
 '-0062',
 '-007',
 '-00870.2008.006.14.',
 '-01',
 '-01.11.2013',
 '-010',
 '-013',
 '-01667.2001.001.23.',
 '-02',
 '-020',
 '-021',
 '-02168',
 '-0288',
 '-03',
 '-03.394.296',
 '-030',
 '-0300',
 '-04',
 '-040',
 '-05',
 '-050',
 '-05h',
 '-06',
 '-060',
 '-07',
 '-070',
 '-08',
 '-08.01.2015',
 '-09',
 '-1',
 '-1,0',
 '-1-',
 '-1.000',
 '-1.000,000',
 '-1.678',
 '-10',
 '-10.00,00',
 '

In [37]:
valores = [valor for valor in vocabulary if valor.startswith('-')]
print(len(valores))
valores

4888


['-',
 '-.-.-.-.-.-.-.-.-.-.-',
 '-.A',
 '-.O',
 '-.horas',
 '-0',
 '-0,00',
 '-0,01',
 '-0,02',
 '-0,04',
 '-0,05',
 '-0,06',
 '-0,09',
 '-0,10',
 '-0,15',
 '-0,30',
 '-0,35',
 '-0,50',
 '-0,55',
 '-0,59',
 '-0,6',
 '-0,60',
 '-0,68',
 '-0,69',
 '-0-',
 '-0.6',
 '-00',
 '-000',
 '-000,00',
 '-00030',
 '-001',
 '-002',
 '-002232',
 '-0040',
 '-006',
 '-0062',
 '-007',
 '-00870.2008.006.14.',
 '-01',
 '-01.11.2013',
 '-010',
 '-013',
 '-01667.2001.001.23.',
 '-02',
 '-020',
 '-021',
 '-02168',
 '-0288',
 '-03',
 '-03.394.296',
 '-030',
 '-0300',
 '-04',
 '-040',
 '-05',
 '-050',
 '-05h',
 '-06',
 '-060',
 '-07',
 '-070',
 '-08',
 '-08.01.2015',
 '-09',
 '-1',
 '-1,0',
 '-1,00',
 '-1,10',
 '-1,30',
 '-1,50',
 '-1-',
 '-1.000',
 '-1.000,00',
 '-1.000,000',
 '-1.000.000,00',
 '-1.001.079.252,00',
 '-1.004,15',
 '-1.005,00',
 '-1.005,40',
 '-1.006,07',
 '-1.007,70',
 '-1.009,33',
 '-1.011,66',
 '-1.012,00',
 '-1.013,60',
 '-1.014,00',
 '-1.016,28',
 '-1.017,84',
 '-1.017,90',
 '-1.018,60',


In [43]:
valores = [valor for valor in vocabulary if re.match(pattern=currency_pattern, string=valor)]
print(len(valores))
valores

107972


['-0,00',
 '-0,01',
 '-0,02',
 '-0,04',
 '-0,05',
 '-0,06',
 '-0,09',
 '-0,10',
 '-0,15',
 '-0,30',
 '-0,35',
 '-0,50',
 '-0,55',
 '-0,59',
 '-0,60',
 '-0,68',
 '-0,69',
 '-000',
 '-000,00',
 '-00030',
 '-001',
 '-002',
 '-002232',
 '-0040',
 '-006',
 '-0062',
 '-007',
 '-010',
 '-013',
 '-020',
 '-021',
 '-02168',
 '-0288',
 '-030',
 '-0300',
 '-040',
 '-050',
 '-060',
 '-070',
 '-1,00',
 '-1,10',
 '-1,30',
 '-1,50',
 '-1.000,00',
 '-1.000.000,00',
 '-1.001.079.252,00',
 '-1.004,15',
 '-1.005,00',
 '-1.005,40',
 '-1.006,07',
 '-1.007,70',
 '-1.009,33',
 '-1.011,66',
 '-1.012,00',
 '-1.013,60',
 '-1.014,00',
 '-1.016,28',
 '-1.017,84',
 '-1.017,90',
 '-1.018,60',
 '-1.020,00',
 '-1.020,40',
 '-1.021,00',
 '-1.021,32',
 '-1.023.518,56',
 '-1.024,91',
 '-1.026,28',
 '-1.028,93',
 '-1.033,00',
 '-1.034,00',
 '-1.037,00',
 '-1.038,52',
 '-1.039,72',
 '-1.039,73',
 '-1.040,00',
 '-1.046,80',
 '-1.050,00',
 '-1.055,12',
 '-1.056,00',
 '-1.060,00',
 '-1.064,00',
 '-1.070,00',
 '-1.071,00',
 '

In [44]:
valores[-300:]

['99460215',
 '99461',
 '99464',
 '99465',
 '994651',
 '9947',
 '99473',
 '994751',
 '994756',
 '994773',
 '994782',
 '99479',
 '9948',
 '994806',
 '99481',
 '99483',
 '99484',
 '99487',
 '99489',
 '9949',
 '99490',
 '994935',
 '99494',
 '99495',
 '995',
 '9950',
 '99500',
 '99504',
 '99509',
 '9951',
 '99514',
 '995145',
 '99518',
 '9952',
 '99520',
 '99522',
 '99523',
 '99524',
 '99526',
 '99529',
 '9953',
 '99530',
 '99533',
 '99537',
 '995377',
 '9954',
 '99543',
 '99544',
 '99546',
 '9955',
 '99550',
 '9955240812006509',
 '9956',
 '99562430600',
 '99563',
 '99565',
 '99568',
 '995686',
 '995698',
 '9957',
 '99573',
 '99575',
 '99576',
 '9958',
 '99580',
 '99586',
 '99587',
 '995890',
 '9958920135040561',
 '9959',
 '99591',
 '99592',
 '99596',
 '996',
 '9960',
 '99601',
 '99603',
 '996051',
 '996064966',
 '99607',
 '9961',
 '99610',
 '99611',
 '996126',
 '99613',
 '996193',
 '9962',
 '99621',
 '99622',
 '996244',
 '99625',
 '99626',
 '99627',
 '99628684',
 '9963',
 '99630',
 '99632