## IMDb

At Fast.ai we have introduced a new module called fastai.text which replaces the torchtext library that was used in our 2018 dl1 course. The fastai.text module also supersedes the fastai.nlp library but retains many of the key functions.

In [1]:
from fastai.text import *
from fastai.core import num_cpus, partition_by_cores
import html
from pathlib import Path
import numpy as np
import csv
import pandas as pd
from collections import Counter, defaultdict
from itertools import chain
from nltk.corpus import brown
import os, re

from gensim.corpora import Dictionary
from gensim.models import Word2Vec
from typing import Callable, List, Collection
from concurrent.futures.process import ProcessPoolExecutor

The Fastai.text module introduces several custom tokens.

We need to download the IMDB large movie reviews from this site: http://ai.stanford.edu/~amaas/data/sentiment/
Direct link : [Link](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) and untar it into the PATH location. We use pathlib which makes directory traveral a breeze.

In [2]:
PATH=Path('/media/discoD/repositorios/1-billion-word-language-modeling-benchmark/')

## Standardize format

The imdb dataset has 3 classes. positive, negative and unsupervised(sentiment is unknown). 
There are 75k training reviews(12.5k pos, 12.5k neg, 50k unsup)
There are 25k validation reviews(12.5k pos, 12.5k neg & no unsup)

Refer to the README file in the imdb corpus for further information about the dataset.

In [3]:
class VocabularyTokenizer():
    "Put together rules, a tokenizer function and a language to tokenize text with multiprocessing."
    def __init__(self, tok_func:Callable=SpacyTokenizer, lang:str='pt', n_cpus:int=None):
        self.tok_func,self.lang = tok_func,lang
        self.n_cpus = n_cpus or num_cpus()//2

    def process_text(self, t:str, tok:BaseTokenizer) -> List[str]:
        "Processe one text `t` with tokenizer `tok`."
        return tok.tokenizer(t)

    def _process_all_1(self, texts:Collection[str]) -> List[List[str]]:
        "Process a list of `texts` in one process."
        tok = self.tok_func(self.lang)
        return [self.process_text(t, tok) for t in texts]

    def process_all(self, texts:Collection[str]) -> List[List[str]]:
        "Process a list of `texts`."
        if self.n_cpus <= 1: return self._process_all_1(texts)
        with ProcessPoolExecutor(self.n_cpus) as e:
            return sum(e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), [])

In [4]:
def save_texts(paths, filename, lang):
    CLASSES = ['unsup']
    file_count = 0
    filename = filename + '_' + lang + '.csv'
    if os.path.isfile(filename):
        os.remove(filename)
    with open(filename, 'a') as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_NONE, escapechar='\\')
        for idx,label in enumerate(CLASSES):
            for path in paths:
                for fname in (path).glob('*'):
                    file_count += 1
                    print('writing from %s' % fname)
                    [writer.writerow([line, idx]) for line in fname.open('r', encoding='utf-8').read().split('\n')]
    print('%d texts saved to %s' % (file_count, filename))

In [5]:
save_texts([PATH/'training-jur/'], 'train_jur', 'pt')
save_texts([PATH/'heldout-jur/'], 'test_jur', 'pt')
#save_texts(PATH/'training-monolingual.tokenized.shuffled/', 'train' + lang + '.csv')
#save_texts(PATH/'heldout-monolingual.tokenized.shuffled/', 'test' + lang + '.csv')
save_texts([PATH/'training-jur/',PATH/'heldout-jur/'], 'full_jur', 'pt')

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00001-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00002-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00003-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00004-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00005-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00006-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00007-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00008-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchma

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00073-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00074-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00075-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00076-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00077-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00078-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00079-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00080-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchma

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00043-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00044-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00045-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00046-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00047-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00048-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00049-of-00050
51 texts saved to test_jur_pt.csv
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00001-of-001

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00033-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00050-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00067-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00068-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00069-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00070-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00071-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/training-jur/jur-00072-of-00100
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchma

writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00037-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00038-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00039-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00040-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00041-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00042-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00043-of-00050
writing from /media/discoD/repositorios/1-billion-word-language-modeling-benchmark/heldout-jur/jur.heldout-00044-of-00050
writing from /media/disc

In [4]:
def get_tokens(filename):
    data = pd.read_csv(filename, header=None, escapechar='\\', chunksize=500000)
    for idx, df in enumerate(data):
        print(idx)
        yield VocabularyTokenizer().process_all(df[0].astype(str))

In [5]:
freq_full = Counter(p for o in chain.from_iterable(get_tokens('full_jur_pt.csv')) for p in o)
freq_full.most_common()

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


[(',', 86209975),
 ('de', 47899122),
 ('a', 46778448),
 ('.', 40302847),
 ('o', 32367373),
 ('do', 24197323),
 ('que', 23907082),
 ('da', 21574005),
 ('e', 18370734),
 (')', 17040131),
 ('(', 15903152),
 ('/', 13528846),
 ('em', 12670640),
 ('"', 10080484),
 ('no', 8940297),
 ('não', 8790572),
 ('os', 8309974),
 ('para', 8124080),
 (':', 7868768),
 (';', 7344581),
 ('à', 6933346),
 ('na', 6887398),
 ('por', 5787370),
 ('com', 5489777),
 ('reclamante', 5418756),
 ('-', 4749272),
 ('as', 4721645),
 ('se', 4296054),
 ('DE', 4288772),
 ('dos', 4245311),
 ('reclamada', 3848965),
 ('das', 3606642),
 ('R', 3320080),
 ('pelo', 3305948),
 ('A', 3303312),
 ('$', 3300015),
 ('como', 3183664),
 ('pela', 3180415),
 ('nos', 3101938),
 ('s', 3001102),
 ('art', 2626062),
 ('é', 2512428),
 ('ou', 2475883),
 ('nº', 2458896),
 ('autos', 2455167),
 ('foi', 2452390),
 ('pagamento', 2444342),
 ('trabalho', 2349480),
 ('autor', 2283061),
 ('horas', 2208676),
 ('valor', 2169408),
 ('O', 2148000),
 ('ser', 209

In [6]:
palavras = [palavra for palavra, contagem in freq_full.most_common()]
print(len(palavras))
palavras

7617126


[',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 '2015',
 'esta',
 'pedido',
 'ré',
 '2014',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda',
 '3',


In [8]:
sum(freq_full.values())

1225378678

In [15]:
def write_list(array, filename):
    array.insert(0, '<UNK>')
    array.insert(0, '<S>')
    array.insert(0, '</S>')
    with open(filename, 'w') as file:
        for item in array:
            file.write(item + '\n')
    file.close()

In [10]:
palavras_filtradas = [palavra for palavra, contagem in freq_full.most_common() if contagem > 1]
print(len(palavras_filtradas))
palavras_filtradas

2571698


[',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 '2015',
 'esta',
 'pedido',
 'ré',
 '2014',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda',
 '3',


In [12]:
palavras_filtradas = [palavra for palavra in palavras_filtradas if not re.match(pattern='\d{7}-\d{2}.\d{4}.\d{1}.\d{2}.\d{4}', string=palavra)]
print(len(palavras_filtradas))
palavras_filtradas

2070354


[',',
 'de',
 'a',
 '.',
 'o',
 'do',
 'que',
 'da',
 'e',
 ')',
 '(',
 '/',
 'em',
 '"',
 'no',
 'não',
 'os',
 'para',
 ':',
 ';',
 'à',
 'na',
 'por',
 'com',
 'reclamante',
 '-',
 'as',
 'se',
 'DE',
 'dos',
 'reclamada',
 'das',
 'R',
 'pelo',
 'A',
 '$',
 'como',
 'pela',
 'nos',
 's',
 'art',
 'é',
 'ou',
 'nº',
 'autos',
 'foi',
 'pagamento',
 'trabalho',
 'autor',
 'horas',
 'valor',
 'O',
 'ser',
 'sua',
 'd',
 'dias',
 'CLT',
 'DA',
 'ID',
 'DO',
 'parte',
 'Trabalho',
 'prazo',
 'sob',
 'uma',
 'sobre',
 'seu',
 'E',
 'LTDA',
 '%',
 'dia',
 'termos',
 'empresa',
 '1',
 'este',
 'sem',
 'até',
 'audiência',
 'forma',
 'conforme',
 'partes',
 '2015',
 'esta',
 'pedido',
 'ré',
 '2014',
 'mais',
 'sendo',
 'um',
 'n',
 'salário',
 'já',
 'TST',
 'era',
 'sentença',
 'FGTS',
 'presente',
 'período',
 'processo',
 'caso',
 'contrato',
 '10',
 'autora',
 'inicial',
 'advogado',
 'depoente',
 'qual',
 'extras',
 'quando',
 'artigo',
 'Em',
 'jornada',
 'há',
 '§',
 'ainda',
 '3',


In [16]:
write_list(array=palavras_filtradas, filename='vocabulario_jur_sm.txt')

In [14]:
!tail -n 10 vocabulario_jur_sm.txt

637-85
09838-3
Cestrem
0142721
nº1383205
2.243,41
085279f
d51dc8d
4a84915
670dc59


In [14]:
singletons = [palavra for palavra, contagem in freq_full.most_common() if contagem == 1]
singletons

['35233588460701',
 '484-E',
 '812,12',
 '024881',
 '0000028-09.2014.5.15.0042',
 '232fc9d',
 'cf2bc68',
 '1aa4f4a',
 '743-030',
 'Urbanetz',
 '28b077a',
 'SCHWARZENBERG',
 'inverntario',
 'ofiado',
 'COMPELTADAS',
 '0000153-59.2014.5.08.0013',
 'aad549e',
 '58f630c',
 '0071000-84.2013.5.21.0005',
 '887f762',
 '0000670-87.2014.5.09.0643',
 '1000934-84.2016.5.02.0070',
 'be66415',
 '7fa5cf6',
 '59deea6',
 '0011116-68.2015.5.15.0055',
 '0000309-79.2015.5.17.0007',
 'e8c1d83',
 '1251085',
 '4963dc6',
 'bbd48b4',
 'Nº0000529-35.2016.5.10.0008',
 '4022439',
 'ca5c545',
 '1FR',
 '89205-120',
 'vidraceiro-montador',
 '1,0061077',
 'BURLOU',
 'Osasco-',
 '06016-902',
 '605812',
 'e008487',
 'b2b2f73',
 '0000685-29.2015.5.05.0612',
 'd18c5f1',
 'Id.d56b31a',
 '186627',
 '0011053-30.2015.5.03.0038',
 '7.4ºC.',
 'reabriu-a',
 '3e4bee1',
 '7237d80',
 'vítalícia',
 'ec67203',
 'e4e2a89',
 '28258f8',
 'e0e2d46',
 '2dd1126',
 '7ae9b13',
 'IDf381c5d',
 '9cfffa0',
 '0000833-02.2014.5.04.0451',
 'ED-DC-

In [15]:
len(singletons)

5045428

In [17]:
processos = [singleton for singleton in singletons if re.match(pattern='\d{7}-\d{2}.\d{4}.\d{1}.\d{2}.\d{4}', string=singleton)]
processos

['0000028-09.2014.5.15.0042',
 '0000153-59.2014.5.08.0013',
 '0071000-84.2013.5.21.0005',
 '0000670-87.2014.5.09.0643',
 '1000934-84.2016.5.02.0070',
 '0011116-68.2015.5.15.0055',
 '0000309-79.2015.5.17.0007',
 '0000685-29.2015.5.05.0612',
 '0011053-30.2015.5.03.0038',
 '0000833-02.2014.5.04.0451',
 '0001420-77.2013.5.03.0098',
 '0011532-13.2015.5.12.0025',
 '0011511-96.2015.5.15.0140',
 '0001598-65.2016.5.07.0015',
 '0000080-48.2014.5.04.0741',
 '0000282-95.2014.5.11.0008',
 '0001107-49.2013.5.23.0005',
 '0010231-15.2013.5.01.0055',
 '0000027-67.2014.5.20.0015',
 '0001744-35.2014.5.10.0002',
 '1000594-07.2015.5.02.0255',
 '1000196-78.2014.5.02.0422',
 '0001312-71.2014.5.08.0131',
 '0001402-56.2016.5.07.0028',
 '0000655-61.2014.5.07.00001',
 '0000310-13.2015.5.08.0008',
 '0001586-69.2013.5.09.0122',
 '0001352-88.2014.5.05.0017',
 '0000428-65.2015.5.07.0024',
 '0011026-17.2016.5.09.0015',
 '0010475-13.2014.5.01.0053',
 '0020801-40.2015.5.04.0012',
 '0010371-56.2015.5.03.0109',
 '0001426

In [18]:
len(processos)

585553

In [34]:
currency_pattern = r"\b(?<![.,-])[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}(?![.,-])\b|\b(?<![.,-])[0-9]{1,3}(?:.?[0-9]{3})*\,[0-9]{2}(?![.,-])\b"

In [35]:
valores = [singleton for singleton in singletons if re.match(pattern=currency_pattern, string=singleton)]
print(len(valores))
valores

345644


['812,12',
 '5.743,01',
 '1.617,42',
 '1.112,47',
 '8.060,91',
 '11.427,18',
 '19.178,93',
 '26.449,07',
 '17.582,51',
 '9.577,33',
 '28.561,12',
 '12.761,90',
 '14.924,00',
 '8.959,36',
 '1500,43',
 '189.339,06',
 '10.192,13',
 '44.227,65',
 '18.665,49',
 '12.231,53',
 '3598,04',
 '9.168,37',
 '2.997,11',
 '1,102.94',
 '156.226,06',
 '6.520,42',
 '32.424,06',
 '19.258,92',
 '5.208,28',
 '16.506,66',
 '10.299,14',
 '42.767,61',
 '43.605,64',
 '3.557,24',
 '114.800,00',
 '152.945,13',
 '251.725,45',
 '1520,41',
 '28.721,00',
 '10.916,22',
 '19.242,89',
 '20.414,67',
 '1.232,28',
 '3,226.06',
 '6.555,92',
 '53.487,98',
 '18.988,82',
 '46.56',
 '2.327,75',
 '29.026,91',
 '2.578,65',
 '9.904,32',
 '7.854,90',
 '2.138,59',
 '594.824,46',
 '35.897,24',
 '3.106,49',
 '10.473,76',
 '111.791,11',
 '14.415,39',
 '5.350,90',
 '8014,74',
 '13.900,57',
 '1.574,58',
 '274.829,84',
 '34.194,51',
 '774,77',
 '1.196,79',
 '10.387,29',
 '83.900,08',
 '766,30',
 '8.081,13',
 '12.286,40',
 '5.187,47',
 '1

In [36]:
valores_cifra = [valor for valor in singletons if (valor.startswith('R$') or valor.startswith('$'))]
print(len(valores_cifra))
valores_cifra

0


[]