# Nome: Rafael Ito

Objetivo desse experimento é conhecer o CountVectorizer do scikit-learn, usando-o numa pequena amostra do dataset IMDB e codificando funções equivalente no Python.

Funções a serem implementadas:

1. vocab = build_vocab(corpus)
2. corpus_tok = tokenizer(corpus, vocab)
3. doc_term = feature(corpus_tok)

Enquanto está depurando o seu programa, utilize um corpus bem pequeno, com poucos exemplos e depois de depurado, rode ele nos 1000 exemplos do imdb_sample.

## Usando o exemplo do scikit-learn:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import re
import torch
import numpy as np

In [0]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [0]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
print(vocab)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


## Mostrando o Document-term também denominado de "bag of words"

In [0]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## Minha implementação de um tokenizador simples usando o vocabulário já extraído pelo scikit-learn

Primeira versão: usando for simples




In [0]:
list_word_based = []
list_token_based = []
for amostra in corpus:
    amostra = re.sub(r'\W',' ',amostra).strip().lower()
    list_words = amostra.split(' ')
    list_tokens = []
    for word in list_words:
        list_tokens.append(vocab.index(word))
    list_word_based.append(list_words)
    list_token_based.append(list_tokens)
list_word_based, list_token_based

([['this', 'is', 'the', 'first', 'document'],
  ['this', 'document', 'is', 'the', 'second', 'document'],
  ['and', 'this', 'is', 'the', 'third', 'one'],
  ['is', 'this', 'the', 'first', 'document']],
 [[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]])

Segunda versão: for com list comprehension




In [0]:
list_word_based = []
list_token_based = []
for amostra in corpus:
    amostra = re.sub(r'\W',' ',amostra).strip().lower()
    list_words = amostra.split(' ')
    list_tokens = [vocab.index(word)   for word in list_words]
    list_word_based.append(list_words)
    list_token_based.append(list_tokens)
list_word_based, list_token_based

([['this', 'is', 'the', 'first', 'document'],
  ['this', 'document', 'is', 'the', 'second', 'document'],
  ['and', 'this', 'is', 'the', 'third', 'one'],
  ['is', 'this', 'the', 'first', 'document']],
 [[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]])

# Download do dataset do IMDB_sample (apenas 1000 exemplos)

O dataset está sendo carregado dos datasets disponibilizados pelo curso fast.ai: https://course.fast.ai/datasets.html

O comando wget busca o arquivo imdb.tgz
O comando tar descomprime o arquivo no diretório local

In [0]:
!wget -nc http://files.fast.ai/data/examples/imdb_sample.tgz
!tar -xzf imdb_sample.tgz

File ‘imdb_sample.tgz’ already there; not retrieving.



O diretório descomprimido tem um arquivo no formato csv:

In [0]:
!ls imdb_sample

texts.csv


In [0]:
import pandas as pd

In [0]:
df = pd.read_csv('imdb_sample/texts.csv')
df.shape

(1000, 3)

In [0]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


## **Pre-processing**

In [0]:
def pre_processing(corpus):
    corpus_pp = []
    for sentence in corpus:
        new_sentence = sentence.lower()                             # convert to lowercase
        new_sentence = re.sub("[^\w]", " ",  new_sentence).split()  # match word characters [a-zA-Z0-9_]
        corpus_pp.append(new_sentence)
    return corpus_pp

In [0]:
corpus_pp = pre_processing(corpus)
corpus_pp

[['this', 'is', 'the', 'first', 'document'],
 ['this', 'document', 'is', 'the', 'second', 'document'],
 ['and', 'this', 'is', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']]

## **Vocabulary**

### function

In [0]:
def build_vocab(corpus):
    vocab = " ".join(corpus)                        # join elements
    vocab = vocab.lower()                           # convert to lowercase
    vocab = re.sub("[^\w]", " ",  vocab).split()    # match word characters [a-zA-Z0-9_]
    vocab = list(set(vocab))                        # remove duplicates
    vocab.sort()                                    # sort elements
    return vocab

### testing

In [0]:
vocab = build_vocab(corpus)
vocab

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

### comparing with scikit-learn

In [0]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
sk_vocab = vectorizer.get_feature_names()
print(sk_vocab)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [0]:
vocab == sk_vocab

True

## **Tokenizer**

### function

In [0]:
def tokenizer(corpus, vocab):
    # first, a dictionary is created with the keys being the words in vocab, and the values being the index
    dict = {vocab[i] : i for i in range(len(vocab))}
    corpus_pp = pre_processing(corpus)
    corpus_tok = []
    for idx, sentence in enumerate(corpus_pp):
        tokens = [dict[word] for word in sentence]
        corpus_tok.append(tokens)
    return corpus_tok

### testing

In [0]:
corpus_tok = tokenizer(corpus, vocab)
corpus_tok

[[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]]

## **Bag of Words**

### function

In [0]:
def feature(corpus_tok):
    # create tensor with zeros of the correct size
    size = max([max(sublist) for sublist in corpus_tok]) + 1
    doc_term = torch.zeros(len(corpus_tok), size, dtype=torch.int64)
    for line, tok in enumerate(corpus_tok):
        for column in tok:
            doc_term[line][column] += 1
    return doc_term

### testing

In [0]:
doc_term = feature(corpus_tok)
doc_term

tensor([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 2, 0, 1, 0, 1, 1, 0, 1],
        [1, 0, 0, 1, 1, 0, 1, 1, 1],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]])

### comparing with scikit-learn

In [0]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [0]:
doc_term_np = doc_term.numpy()
print(doc_term_np)

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [0]:
np.array_equal(doc_term_np, X.toarray())

True

## **IMDb**

### Filter data

In [0]:
# getting only the 'text' column
imdb_corpus = df['text']
imdb_corpus.shape

(1000,)

### Vocabulary

In [0]:
# build_vocab
imdb_vocab = build_vocab(imdb_corpus)
len(imdb_vocab)

18705

In [0]:
# scikit-learn comparison
vectorizer = CountVectorizer()
Y = vectorizer.fit_transform(imdb_corpus)
sk_imdb_vocab = vectorizer.get_feature_names()
len(sk_imdb_vocab)

18668

### Tokenizer

In [0]:
imdb_corpus_tok = tokenizer(imdb_corpus, imdb_vocab)
len(imdb_corpus_tok)

1000

### Bag of Words

In [0]:
imdb_doc_term = feature(imdb_corpus_tok)
imdb_doc_term.shape

torch.Size([1000, 18705])

In [0]:
# scikit-learn comparison
Y.toarray().shape

(1000, 18668)

### Comments:
O tamanho do vocabulário com a implementação do scikit-learn foi ligeiramente menor do que a minha implementação (scikit-learn: 18668, minha implementação: 18705). Isso ocorre devido a diferença de filtragem inicial. No meu caso, apenas troquei os caracteres para minúsculo e depois selecionei palavras que começam com os caracteres [a-zA-Z0-9_].