# Bag-of-words


The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 

In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary.

(https://en.wikipedia.org/wiki/Bag-of-words_model)

In [1]:
from __future__ import unicode_literals 

## Collecting a dataset from ecommerce with product name and category

In [2]:
import gzip
import json

In [3]:
# load json ecommerce dump
corpus = list()
with gzip.open('ecommerce.json.gz') as fp:
    for line in fp:
        entry = line.decode('utf8')
        corpus.append(json.loads(entry))

In [4]:
from pprint import pprint
pprint(corpus[0])

{u'_id': 120008322,
 u'cat': u' Automotivo',
 u'descr': u'Chegou o kit que junta resist\xeancia e conforto, al\xe9m de n\xedveis m\xe1ximos de seguran\xe7a. S\xe3o 4 pneus para seu carro ficar completo e com a qualifica\xe7\xe3o que voc\xea precisa.\nCom os conhecimentos avan\xe7ados de hoje e um entusiasmo pela dire\xe7\xe3o, os engenheiros da Pirelli puderam dar grandes passos. Cada pneu da Pirelli \xe9 respons\xe1vel n\xe3o s\xf3 pelo desempenho, mas tamb\xe9m por uma "vontade de ir pra estrada", comunicando-se com o motorista e gerando um melhor entendimento do desempenho do ve\xedculo, ou seja, a Pirelli transforma a sua viagem em uma aventura divertida e livre de problemas. Pneu Pirelli para carros com rodas aro 16, modelo high performance Phanthon, perfil baixo proporcionando maior estabilidade nas curvas, excelente qualidade e durabilidade para pistas.\n\nImagens meramente ilustrativas.\nTodas as informa\xe7\xf5es divulgadas s\xe3o de responsabilidade do fabricante/fornecedor.'

In [5]:
len(corpus)

65875

In [6]:
# let's build a classifier for product categories
# for speed up the example lets only consider the first 10k products
dataset = list()
for entry in corpus[:10000]:
    if 'cat' in entry:
        dataset.append( (entry['name'], entry['cat'].lower().strip()) )

In [7]:
len(dataset)

9953

In [8]:
pprint(dataset[:10])

[(u'Kit com 4 Pneus de Alta Performance Pirelli Aro 16 205/55R16 Phantom',
  u'automotivo'),
 (u'Chandon Brut Ros\xe9 750 ml', u'alimentos e bebidas'),
 (u'Kit com 2 Vodkas Sueca Absolut Vanilia 1000ml', u'alimentos e bebidas'),
 (u'Kit  - Livros de Colorir: Jardim Secreto + Floresta Encantada + Reino Animal',
  u'livros'),
 (u"Livro - Assassin's Creed: Submundo", u'livros'),
 (u'BCAA 2400 - 100 C\xe1psulas - Nitech Nutrition',
  u'suplementos e vitaminas'),
 (u'100% Whey - 900g - Baunilha - Nitech Nutrition',
  u'suplementos e vitaminas'),
 (u'Whey Protein Isolate - 900g - Morango - Nitech Nutrition',
  u'suplementos e vitaminas'),
 (u'100% Whey - 900g - Chocolate - Nitech Nutrition',
  u'suplementos e vitaminas'),
 (u'BCAA 2400 - 200 C\xe1psulas - Nitech Nutrition',
  u'suplementos e vitaminas')]


In [9]:
# how many distinc categories do we have and how many items per category?
from collections import Counter
counter = Counter([cat for prod, cat in dataset])

pprint(counter.most_common())

[(u'beb\xeas', 1208),
 (u'eletroport\xe1teis', 1052),
 (u'automotivo', 915),
 (u'utilidades dom\xe9sticas', 857),
 (u'suplementos e vitaminas', 787),
 (u'ar-condicionado e aquecedores', 754),
 (u'inform\xe1tica', 706),
 (u'cama, mesa e banho', 670),
 (u'tv e home theater', 644),
 (u'perfumaria', 532),
 (u'beleza e sa\xfade', 497),
 (u'dvds e blu-ray', 433),
 (u'rel\xf3gios', 410),
 (u'pet shop', 391),
 (u'instrumentos musicais', 44),
 (u'celulares e telefones', 18),
 (u'eletrodom\xe9sticos', 16),
 (u'\xe1udio', 13),
 (u'livros', 2),
 (u'alimentos e bebidas', 2),
 (u'brinquedos', 1),
 (u'linha industrial', 1)]


# Building a SVM Classivier with bag-of-words

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
classifier = Pipeline([('vect', CountVectorizer()), ('clf', SVC(kernel='linear', probability=True))])
encoder = LabelEncoder()
# Please check on http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [12]:
data = [prod for prod, cat in dataset]
labels = [cat for prod, cat in dataset]
len(data)

9953

In [13]:
target = encoder.fit_transform(labels)
classifier.fit(data, target)

Pipeline(steps=[(u'vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        s...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [14]:
classifier.predict(["Refrigerador Brastemp com função frostfree"])

array([3])

In [15]:
print encoder.classes_[3]

bebês


In [16]:
probs = classifier.predict_proba(["Refrigerador Brastemp com função frostfree"])

In [17]:
guess = [( class_, probs.item(n)) for n, class_ in enumerate(encoder.classes_)]
pprint(guess)

[(u'alimentos e bebidas', 0.009727997974364595),
 (u'ar-condicionado e aquecedores', 0.01189928250891044),
 (u'automotivo', 0.05661080631753129),
 (u'beb\xeas', 0.12315065374854184),
 (u'beleza e sa\xfade', 0.03663084493212984),
 (u'brinquedos', 0.004066643922691623),
 (u'cama, mesa e banho', 0.012042843279376204),
 (u'celulares e telefones', 0.004646753801491675),
 (u'dvds e blu-ray', 0.031133875578162027),
 (u'eletrodom\xe9sticos', 0.36486048712399227),
 (u'eletroport\xe1teis', 0.005211983196782103),
 (u'inform\xe1tica', 0.03345228439460204),
 (u'instrumentos musicais', 0.010570278445553187),
 (u'linha industrial', 0.0051461009973493635),
 (u'livros', 0.007347498569096653),
 (u'perfumaria', 0.014809975695197167),
 (u'pet shop', 0.04802163272297251),
 (u'rel\xf3gios', 0.005003718552577751),
 (u'suplementos e vitaminas', 0.13433214714263636),
 (u'tv e home theater', 0.042778075652347584),
 (u'utilidades dom\xe9sticas', 0.020415493708260943),
 (u'\xe1udio', 0.01814062173543265)]


In [18]:
from operator import itemgetter
for cat, proba in sorted(guess, key=itemgetter(1), reverse=True):
    print '{}: {:.4f}'.format(cat,proba)

eletrodomésticos: 0.3649
suplementos e vitaminas: 0.1343
bebês: 0.1232
automotivo: 0.0566
pet shop: 0.0480
tv e home theater: 0.0428
beleza e saúde: 0.0366
informática: 0.0335
dvds e blu-ray: 0.0311
utilidades domésticas: 0.0204
áudio: 0.0181
perfumaria: 0.0148
cama, mesa e banho: 0.0120
ar-condicionado e aquecedores: 0.0119
instrumentos musicais: 0.0106
alimentos e bebidas: 0.0097
livros: 0.0073
eletroportáteis: 0.0052
linha industrial: 0.0051
relógios: 0.0050
celulares e telefones: 0.0046
brinquedos: 0.0041


# Activity: How to improve the result above?

What causes the result above?
How can we improve?