## Tutorial for BOC (Bag-of-Concepts)

- This tutorial is a re-implementation of Kim, Han Kyul, Hyunjoong Kim, and Sungzoon Cho. "Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation." Neurocomputing (2017).

- It will show you how to use this package to create sample BOC documents vectors as presented in the paper.

- We will use the trained BOC vectors to train SVM classifier

### 1. Import the package and designate the location of the input trext file.

- If you don't have bagofconcepts installed, please install it using pip (refer to the commented line)

- document_path contains 5,000 samples articles from Reuter dataset used in the paper

In [2]:
import bagofconcepts as bc

# pip install bagofconcepts

In [3]:
document_path='./sample_data/sample_articles.txt'

### 2. Set parameters for training BOC

- To train BOC, embedding dimension, context window size, minimum frequency and number of concepts must be defined as parameters.
- **Embedding dimension** denotes the dimensions of word vectors to be trained from word2vec
- **Context window size** refers to the number of precedeing and subsequent words that are going to be regarded as contexts for a given input word
- Words with frequencies below **minimum frequency** will be ignored in the model
- **The number of concepts** indicates the value of k to be used for spherical clustering. (number of concepts & dimensions of document vectors to be trained)


### 3. Train BOC document vectors

In [13]:
boc_model=bc.BOCModel(doc_path=document_path, embedding_dim=200, context=8, min_freq=50, num_concept=100)

boc_matrix,word2concept_list,idx2word_converter=boc_model.fit()

In [17]:
print(boc_matrix.shape)

(5000, 100)


In [22]:
word2concept_list[:10]

[('the', 62),
 ('<br>', 2),
 ('to', 78),
 ('of', 94),
 ('a', 97),
 ('in', 5),
 ('and', 55),
 ('[%_Number_%]', 93),
 ('on', 14),
 ('said', 9)]

In [24]:
idx2word_converter[:10]

['the', '<br>', 'to', 'of', 'a', 'in', 'and', '[%_Number_%]', 'on', 'said']