## Tutorial for BOC (Bag-of-Concepts)

- This tutorial is a re-implementation of Kim, Han Kyul, Hyunjoong Kim, and Sungzoon Cho. "Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation." Neurocomputing (2017).

- It will show you how to use this package to create sample BOC documents vectors as presented in the paper.

- We will use the trained BOC vectors to train SVM classifier

### 1. Import the package and designate the location of the input trext file.

- If you don't have bagofconcepts installed, please install it using pip (refer to the commented line)

- document_path contains 5,000 samples articles from Reuter dataset used in the paper

In [1]:
import bagofconcepts as bc

# pip install bagofconcepts
document_path='./sample_data/sample_articles.txt'

### 2. Train BOC document vectors

#### List of initialization parameters
- **doc_path**: full or relative path of the corpus to be trained. Each document of the corpus must be separated by a new line (Each line is equivalent to a single document)
- **model_path**: full or relative path of the gensim model path to be loaded 
    - Pretrained gensim word2vec model can be loaded to generate boc vectors
- **embedding_dim (default: 200)**: Dimension for training word2vec
- **context (default: 8)**: size of the window to be used during word2vec training
- **min_freq (default: 100)**: words below this threshold will be ignored during the training
- **num_concept (default: 100)**: number of concepts to be generated from boc
- **iterations (default: 5)**: number of epocs for word2vec training

In [2]:
corpus = bc.utils.create_corpus(document_path)
print(len(corpus))

5000


In [3]:
wv, idx2word = bc.utils.train_gensim_w2v(corpus, 200, 8, 50, 3)

In [4]:
boc_model=bc.BOCModel(corpus, wv, idx2word, num_concept=100, iterations=5, random_state=42)

boc_matrix, word2concept_list, idx2word_converter=boc_model.fit()

In [5]:
print(boc_matrix.shape)

(5000, 100)


In [6]:
word2concept_list[:10]

[('the', 33),
 ('<br>', 33),
 ('to', 85),
 ('of', 89),
 ('a', 46),
 ('in', 33),
 ('and', 83),
 ('[%_Number_%]', 31),
 ('on', 18),
 ('said', 69)]

In [7]:
idx2word_converter[:10]

['the', '<br>', 'to', 'of', 'a', 'in', 'and', '[%_Number_%]', 'on', 'said']