## Tutorial for BOC (Bag-of-Concepts)

- This tutorial is a re-implementation of Kim, Han Kyul, Hyunjoong Kim, and Sungzoon Cho. "Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation." Neurocomputing (2017).

- It will show you how to use this package to create sample BOC documents vectors as presented in the paper.

### 1. Import the package and designate the location of the input trext file.

- Sample text file contains 5,000 articles from Reuter dataset used in the paper

In [2]:
import bagofconcepts as bc

# pip install bagofconcepts

In [3]:
document_path='./sample_data/sample_articles.txt'

### 2. Set parameters for training BOC

- To train BOC, embedding dimension, context window size, minimum frequency and number of concepts must bedefined as parameters.
- **Embedding dimension** denotes the dimensions of word vectors to be trained from word2vec
- **Context window size** refers to the number of precedeing and subsequent words that are going to be regarded as contexts for a given input word
- Words with frequencies below **minimum frequency** will be ignored in the model
- ** The number of concepts** indicates the value of k to be used for spherical clustering. (number of concepts & dimensions of document vectors to be trained)


### 3. Train BOC document vectors

In [7]:
boc_model=bc.BOCModel(doc_path=document_path, embedding_dim=200, context=8, min_freq=100, num_concept=100 )

boc_matrix,word2concept_list,idx2word_converter=boc_model.fit()

In [None]:


X_train=boc_matrix[0:4000]
X_test=boc_matrix[4000:]
Y_train=labels[0:4000]
Y_test=labels[4000:]


parameters={'C':[0.5, 2], 'kernel':['linear', 'poly']}
svr=svm.SVC(kernel=ek,decision_function_shape='ovr')
clf1=GridSearchCV(svr, parameters, cv=10)
clf1.fit(X_train, Y_train)
print("----------Cross Validation Result----------")
print(clf1.best_score_)
print(clf1.best_params_)


print("----------Prediction Result----------")
yhat=clf1.predict(X_test)
print(f1_score(Y_test, yhat, average='micro'))

### 4. Two output files are created

- ```w2c_d200_w8_mf50_c100.csv``` contains information about each word's assigned concept
- ```boc_d200_w8_mf50_c100.csv``` contains actual BOC document vectors for the input documents

### 5. Using the generated BOC document vectors as inputs, document classifiers can be trained

- Using the sample articles and labels, SVM (support vector machine) will be trained to classify each document's category
- First 4,000 articles will be used as a training data, while the rest of 1,000 articles will be used as a test data
- 10 Fold Cross Validation is applied to search for the optimal SVM model amongst various combinations of parameters (e.g kernel type, regularization terms)
- F1 score of prediction from test set will be printed (It will take while!)
- Try training different types of document classifiers using BOC vectors.


In [16]:
from numpy import genfromtxt
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn import svm


BOC_matrix=genfromtxt('boc_d200_w8_mf50_c100.csv', delimiter=',')

with open('./sample_data/sample_labels.txt') as f:
    labels=[]
    for line in f:
        labels.append(line)
        
X_train=BOC_matrix[0:4000]
X_test=BOC_matrix[4000:]
Y_train=labels[0:4000]
Y_test=labels[4000:]


parameters={'C':[0.5, 2], 'kernel':['linear', 'poly']}
svr=svm.SVC(kernel=ek,decision_function_shape='ovr')
clf1=GridSearchCV(svr, parameters, cv=10)
clf1.fit(X_train, Y_train)
print("----------Cross Validation Result----------")
print(clf1.best_score_)
print(clf1.best_params_)


print("----------Prediction Result----------")
yhat=clf1.predict(X_test)
print(f1_score(Y_test, yhat, average='micro'))

----------Cross Validation Result----------
0.75075
{'kernel': 'linear', 'C': 0.5}
----------Prediction Result----------
0.758
