## <font color='darkblue'>Preface</font>
([article source](https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de)) <font size='3ptx'><b>An introduction to embedding-based classification of unlabeled text documents</b></font>

<b>Text classification is the task of assigning a sentence or document an appropriate category. The categories depend on the selected dataset and can cover arbitrary subjects. Therefore, text classifiers can be used to organize, structure, and categorize any kind of text.</b>

Common approaches use supervised learning to classify texts. Especially BERT-based language models achieved very good text classification results in recent years. These conventional text classification approaches usually require a large amount of labeled training data. In practice, however, <font color='darkred'><b>an annotated text dataset for training state-of-the-art classification algorithms is often unavailable. The annotation of data usually involves a lot of manual effort and high expenses</b></font>. Therefore, unsupervised approaches offer the opportunity to run low-cost text classification for unlabeled data sets. <b>In this article, you will learn how to use <a href='https://pypi.org/project/lbl2vec/'>Lbl2Vec</a> to perform unsupervised text classification.</b>

### <font color='darkgreen'>Agenda</font>
* <font size='3ptx'><b><a href='#sect1'>How does Lbl2Vec work?</a></b></font>
* <font size='3ptx'><b><a href='#sect2'>Lbl2Vec Tutorial</a></b></font>

<a id='sect1'></a>
## <font color='darkblue'>How does Lbl2Vec work?</font>
* <font size='3ptx'><b><a href='#sect1_1'>1. Use Manually Defined Keywords for Each Category of Interest</a></b></font>
* <font size='3ptx'><b><a href='#sect1_2'>2. Create Jointly Embedded Document and Word Vectors</a></b></font>
* <font size='3ptx'><b><a href='#sect1_3'>3. Find Document Vectors that are Similar to the Keyword Vectors of Each Classification Category</a></b></font>
* <font size='3ptx'><b><a href='#sect1_4'>4. Clean Outlier Documents for Each Classification Category</a></b></font>
* <font size='3ptx'><b><a href='#sect1_5'>5. Compute the Centroid of the Outlier Cleaned Document Vectors as Label Vector for Each Classification Category</a></b></font>
* <font size='3ptx'><b><a href='#sect1_6'>6. Text Document Classification</a></b></font>
<br/>

<b><font size='3ptx'><a href='https://pypi.org/project/lbl2vec/'>Lbl2Vec</a> is an algorithm for unsupervised document classification and unsupervised document retrieval.</font> It automatically generates jointly embedded label, document and word vectors and returns documents of categories modeled by manually predefined keywords. The key idea of the algorithm is that many semantically similar keywords can represent a category</b>. 

In the first step, the algorithm creates a joint embedding of document and word vectors. Once documents and words are embedded in a shared vector space, the goal of the algorithm is to learn label vectors from previously manually defined keywords representing a category. Finally, <b>the algorithm can predict the affiliation of documents to categories based on the similarities of the document vectors with the label vectors</b>. At a high level, the algorithm performs the following steps to classify unlabeled texts:

<a id='sect1_1'></a>
### <font color='darkgreen'>1. Use Manually Defined Keywords for Each Category of Interest</font>
<b><font size='3ptx'>First, we have to define keywords to describe each classification category of interest.</font></b>

This process requires some degree of domain knowledge to define keywords that describe classification categories and are semantically similar to each other within the classification categories. e.g.:
![keyword examples](images/1.PNG)
<br/>

<a id='sect1_2'></a>
### <font color='darkgreen'>2. Create Jointly Embedded Document and Word Vectors</font>
> An embedding vector is a vector that allows us to represent a word or text document in multi-dimensional space. The idea behind embedding vectors is that similar words or text documents will have similar vectors. - <a href='https://towardsdatascience.com/how-to-perform-topic-modeling-with-top2vec-1ae9bb4e89dc'>Amol Mavuduru</a>
<br/>

Therefore, after creating jointly embedded vectors, documents are located close to other similar documents and close to the most distinguishing words.
![doc vectors](images/2.PNG)
<br/>

Once we have a set of word and document vectors, we can move on to the next step.

<a id='sect1_3'></a>
### <font color='darkgreen'>3. Find Document Vectors that are Similar to the Keyword Vectors of Each Classification Category</font>
<b><font size='3ptx'>Now we can compute cosine similarities between documents and the manually defined keywords of each category. Documents that are similar to category keywords are assigned to a set of candidate documents of the respective category.</font></b>
![doc category space](images/3.PNG)
<br/>

<a id='sect1_4'></a>
### <font color='darkgreen'>4. Clean Outlier Documents for Each Classification Category</font>
<b><font size='3ptx'>The algorithm uses <a href='https://towardsdatascience.com/local-outlier-factor-lof-algorithm-for-outlier-identification-8efb887d9843'>LOF</a> to clean outlier documents from each set of candidate documents that may be related to some of the descriptive keywords but do not properly match the intended classification category.</font></b>
![clean outlier](images/4.PNG)
<br/>

<a id='sect1_5'></a>
### <font color='darkgreen'>5. Compute the Centroid of the Outlier Cleaned Document Vectors as Label Vector for Each Classification Category</font>
<b><font size='3ptx'>To get embedding representations of classification categories, we compute label vectors. Later, the similarity of documents to label vectors will be used to classify text documents. </font></b>

Each label vector consists of the <b><a href='https://en.wikipedia.org/wiki/Centroid'>centroid</a></b> of the outlier cleaned document vectors for a category. The algorithm computes document rather than keyword centroids since experiments showed that <b>it is more difficult to classify documents based on similarities to keywords only, even if they share the same vector space.</b>
![centroid of doc](images/5.PNG)
<br/>

<a id='sect1_6'></a>
### <font color='darkgreen'>6. Text Document Classification</font>
<b><font size='3ptx'>The algorithm computes label vector (<font color='brown'>document vector similarities</font>) for each label vector and document vector in the dataset. Finally, text documents are classified as category with the highest label vector (<font color='brown'>document vector similarities</font>).</font></b>
![doc classifiers](images/6.PNG)
<br/>

<a id='sect2'></a>
## <font color='darkblue'>Lbl2Vec Tutorial</font>
* <b><font size='3ptx'><a href='#sect2_1'>Installing Lbl2Vec</a></font></b>
* <b><font size='3ptx'><a href='#sect2_2'>Reading the Data</a></font></b>
* <b><font size='3ptx'><a href='#sect2_3'>Preprocessing the Data</a></font></b>
* <b><font size='3ptx'><a href='#sect2_4'>Training Lbl2Vec</a></font></b>
* <b><font size='3ptx'><a href='#sect2_5'>Classification of Text Documents</a></font></b>
<br/><br/>

<b><font size='3ptx'>In this tutorial we will use <a href='https://github.com/sebischair/Lbl2Vec'>Lbl2Vec</a> to classify text documents from the <a href='http://qwone.com/~jason/20Newsgroups/'>20 Newsgroups dataset</a>. It is a collection of approximately 20,000 text documents, partitioned evenly across 20 different newsgroups categoties.</font></b>

In this tutorial, we will focus on a subset of the 20 Newsgroups dataset consisting of the categories “rec.motorcycles”, “rec.sport.baseball”, “rec.sport.hockey” and “sci.crypt”. Furthermore, we will use already predefined keywords for each classification category. The predefined keywords can be downloaded <a href='https://github.com/TimSchopf/MediumBlog/blob/main/data/20newsgroups_keywords.csv'>here</a>. You can also access more <b><a href='https://github.com/sebischair/Lbl2Vec/tree/main/examples'>Lbl2Vec examples on GitHub</a></b>.

<a id='sect2_1'></a>
### <font color='darkgreen'>Installing Lbl2Vec</font>
We can install Lbl2Vec using pip with the following command:

In [2]:
#!pip install lbl2vec

<a id='sect2_2'></a>
### <font color='darkgreen'>Reading the Data</font>
We store the downloaded “<a href='https://github.com/TimSchopf/MediumBlog/blob/main/data/20newsgroups_keywords.csv'>20newsgroups_keywords.csv</a>” file in the same directory as our Python script. Then we read the CSV with pandas and fetch the 20 Newsgroups dataset from Scikit-learn.

In [8]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# load data
train = fetch_20newsgroups(subset='train', shuffle=False)
test = fetch_20newsgroups(subset='test', shuffle=False)

# parse data to pandas DataFrames
newsgroup_test = pd.DataFrame({'article':test.data, 'class_index':test.target})
newsgroup_train = pd.DataFrame({'article':train.data, 'class_index':train.target})

# load labels with keywords
labels = pd.read_csv('datas/20newsgroups_keywords.csv',sep=';')
labels

Unnamed: 0,class_index,class_name,keywords
0,8,rec.motorcycles,Bike Dod Ride Bmw Riding Bikes Motorcycle Ride...
1,9,rec.sport.baseball,Baseball Game Team Year Players Games Hit Brav...
2,10,rec.sport.hockey,Hockey Game Team Nhl Play Season Games Espn Pl...
3,11,sci.crypt,Key Clipper Encryption Chip Keys Privacy Escro...


<a id='sect2_3'></a>
### <font color='darkgreen'>Preprocessing the Data</font>
<b><font size='3ptx'>To train a Lbl2Vec model, we need to preprocess the data</font></b>. First, we process the keywords to be used as input for Lbl2Vec.

In [9]:
# split keywords by separator and save them as array
labels['keywords'] = labels['keywords'].apply(lambda x: x.split(' '))

# convert description keywords to lowercase
labels['keywords'] = labels['keywords'].apply(
    lambda description_keywords: [keyword.lower() for keyword in description_keywords])

# get number of keywords for each class
labels['number_of_keywords'] = labels['keywords'].apply(lambda row: len(row))

# lets check our keywords
labels

Unnamed: 0,class_index,class_name,keywords,number_of_keywords
0,8,rec.motorcycles,"[bike, dod, ride, bmw, riding, bikes, motorcyc...",28
1,9,rec.sport.baseball,"[baseball, game, team, year, players, games, h...",31
2,10,rec.sport.hockey,"[hockey, game, team, nhl, play, season, games,...",35
3,11,sci.crypt,"[key, clipper, encryption, chip, keys, privacy...",40


We see that the keywords describe each classification category and the number of keywords varies.

Furthermore, we also need to preprocess the news articles. Therefore, we word tokenize each document and add <b><a href='https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument'>gensim.models.doc2vec.TaggedDocument</a></b> tags. <a href='https://github.com/sebischair/Lbl2Vec'><b>Lbl2Vec</b></a> needs the tokenized and tagged documents as training input format.

In [10]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
from gensim.models.doc2vec import TaggedDocument


def tokenize(doc):
    '''Tokenize the given doc as string
    
    - Method strip_tags removes meta tags from the text.
    - Method simple preprocess converts a document into a list of lowercase tokens, 
        ignoring tokens that are too short or too long
    - Method simple preprocess also removes numerical values as well as punktuation characters
    
    Args:
        doc: document text string
    '''
    return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=15)

# Add data set type column
newsgroup_train['data_set_type'] = 'train'
newsgroup_test['data_set_type'] = 'test'

# Concat train and test data
newsgroup_full_corpus = pd.concat(
    [newsgroup_train, newsgroup_test]).reset_index(drop=True)

# Reduce dataset to only articles that belong to classes where we defined our keywords
newsgroup_full_corpus = newsgroup_full_corpus[
    newsgroup_full_corpus['class_index'].isin(list(labels['class_index']))]

# Tokenize and tag documents for Lbl2Vec training
newsgroup_full_corpus['tagged_docs'] = newsgroup_full_corpus.apply(
    lambda row: TaggedDocument(tokenize(row['article']), [str(row.name)]), axis=1)

# Add doc_key column
newsgroup_full_corpus['doc_key'] = newsgroup_full_corpus.index.astype(str)

# Add class_name column
newsgroup_full_corpus = newsgroup_full_corpus.merge(
    labels, left_on='class_index', right_on='class_index', how='left')

newsgroup_full_corpus.head()

Unnamed: 0,article,class_index,data_set_type,tagged_docs,doc_key,class_name,keywords,number_of_keywords
0,From: cubbie@garnet.berkeley.edu ( ...,9,train,"([from, cubbie, garnet, berkeley, edu, subject...",0,rec.sport.baseball,"[baseball, game, team, year, players, games, h...",31
1,From: crypt-comments@math.ncsu.edu\nSubject: C...,11,train,"([from, crypt, comments, math, ncsu, edu, subj...",2,sci.crypt,"[key, clipper, encryption, chip, keys, privacy...",40
2,From: george@minster.york.ac.uk\nSubject: Non-...,11,train,"([from, george, minster, york, ac, uk, subject...",11,sci.crypt,"[key, clipper, encryption, chip, keys, privacy...",40
3,From: williac@govonca.gov.on.ca (Chris William...,10,train,"([from, williac, govonca, gov, on, ca, chris, ...",12,rec.sport.hockey,"[hockey, game, team, nhl, play, season, games,...",35
4,From: ayari@judikael.loria.fr (Ayari Iskander)...,10,train,"([from, ayari, judikael, loria, fr, ayari, isk...",15,rec.sport.hockey,"[hockey, game, team, nhl, play, season, games,...",35


We can see the article texts and their classification categories in the dataframe. The “`tagged_docs`” column consists of the preprocessed documents that are needed as Lbl2Vec input. The classification categories in the “`class_name`” column are used for evaluation only but not for Lbl2Vec training.

<a id='sect2_4'></a>
### <font color='darkgreen'>Training Lbl2Vec</font>
After preparing the data, we now can train a Lbl2Vec model on the train dataset. We initialize the model with the following parameters:
* <b><font color='violet'>keywords_list</font></b> : iterable list of lists with descriptive keywords for each category.
* <b><font color='violet'>tagged_documents</font></b> : iterable list of <b><a href='https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument'>gensim.models.doc2vec.TaggedDocument</a></b> elements. Each element consists of one document.
* <b><font color='violet'>label_names</font></b> : iterable list of custom names for each label. Label names and keywords of the same topic must have the same index.
* <b><font color='violet'>similarity_threshold</font></b> : only documents with a higher similarity to the respective description keywords than this treshold are used to calculate the label embedding.
* <b><font color='violet'>min_num_docs</font></b> : minimum number of documents that are used to calculate the label embedding.
* <b><font color='violet'>epochs</font></b> : number of iterations over the corpus.

In [12]:
%%time
from lbl2vec import Lbl2Vec

# init model with parameters
Lbl2Vec_model = Lbl2Vec(
    keywords_list=list(labels.keywords),
    tagged_documents=newsgroup_full_corpus['tagged_docs'][newsgroup_full_corpus['data_set_type'] == 'train'],
    label_names=list(labels.class_name),
    similarity_threshold=0.43,
    min_num_docs=100, epochs=10)

# train model
Lbl2Vec_model.fit()

2021-12-26 14:18:42,942 - Lbl2Vec - INFO - Train document and word embeddings
2021-12-26 14:18:48,679 - Lbl2Vec - INFO - Train label embeddings


Wall time: 6 s


<a id='sect2_5'></a>
### <font color='darkgreen'>Classification of Text Documents</font>
After the model is trained, we can predict the categories of documents used to train the Lbl2Vec model.

In [14]:
%%time
from sklearn.metrics import f1_score

# predict similarity scores
model_docs_lbl_similarities = Lbl2Vec_model.predict_model_docs()

# merge DataFrames to compare the predicted and true category labels
evaluation_train = model_docs_lbl_similarities.merge(
    newsgroup_full_corpus[newsgroup_full_corpus['data_set_type'] == 'train'],
    left_on='doc_key', right_on='doc_key')
y_true_train = evaluation_train['class_name']
y_pred_train = evaluation_train['most_similar_label']

print('F1 score:',f1_score(y_true_train, y_pred_train, average='micro'))

2021-12-26 14:20:42,900 - Lbl2Vec - INFO - Get document embeddings from model
2021-12-26 14:20:42,904 - Lbl2Vec - INFO - Calculate document<->label similarities


F1 score: 0.8744769874476988
Wall time: 3.18 s


<b>Our model can predict the correct document categories with a respectable F1 Score of 0.88. This is achieved without even seeing the document labels during training</b>.

Moreover, we can also predict the classification categories of documents that were not used to train the Lbl2Vec model and are therefore completely unknown to it. To this end, we predict the categories of documents from the previously unused test dataset.

In [15]:
# predict similarity scores of new test documents (they were not used during Lbl2Vec training)
new_docs_lbl_similarities = Lbl2Vec_model.predict_new_docs(
    tagged_docs=newsgroup_full_corpus['tagged_docs'][newsgroup_full_corpus['data_set_type']=='test'])

# merge DataFrames to compare the predicted and true topic labels
evaluation_test = new_docs_lbl_similarities.merge(
    newsgroup_full_corpus[newsgroup_full_corpus['data_set_type']=='test'], left_on='doc_key', right_on='doc_key')
y_true_test = evaluation_test['class_name']
y_pred_test = evaluation_test['most_similar_label']

print('F1 score:',f1_score(y_true_test, y_pred_test, average='micro'))

2021-12-26 14:22:45,984 - Lbl2Vec - INFO - Calculate document embeddings
2021-12-26 14:22:47,606 - Lbl2Vec - INFO - Calculate document<->label similarities


F1 score: 0.8572327044025158


Our trained Lbl2Vec model can even predict the classification categories of new documents with a F1 Score of 0.86. As mentioned before, this is achieved with a completely unsupervised approach where no label information was used during training.

For more details about the features available in Lbl2Vec, please check out the <a href='https://github.com/sebischair/Lbl2Vec'>Lbl2Vec GitHub repository</a>. I hope you found this tutorial to be useful.

<a id='sect3'></a>
## <font color='darkblue'>Summary</font>
<b><font size='3ptx'><a href='https://pypi.org/project/lbl2vec/'>Lbl2Vec</a> is a recently developed approach that can be used for unsupervised text document classification.</font></b> Unlike other state-of-the-art approaches it needs no label information during training and therefore offers the opportunity to run low-cost text classification for unlabeled datasets. The open-source Lbl2Vec library is also very easy to use and allows developers to train models in just a few lines of code.