# Classification with document embedding techniques

I will test various document embedding techniques in order to verify its hability to classify texts.

Techniques covered:
- **TF-IDF**
- Google's **word2vec**
- Le and Mikolov's **doc2vec**

Info sources:
- [**Le and Mikolov: Distributed Representations of Sentences and Documents**](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
- [Susan Li: Multi-Class Text Classification Model Comparison and Selection](https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568)
- [nadbordrozd: Text Classification With Word2Vec](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)
- [gensim: Doc2Vec Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py)
- [Understanding Word2Vec and Doc2Vec](https://shuzhanfan.github.io/2018/08/understanding-word2vec-and-doc2vec/)
- [Jay Alammar: The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/)
- [Quora: How does word2vec work? Can someone walk through a specific example?](https://www.quora.com/How-does-word2vec-work-Can-someone-walk-through-a-specific-example)
- [Allison Parrish: Understanding word vectors](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469)

In [1]:
from collections import Counter
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from tqdm.notebook import tqdm
import gensim
import nltk
import numpy as np
import os
import pandas as pd

## The data

### Load data

In [2]:
filenames = os.listdir(os.path.join('data', 'cbr-ilp-ir-son-int'))
corpus = []
for filename in filenames:
    with open(os.path.join('data', 'cbr-ilp-ir-son-int', filename), 'r', encoding='latin-1') as f:
        corpus.append(f.read())

### See the data

This is paper data, with various classes.

In [3]:
corpus[314][:314]

'A Case-Based Methodology for Planning Individualized Case Oriented Tutoring\n\nAlexander Seitz\n\nDept. of Artificial Intelligence\nUniversity of Ulm\nD-89069 Ulm, Germany\nseitz@ki.informatik.uni-ulm.de\n\n\n\nCase oriented tutoring gives students the possibility to practice their acquired\ntheoretical knowledge in the cont'

### Classes

In [4]:
classes = [(filename.split('-')[0]).split('_')[0] for filename in filenames]
which_class = dict(zip(set(classes), range(5)))
y = np.array([which_class[name] for name in classes])

In [5]:
Counter(classes)

Counter({'SON': 101, 'CBR': 276, 'RI': 179, 'ILP': 119, 'INT': 6})

## TF-IDF

We will use scikit-learn's TF-IDF to vectorize our documents.

### Vectorize it

In [6]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X = X.todense()
X = np.array(X)

### Split data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Train model

We now train a classifier.

In [8]:
%%time
lr = LogisticRegressionCV(cv=5, max_iter=10000).fit(X_train, y_train)



CPU times: user 3min 55s, sys: 2.3 s, total: 3min 57s
Wall time: 59.9 s


### Evaluate

In [9]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

1.0

In [10]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.9635036496350365

Very good results.

## word2vec

We will use gensim's implementation of Google's word2vec to vectorize the words.

### Vectorize it

In [11]:
# Preprocess data
tokenized_corpus = [gensim.utils.simple_preprocess(doc) for doc in corpus]

In [12]:
%%time
model = gensim.models.Word2Vec(tokenized_corpus)

CPU times: user 4.37 s, sys: 24.4 ms, total: 4.39 s
Wall time: 2.32 s


'Novel' is a word very common in abstracts. Let's see the most similar words to it:

In [13]:
# Curiosity
model.wv.most_similar('novel', topn=15)

[('previous', 0.9902200698852539),
 ('define', 0.9893702864646912),
 ('implemented', 0.9886905550956726),
 ('apply', 0.9883219003677368),
 ('make', 0.9879704713821411),
 ('possible', 0.9867957234382629),
 ('take', 0.986771285533905),
 ('them', 0.9866843223571777),
 ('compare', 0.9866756200790405),
 ('designed', 0.9865284562110901),
 ('develop', 0.9856593012809753),
 ('many', 0.9850836396217346),
 ('then', 0.9844640493392944),
 ('relevant', 0.9840606451034546),
 ('identify', 0.9840162396430969)]

In [14]:
w2v = dict(zip(model.wv.index2word, model.wv.vectors))

In [15]:
vectorized_corpus = []
for doc in tqdm(tokenized_corpus):
    vectorized_doc = []
    for word in doc:
        if word in model.wv.index2word:
            vectorized_doc.append(w2v[word])
    vectorized_corpus.append(np.mean(vectorized_doc, axis=0))

HBox(children=(FloatProgress(value=0.0, max=681.0), HTML(value='')))




In [16]:
X = np.array(vectorized_corpus)

### Split data

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Train model

In [18]:
%%time
lr = LogisticRegressionCV(cv=5, max_iter=10000).fit(X_train, y_train)



CPU times: user 28.8 s, sys: 304 ms, total: 29.1 s
Wall time: 7.29 s


### Evaluate

In [19]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

0.9852941176470589

In [20]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.9343065693430657

Also nice results, but not better.

## doc2vec

We will try the implementation of Le and Mikolov's doc2vec from gensim.

### Vectorize it

We tag the documents:

In [21]:
tagged_corpus = [TaggedDocument(words=doc, tags=[i]) for i, doc in enumerate(tokenized_corpus)]

In [22]:
%%time
model = Doc2Vec(
    documents=tagged_corpus,
    vector_size=50,
    min_count=2,
    epochs=30,
    total_examples=len(tagged_corpus)
)

CPU times: user 29.4 s, sys: 296 ms, total: 29.7 s
Wall time: 12.4 s


In [23]:
X = np.array([model.docvecs[i] for i in range(len(tagged_corpus))])

### Split data

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Train model

In [25]:
%%time
lr = LogisticRegressionCV(cv=5, max_iter=10000).fit(X_train, y_train)

CPU times: user 6.51 s, sys: 64 ms, total: 6.57 s
Wall time: 1.65 s


### Assesing the model

We can check if the model can infer a vector similar to the real vector of its document.

In [26]:
ranks = []
second_ranks = []
for doc_id in range(len(tokenized_corpus)):
    inferred_vector = model.infer_vector(tokenized_corpus[doc_id])
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

In [27]:
Counter(ranks)

Counter({0: 681})

This means for 680 documents the infered vector is more similar to the real vector. This is good.

### Evaluate

In [28]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

0.9834558823529411

In [29]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.9343065693430657

Nice results.

## Conclusion

All models worked well on this dataset.