# Classification with word embedding techniques

I will test various word embedding techniques in order to verify its hability to classify texts.

Techniques covered:
- **TF-IDF**
- Google's **word2vec**
- Le and Mikolov's **doc2vec**

Info sources:
- [**Le and Mikolov: Distributed Representations of Sentences and Documents**](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
- [Susan Li: Multi-Class Text Classification Model Comparison and Selection](https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568)
- [nadbordrozd: Text Classification With Word2Vec](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)
- [gensim: Doc2Vec Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py)
- [Jay Alammar: The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/)
- [Quora: How does word2vec work? Can someone walk through a specific example?](https://www.quora.com/How-does-word2vec-work-Can-someone-walk-through-a-specific-example)
- [Allison Parrish: Understanding word vectors](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469)

In [1]:
from collections import Counter
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from tqdm.notebook import tqdm
import gensim
import nltk
import numpy as np
import os
import pandas as pd

## The data

### Load data

In [2]:
filenames = os.listdir(os.path.join('data', 'cbr-ilp-ir-son-int'))
corpus = []
for filename in filenames:
    with open(os.path.join('data', 'cbr-ilp-ir-son-int', filename), 'r', encoding='latin-1') as f:
        corpus.append(f.read())

### See the data

This is paper data, with various classes.

In [3]:
corpus[314][:314]

'A Case-Based Methodology for Planning Individualized Case Oriented Tutoring\n\nAlexander Seitz\n\nDept. of Artificial Intelligence\nUniversity of Ulm\nD-89069 Ulm, Germany\nseitz@ki.informatik.uni-ulm.de\n\n\n\nCase oriented tutoring gives students the possibility to practice their acquired\ntheoretical knowledge in the cont'

### Classes

In [4]:
classes = [(filename.split('-')[0]).split('_')[0] for filename in filenames]
which_class = dict(zip(set(classes), range(5)))
y = np.array([which_class[name] for name in classes])

In [5]:
Counter(classes)

Counter({'SON': 101, 'CBR': 276, 'RI': 179, 'ILP': 119, 'INT': 6})

## TF-IDF

We will use scikit-learn's TF-IDF to vectorize our documents.

### Vectorize it

In [6]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X = X.todense()
X = np.array(X)

### Split data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=314)

### Train model

We now train a classifier.

In [8]:
%%time
scale = preprocessing.StandardScaler().fit(X_train)
lr = LogisticRegressionCV(cv=5, random_state=314).fit(X_train, y_train)

CPU times: user 3min 43s, sys: 2.31 s, total: 3min 45s
Wall time: 57.2 s


### Evaluate

In [9]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

1.0

In [10]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.9781021897810219

Very good results.

## word2vec

We will use gensim's implementation of Google's word2vec to vectorize the words.

### Vectorize it

In [11]:
# Preprocess data
tokenized_corpus = [gensim.utils.simple_preprocess(doc) for doc in corpus]

In [12]:
%%time
model = gensim.models.Word2Vec(tokenized_corpus)

CPU times: user 7.88 s, sys: 24.2 ms, total: 7.9 s
Wall time: 3.48 s


'Novel' is a word very common in abstracts. Let's see the most similar words to it:

In [13]:
# Curiosity
model.wv.most_similar('novel', topn=15)

[('previous', 0.9933876395225525),
 ('several', 0.9925034046173096),
 ('coverage', 0.9901109337806702),
 ('suggest', 0.989838719367981),
 ('initial', 0.989747166633606),
 ('successfully', 0.9896761178970337),
 ('examine', 0.9889732599258423),
 ('demonstrate', 0.9888555407524109),
 ('them', 0.9885226488113403),
 ('evaluate', 0.9884669780731201),
 ('according', 0.9883145689964294),
 ('combine', 0.9882522821426392),
 ('then', 0.9882426261901855),
 ('conducted', 0.9876145720481873),
 ('develop', 0.9875271916389465)]

In [14]:
w2v = dict(zip(model.wv.index2word, model.wv.vectors))

In [15]:
vectorized_corpus = []
for doc in tqdm(tokenized_corpus):
    vectorized_doc = []
    for word in doc:
        if word in model.wv.index2word:
            vectorized_doc.append(w2v[word])
    vectorized_corpus.append(np.mean(vectorized_doc, axis=0))

HBox(children=(FloatProgress(value=0.0, max=681.0), HTML(value='')))




In [16]:
X = np.array(vectorized_corpus)

### Split data

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=314)

### Train model

In [18]:
%%time
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
lr = LogisticRegressionCV(cv=5, random_state=314, max_iter=10000).fit(X_train, y_train)

CPU times: user 38 s, sys: 412 ms, total: 38.4 s
Wall time: 9.6 s


### Evaluate

In [19]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

0.9852941176470589

In [20]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.927007299270073

Also nice results, but not better.

## doc2vec

We will try the implementation of Le and Mikolov's doc2vec from gensim.

### Vectorize it

We tag the documents:

In [21]:
tagged_corpus = [TaggedDocument(words=doc, tags=[i]) for i, doc in enumerate(tokenized_corpus)]

In [22]:
%%time
model = Doc2Vec(
    documents=tagged_corpus,
    vector_size=50,
    min_count=2,
    epochs=30,
    total_examples=len(tagged_corpus)
)

CPU times: user 33.2 s, sys: 255 ms, total: 33.5 s
Wall time: 13.8 s


In [23]:
X = np.array([model.docvecs[i] for i in range(len(tagged_corpus))])

### Split data

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=314)

### Train model

In [25]:
%%time
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
lr = LogisticRegressionCV(cv=5, random_state=314, max_iter=10000).fit(X_train, y_train)

CPU times: user 5.97 s, sys: 76 ms, total: 6.05 s
Wall time: 1.52 s


### Assesing the model

We can check if the model can infer a vector similar to the real vector of its document.

In [26]:
ranks = []
second_ranks = []
for doc_id in range(len(tokenized_corpus)):
    inferred_vector = model.infer_vector(tokenized_corpus[doc_id])
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

In [27]:
Counter(ranks)

Counter({0: 680, 1: 1})

This means for 680 documents the infered vector is more similar to the real vector. This is good news.

### Evaluate

In [28]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

0.9797794117647058

In [29]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.9416058394160584

Nice results.