# Classification with document embedding techniques (2)

I will test various document embedding techniques in order to verify its hability to classify texts.

Techniques covered:
- **Universal Sentence Encoding**

Info sources:
- [**Cer et al.: Universal Sentence Encoder**](https://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/46808.pdf)

In [1]:
from collections import Counter
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from tqdm.notebook import tqdm
import gensim
import nltk
import numpy as np
import os
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub

## The data

### Load data

In [2]:
filenames = os.listdir(os.path.join('data', 'cbr-ilp-ir-son-int'))
corpus = []
for filename in filenames:
    with open(os.path.join('data', 'cbr-ilp-ir-son-int', filename), 'r', encoding='latin-1') as f:
        corpus.append(f.read())

### Classes

In [3]:
classes = [(filename.split('-')[0]).split('_')[0] for filename in filenames]
which_class = dict(zip(set(classes), range(5)))
y = np.array([which_class[name] for name in classes])

## Universal Sentence Encoder

### Vectorize the documents

In [4]:
embed = hub.KerasLayer(
    'https://tfhub.dev/google/universal-sentence-encoder/4',
    dtype=tf.string,
    trainable=True
)

use_corpus = []
for doc in tqdm(corpus):
    tokenized_doc = sent_tokenize(doc)
    embedded_doc = embed(tokenized_doc)
    # The approach here is to take the mean of the vectors of the sentences of the doc
    use_corpus.append(np.mean(embedded_doc, axis=0))

X = np.array(use_corpus)

HBox(children=(FloatProgress(value=0.0, max=681.0), HTML(value='')))




### Split training and test data

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Train model

In [6]:
%%time
lr = LogisticRegressionCV(cv=5, max_iter=10000).fit(X_train, y_train)

CPU times: user 17.7 s, sys: 4.85 s, total: 22.6 s
Wall time: 5.8 s


### Evaluate

In [7]:
# Training data
accuracy_score(y_train, lr.predict(X_train))

1.0

In [8]:
# Test data
accuracy_score(y_test, lr.predict(X_test))

0.9416058394160584

We see nice results.