# Semantic Features
I'm going to try adding semantic features: 
- vector representations of words
- vector representations of sentences

First thing to try is glove embeddings for words. We'll compare to the baseline

In [20]:
import pandas as pd
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

import hyper.eval as evl

Read in the data

In [21]:
data = pd.read_csv("../data/processed.csv", sep="\t", dtype={"content": "string", "label": bool})
X = data["content"]
y = data["label"]

## Baseline

In [23]:
base_pipe = make_pipeline(
    CountVectorizer(), 
    LogisticRegression(max_iter=300)
)

In [40]:
res = evl.evaluate_algorithm(X, y, base_pipe)
res

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.5s remaining:    3.7s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.6s finished


{'fit_time': array([2.05309916, 2.48872709, 2.27170467, 2.42785668, 2.40229273]),
 'score_time': array([0.06390119, 0.04425859, 0.08256388, 0.04086161, 0.04082966]),
 'test_accuracy': array([0.72093023, 0.70542636, 0.70542636, 0.80620155, 0.72093023]),
 'test_precision': array([0.63043478, 0.63888889, 0.63888889, 0.78947368, 0.66666667]),
 'test_recall': array([0.60416667, 0.47916667, 0.47916667, 0.63829787, 0.46808511]),
 'test_f1': array([0.61702128, 0.54761905, 0.54761905, 0.70588235, 0.55      ])}

In [16]:
res["test_accuracy"].mean()

0.7317829457364341

These scores match what is in the baseline notebook. This means we have reproducibility, which is good.

## Glove embeddings
with zeugma

In [48]:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer("glove")

In [31]:
glove_pipeline = make_pipeline(
    glove,
    LogisticRegression(max_iter=300)
)

In [33]:
res = evl.evaluate_algorithm(X, y, glove_pipeline)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.1min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.2min finished


In [34]:
res

{'fit_time': array([1.91023254, 1.98548007, 1.78395462, 1.71965981, 1.65102267]),
 'score_time': array([0.44476247, 0.4780736 , 0.50054312, 0.32994771, 0.29263854]),
 'test_accuracy': array([0.62015504, 0.68992248, 0.65116279, 0.7751938 , 0.74418605]),
 'test_precision': array([0.48275862, 0.65384615, 0.56521739, 0.82142857, 0.69444444]),
 'test_recall': array([0.29166667, 0.35416667, 0.27083333, 0.4893617 , 0.53191489]),
 'test_f1': array([0.36363636, 0.45945946, 0.36619718, 0.61333333, 0.60240964])}

In [9]:
res["test_accuracy"].mean()

0.696124031007752

So this is not as good as the baseline

## Building a Transformer using spacy embeddings

In [10]:
import spacy
from hyper.transformers import SpacyTransformer

In [4]:
spacy_model = spacy.load("en_core_web_md")

In [25]:
spacy_transformer = SpacyTransformer(spacy_model, "en_core_web_md")

In [26]:
space = make_pipeline(
    spacy_transformer,
    LogisticRegression(max_iter=300),
)

In [9]:
results = evl.evaluate_algorithm(X, y, space)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.1min remaining:  3.1min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.4min finished


In [11]:
results

{'fit_time': array([61.96490002, 64.77607632, 67.31588554, 75.45301342, 74.95549393]),
 'score_time': array([15.12033987, 14.11581182, 13.83320427,  8.49561477,  7.1852808 ]),
 'test_accuracy': array([0.74418605, 0.69767442, 0.72093023, 0.82170543, 0.7751938 ]),
 'test_precision': array([0.66666667, 0.64516129, 0.65      , 0.81578947, 0.78125   ]),
 'test_recall': array([0.625     , 0.41666667, 0.54166667, 0.65957447, 0.53191489]),
 'test_f1': array([0.64516129, 0.50632911, 0.59090909, 0.72941176, 0.63291139])}

In [12]:
results["test_accuracy"].mean()

0.751937984496124

A bit better than the baseline and much better than the glove embeddings.

## Feature union

In [19]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [27]:
pipeline = Pipeline([
    ("union", FeatureUnion(
        transformer_list=[
            ("base_vectorizer", CountVectorizer()),
            ("spacy_vectorizer", spacy_transformer),
        ],
    )),
    ("log_reg", LogisticRegression(max_iter=300)),
])

In [37]:
results = evl.evaluate_algorithm(X, y, pipeline)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.2min remaining:  3.4min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.4min finished


In [42]:
res

{'fit_time': array([2.05309916, 2.48872709, 2.27170467, 2.42785668, 2.40229273]),
 'score_time': array([0.06390119, 0.04425859, 0.08256388, 0.04086161, 0.04082966]),
 'test_accuracy': array([0.72093023, 0.70542636, 0.70542636, 0.80620155, 0.72093023]),
 'test_precision': array([0.63043478, 0.63888889, 0.63888889, 0.78947368, 0.66666667]),
 'test_recall': array([0.60416667, 0.47916667, 0.47916667, 0.63829787, 0.46808511]),
 'test_f1': array([0.61702128, 0.54761905, 0.54761905, 0.70588235, 0.55      ])}

In [58]:
res["test_accuracy"].mean()

0.7317829457364341

This isn't working as intended. This is exactly the same score as with the base classifier. I think it's because the outputs of the two transformers aren't the same shape. 

In [49]:
union_transformer = FeatureUnion(
        transformer_list=[
            ("glove_vectorizer", glove),
            ("spacy_vectorizer", spacy_transformer),
        ],
    )

In [52]:
pipeline = make_pipeline(
    union_transformer,
    LogisticRegression()
)

In [53]:
results = evl.evaluate_algorithm(X, y, pipeline, verbosity=0)

In [57]:
results

{'fit_time': array([75.09200478, 77.23983812, 75.42570567, 78.75791216, 72.82453895]),
 'score_time': array([21.33738518, 18.13628912, 17.34909534,  9.07141471,  8.17491913]),
 'test_accuracy': array([0.72868217, 0.70542636, 0.68992248, 0.82945736, 0.78294574]),
 'test_precision': array([0.65116279, 0.67857143, 0.6       , 0.85714286, 0.77142857]),
 'test_recall': array([0.58333333, 0.39583333, 0.5       , 0.63829787, 0.57446809]),
 'test_f1': array([0.61538462, 0.5       , 0.54545455, 0.73170732, 0.65853659])}

In [56]:
results["test_accuracy"].mean()

0.7472868217054264

Which is slightly worse than when using only the spacy vectorizer. So the glove vector isn't really doing anything for me.

## Flair transformer

In [24]:
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
from flair.data import Sentence
from hyper.transformers import FlairTransformer

In [9]:
glove_embedding = WordEmbeddings("glove")
document_embeddings = DocumentPoolEmbeddings([glove_embedding])

In [15]:
f_transformer = trans.FlairTransformer(document_embeddings)

In [25]:
pipe = make_pipeline(
    f_transformer, 
    LogisticRegression(max_iter=300)
)

In [26]:
results = evl.evaluate_algorithm(X, y, pipe)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   37.1s remaining:   55.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   41.9s finished


In [28]:
results["test_accuracy"].mean()

0.7348837209302326

So compared to the other glove embeddings, these are much better. But they're not quite as good as the spacy embeddings. They're much faster than the spacy embeddings though. 