# Semantic Features
I'm going to try adding semantic features: 
- vector representations of words
- vector representations of sentences

First thing to try is glove embeddings for words. We'll compare to the baseline

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

import hyper.eval as evl

Read in the data

In [2]:
data = pd.read_csv("../data/processed.csv", sep="\t", dtype={"content": "string", "label": bool})
X = data["content"]
y = data["label"]

## Baseline

In [13]:
base_pipe = make_pipeline(
    CountVectorizer(), 
    LogisticRegression(max_iter=300)
)
res = evl.evaluate_algorithm(X, y, base_pipe)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.6s remaining:    2.4s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.1s finished


In [14]:
res

{'fit_time': array([1.22697282, 1.41055989, 1.29889274, 1.45595884, 1.34704542]),
 'score_time': array([0.07224655, 0.04132438, 0.05688477, 0.07994914, 0.0738039 ]),
 'test_accuracy': array([0.72093023, 0.70542636, 0.70542636, 0.80620155, 0.72093023]),
 'test_precision': array([0.63043478, 0.63888889, 0.63888889, 0.78947368, 0.66666667]),
 'test_recall': array([0.60416667, 0.47916667, 0.47916667, 0.63829787, 0.46808511]),
 'test_f1': array([0.61702128, 0.54761905, 0.54761905, 0.70588235, 0.55      ])}

In [16]:
res["test_accuracy"].mean()

0.7317829457364341

These scores match what is in the baseline notebook. This means we have reproducibility, which is good.

## Glove embeddings
with zeugma

In [27]:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer("glove")

In [31]:
glove_pipeline = make_pipeline(
    glove,
    LogisticRegression(max_iter=300)
)

In [33]:
res = evl.evaluate_algorithm(X, y, glove_pipeline)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.1min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.2min finished


In [34]:
res

{'fit_time': array([1.91023254, 1.98548007, 1.78395462, 1.71965981, 1.65102267]),
 'score_time': array([0.44476247, 0.4780736 , 0.50054312, 0.32994771, 0.29263854]),
 'test_accuracy': array([0.62015504, 0.68992248, 0.65116279, 0.7751938 , 0.74418605]),
 'test_precision': array([0.48275862, 0.65384615, 0.56521739, 0.82142857, 0.69444444]),
 'test_recall': array([0.29166667, 0.35416667, 0.27083333, 0.4893617 , 0.53191489]),
 'test_f1': array([0.36363636, 0.45945946, 0.36619718, 0.61333333, 0.60240964])}

In [9]:
res["test_accuracy"].mean()

0.696124031007752

So this is not as good as the baseline

## Building a Transformer using spacy embeddings

In [10]:
import spacy
from hyper.spacy_transformer import SpacyTransformer

In [4]:
spacy_model = spacy.load("en_core_web_md")

In [5]:
transformer = SpacyTransformer(spacy_model, "en_core_web_md")

In [8]:
space = make_pipeline(
    transformer,
    LogisticRegression(max_iter=300),
)

In [9]:
results = evl.evaluate_algorithm(X, y, space)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  2.1min remaining:  3.1min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.4min finished


In [11]:
results

{'fit_time': array([61.96490002, 64.77607632, 67.31588554, 75.45301342, 74.95549393]),
 'score_time': array([15.12033987, 14.11581182, 13.83320427,  8.49561477,  7.1852808 ]),
 'test_accuracy': array([0.74418605, 0.69767442, 0.72093023, 0.82170543, 0.7751938 ]),
 'test_precision': array([0.66666667, 0.64516129, 0.65      , 0.81578947, 0.78125   ]),
 'test_recall': array([0.625     , 0.41666667, 0.54166667, 0.65957447, 0.53191489]),
 'test_f1': array([0.64516129, 0.50632911, 0.59090909, 0.72941176, 0.63291139])}

In [12]:
results["test_accuracy"].mean()

0.751937984496124

A bit better than the baseline and much better than the glove embeddings.