# Semantic Features
I'm going to try adding semantic features: 
- vector representations of words
- vector representations of sentences

First thing to try is glove embeddings for words. We'll compare to the baseline

In [1]:
import pandas as pd

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

import hyper.eval as evl

import importlib
importlib.reload(evl);

Read in the data

In [2]:
data = pd.read_csv("../data/processed.csv", sep="\t", dtype={"content": "string", "label": bool})
X = data["content"]
y = data["label"]

## Baseline

In [3]:
base_pipe = make_pipeline(
    CountVectorizer(), 
    LogisticRegression(max_iter=300)
)
res = evl.evaluate_algorithm(X, y, base_pipe)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.5s remaining:    3.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.9s finished


In [4]:
res

{'fit_time': array([1.16859889, 1.73637605, 1.68676209, 1.54384494, 1.57680845]),
 'score_time': array([0.11799264, 0.04099298, 0.04725146, 0.07476759, 0.04481411]),
 'test_accuracy': array([0.72093023, 0.70542636, 0.70542636, 0.80620155, 0.72093023]),
 'test_precision': array([0.63043478, 0.63888889, 0.63888889, 0.78947368, 0.66666667]),
 'test_recall': array([0.60416667, 0.47916667, 0.47916667, 0.63829787, 0.46808511]),
 'test_f1': array([0.61702128, 0.54761905, 0.54761905, 0.70588235, 0.55      ])}

These scores match what is in the baseline notebook. This means we have reproducibility, which is good.

## Glove embeddings
with zeugma

In [5]:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer("glove")

In [6]:
pipeline = make_pipeline(
    glove,
    LogisticRegression(max_iter=300)
)

In [7]:
res = evl.evaluate_algorithm(X, y, pipeline)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.1min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.2min finished


In [8]:
res

{'fit_time': array([1.56632805, 1.79148555, 1.66049409, 1.70820832, 1.51508212]),
 'score_time': array([0.51482701, 0.54885721, 0.49661326, 0.3706882 , 0.28770113]),
 'test_accuracy': array([0.62015504, 0.68992248, 0.65116279, 0.7751938 , 0.74418605]),
 'test_precision': array([0.48275862, 0.65384615, 0.56521739, 0.82142857, 0.69444444]),
 'test_recall': array([0.29166667, 0.35416667, 0.27083333, 0.4893617 , 0.53191489]),
 'test_f1': array([0.36363636, 0.45945946, 0.36619718, 0.61333333, 0.60240964])}

In [9]:
res["test_accuracy"].mean()

0.696124031007752

So this is not as good as the baseline