# TF-IDF Encoder

 This notebook trains a TF-IDF model.

In particular, we stack a TF-IDF model and a TruncatedSVD model. This way we end up with a dense vector representation instead of sparce vectors.

In [1]:
import numpy as np
import pandas as pd

from pathlib import Path

In [2]:
data_path = Path.cwd().resolve().absolute().parent / "data"
encoders_path = Path.cwd().resolve().absolute().parent / "models/encoders"
data_filepath = data_path / "train_query_context_pairs.csv"

## Load the data

In [3]:
data = pd.read_csv(data_filepath)
data

Unnamed: 0,question,context,question_id,context_id
0,Do European Leagues sell their television righ...,The Premier League sells its television rights...,c3d337ab68dfd285f559ebc0daf65125,98d6e3c8d58561cff931f63fb4e64c1c
1,"What does the Catholic church considered ""mixe...",Between the third and fourth sessions the pope...,b03fc4a34dda7d1cfb4e9640aec30d39,4bcb9c7951bfad7dc475dc0a8364b86d
2,What are some of the practices Gautama underwe...,Gautama first went to study with famous religi...,12397967175462937d614d633fb6a8b0,70e382792af20b3772cce2520b45da5e
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to L...",77307b73836b34d59721596af121cd2f,5e450e68f649328e03bace871b873fee
4,When did the federation have to be implemented...,"After Nasser died in November 1970, his succes...",1a6850f01a96afabb91c24cdb3edcdee,0409ff54cef43157e6e7c88803e8590a
...,...,...,...,...
16030,What did previous religious orders do for a li...,Dominic sought to establish a new kind of orde...,a40c4eecb8ecad6ecc8dfcaca8086bea,a3cde9cea3bc2f90da8113dd41bebcf2
16031,What is an opening through the head called?,The salivary glands (element 30 in numbered di...,86d8ede802e28f0b80ebe11fda975738,1917672db4036274de7928953a7b8171
16032,What type of fruit is exported from Valencia?,Valencia's port is the biggest on the Mediterr...,671d51402eed02042c143f2fee28180b,54782fdb0abdd6ae102990368462460d
16033,How many listed buildings are present in the B...,"The early port settlement of Plymouth, called ...",1ea2e9373b6f91276133004d7e62ffe5,29f013960b13821473a6f7a4c125670f


### Text preprocessing

Tf-idf works better when we do some text normalization and therefore we are going to apply:
- Lowercasing
- Accents normalization
- Stemming
- Stopwords removal

In [4]:
import nltk
import unicodedata
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

tokenizer = RegexpTokenizer(r"\w+")
ps = PorterStemmer()
english_stopwords = stopwords.words('english')

def text_preprocessing(text):
    # lowercasing
    text = text.lower()
    # accents normalization
    text = unicodedata.normalize("NFD", text)
    text = "".join(c for c in text if unicodedata.category(c) != "Mn")
    # stemming
    tokens = [ps.stem(token) for token in tokenizer.tokenize(text)]
    # stopwords removal
    tokens = [token for token in tokens if token not in english_stopwords]
    return " ".join(tokens)

text_preprocessing("Do 'European Leagues' blá blá blá?")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joao.barroca/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joao.barroca/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'european leagu bla bla bla'

In [5]:
data["p_question"] = data["question"].apply(text_preprocessing)
data["p_context"] = data["context"].apply(text_preprocessing)
data

Unnamed: 0,question,context,question_id,context_id,p_question,p_context
0,Do European Leagues sell their television righ...,The Premier League sells its television rights...,c3d337ab68dfd285f559ebc0daf65125,98d6e3c8d58561cff931f63fb4e64c1c,european leagu sell televis right per collect ...,premier leagu sell televis right collect basi ...
1,"What does the Catholic church considered ""mixe...",Between the third and fourth sessions the pope...,b03fc4a34dda7d1cfb4e9640aec30d39,4bcb9c7951bfad7dc475dc0a8364b86d,doe cathol church consid mix mix marriag,third fourth session pope announc reform area ...
2,What are some of the practices Gautama underwe...,Gautama first went to study with famous religi...,12397967175462937d614d633fb6a8b0,70e382792af20b3772cce2520b45da5e,practic gautama underw hi quest,gautama first went studi famou religi teacher ...
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to L...",77307b73836b34d59721596af121cd2f,5e450e68f649328e03bace871b873fee,mani band member wrote queen one vision,band revitalis respons live aid shot arm roger...
4,When did the federation have to be implemented...,"After Nasser died in November 1970, his succes...",1a6850f01a96afabb91c24cdb3edcdee,0409ff54cef43157e6e7c88803e8590a,feder implement,nasser die novemb 1970 hi successor anwar sada...
...,...,...,...,...,...,...
16030,What did previous religious orders do for a li...,Dominic sought to establish a new kind of orde...,a40c4eecb8ecad6ecc8dfcaca8086bea,a3cde9cea3bc2f90da8113dd41bebcf2,previou religi order live,domin sought establish new kind order one woul...
16031,What is an opening through the head called?,The salivary glands (element 30 in numbered di...,86d8ede802e28f0b80ebe11fda975738,1917672db4036274de7928953a7b8171,open head call,salivari gland element 30 number diagram insec...
16032,What type of fruit is exported from Valencia?,Valencia's port is the biggest on the Mediterr...,671d51402eed02042c143f2fee28180b,54782fdb0abdd6ae102990368462460d,type fruit export valencia,valencia port biggest mediterranean western co...
16033,How many listed buildings are present in the B...,"The early port settlement of Plymouth, called ...",1ea2e9373b6f91276133004d7e62ffe5,29f013960b13821473a6f7a4c125670f,mani list build present barbican area,earli port settlement plymouth call sutton app...


## Metrics

This implements a contrastive loss. It is very similar to [MultipleNegativesRankingLoss](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

We first compute a pairwise similarity matrix by doing the cosine similarity of each query vector against each context vector. The similarity matrix will then be a (NxN) matrix, where N is the total number of instannces in the test data.

Then, we apply a softmax to "transform" such similarity score into probability distributions.

Finally, we apply cross entropy loss (log loss). The target is the identity matrix.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import log_loss
from scipy.special import softmax

def constrastive_loss(query_emb_matrix, context_emb_matrix):
    sim_matrix = cosine_similarity(query_emb_matrix, context_emb_matrix, dense_output=False)
    n_sim_matrix = softmax(sim_matrix, axis=1)
    N = sim_matrix.shape[0]
    targets = np.arange(N)
    ce_loss = log_loss(y_pred=n_sim_matrix, y_true=targets)
    return ce_loss

## Train test splits

Cross-validation would take a very long time to run, and therefore we stick with just a single train-test split.

In [7]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, test_size=0.2)
train_data.shape, test_data.shape

((12828, 6), (3207, 6))

## TF-IDF Encoder

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

In [9]:
X_train = train_data.loc[:, ["p_question", "p_context"]].values
X_test = test_data.loc[:, ["p_question", "p_context"]].values

X_train.shape, X_test.shape

((12828, 2), (3207, 2))

### Hyper-parameter tuning

In [10]:
from itertools import product

params = {
    "ngram_ranges": [
        (1, 2),  # unigrams + bigrams
    ],
    "max_df": [
        0.5,  # discard tokens that appear in more than 50% of the documents
    ],
    "min_df": [
        5,  # tokens need to be in at least 5 documents to be used
    ],
    "n_components": [512, 1024, 2048, 4096],  # final dimension of the embeddings after applying PCA
}

combinations = product(
    params["ngram_ranges"],
    params["max_df"],
    params["min_df"],
    params["n_components"],
)

In [11]:
results = []
for ngram_range, max_df, min_df, n_components in combinations:

    results_obj = {
        "ngram_range": ngram_range,
        "max_df": max_df,
        "min_df": min_df,
        "n_components": n_components,
    }
    print(f"Running experiment for {results_obj}")

    vectorizer = TfidfVectorizer(
        lowercase=False,
        ngram_range=ngram_range,
        max_df=max_df,
        min_df=min_df,
    )

    truncated_svd = TruncatedSVD(
        n_components=n_components,
        random_state=42,
    )

    pipeline = make_pipeline(vectorizer, truncated_svd)
    results_obj["model"] = pipeline

    pipeline.fit(X_train.reshape(-1))

    query_emb_matrix = pipeline.transform(X_test[:, 0])
    context_emb_matrix = pipeline.transform(X_test[:, 1])

    loss = constrastive_loss(query_emb_matrix, context_emb_matrix)
    print(f"loss={loss}")

    results_obj["loss"] = loss
    results.append(results_obj)
    print(results_obj)

Running experiment for {'ngram_range': (1, 2), 'max_df': 0.5, 'min_df': 5, 'n_components': 512}
loss=7.688612948436394
{'ngram_range': (1, 2), 'max_df': 0.5, 'min_df': 5, 'n_components': 512, 'model': Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(lowercase=False, max_df=0.5, min_df=5,
                                 ngram_range=(1, 2))),
                ('truncatedsvd',
                 TruncatedSVD(n_components=512, random_state=42))]), 'loss': 7.688612948436394}
Running experiment for {'ngram_range': (1, 2), 'max_df': 0.5, 'min_df': 5, 'n_components': 1024}
loss=7.700708107732749
{'ngram_range': (1, 2), 'max_df': 0.5, 'min_df': 5, 'n_components': 1024, 'model': Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(lowercase=False, max_df=0.5, min_df=5,
                                 ngram_range=(1, 2))),
                ('truncatedsvd',
                 TruncatedSVD(n_components=1024, random_state=42))]), 'loss': 7.700708107732749}
Running exp

In [12]:
sorted(results, key=lambda x: x["loss"])

[{'ngram_range': (1, 2),
  'max_df': 0.5,
  'min_df': 5,
  'n_components': 512,
  'model': Pipeline(steps=[('tfidfvectorizer',
                   TfidfVectorizer(lowercase=False, max_df=0.5, min_df=5,
                                   ngram_range=(1, 2))),
                  ('truncatedsvd',
                   TruncatedSVD(n_components=512, random_state=42))]),
  'loss': 7.688612948436394},
 {'ngram_range': (1, 2),
  'max_df': 0.5,
  'min_df': 5,
  'n_components': 1024,
  'model': Pipeline(steps=[('tfidfvectorizer',
                   TfidfVectorizer(lowercase=False, max_df=0.5, min_df=5,
                                   ngram_range=(1, 2))),
                  ('truncatedsvd',
                   TruncatedSVD(n_components=1024, random_state=42))]),
  'loss': 7.700708107732749},
 {'ngram_range': (1, 2),
  'max_df': 0.5,
  'min_df': 5,
  'n_components': 2048,
  'model': Pipeline(steps=[('tfidfvectorizer',
                   TfidfVectorizer(lowercase=False, max_df=0.5, min_df=5,
        

### Train best model on full data and save locally

In [13]:
X = data.loc[:, ["p_question", "p_context"]].values

tdf_idf_encoder = sorted(results, key=lambda x: x["loss"])[0]["model"]
tdf_idf_encoder.fit(X.reshape(-1))

In [14]:
from joblib import dump
dump(tdf_idf_encoder, Path(encoders_path, "tf-idf-encoder.joblib"))

['/Users/joao.barroca/Desktop/projects/deus-use-case/models/encoders/tf-idf-encoder.joblib']