# TF-IDF feature extraction

Feature engineering is a large focus of research NLP. Text cannot be fed directly into a machine learning model. Therefore, we need some kind of numerical representation.

Bag-of-words (BOW) is a farily old but still somewhat popular feature engineering method for text. In this approach, a vocabulary of all the unique words in the corpus is created and each document is represented as a vector of
word counts. This method ignores the order of words in the text but can still be effective for many NLP tasks. In fact, this is the method used the n-gram analysis on the arXiv dataset in the previous notebook. A version of this can use bigrams or even additional number of words taken together in sequence.

Term frequency-inverse document frequency (TF-IDF) is a variation of BOW where instead of a simple count, a weight is used, representing the importance of words in a document. Words that appear frequently in a document and rarely in others are given a relatively higher weight compared to more common words.

The calculation for TF-IDF is the following:

For each word taken from a document belonging to a set of documents,
TF = (count of word / total words) in the same document
IDF = log(total count of documents / count of documents that contain the word)
TF-IDF = TF * IDF

where TF stands for Term Frequency in document and IDF for Inverse Document Frequency.

In this notebook, a TfidfVectorizer is fitted on the trainining data. The resulting model is used to create feature vectors out of the train, validation and testing data.

Two versions of the TF-IDF vectorization model are trained. One considering single words only and the others considering unigrams and bigrams.

The max number of features is limited to 10000.

In [6]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import os
import pandas as pd
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
THIS_MODEL_NAME = 'TFIDF'
VERSION = '1'

OUTPUT_DIR = '../data/wip/'+THIS_MODEL_NAME+'/'
for dir_ in [OUTPUT_DIR]:
    if not os.path.exists(dir_):
        os.makedirs(dir_)

In [8]:
FILE = "../data/data.parquet.gzip"
data = pd.read_parquet(FILE, columns=['target','processed_docs'])

In [9]:
train_idx = pickle.load(open("../data/wip/train_idx.pkl", 'rb'))
val_idx = pickle.load(open("../data/wip/val_idx.pkl", 'rb'))
test_idx = pickle.load(open("../data/wip/test_idx.pkl", 'rb'))

In [10]:
training_data = data.loc[train_idx]
validation_data = data.loc[val_idx]
testing_data = data.loc[test_idx]

In [11]:
del data

In [12]:
features = 'processed_docs'
target = 'target'

X_train_raw = training_data[features]
X_val_raw = validation_data[features]
X_test_raw = testing_data[features]

y_train = training_data[target]
y_val = validation_data[target]
y_test = testing_data[target]

In [14]:
ngram_ranges=[(1,1), (1,2)]
labels = ["X_train", "X_val", "X_test"]
Xsets = [X_train_raw, X_val_raw, X_test_raw]

for nr in ngram_ranges:
    nr_lable = str(nr[0])+"_"+str(nr[1])
    vectorizer = TfidfVectorizer(ngram_range=nr, max_features = 10_000)
    for label, X in zip(labels, Xsets):
        if(label == "X_train"):
            vectorizer.fit(X)
        name = THIS_MODEL_NAME+"_"+label+"_"+nr_lable
        vectors = vectorizer.transform(X)
        with open(OUTPUT_DIR+name+'.pkl', 'wb') as f:
            pickle.dump(vectors, f)
    print()

In [24]:
labels = ["y_train", "y_val", "y_test"]
ysets = [y_train, y_val, y_test]

for label, y in zip(labels, ysets):
    name = THIS_MODEL_NAME+"_"+label+"_"
    with open(OUTPUT_DIR+name+'.pkl', 'wb') as f:
        pickle.dump(y, f)