# TF-IDF Predictor with Scikit-learn: Proof of Concept

**Term Frequency-Inverse Document Frequency** (TF-IDF) is an information retrieval (IR) statistic used to determine how important a word is to a document. Simply put, we determine the TF-IDF of a word in a document by counting the occurrences of that word in the document, then dividing it by the number of documents in which it appears. Then, this process is repeated for each word in the document, resulting in a vector indicating the relevance of each word to that document. It's a simple, but effective IR method that was introduced in the 1980s, yet has stood the test of time.

The bread and butter of IR is the process of turning arbitrary-length documents into fixed-length vectors (other such methods include word embeddings, encode-decode model, etc.). Once we have vectors that abstractly represent the semantics of documents, we can compare such vectors using metrics like cosine similarity, i.e. the dot product between two vectors. This principle then forms the basis of our project - *using TF-IDF, we map a small query and large documents into a high-dimensional vector space, then determine which documents are most relevant to the query*. In essence, it is a search engine.

In [25]:
# Update and import dependencies
!pip install -Uqr requirements.txt

# Basic packages
import importlib
from time import time
from pathlib import Path
from progress.bar import Bar
import io

# Data science/NLP packages
import numpy as np
import pandas as pd
pd.set_option("display.max_colwidth", None)

# AWS packages
import boto3

# Local modules
import tfidf_predictor, utils_tfidf, train_tfidf
for m in [tfidf_predictor, utils_tfidf, train_tfidf]:
    importlib.reload(m)

from tfidf_predictor import VectorSimilarity, TfidfPredictor
from train_tfidf import combine_dfs
from utils_tfidf import get_data, get_corpus_labels

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


## Extending SKLearn: VectorSimilarity

In order to perform pairwise comparisons across vectors and find the most similar pairs, we implemented the `VectorSimilarity` class, extending SKLearn's `Estimator` interface. The fitting process of `VectorSimilarity` takes two arrays:

1. An array of float vectors, each with the same dimensionality
1. An array of labels, which can have any type or shape.

Note that both arrays must have equal length, and that the `i`th vector corresponds to the `i`th label for all `i`. Then, when we feed it an input vector, `VectorSimilarity` compares that vector against all of the vectors that it was fitted on, returning the corresponding labels of the `n_best` vectors. A "score" is also returned, which is just the dot product with each of the `n_best` vectors.

In [4]:
"""
Visual representation of vector locations

       /\
        b  ?
        |
        |
< c --------- a >
        |
        |
        d
       \/
"""
X = np.array(
    [[1, 0],
     [0, 1],
     [-1, 0],
     [0, -1]]
)
y = np.array(['a', 'b', 'c', 'd'])

sim_estimator = VectorSimilarity(n_best=10)
sim_estimator = sim_estimator.fit(X, y)

vec_input = np.array([0.5, 1]).reshape(1, -1) # Shape needs to be (1, n)
pred, score = sim_estimator.predict(vec_input)
print('Most similar vectors:\n', pred)
print('Confidence scores:\n', score)

Most similar vectors:
 [['b' 'a' 'c' 'd']]
Confidence scores:
 [[ 1.   0.5 -0.5 -1. ]]


## Extending SKLearn: TfidfPredictor

To make similarity predictions on natural language, we use the above `VectorSimilarity` class in tandem with SKLearn's `TfidfVectorizer`, tying the two using SKLearn's `Pipeline` interface. We then wrapped the pipeline in another custom `Estimator`, `TfidfPredictor`, in order to have more control over the inputs and outputs of the model.

The `TfidfVectorizer` takes in natural language documents and performs the following transformations:

1. The document is separated into word tokens that are alphabetical and at least 3 characters long.
1. Stop words such as articles, prepositions, and pronouns are removed.
1. Lemmatization is applied to homogenize the different tenses and cases of each word. For example, "walk", "walks" and "walking" all become "walk".
1. TF-IDF values are calculated for each token.

In [3]:
basic_corpus = [
    'Bees like to make honey',
    'Bears like to eat honey',
    "Bees don't like bears",
    'Humans are walking around the park'
]
basic_labels = ['a', 'b', 'c', 'd']

tfidf_model = TfidfPredictor(lemmatize='custom')
tfidf_model.fit(basic_corpus, basic_labels, verbose=True)
pred, score = tfidf_model.predict(basic_corpus)
print(pred)
print(score)

Training took 1.770897626876831 seconds
[['a' 'c' 'b' 'd']
 ['b' 'c' 'a' 'd']
 ['c' 'b' 'a' 'd']
 ['d' 'c' 'b' 'a']]
[[1.         0.27710268 0.27710268 0.        ]
 [1.         0.27710268 0.27710268 0.        ]
 [1.         0.27710268 0.27710268 0.        ]
 [1.         0.         0.         0.        ]]


  'stop_words.' % sorted(inconsistent))


## Downloading Data and Fitting TF-IDF Predictor

We now download the training data and store it in a CSV, if it doesn't already exist. The training data is stored in [this S3 bucket](https://s3.console.aws.amazon.com/s3/buckets/amplifyobserverinsights-aoinsightslandingbucket29-5vcr471d4nm5). We extract the columns that are used for our corpus and labels, then feed it into a `TfidfPredictor` to fit the model.

In [27]:
# Train model
df = get_data(r'./data/training_data.csv', force_redownload=False, verbose=True)

corpus_col='title_body'
url_col = 'url'
title_col='title'

# corpus = train_df[corpus_col]
# labels = list(zip(train_df[url_col], train_df[title_col]))

corpus, labels = get_corpus_labels(df, corpus_col, [url_col, title_col])

tfidf_model = TfidfPredictor(lemmatize='custom')
tfidf_model.fit(corpus, labels, verbose=True)

Deserializing data from ./data/training_data.csv took 0.22211885452270508 seconds


  'stop_words.' % sorted(inconsistent))


Training took 79.89296960830688 seconds


### Model Statistics
Below, you can see the number of words in the vocabulary. This number is quite high, but the vectors being compared are "sparse" (i.e. having few non-zero values), so dot product computation is fast in the average case.

In [22]:
# Model stats
vocab = tfidf_model._vectorizer.get_feature_names()
print('Number of words in vocabulary:', len(vocab))
repo_list = list(set(df['repo']))
print('Available repos:', repo_list)

Number of vocab words: 23685
Available repos: ['docs', 'amplify-codegen', 'amplify-android', 'aws-appsync-realtime-client-ios', 'aws-amplify.github.io', 'amplify-ci-support', 'amplify-ui', 'amplify-js', 'aws-sdk-android', 'amplify-adminui', 'aws-sdk-ios', 'amplify-ios', 'amplify-console', 'amplify-cli', 'amplify-flutter', 'amplify-observer', 'amplify-js-samples', 'community']


### Performing inferences

Below, we perform an example query prediction. `TfidfPredictor.predict` returns a list of labels corresponding to the `n_best` vectors that are closest to the query's vector, and a list of scores that each of those labels attained.

In [29]:
query = 'AmplifySignIn component does not work with password managers or native browser autofill'
pred, score = tfidf_model.predict(query, verbose=True)
print(pred)
print(score)

Prediction took 2.3255679607391357 seconds
[[['https://github.com/aws-amplify/amplify-js/issues/8472'
   'AmplifySignIn component does not work with password managers or native browser autofill']
  ['https://github.com/aws-amplify/amplify-adminui/issues/233'
   'Password managers, remember password, and suggest password not working in login form']
  ['https://github.com/aws-amplify/amplify-js/issues/5782'
   "(React) UI Components don't support password managers"]
  ['https://github.com/aws-amplify/amplify-js/issues/8289'
   "Password managers don't seem to auto-fill login credentials in AmplifyAuthenticator -> AmplifySignIn"]
  ['https://github.com/aws-amplify/amplify-js/issues/4748'
   "[VueJS] Firefox autofill don't work"]
  ['https://github.com/aws-amplify/amplify-js/issues/3799'
   'Password reset issue - chrome autofill']
  ['https://github.com/aws-amplify/amplify-js/issues/14'
   'Forgot password / change password?']
  ['https://github.com/aws-amplify/aws-sdk-ios/issues/3076'
  