# TF-IDF Predictor with Scikit-learn: Proof of Concept

**Term Frequency-Inverse Document Frequency** (TF-IDF) is an information retrieval (IR) statistic used to determine how important a word is to a document. Simply put, we determine the TF-IDF of a word in a document by counting the occurrences of that word in the document, then dividing it by the number of documents in which it appears. Then, this process is repeated for each word in the document, resulting in a vector indicating the relevance of each word to that document. It's a simple, but effective IR method that was introduced in the 1980s, yet has stood the test of time.

The bread and butter of IR is the process of turning arbitrary-length documents into fixed-length vectors (other such methods include word embeddings, encode-decode model, etc.). Once we have vectors that abstractly represent the semantics of documents, we can compare such vectors using metrics like cosine similarity, i.e. the dot product between two vectors. This principle then forms the basis of our project - *using TF-IDF, we map a small query and large documents into a high-dimensional vector space, then determine which documents are most relevant to the query*. In essence, it is a search engine.

In [54]:
# Update and import dependencies
!pip install -Uqr requirements.txt

# Basic packages
import importlib
from time import time
from pathlib import Path
from progress.bar import Bar
import io

# Data science/NLP packages
import numpy as np
import pandas as pd
pd.set_option("display.max_colwidth", None)

# AWS packages
import boto3

# Local modules
import tfidf_predictor, utils_tfidf, train_tfidf
for m in [tfidf_predictor, utils_tfidf, train_tfidf]:
    importlib.reload(m)

from tfidf_predictor import VectorSimilarity, TfidfPredictor
from train_tfidf import combine_dfs
from utils_tfidf import get_data

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


## Extending SKLearn: VectorSimilarity

In order to perform pairwise comparisons across vectors and find the most similar pairs, we implemented the `VectorSimilarity` class, extending SKLearn's Estimator interface.

In [2]:
"""
Visual representation of vector locations

       /\
        b  ?
        |
        |
< c --------- a >
        |
        |
        d
       \/
"""
X = np.array(
    [[1, 0],
     [0, 1],
     [-1, 0],
     [0, -1]]
)
y = np.array(['a', 'b', 'c', 'd'])

sim_estimator = VectorSimilarity(n_best=10)
sim_estimator = sim_estimator.fit(X, y)

vec_input = np.array([0.5, 1]).reshape(1, -1) # Shape needs to be (1, n)
pred, score = sim_estimator.predict(vec_input)
print('Most similar vectors:\n', pred)
print('Confidence scores:\n', score)

Most similar vectors:
 [['b' 'a' 'c' 'd']]
Confidence scores:
 [[ 1.   0.5 -0.5 -1. ]]


In [3]:
basic_corpus = [
    'Bees like to make honey',
    'Bears like to eat honey',
    "Bees don't like bears",
    'Humans are walking around the park'
]
basic_labels = ['a', 'b', 'c', 'd']

tfidf_model = TfidfPredictor(lemmatize='custom')
tfidf_model.fit(basic_corpus, basic_labels, verbose=True)
pred, score = tfidf_model.predict(basic_corpus)
print(pred)
print(score)

Training took 1.770897626876831 seconds
[['a' 'c' 'b' 'd']
 ['b' 'c' 'a' 'd']
 ['c' 'b' 'a' 'd']
 ['d' 'c' 'b' 'a']]
[[1.         0.27710268 0.27710268 0.        ]
 [1.         0.27710268 0.27710268 0.        ]
 [1.         0.27710268 0.27710268 0.        ]
 [1.         0.         0.         0.        ]]


  'stop_words.' % sorted(inconsistent))


In [59]:
df = get_data('training_data.csv', force_redownload=False, verbose=True)
print(df.shape)
print(df.columns)

Deserializing data from training_data.csv took 0.21923613548278809 seconds
(17253, 7)
Index(['Unnamed: 0', 'repo', 'title_body', 'title', 'id', 'url', 'number'], dtype='object')


In [48]:
# Train model
corpus_col='title_body'
url_col = 'url'
title_col='title'
train_df = df

corpus = train_df[corpus_col]
labels = list(zip(train_df[url_col], train_df[title_col]))

tfidf_model = TfidfPredictor(lemmatize='custom')
tfidf_model.fit(corpus, labels, verbose=True)

  'stop_words.' % sorted(inconsistent))


Training took 80.04574918746948 seconds


In [49]:
# Model stats
vocab = tfidf_model._vectorizer.get_feature_names()
print('Number of vocab words:', len(vocab))
repo_list = list(set(df['repo']))
print('Available repos:', repo_list)

Number of vocab words: 23685
Available repos: ['amplify-js', 'amplify-adminui', 'aws-appsync-realtime-client-ios', 'amplify-ui', 'amplify-ci-support', 'amplify-console', 'amplify-observer', 'docs', 'aws-sdk-android', 'amplify-js-samples', 'amplify-cli', 'amplify-ios', 'aws-amplify.github.io', 'amplify-flutter', 'community', 'amplify-android', 'aws-sdk-ios', 'amplify-codegen']


In [50]:
pw_mgr_query = ['AmplifySignIn component does not work with password managers or native browser autofill']
pred, score = tfidf_model.predict(pw_mgr_query, verbose=True)
print(pred)

Prediction took 2.404874086380005 seconds
[[['https://github.com/aws-amplify/amplify-js/issues/8472'
   'AmplifySignIn component does not work with password managers or native browser autofill']
  ['https://github.com/aws-amplify/amplify-adminui/issues/233'
   'Password managers, remember password, and suggest password not working in login form']
  ['https://github.com/aws-amplify/amplify-js/issues/5782'
   "(React) UI Components don't support password managers"]
  ['https://github.com/aws-amplify/amplify-js/issues/8289'
   "Password managers don't seem to auto-fill login credentials in AmplifyAuthenticator -> AmplifySignIn"]
  ['https://github.com/aws-amplify/amplify-js/issues/4748'
   "[VueJS] Firefox autofill don't work"]
  ['https://github.com/aws-amplify/amplify-js/issues/3799'
   'Password reset issue - chrome autofill']
  ['https://github.com/aws-amplify/amplify-js/issues/14'
   'Forgot password / change password?']
  ['https://github.com/aws-amplify/aws-sdk-ios/issues/3076'
   