# TF-IDF Predictor with Scikit-learn: Proof of Concept

**Term Frequency-Inverse Document Frequency** (TF-IDF) is an information retrieval (IR) statistic used to determine how important a word is to a document. Simply put, we determine the TF-IDF of a word in a document by counting the occurrences of that word in the document, then dividing it by the number of documents in which it appears. Then, this process is repeated for each word in the document, resulting in a vector indicating the relevance of each word to that document. It's a simple, but effective IR method that was introduced in the 1980s, yet has stood the test of time.

The bread and butter of IR is the process of turning arbitrary-length documents into fixed-length vectors (other such methods include word embeddings, encode-decode model, etc.). Once we have vectors that abstractly represent the semantics of documents, we can compare such vectors using metrics like cosine similarity, i.e. the dot product between two vectors. This principle then forms the basis of our project - *using TF-IDF, we map a small query and large documents into a high-dimensional vector space, then determine which documents are most relevant to the query*. In essence, it is a search engine.

In [22]:
# Update and import dependencies
!pip install -Uqr requirements.txt

# Basic packages
import importlib
from time import time
from pathlib import Path
from progress.bar import Bar
import io

# Data science/NLP packages
import numpy as np
import pandas as pd
pd.set_option("display.max_colwidth", None)

# AWS packages
import boto3

# Local modules
import tfidf_predictor
import train_tfidf
for m in [tfidf_predictor, train_tfidf]:
    importlib.reload(m)

from tfidf_predictor import VectorSimilarity, TfidfPredictor
from train_tfidf import combine_dfs

## Extending SKLearn: VectorSimilarity

In order to perform pairwise comparisons across vectors and find the most similar pairs, we implemented the `VectorSimilarity` class, extending SKLearn's Estimator interface.

In [None]:
"""
Visual representation of vector locations

       /\
        b  ?
        |
        |
< c --------- a >
        |
        |
        d
       \/
"""
X = np.array(
    [[1, 0],
     [0, 1],
     [-1, 0],
     [0, -1]]
)
y = np.array(['a', 'b', 'c', 'd'])

sim_estimator = VectorSimilarity(n_best=10)
sim_estimator = sim_estimator.fit(X, y)

vec_input = np.array([0.5, 1]).reshape(1, -1) # Shape needs to be (1, n)
pred, score = sim_estimator.predict(vec_input)
print('Most similar vectors:\n', pred)
print('Confidence scores:\n', score)

In [23]:
basic_corpus = [
    'Bees like to make honey',
    'Bears like to eat honey',
    "Bees don't like bears",
    'Humans are walking around the park'
]
basic_labels = ['a', 'b', 'c', 'd']

tfidf_model = TfidfPredictor(lemmatize='custom')
tfidf_model.fit(basic_corpus, basic_labels, verbose=True)
pred, score = tfidf_model.predict(basic_corpus)
print(pred)
print(score)


Training took 0.06862807273864746 seconds
[['a' 'c' 'b' 'd']
 ['b' 'c' 'a' 'd']
 ['c' 'b' 'a' 'd']
 ['d' 'c' 'b' 'a']]
[[1.         0.27710268 0.27710268 0.        ]
 [1.         0.27710268 0.27710268 0.        ]
 [1.         0.27710268 0.27710268 0.        ]
 [1.         0.         0.         0.        ]]




In [24]:
# File helper functions
def list_data_objs():
    secret_name = "SageMakerS3Access"
    region_name = "us-west-2"
    bucket_name = 'amplifyobserverinsights-aoinsightslandingbucket29-5vcr471d4nm5'
    bucket_subfolder = 'data/issues/'

#     secrets = boto3.client(
#         service_name='secretsmanager',
#         region_name=region_name
#     )

#     secrets_response = secrets.get_secret_value(SecretId=secret_name)
#     secrets_dict = json.loads(secrets_response['SecretString'])
#     (access_key, secret_key), = secrets_dict.items()

    s3 = boto3.client('s3')
    data_objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=bucket_subfolder)['Contents']
    data_obj_names = [key['Key'] for key in data_objects]
#     data_obj_names = [f"s3://{bucket_name}/{key['Key']}" for key in data_objects]

    return data_obj_names


def download_data(filename, data_obj_names):
    dfs = []
    s3 = boto3.client(
        's3',
    )

    with Bar(
        message='Downloading parquets',
        check_tty=False,
        hide_cursor=False,
        max=len(data_obj_names)
    ) as bar:

        for obj_name in data_obj_names:
#             df = wr.s3.read_csv(obj_name)
            obj = s3.get_object(Bucket='amplifyobserverinsights-aoinsightslandingbucket29-5vcr471d4nm5', Key=obj_name)
            df = pd.read_parquet(io.BytesIO(obj['Body'].read()))
            dfs.append(df)
            bar.next()

        bar.finish()

    df = combine_dfs(dfs)
    df.to_csv(filename)

    return df

def get_data(filename, force_redownload=False):
    start = time()
    data = Path(filename)

    if data.is_file() and not force_redownload:
        print('Deserializing data from', filename, '...')
        df = pd.read_csv(filename)

    else:
        data_obj_names = list_data_objs()
        df = download_data(filename, data_obj_names[1:]) # TODO: this is because list data objs is returning an empty thing

    print('Took', time() - start, 'seconds')
    return df

In [25]:
df = get_data('training_data.csv', force_redownload=True)

ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

In [None]:
# Train model
corpus_col='title_body'
url_col = 'url'
title_col='title'
train_df = df

corpus = train_df[corpus_col]
labels = list(zip(train_df[url_col], train_df[title_col]))

pipe = get_fitted_model(corpus, labels, lemmatize='custom')