## BERT Embeddings

This notebook is for creating and saving the BERT embeddings from the quote dataset.

#### Input:
* Quote dataset after cleaning for missing values and incorrect value placements (e.g. half of the quote being in 'author')

#### Output:
* Matrix of vector embeddings for the quotes, ready for KNN with incoming inputs

In [1]:
import pandas as pd
import numpy as np
import time
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, BertModel
import torch

In [3]:
quotes_df = pd.read_csv('quotes.csv')
quotes_df.head()

Unnamed: 0,quote,author,category
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"attributed-no-source, best, life, love, mistak..."
1,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
2,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
4,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."


### Helper Functions

In [112]:
def print_quote_by_index(idx, df):
    """
    Given the index number and the dataframe of quotes, prints the quote along 
    with the author.
    """
    quote = df.iloc[idx].quote
    author = df.iloc[idx].author
    print(f'"{quote}"')
    print(f' - {author}\n')


def build_tfidf_matrix(quotes_df):
    """
    Builds a matrix where each row corresponds to a quote, using TF-IDF vectorization.
    """
    corpus = quotes_df['quote'].tolist()
    vectorizer = TfidfVectorizer(max_features=TFIDF_MAX_FEATURES)
    vectors = vectorizer.fit_transform(corpus)
    matrix = vectors.todense()

    # feature_names = vectorizer.get_feature_names_out()
    # list_dense = matrix.tolist()
    # df = pd.DataFrame(list_dense, columns=feature_names)
    # ^ use to view the matrix

    return matrix, vectorizer


def build_bert_matrix(quotes_df, tokenizer, model):
    """
    Builds a matrix where each row corresponds to a quote, using BERT vector embedding.
    """
    documents = quotes_df['quote'].to_list()

    # Encode documents
    matrix = np.vstack([encode_document(doc, tokenizer, model) for doc in documents])

    return matrix


def encode_document(doc, tokenizer, model):
    """
    Encode a single document (string) into a BERT embedding.
    """
    inputs = tokenizer(doc, return_tensors="pt", max_length=512, truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()


def find_closest_quote(text_vect, matrix, n_closest):
    """
    Computes the distances between 'text_vect' (embedded user input) and 
    each row (representing a embedded quote) in the matrix. Then,
    prints the top 'n_closest' quotes from the matrix using cosine 
    similarity.
    """
    
    A = matrix
    x = text_vect.reshape((-1,1))
    
    temp = (np.sqrt((np.square(A)).sum(axis=1)) * np.linalg.norm(x))
    temp[temp == 0] = np.finfo(float).tiny
    temp = temp.reshape((-1,1)) # necessary to ensure A@x / temp is elementwise
    print('finished computing temp')
    
    distances = (1 - np.matmul(A,x) / temp).flatten()
    print('finished computing distances')
    
    sorted_indices = np.argsort(distances).tolist()
    print('finished sorting\n')
    
    for idx in sorted_indices[:n_closest]:
        print_quote_by_index(idx, quotes_df)

### Approach 1: TF-IDF Vectorisation

**What is it**?

TF-IDF is a smart way of weighting the importance of keywords in a given document among a corpus. A word is weighted heavily if it occurs frequently within the document (high Term Frequency) and rarely occurs in other documents (high Inverse Document Frequency). TF-IDF is the product of TF and IDF.

**Caveats**

TF-IDF is a glorified keyword counter, so it doesn't capture the contexts in which the words are used; in other words, it doesn't really capture the "meaning" of each word in its sentence.

### Approach 2: BERT Embedding

**What is it**?

BERT was developed by Google and is very smart.

**Caveats**

Computationally expensive.

In [11]:
TFIDF_MAX_FEATURES = 3000

In [113]:
start = time.time()
# matrix, vectorizer = build_tfidf_matrix(quotes_df)

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

matrix = build_bert_matrix(quotes_df.iloc[:1000], tokenizer, model)
end = time.time() 

print(f'Initialising the matrix for {matrix.shape[0] quotes} took {round(end - start, 2)} seconds.')

Initialising the quote matrix took 59.49 seconds.


In [114]:
matrix.shape[0]

1000

In [131]:
# "why we can't have nice things" illustrates limitation of TF-IDF