<a href="https://colab.research.google.com/github/mihaiaperghis/python-seo/blob/main/bert-string-similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using pre-trained BERT models and Google Sheets to compare string similarity

## Description

The goal of this notebook is to use pre-trained BERT models for comparing a set of string to another set of strings, taking each string from the first set in order to find the most similar one from the second set.

This can prove to be **highly useful for SEO**, such as for **mapping keywords to landing pages** or for **complex site migrations**, where comparing title tags, H1s, or other text scraped via crawlers can be compared in terms of similarity when no other type of automated matching (such as via URLs, SKUs, etc.) can be done.

## Required packages

We use the [Sentence Transformers framework](https://www.sbert.net/index.html) for our pre-trained BERT models. It allows us to load any BERT model from their library or via [HuggingFace](https://huggingface.co/models) and makes it very easy to calculate semantic similarity between various pieces of text.

We also use the [gspread](https://docs.gspread.org/) library to interact with our Google Spreadsheets.

In [None]:
# Install Sentence Transformers and upgrade gspread
!pip install --quiet sentence-transformers
!pip install -U --quiet gspread

## Load the BERT model

When loading a model directly from the Sentence Transormers library, we should use one that is [optimized for detecting text similarity](https://www.sbert.net/docs/pretrained_models.html#semantic-textual-similarity). In this case, loading the model is extremely straightforward, just do

```
model = SentenceTransformer('model name')
```

If looking for a non-English model, we might get better results by using a custom language-trained model from the [HuggingFace library](https://huggingface.co/models) instead of the multilingual ones. In that case, we need a few extra lines of code for getting the BERT model and adapting it for use with sentence vectors.

In [None]:
from sentence_transformers import SentenceTransformer, models

# Load a pre-trained BERT model for mapping tokens to embeddings.
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

# If you'd like to use one of the pre-trained models from the HuggingFace library, use the code below instead.
# word_embedding_model = models.Transformer('dumitrescustefan/bert-base-romanian-uncased-v1')
# pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
#                                pooling_mode_mean_tokens=True,
#                                pooling_mode_cls_token=False,
#                                pooling_mode_max_tokens=False)
# model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


## Authenticate with Google

Since we'll be pulling/pushing data from/to Google Sheets, we need to authenticate the script with our Google account that has access to our spreadsheet. Click on the link that the script generates, copy the code and paste it in the box you'll see below the link.

In [None]:
# Authenticate with Google and access API services for Google Sheets
from google.colab import auth
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials

auth.authenticate_user()

## Get the data from our spreadsheet

We go through multiple steps in order to fetch both lists of strings (in our case these could be title tags, headings, any other text that we might have from our crawls or other sources):

### Establish stopwords

If we're mainly comparing short strings (such as title tags or headings), it makes sense to remove stopwords from both sets of strings in order to make sure we're comparing 'apples to apples' (especially with title tags, a lot of SEO optimizations involve removing stopwords). We're also adding a few other non-alphanumeric symbols for removal, since it's likely they'll add more confusion than clarity to our model.

In [None]:
import nltk
from nltk.corpus import stopwords

# Get stopwords and add some symbols we want to remove
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words = stop_words + ['<', '%', ':', '&']

### Load and clean up the data 

Assuming we have a separate sheet with the data that represents the first set of strings and one with the data featuring the second set, we use the gspread module to load everything from our Google Sheets spreadsheet based on what columns we're interested in. We need to use numerical numbers for our columns, so, for example, column E would be number 5.

The rest of the code is to:

* Remove the header (using .pop)
* Make all words lowercase
* Remove the stopwords established earlier

In [None]:
import gspread

# Where to find the data in Google Sheets
spreadsheet_id = '***************************************'
sheet_first_set = '******' # Name of first sheet
sheet_first_set_column = 1 # Column A
sheet_second_set = '******' # Name of second sheet
sheet_second_set_column = 1 # Column A

# Fetch the data from the spreadsheet, remove header row
gc = gspread.authorize(GoogleCredentials.get_application_default())
ss = gc.open_by_key(spreadsheet_id)
first_set_strings = ss.worksheet(sheet_first_set).col_values(sheet_first_set_column)
first_set_strings.pop(0)
second_set_strings = ss.worksheet(sheet_second_set).col_values(sheet_second_set_column)
second_set_strings.pop(0)

# Make everything lowercase and remove stopwords
first_set_strings_clean = first_set_strings.copy()
second_set_strings_clean = second_set_strings.copy()

for idx, string in enumerate(first_set_strings_clean):
  string = [word.lower() for word in string.split(' ') if word not in stop_words]
  string =  ' '.join(string)
  first_set_strings_clean[idx] = string

for idx, string in enumerate(second_set_strings_clean):
  string = [word.lower() for word in string.split(' ') if word not in stop_words]
  string = ' '.join(string)
  second_set_strings_clean[idx] = string


### Print the results

This is just a 'sanity' check to make sure all of our changes have been applied correctly: 

In [None]:
print(f"First Set Strings: \t{first_set_strings}")
print(f"Clean First Set Strings: \t{first_set_strings_clean}")
print(f"Second Set Strings: \t{second_set_strings}")
print(f"Clean Second Set Strings: \t{second_set_strings_clean}")

## Calculate string similarity

This is where we use the BERT model we previously loaded to calculate vectors for each string from first set and compare to all of the strings from the second set. The results are scores from 0 to 1 that reflect the [cosine similarity](https://www.sciencedirect.com/topics/computer-science/cosine-similarity), 1 meaning two strings are basically identical from a semantic point of view.

Once the calculations are done, the results are sorted descending by this score.

### Test the similarity calculations

Before actually using the values returned by the model, let's first print out a few examples.

In [None]:
from scipy.spatial.distance import cdist

# Get embeddings from BERT for all our strings
first_set_embeddings = model.encode(first_set_strings_clean)
second_set_embeddings = model.encode(second_set_strings_clean)

# Compare the 'distance' (similarity) between each of our first set and the second set
for i, (first_set_string, first_set_string_clean, first_set_embedding) in \
  enumerate(zip(first_set_strings, first_set_strings_clean, first_set_embeddings)):
    
    distances = cdist([first_set_embedding], second_set_embeddings, "cosine")[0]
    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    
    # Print the top 5 new titles for each old titles
    print(f"First Set String: {first_set_string}\n\n")
    print("Second Set Similarity:\n")
    
    for j, (idx, distance) in enumerate(results):
        print(f"{second_set_strings[idx]} (Score: {(1-distance):.2f})")
        if (j + 1) % 5 == 0:
          break
    print("\n-------------------------------\n")
    if (i + 1) % 20 == 0:
      break

### Load the calculations in a dataframe

We basically repeat the above code but this time we load the results in a Pandas dataframe (basically a table) that contains the top two 'predictions' (highest similarity scores) for each string from the first set.

In [None]:
import pandas as pd

# Set up a  dataframe that holds our calculations
df_model = pd.DataFrame(columns=['First Set String', 'String Similarity 1', 'Similarity Score 1', 'String Similarity 2', 'Similarity Score 2'])
df_model.set_index('First Set String', inplace = True)

# Compare the 'distance' (similarity) between each of our first set and the second set
for i, (first_set_string, first_set_string_clean, first_set_embedding) in \
  enumerate(zip(first_set_strings, first_set_strings_clean, first_set_embeddings)):
    
    distances = cdist([first_set_embedding], second_set_embeddings, "cosine")[0]
    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    # Add top 2 predictions to the dataframe
    for j, (idx, distance) in enumerate(results):
      if (j == 0):    
        df_model.at[first_set_string, 'String Similarity 1'] = second_set_strings[idx]
        df_model.at[first_set_string, 'Similarity Score 1'] = '%.2f' % (1-distance)
      if (j == 1): 
        df_model.at[first_set_string, 'String Similarity 2'] = second_set_strings[idx]
        df_model.at[first_set_string, 'Similarity Score 2'] = '%.2f' % (1-distance)

df_model.head()

## Upload the results back in the spreadsheet

All we're doing here is refine the table to match the order of rows from our spreadsheet, so our predictions are sorted correctly. Afterwards, we simply add a few headers and upload the data into the columns of choice from the sheet with our first set of strings.

In [None]:
import numpy as np

# Set up the final dataframe that matches the order of rows from the spreadsheet
df_final = pd.DataFrame(first_set_strings, columns = ['First Set String'])
df_final.set_index('First Set String', inplace = True)

# Merge similarity calculations in the new dataframe to sort them for the spreadsheet
df = df_final.merge(df_model, how='left', on=['First Set String']) 
df.fillna('', inplace=True)

# Write similar strings and their scores to our spreadsheet (first sheet)
ss.worksheet(sheet_first_set).update('B1:B', np.array([['String Similarity 1'] + df['String Similarity 1'].tolist()]).T.tolist())
ss.worksheet(sheet_first_set).update('C1:C', np.array([['Similarity Score 1'] + df['Similarity Score 1'].tolist()]).T.tolist())
ss.worksheet(sheet_first_set).update('D1:D', np.array([['String Similarity 2'] + df['String Similarity 2'].tolist()]).T.tolist())
ss.worksheet(sheet_first_set).update('E1:E', np.array([['Similarity Score 2'] + df['Similarity Score 2'].tolist()]).T.tolist())