# BERT Tokeniser & Vector Querying

The model will use a BERT tokeniser to convert the text for each CV into Semantic space and then query that space with text. Will also clean and stem the data so that it produces only the most important words. Will then use Cosine Similarity to measure the similarity in semantic space:

For Cosine Similarity:
$$
\text{CosineSimilarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}
$$

Where:
$$
A_i \text{ and } B_i \text{ are the components of vectors } A \text{ and } B, \text{ respectively.} \\
\|A\| \text{ and } \|B\| \text{ are the magnitudes of vectors } A \text{ and } B, \text{ respectively.}
$$

## Cleaning Word Data:
### Stemming &Tokenisation

In [9]:
import pandas as pd

df = pd.read_csv("../data/raw/UpdatedResumeDataSet.csv")

In [2]:
import json
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

In [3]:
nltk.download("stopwords")
STOPWORDS = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jtren\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
def remove_duplicate_words(text):
    """
    Remove duplicate words from the text, preserving the original order.
    """
    words = text.split()
    seen = set()
    seen_add = seen.add
    # Preserve order and remove duplicates
    words_no_duplicates = [word for word in words if not (word in seen or seen_add(word))]
    return ' '.join(words_no_duplicates)

In [5]:
def clean_text(text, stopwords=STOPWORDS):
    """Clean raw text string."""
    # Lower
    text = text.lower()

    # Remove stopwords
    pattern = re.compile(r'\b(' + r"|".join(stopwords) + r")\b\s*")
    text = pattern.sub('', text)

    # Spacing and filters
    text = re.sub(r"([!\"'#$%&()*\+,-./:;<=>?@\\\[\]^_`{|}~])", r" \1 ", text)  # add spacing
    text = re.sub("[^A-Za-z0-9]+", " ", text)  # remove non alphanumeric chars
    text = re.sub(" +", " ", text)  # remove multiple spaces
    text = re.sub("\n", " ", text)  # remove multiple spaces
    text = text.strip()  # strip white space at the ends
    text = re.sub(r"http\S+", "", text)  #  remove links
    text = remove_duplicate_words(text)
    
    return text # Apply to dataframe

In [6]:
import numpy as np
import pandas as pd
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained('bert-base-uncased')

  torch.utils._pytree._register_pytree_node(


In [7]:
def tokenize(text):
    encoded_inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True, max_length=512)
    return encoded_inputs

In [8]:
def preprocess(df):
    df["cleaned_resume"] = df["resume"].apply(clean_text)  # Apply clean_text
    df["tokenized_data"] = df["cleaned_resume"].apply(lambda x: tokenize(x))  # Apply tokenize
    return df

In [10]:
processed_df = preprocess(df)

## Embedding
We need to then create the embeddings. These are tensor objects that contain the semantic information for each word.

In [11]:
def create_embeddings(tokenized_data):
    """Generate embeddings by averaging token embeddings (excluding padding tokens)."""
    input_ids = tokenized_data['input_ids']
    attention_mask = tokenized_data['attention_mask']
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # Get the hidden states of all tokens
        last_hidden_states = outputs.last_hidden_state
        # Create a mask for ignoring padding tokens
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
        # Sum embeddings for each token, ignoring padding tokens, then divide by the number of non-padding tokens
        sum_embeddings = torch.sum(last_hidden_states * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)  # Prevent division by zero
        mean_embeddings = sum_embeddings / sum_mask
    
    return mean_embeddings

In [12]:
from tqdm.notebook import tqdm
tqdm.pandas(desc="Creating Embeddings")

In [13]:
df['embeddings'] = df['tokenized_data'].progress_apply(lambda row: create_embeddings(row))

Creating Embeddings:   0%|          | 0/167 [00:00<?, ?it/s]

In [14]:
from sklearn.preprocessing import normalize

df['normalized_embeddings'] = df['embeddings'].apply(lambda x: normalize(x.reshape(1, -1), axis=1).flatten())


## Vector Query
Now that we have created out embeddings, we can now query the data with a prompt. I will deliberately use a prompt that is heavily weighted towards data science. We should in theory get a lot of data science categories near the top of the list with a high similarity score.

In [22]:
query_text = 'Python machine learning sklearn SQL database data science database coding programming'
query_tokenized = tokenize(query_text)  # Ensure this uses the same tokenizer and method used for CVs
query_embedding = create_embeddings(query_tokenized).numpy() 

In [23]:
embeddings_matrix = np.vstack(df['embeddings'].values)

In [24]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(query_embedding.reshape(1, -1), embeddings_matrix)

In [25]:
df['similarity_score'] = similarities[0]

In [26]:
sorted_df = df.sort_values(by='similarity_score', ascending=False).reset_index(drop=True)

In [27]:
job_titles = sorted_df[['category', 'similarity_score']]

In [29]:
for index, value in enumerate(job_titles['category']):
    if index < 30:
        print(f"Index {index}: Value {value}")
    else:
        break


Index 0: Value DotNet Developer
Index 1: Value DotNet Developer
Index 2: Value Python Developer
Index 3: Value Python Developer
Index 4: Value Data Science
Index 5: Value DotNet Developer
Index 6: Value Java Developer
Index 7: Value Data Science
Index 8: Value Data Science
Index 9: Value Data Science
Index 10: Value HR
Index 11: Value Hadoop
Index 12: Value Java Developer
Index 13: Value Hadoop
Index 14: Value Data Science
Index 15: Value Blockchain
Index 16: Value DevOps Engineer
Index 17: Value Java Developer
Index 18: Value Java Developer
Index 19: Value Data Science
Index 20: Value Java Developer
Index 21: Value Python Developer
Index 22: Value Automation Testing
Index 23: Value Java Developer
Index 24: Value Database
Index 25: Value Java Developer
Index 26: Value Python Developer
Index 27: Value Database
Index 28: Value Java Developer
Index 29: Value Business Analyst


This has done a pretty good job, we can definitely see a lot of Python Developer etc. However, we have DotNet Developers at the top which indicates the algorithmn could be improved.