# BM25 Search

In order to prevent the algorithmn from having higher cosine similarity scoes for smaller text, we can introduce the BM-25 search algorithmn.
$$
\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k1 + 1)}{f(q_i, D) + k1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}
$$

$$
f(q_i, D) \text{ is the term frequency of term } q_i \text{ in document } D. \\
|D| \text{ is the length of document } D. \\
\text{avgdl} \text{ is the average document length in the corpus.} \\
\text{IDF}(q_i) \text{ is the inverse document frequency of term } q_i. \\
k1 \text{ and } b \text{ are parameters for BM25 (commonly set to 1.2 or 2.0 for } k1 \text{ and 0.75 for } b). \\
$$

r 

This is unction that normalises for document length.} b). \\
$$

## Clean and Embed Word Tokens

In [39]:
import pandas as pd

df = pd.read_csv("../data/raw/UpdatedResumeDataSet.csv")

In [40]:
import json
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import numpy as np
from transformers import BertTokenizer, BertModel
import torch
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.notebook import tqdm

# Text cleaning
nltk.download('stopwords')
nltk.download('wordnet')
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def remove_duplicate_words(text):
    """
    Remove duplicate words from the text, preserving the original order.
    """
    words = text.split()
    seen = set()
    seen_add = seen.add
    # Preserve order and remove duplicates
    words_no_duplicates = [word for word in words if not (word in seen or seen_add(word))]
    return ' '.join(words_no_duplicates)

def clean_text(text, stopwords=STOPWORDS):
    """Clean raw text string."""
    # Lower
    text = text.lower()

    # Remove stopwords
    pattern = re.compile(r'\b(' + r"|".join(stopwords) + r")\b\s*")
    text = pattern.sub('', text)

    # Spacing and filters
    text = re.sub(r"([!\"'#$%&()*\+,-./:;<=>?@\\\[\]^_`{|}~])", r" \1 ", text)  # add spacing
    text = re.sub("[^A-Za-z0-9]+", " ", text)  # remove non alphanumeric chars
    text = re.sub(" +", " ", text)  # remove multiple spaces
    text = re.sub("\n", " ", text)  # remove multiple spaces
    text = text.strip()  # strip white space at the ends
    text = re.sub(r"http\S+", "", text)  #  remove links
    text = remove_duplicate_words(text)
    
    return text # Apply to dataframe

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained('bert-base-uncased')

def tokenize(text):
    encoded_inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True, max_length=512)
    return encoded_inputs

def preprocess(df):
    df["cleaned_resume"] = df["resume"].apply(clean_text)
    df["tokenized_data"] = df["cleaned_resume"].apply(lambda x: tokenize(x))
    return df

def create_embeddings(tokenized_data):
    input_ids = tokenized_data['input_ids']
    attention_mask = tokenized_data['attention_mask']
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # Average the embeddings of all tokens to get a robust representation
        embeddings = outputs.last_hidden_state.mean(dim=1)
    
    return embeddings

# Preprocess the DataFrame
processed_df = preprocess(df)
tqdm.pandas(desc="Creating Embeddings")
df['embeddings'] = df['tokenized_data'].progress_apply(lambda row: create_embeddings(row))

# Stack embeddings into a matrix
embeddings_matrix = np.vstack(df['embeddings'].values)

# Normalize embeddings
df['normalized_embeddings'] = df['embeddings'].apply(lambda x: normalize(x.reshape(1, -1), axis=1).flatten())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jtren\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jtren\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Creating Embeddings:   0%|          | 0/167 [00:00<?, ?it/s]

## Vector Query

In [41]:
# Query text processing
query_text = 'Python machine learning sklearn SQL database data science database coding programming'
query_tokenized = tokenize(query_text)
query_embedding = create_embeddings(query_tokenized).numpy()

# Calculate Cosine Similarity
cosine_similarities = cosine_similarity(query_embedding.reshape(1, -1), embeddings_matrix)
df['cosine_similarity'] = cosine_similarities[0]

# Mean centering and scaling the cosine similarity scores
mean_similarity = df['cosine_similarity'].mean()
std_similarity = df['cosine_similarity'].std()
df['normalized_similarity'] = (df['cosine_similarity'] - mean_similarity) / std_similarity


In [42]:
# Preprocess the text for BM25
def preprocess_for_bm25(text):
    return word_tokenize(clean_text(text))

# Tokenize the resumes for BM25
df['tokenized_resume'] = df['resume'].apply(preprocess_for_bm25)

# Create a BM25 object
bm25 = BM25Okapi(df['tokenized_resume'].tolist())

# Query preprocessing for BM25
tokenized_query = preprocess_for_bm25(query_text)

# Get BM25 scores
bm25_scores = bm25.get_scores(tokenized_query)
df['bm25_score'] = bm25_scores

# Normalize BM25 scores
df['normalized_bm25_score'] = (df['bm25_score'] - df['bm25_score'].mean()) / df['bm25_score'].std()

# Adjusted similarity by combining BM25 and BERT-based cosine similarity
df['combined_similarity'] = df['normalized_similarity'] + df['normalized_bm25_score']

# Sort by combined similarity score
sorted_df = df.sort_values(by='combined_similarity', ascending=False).reset_index(drop=True)

In [43]:
for index, value in enumerate(sorted_df['category']):
    if index < 30:
        print(f"Index {index}: Value {value}")
    else:
        break

Index 0: Value Data Science
Index 1: Value Data Science
Index 2: Value Data Science
Index 3: Value Data Science
Index 4: Value Database
Index 5: Value Hadoop
Index 6: Value Python Developer
Index 7: Value Data Science
Index 8: Value Data Science
Index 9: Value Data Science
Index 10: Value Python Developer
Index 11: Value Data Science
Index 12: Value Data Science
Index 13: Value DotNet Developer
Index 14: Value Python Developer
Index 15: Value HR
Index 16: Value Blockchain
Index 17: Value DotNet Developer
Index 18: Value Hadoop
Index 19: Value DotNet Developer
Index 20: Value Hadoop
Index 21: Value Java Developer
Index 22: Value DevOps Engineer
Index 23: Value Blockchain
Index 24: Value DotNet Developer
Index 25: Value Java Developer
Index 26: Value Java Developer
Index 27: Value Automation Testing
Index 28: Value Java Developer
Index 29: Value Java Developer


This produces a much better result. Now data science scores more highly on similarity scores.

In [49]:
sorted_df.iloc[10,2]

'technical skills responsibilities hands experience production maintenance projects handling agile methodology sdlc involved stage software development life cycle responsible gather requirement customer interaction providing estimate solution document per process fs ts coding utp utr ptf sow submission strong knowledge debugging testing based python 400 worked change controller promoting changes uat live environment pivotal cloud foundry good communication inter personal hardworking result oriented individual team certification trainings completed internal training web crawling scraping data science mongodb mysql postgresql django angular 6 html css german a1 level preparing a2 goethe institute core java ibm series course maples pune complete movex erp techn as400 rpg rpgle m3 stream serve enterprise collaborator mec education details sc computer maharashtra university b h c restful api developer kpit technologies skill flask exprience less 1 year months rest numpy 90 monthscompany com