# BERT Tokeniser & Vector Querying (Improved)

Have had a few ideas for improvements:

1) Use Lemmatization instead of Stemming
2) Take the CLS token from BERT rather than averaging the embedded values
3) Normalise the output
on similarity.

### How Is the Enhanced Approach Better?

The enhanced approach I suggested includes se
eral__mprovements:

1. **Lemmatizatio:__  of Stemming:**
   - Lemmatization provides more meaningful base ~forms of words compared to stemming, which can help in improving the quality o__the embeddings.

:__ Token Embedding:**
   - Using the `[CLS]` token embedding captures the entire sequenceâ€™s contextual information, which can be more representative than averaging all token embeddings. 3nd__he compstatio:__ .

4. **Normalization:**
   - Embeddings are normalized to ensure that the cosine similarity measures are on a uniform scale, which can improve the accuracy of similarity scoring.


## Clean and Embed Word Tokens

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/UpdatedResumeDataSet.csv")

In [2]:
import json
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import numpy as np
from transformers import BertTokenizer, BertModel
import torch
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.notebook import tqdm

# Text cleaning
nltk.download('stopwords')
nltk.download('wordnet')
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text, stopwords=STOPWORDS):
    text = text.lower()
    text = re.sub(r'\b(' + r"|".join(stopwords) + r")\b\s*", '', text)
    text = re.sub(r"([!\"'#$%&()*\+,-./:;<=>?@\\\[\]^_`{|}~])", r" \1 ", text)
    text = re.sub("[^A-Za-z0-9]+", " ", text)
    text = re.sub(" +", " ", text)
    text = re.sub("\n", " ", text)
    text = re.sub(r"http\S+", "", text)
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained('bert-base-uncased')

def tokenize(text):
    encoded_inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True, max_length=512)
    return encoded_inputs

def preprocess(df):
    df["cleaned_resume"] = df["resume"].apply(clean_text)
    df["tokenized_data"] = df["cleaned_resume"].apply(lambda x: tokenize(x))
    return df

def create_embeddings(tokenized_data):
    input_ids = tokenized_data['input_ids']
    attention_mask = tokenized_data['attention_mask']
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # Average the embeddings of all tokens to get a robust representation
        embeddings = outputs.last_hidden_state.mean(dim=1)
    
    return embeddings

# Preprocess the DataFrame
processed_df = preprocess(df)
tqdm.pandas(desc="Creating Embeddings")
df['embeddings'] = df['tokenized_data'].progress_apply(lambda row: create_embeddings(row))

# Stack embeddings into a matrix
embeddings_matrix = np.vstack(df['embeddings'].values)

# Normalize embeddings
df['normalized_embeddings'] = df['embeddings'].apply(lambda x: normalize(x.reshape(1, -1), axis=1).flatten())


  torch.utils._pytree._register_pytree_node(
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jtren\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jtren\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Creating Embeddings:   0%|          | 0/167 [00:00<?, ?it/s]

## Vector Query

In [9]:

# Query text processing
query_text = 'Python machine learning sklearn SQL database data science optimise clean software database coding programming java javascript SQL'
query_tokenized = tokenize(query_text)
query_embedding = create_embeddings(query_tokenized).numpy()

# Calculate Cosine Similarity
cosine_similarities = cosine_similarity(query_embedding.reshape(1, -1), embeddings_matrix)
df['cosine_similarity'] = cosine_similarities[0]

# Mean centering and scaling the cosine similarity scores
mean_similarity = df['cosine_similarity'].mean()
std_similarity = df['cosine_similarity'].std()
df['normalized_similarity'] = (df['cosine_similarity'] - mean_similarity) / std_similarity

# Document length penalty: a more sophisticated penalty, like log transformation
df['document_length'] = df['cleaned_resume'].apply(lambda x: len(x.split()))
df['length_penalty'] = np.log1p(df['document_length'])

# Adjusted similarity with length penalty
df['adjusted_similarity'] = df['normalized_similarity'] - df['length_penalty']
sorted_df = df.sort_values(by='normalized_similarity', ascending=False).reset_index(drop=True)

In [10]:
for index, value in enumerate(sorted_df['category']):
    if index < 30:
        print(f"Index {index}: Value {value}")
    else:
        break


Index 0: Value Python Developer
Index 1: Value HR
Index 2: Value Data Science
Index 3: Value Java Developer
Index 4: Value Data Science
Index 5: Value Python Developer
Index 6: Value Java Developer
Index 7: Value Data Science
Index 8: Value Hadoop
Index 9: Value Data Science
Index 10: Value DotNet Developer
Index 11: Value Java Developer
Index 12: Value Data Science
Index 13: Value Blockchain
Index 14: Value DotNet Developer
Index 15: Value Hadoop
Index 16: Value DevOps Engineer
Index 17: Value Database
Index 18: Value Java Developer
Index 19: Value Automation Testing
Index 20: Value Java Developer
Index 21: Value Python Developer
Index 22: Value Java Developer
Index 23: Value Data Science
Index 24: Value Java Developer
Index 25: Value SAP Developer
Index 26: Value ETL Developer
Index 27: Value HR
Index 28: Value Business Analyst
Index 29: Value Automation Testing


Interestingly, this has made the result worse! By making these changes, the model appears to massively favour any text that is shorter. This could be because comparing vectors with fewer words/features can sometimes produce higher similarity scores as they have fewer dimensions to normalise over.

In [12]:
sorted_df.iloc[27,2]

'key skill computerized accounting tally sincere hard working management accounting income tax good communication leadership two four wheeler driving license internet ecommerce management computer skill c language web programing tally dbms education detail june 2017 june 2019 mba finance hr india mlrit june 2014 june 2017 bcom computer hyderabad telangana osmania university june 2012 april 2014 inter mec india srimedhav hr nani skill detail accounting exprience 6 month database management system exprience 6 month dbms exprience 6 month management accounting exprience 6 month ecommerce exprience 6 monthscompany detail company valuelabs description give rrf form required dlt hand rlt scrum master take form rlt scrum master give form trainee work requirement till candidate receive offer company'

This result came 27th in the rank despite not having a whole lot to do with python or data science.