In [1]:
#!pip install -U -qq sentence-transformers
!pip install -U -qq transformers accelerate bitsandbytes keybert summa multi_rake

In [2]:
!pip install -qq git+https://github.com/LIAAD/yake

In [3]:
import pandas as pd
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

**Context**

You spend hours perfecting your resume, making sure it outlines your skills and experience in the best possible light. After all, when it comes to job hunting, your resume is your most important tool.

But after all that work, you’re still not getting enough interviews, even for jobs you know you’re qualified for. Why not?

What you might not realize is that your resume usually doesn’t go to a human being after you submit it – it goes to a computer. In fact, there’s a good chance a real person will never see your resume!

That’s because more and more employers are using applicant tracking systems (ATS) to screen resumes. 

What is an ATS? It’s computer software designed to scan resumes for certain keywords and weed out the ones that don’t match the job description.

So if you want your resume to actually make it into the hands of a human being, you need to make sure it’s optimized for the ATS.

In this notebook, we’re going to use NLP and language models to create a resume that can "pass" ATS!

**How applicant tracking systems work?**
There are 4 basic steps to how an applicant tracking system works:

1. A job requisition enters into the ATS. This requisition includes information about the position, such as the job title, desired skills, and required experience.
2. The ATS then uses this information to create a profile for the ideal candidate.
3. As applicants submit their resumes, the ATS parses, sorts, and ranks them based on how well they match the profile.
4. Hiring managers then quickly identify the most qualified candidates and move them forward in the hiring process.

What’s especially important to understand is that recruiters often filter resumes by searching for key skills and job titles.

This means that if you can predict the resume keywords that recruiters will use in their search, you’ll greatly increase your chances of moving on in the hiring process. But you don’t have to guess which keywords to use. All you have to do is analyze the job description to find them.

This notebook automates this process by using AI technology to analyze resume against the job description. It then provides a score that shows how well your resume matches the job description.

**Who uses ATS?**
Over 97% of Fortune 500 companies use ATS while a Kelly OCG survey estimated that 66% of large companies and 35% of small organizations rely on recruitment software. And these numbers continue to grow.

If you’re applying to a large organization, you’ll most likely face an ATS. 

If you’re applying through any online form, you’re applying through an ATS. 

Even job sites like Indeed and LinkedIn have their own built-in ATS.

It’s clear that ATS is here to stay. That’s why it’s so important to use the right keywords and format your resume in a way that makes it easy for ATS software to read.

**How to optimize your resume for an ATS?**
1. Carefully tailor your resume to the job description every single time you apply.
2. Optimize for ATS search and ranking algorithms by matching your resume keywords to the job description.
3. Use both the long-form and acronym version of keywords (e.g. “Master of Business Administration (MBA)” or “Search Engine Optimization (SEO)”) for maximum searchability.
4. Use a chronological or hybrid resume format (avoid the functional resume format).
5. Use a traditional resume font like Helvetica, Garamond, or Georgia.
6. Don’t use headers or footers as the information might get lost or cause a parsing error.
7. Use standard resume section headings like “Work Experience” rather than being cute or clever (“Where I’ve Been”).
8. Use an ATS-friendly resume builder to create your resume.

### Keyword Extraction


**TF-IDF**

term frequency–inverse document frequency, often abbreviated tf-idf, is a method that tries to identify the most distinctively frequent or significant words in a document. 

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).


In [4]:
linkedin_data = pd.read_excel('/kaggle/input/linkedin-sr-data-scientist-desc/linkedin_sr_data_scientist_desc.xlsx', sheet_name='Sheet1')

**Calculate tf–idf**
To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [5]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [6]:
tfidf_vector = tfidf_vectorizer.fit_transform(linkedin_data['skills'])

Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

In [7]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [8]:
tfidf_df = tfidf_df.stack().reset_index()

In [9]:
tfidf_df = tfidf_df.drop('level_0', axis=1)

In [10]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf','level_1': 'term'})

To find out the top 10 words with the highest tf–idf

In [11]:
tfidf_df.sort_values(by=['tfidf'], ascending=[False]).head(10)

Unnamed: 0,term,tfidf
3470,statistics,0.422105
3419,related,0.386528
5590,principles,0.378246
741,privacy,0.337159
4431,applied,0.331959
5274,using,0.317366
2458,model,0.310401
5955,languages,0.308765
5710,understanding,0.305988
3057,training,0.301546


### Extract Keywords using Python

**YAKE!**

It is a lightweight, unsupervised automatic keyword extraction method that relies on statistical text features extracted from individual documents to identify the most relevant keywords in the text. 

This system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, text size, domain, or language. Yake defines a set of five features capturing keyword characteristics which are heuristically combined to assign a single score to every keyword. The lower the score, the more significant the keyword will be.

YAKE! pays attention to capital letters and gives more importance to words that start with a capital letter.

In [12]:
job = linkedin_data['skills'].iloc[14]

In [13]:
import yake
kw_extractor = yake.KeywordExtractor(top=10, stopwords=None)
keywords = kw_extractor.extract_keywords(job)
for kw, v in keywords:
  print("Keyphrase: ",kw, ": score", v)

Keyphrase:  analyzing data : score 0.021871602942235096
Keyphrase:  experience working : score 0.09598042658969005
Keyphrase:  years : score 0.14629926398739557
Keyphrase:  analyzing : score 0.14629926398739557
Keyphrase:  data : score 0.14629926398739557
Keyphrase:  experience : score 0.14748481196459576
Keyphrase:  Safety : score 0.14954886395715059
Keyphrase:  Fraud : score 0.14954886395715059
Keyphrase:  Spam : score 0.14954886395715059
Keyphrase:  Investigations : score 0.14954886395715059


**Rake**

Rake is short for Rapid Automatic Keyword Extraction and it is a method of extracting keywords from individual documents. It can also be applied to new fields very easily and is very effective in dealing with multiple types of documents, especially text that requires specific grammatical conventions. Rake identifies key phrases in a text by analyzing the occurrence of a word and its compatibility with other words in the text (co-occurrence).

In [14]:
from multi_rake import Rake
rake = Rake()
keywords = rake.apply(job)
print(keywords[:10])

[('machine learning lifecycle', 9.0), ('communicate complex concepts', 9.0), ('relevant work experience', 8.0), ('2+ years', 4.0), ('experience working', 4.0), ('cyber proficiency', 4.0), ('programming language', 4.0), ('basic knowledge', 4.0), ('cross-functional stakeholders', 4.0), ('experience', 2.0)]


**TextRank**

TextRank is an unsupervised method for extracting keywords and sentences. It is based on a graph where each node is a word, and edges represent relationships between words which are formed by defining the co-occurrence of words within a moving window of a predetermined size. The algorithm is inspired by PageRank which was used by Google to rank websites. 

It first Tokenizes and annotates text with Part of Speech (PoS). It only considers single words. However, no n-grams are used, multi-words are reconstructed later. An edge is created if lexical units co-occur within a window of N-words to obtain an unweighted undirected graph. 

Then it runs the text rank algorithm to rank the words. The most important lexical words are selected and then adjacent keywords are folded into a multi-word keyword.

In [15]:
from summa import keywords
TR_keywords = keywords.keywords(job, scores=True)
print(TR_keywords[0:10])

[('experience working', 0.6087697185571863), ('relevant work', 0.6087697185571861), ('basic', 0.05370732274465505), ('language', 0.049357038879681674), ('r', 0.026853661372327527)]


**KeyBert**

KeyBERT is a simple, easy-to-use keyword extraction algorithm that takes advantage of SBERT embeddings to generate keywords and key phrases from a document that are more similar to the document. First, document embedding (a representation) is generated using the sentences-BERT model. 

Next, the embeddings of words are extracted for N-gram phrases. The similarity of each keyphrase to the document is then measured using cosine similarity. The most similar words can then be identified as the words that best describe the entire document and are considered as keywords.

In [16]:
from keybert import KeyBERT
kw_model = KeyBERT(model='all-mpnet-base-v2')
keywords = kw_model.extract_keywords(job, 

                                     keyphrase_ngram_range=(1, 3), 

                                     stop_words='english', 

                                     highlight=False,

                                     top_n=10)

keywords_list= list(dict(keywords).keys())

print(keywords_list)

2024-04-23 22:34:35.113868: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-23 22:34:35.114011: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-23 22:34:35.236419: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

['experience working analyzing', 'years experience working', 'data technical background', 'technical background', 'machine learning lifecycle', 'investigations cyber proficiency', 'relevant work experience', 'cyber proficiency sql', 'experience working', 'machine learning models']


In [17]:
keywords = kw_model.extract_keywords(job, 

                                     keyphrase_ngram_range=(1, 3), 

                                     stop_words='english', 

                                     highlight=True,

                                     top_n=10)

In [20]:
keywords = kw_model.extract_keywords(job, 

                                     keyphrase_ngram_range=(1, 2), 

                                     stop_words='english', 

                                     highlight=True,

                                     top_n=10)

### Extracting Job Skills using LLMs

TF-IDF needs a lot of processing and does not work as efficently. Classical NLP tools do not extract all the key words.

The idea of extracting keywords from documents through an LLM is straightforward and allows for easily testing your LLM and its capabilities.

**Why Mistral 7B model?**
Mistral 7B is a 7-billion-parameter language model released by Mistral AI. Mistral 7B is a carefully designed language model that provides both efficiency and high performance to enable real-world applications. Due to its efficiency improvements, the model is suitable for real-time applications where quick responses are essential.

Mistral 7B has demonstrated superior performance across various benchmarks, outperforming even models with larger parameter counts. It excels in areas like mathematics, code generation, and reasoning.

* https://www.linkedin.com/pulse/proof-concept-using-large-language-models-llms-extract-truc-phan-w5vde/
* https://huggingface.co/docs/transformers/main/en/model_doc/llama2
* https://huggingface.co/blog/llama2#how-to-prompt-llama-2
* https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing
* https://towardsdatascience.com/meta-llama-3-optimized-cpu-inference-with-hugging-face-and-pytorch-9dde2926be5c
* https://www.promptingguide.ai/models/mistral-7b

In [21]:
from huggingface_hub import notebook_login, Repository

# Login to Hugging Face
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [22]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
import time
import regex
import json

start = time.time()

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
#token = TOKEN

prompt = f"""

Always assist with care, respect, and truth. 
Respond with utmost utility yet securely. 
Avoid harmful, unethical, prejudiced, or negative content. 
Ensure replies promote fairness and positivity.

Read the JOB DESCRIPTION 
and extract keywords for 
data related skills such as Generative AI, Statistics,
machine learning tasks such as feature enginerring, classification,
data analysis tools such as Tableau, Streamlit, 
programming languages such as Python, SQL,
education such as B.Sc.,
and soft skills e.g., communication required.

Generate a valid JSON object with following key artifacts
skills: "",
machine learning: ""
tools: "",
languages: "",
education: "",
soft skills : "".

Just generate the JSON object without explanation, unique words or duplicates. Be brief.


JOB DESCRIPTION
"
{job}
"

"""

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map="auto", 
                                             quantization_config=bnb_config)
                                             #token=TOKEN
tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          use_fast=True, 
                                          quantization_config=bnb_config)
                                          #token=TOKEN
    
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

output = model.generate(**model_inputs,
                          max_new_tokens=1024,
                          repetition_penalty=1.5)

text_string = tokenizer.decode(output[0], 
                       skip_special_tokens=True)

# Define a pattern to match JSON object
pattern = regex.compile(r'\{(?:[^{}]|(?R))*\}')

# Find JSON object using regular expression
json_match = pattern.findall(text_string)

if json_match:
    # Clean the text
    cleaned_text = str(json_match).strip('[]')  # remove square brackets
    cleaned_text = cleaned_text.strip()  # remove leading/trailing spaces
    cleaned_text = cleaned_text.replace('\n', '')  # remove newlines
    cleaned_text = cleaned_text.replace('\t', '')  # remove tabs
    cleaned_text = cleaned_text.replace('\\', '')  # remove escape characters

    # Print the dictionary
    print(cleaned_text)
else:
    print("No JSON object found in the text.")
    
end = time.time()
print(f"Time (minutes): {(end - start)}")

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'{nt"skills": ["Data Analysis", "SQL","Programming"],n    "machine_learning":["Machine Learning Model Implementation/Improvement"],n     "tools":"N/A",n      "languages":{"python":"1"},n       "education": {"Bsc Degree":"0"},"n        software Skills ": {   "communicationSkillLevel": "Advanced" }n}'
Time (minutes): 92.3208920955658


### Semantic Similarity 
It is the similarity between two words or two sentences/phrase/text. It measures how close or how different the two pieces of word or text are in terms of their meaning and context.

In [None]:
from sentence_transformers import SentenceTransformer, util

def semantic_similarity_sbert_base_v2(job,resume, model):
    """calculate similarity with SBERT all-mpnet-base-v2"""
    model = SentenceTransformer(model)
    # Compute embedding for both lists
    embeddings1 = model.encode(job, convert_to_tensor=True)
    embeddings2 = model.encode(resume, convert_to_tensor=True)

    # Compute cosine-similarities
    cosine_scores = util.cos_sim(embeddings1, embeddings2)
     
    return cosine_scores

In [None]:
resume = """


"""

In [None]:
semantic_similarity_sbert_base_v2(job,resume, 'all-MiniLM-L12-v1')


### How to Conduct Candidate Analysis

Crafting a comprehensive candidate analysis involves multiple dimensions. Here are the key steps:

1. Resume Keyword Matching: Verify alignment between the resume and job description by looking for overlaps in skills and experience.
2. Competency-Based Evaluation: Scrutinize past achievements using the STAR technique to ensure competencies match those necessary for the role.
3. AIDA Cover Letter Review: Evaluate the cover letter to see if it effectively grabs Attention, maintains Interest, builds Desire, and prompts Action.
4. Fit/Gap Analysis: Determine where a candidate’s skills meet the job prerequisites and where they don't to assess overall compatibility.
5. Growth Potential Assessment: Consider the candidate's past trajectories to estimate their potential for future growth within your startup.

**ChatGPT Prompt for Founders to Create Candidate Analysis**


Using the job description provided, I need a detailed **candidate analysis report** for a job application. This report will assist me in making an informed decision about whether to proceed with the interview process for this candidate.

Here is the job description to use as a benchmark:

[Job Description]

#### Candidate’s Application Analysis:

1. **Resume Keyword Match**: Examine the applicant's resume and extract key skills, experiences, and qualifications. Present these in a bullet-pointed list and note which directly match the job description criteria.

2. **Competency-Based Evaluation**: Analyze the candidate's strongest work achievements. Use the STAR technique to break these down and comment on how these achievements demonstrate competencies required for the job.

3. **Cover Letter AIDA Assessment**: Critique the cover letter using the AIDA model, focusing on how the candidate uses it to illustrate suitability for the role.

4. **Fit-Gap Analysis**: Conduct a fit-gap analysis by creating two lists: one showing where the candidate's skills and experiences match the job requirements ('Fit') and another where they do not align ('Gap').

5. **Growth Potential**: Comment briefly on the candidate's potential for growth and learning within the company based on their career trajectory and achievements presented.

6. **Final Suitability Statement**: Conclude with a suitability statement summarizing whether the candidate should be considered for the role  based on criteria matches, potential growth, and overall fit for the company culture.

Please present your findings in a cohesive markdown format, ensuring each section is clear and well-structured for ease of review.

Candidate's Application:

[Candidate Application]
