This notebook utilizes the open-source model, LLama3-8b-instruct, alongside various prompts to accomplish the following:

1. Extract keywords from the responsibilities and skills outlined in the job description, offering a concise overview of the role's key aspects.
2. Summarize the job description, providing a concise and coherent overview while rewriting the professional profile to align with the role's requirements.
3. Perform a thorough spelling and grammar check to ensure the text is error-free and professionally presented.

By leveraging [LLama3-8b-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and tailored prompts, this notebook streamlines the process of analyzing and refining job descriptions, aiding in the creation of clear and effective content.

In [1]:
!pip install -U -qq transformers accelerate sentence-transformers

### Keyword Extraction


In [2]:
from huggingface_hub import notebook_login, Repository

# Login to Hugging Face
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

I experimented with various models, including Mistral and LLama3. LLama3 didn't perform well with bitsandbytes, but it's producing satisfactory results without quantization.

Unfortunately, the Llama-3-70B model won't fit on the free GPU notebook due to resource constraints.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto") 
                                             
tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          use_fast=True) 
    

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

### Extract Job Responsibilities

[An Example](https://www.linkedin.com/jobs/search/?currentJobId=3921153671&f_TPR=r604800&geoId=104738515&keywords=data%20scientist&location=Ireland&origin=JOB_SEARCH_PAGE_LOCATION_AUTOCOMPLETE&refresh=true&start=25)

In [5]:
responsibility = """

* You will work with a team of high-performing analytics, data science professionals, and cross-functional teams to identify business opportunities, optimize product performance or go-to-market strategy.

● You will analyze large-scale structured and unstructured data; develop deep-dive analysis and machine learning models to drive member value and customer success.

● You will design and develop core business metrics, create insightful automated dashboards and data visualizations to track them and extract useful business insights.

● You will design and analyze experiments to test new product ideas or go-to-market strategies and convert the results into actionable recommendations.

● You will craft compelling stories; make logical recommendations; drive informed actions.

● You will engage with internal technology partners to prototype and validate tools developed in-house for near-real-time processing of very large datasets.

● You will communicate findings to senior leaders and evangelize data-driven business decisions.

"""

**How to Generate**
* https://huggingface.co/blog/how-to-generate

In [6]:
resp_messages = [
{"role": "system", 
"content": """
You are an expert text analyst and researcher.
Please respond only in the English language. 
Do not explain what you are doing. 
Do not self reference.

Generate a valid JSON object with following key artifact:
"responsibilities": [],
"tasks": [],
"keywords": []

Just generate the JSON object without duplicates.
AVOID adding any details if not explicitly mentioned.
Ensure there are no spelling and grammar mistakes.

"""},
{"role": "user", 
"content": f"""
 
Please list all the main responsibilities, tasks and keywords from the following:
{responsibility}

Responsibilities are duties that you will carry out on a regular basis.
Tasks are the specific actions that you will perform.
Extract the keywords mentioned for the position.

DO NOT LIMIT the responsibilities, tasks or keywords.


"""},
]
start = time.time()

resp_input_ids = tokenizer.apply_chat_template(
    resp_messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda:0")

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

resp_outputs = model.generate(
    resp_input_ids,
    max_new_tokens=4096,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
resp_response = resp_outputs[0][resp_input_ids.shape[-1]:]
print(tokenizer.decode(resp_response, skip_special_tokens=True))
end = time.time()
print(f"Time (minutes): {(end - start)}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
2024-05-14 20:57:09.914145: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-14 20:57:09.914247: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-14 20:57:10.039816: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


{
"responsibilities": [
"work with a team",
"identify business opportunities",
"optimize product performance",
"go-to-market strategy",
"analyze large-scale structured and unstructured data",
"develop deep-dive analysis",
"machine learning models",
"drive member value",
"customer success",
"design and develop core business metrics",
"create insightful automated dashboards",
"data visualizations",
"track business metrics",
"extract business insights",
"design and analyze experiments",
"test new product ideas",
"go-to-market strategies",
"convert results into actionable recommendations",
"craft compelling stories",
"make logical recommendations",
"drive informed actions",
"engage with internal technology partners",
"prototype and validate tools",
"near-real-time processing of very large datasets",
"communicate findings to senior leaders",
"evangelize data-driven business decisions"
],
"tasks": [
"analyze large-scale structured and unstructured data",
"develop deep-dive analysis and machi

### Extract Skills

In [7]:
skills = """

Basic Qualifications:

● Bachelor’s Degree in a quantitative discipline: Statistics, Operations Research, Computer Science, Informatics, Engineering, Applied Mathematics, Economics, etc.
● 1+ years of industry experience
● 1 + years experience with SQL or relational database query performance.

Preferred Qualifications: 

● 2+ years of relevant work experience
● MS or PhD in a quantitative discipline: Statistics, Operations Research, Computer Science, Informatics, Engineering, Applied Mathematics, Economics, etc.
● Experience in at least one programming language (e.g., R, Python).
● Experience with data visualization tools (eg. Tableau, BI dashboarding, R visualization packages, etc.).
● Experience with manipulating massive scale structured and unstructured data.
● Experience with Hadoop or other MapReduce paradigms, and associated languages such as Hive, Presto etc.
● Working knowledge of Unix and Unix-like systems, git and review board.
● Excellent communications skills, with the ability to synthesize, simplify and explain complex problems to different types of audiences.

Suggested Skills:

Statistics, 
Computer Science, 
Engineering, 
Applied Mathematics
SQL
"""

In [8]:
messages = [
{"role": "system", 
"content": """
You are an expert text analyst and researcher.
Please respond only in the English language. 
Do not explain what you are doing. 
Do not self reference.
Do not add any unneccesary details. 

Generate a valid JSON object with following key artifacts:
skills: [],
machine learning techniques: [],
tools: [],
programming_languages: [],
education: [],
experience: [],
soft_skills : []

Just generate the JSON object without explanation, unique words or duplicates. Be brief.

"""},
{"role": "user", 
"content": f""" 
Please extract only the most relevant keywords and key phrases from the provided 
{skills}.

Extract keywords for 
data science related skills,  
machine learning techniques, 
data analysis tools,  
programming languages, 
educational qualifications,
experience with number of years,
and soft skills. 

AVOID adding any details if not explicitly mentioned.


"""},
]
start = time.time()

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda:0")

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=4096,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
end = time.time()
print(f"Time (minutes): {(end - start)}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


{
"skills": ["Statistics", "Computer Science", "Engineering", "Applied Mathematics", "SQL"],
"machine learning techniques": [],
"data analysis tools": ["Tableau", "Hive", "Presto"],
"programming languages": ["R", "Python"],
"educational qualifications": ["Bachelor's Degree", "MS", "PhD"],
"experience with number of years": ["1+", "2+"],
"soft skills": ["Excellent communications skills"]
}
Time (minutes): 60.90728187561035


### Write the professional profile

In [9]:
profile = """

a talented and driven individual to accelerate our efforts and be a major part of our data-centric culture. This person will work closely with various cross-functional teams such as product, marketing, sales, engineering and operations to develop and deliver metrics, analyses, solutions, and insights, with actionable recommendations to business partners. Successful candidates will exhibit technical acumen and business savviness, with a passion for making an impact through creative storytelling and timely actions.

"""


In [10]:
profile_messages = [
{"role": "system", 
"content": """
You are an expert with superb comprehension and communication skills, 
skilled in reading, understanding, and summarizing the main points of large sections of dense texts. 

The summary should cover all the key points and main ideas presented in the original text, 
while also condensing the information into a concise and easy-to-understand format.

Ensure you focus on the keywords mentioned in the text provided.

Please ensure that the summary includes relevant details and examples that support the main ideas, 
while avoiding any unnecessary information or repetition. 

The summary should be in first person narrative, active voice, that is professional, brief yet concise.

The length of the summary should be appropriate for the length and complexity of the original text, 
providing a clear and accurate overview without omitting any important information.

Generate a valid JSON object with following key artifact:
"summary": ""

"""},
{"role": "user", 
"content": f"""
 
Can you provide a comprehensive summary of the given 
{profile}?

"""},
]
start = time.time()

profile_input_ids = tokenizer.apply_chat_template(
    profile_messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda:0")

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

profile_outputs = model.generate(
    profile_input_ids,
    max_new_tokens=1028,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.3,
    top_p=0.9,
)
profile_response = profile_outputs[0][profile_input_ids.shape[-1]:]
print(tokenizer.decode(profile_response, skip_special_tokens=True))
end = time.time()
print(f"Time (minutes): {(end - start)}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Here is a comprehensive summary of the given text in JSON format:

{
"summary": "I'm seeking a talented and driven individual to join our team and accelerate our data-centric culture. As a key team member, this person will collaborate with cross-functional teams, including product, marketing, sales, engineering, and operations, to develop and deliver metrics, analyses, solutions, and insights with actionable recommendations for business partners. The ideal candidate will possess technical expertise and business acumen, with a passion for creative storytelling and timely action to drive impact."
}
Time (minutes): 68.82106018066406


### Spelling and Grammar checker


In [11]:
check_grammar = """

 I'm a talented and driven individual who collaborates with cross-functional teams, including product to develop and 
 deliver metrics, analyses, solutions, and insights with actionable recommendations for business partners. 
 I possess technical expertise and business acumen, 
 with a passion for creative storytelling and timely action to drive impact.


"""

In [12]:
grammar_messages = [
{"role": "system", 
"content": """
You are a spelling and grammar checker that looks for mistakes and makes sentences more fluent. 
You take all the user’s input and autocorrect it. 
You provide improvements to enhance overall readability. 
Make sure that the tone is professional, concise yet informal. 
Text should flow properly for a human to grasp the information as quickly as possible. 
Keep it simple and without jargons. 
"""},
    
{"role": "user", 
"content": f"""
 
Rewrite the following 
{check_grammar}
"""},
]
start = time.time()

grammar_input_ids = tokenizer.apply_chat_template(
    grammar_messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda:0")

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

grammar_outputs = model.generate(
    grammar_input_ids,
    max_new_tokens=8000,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
)
grammar_response = grammar_outputs[0][grammar_input_ids.shape[-1]:]
print(tokenizer.decode(grammar_response, skip_special_tokens=True))
end = time.time()
print(f"Time (minutes): {(end - start)}")



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Here is a rewritten version of the text:

As a skilled and driven professional, I work effectively with cross-functional teams, including product, to develop and deliver actionable metrics, analyses, solutions, and insights that provide valuable recommendations for business partners. With a strong foundation in technical expertise and business acumen, I am passionate about crafting compelling stories and driving timely action to drive meaningful impact.

I made the following changes to improve the text:

* Changed "I'm a talented and driven individual" to "As a skilled and driven professional" to make the language more concise and professional.
* Changed "collaborates with cross-functional teams, including product to develop and deliver" to "work effectively with cross-functional teams, including product, to develop and deliver" to make the sentence structure more clear and concise.
* Changed "metrics, analyses, solutions, and insights with actionable recommendations" to "actionable me