<a href="https://colab.research.google.com/github/pdrobny/Potential_Talents/blob/main/P3_gemma_FT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Potential Talent






## Background

#### As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

#### The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

#### We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

#### Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

## Goals
#### - Predict how fit the candidate is based on their available information (variable fit)
#### - Rank candidates based on a fitness score.
#### - Re-rank candidates when a candidate is starred.

## Setup

In [None]:
!pip install transformers sentence-transformers
!pip install transformers torch
!pip install -U bitsandbytes
!pip install datasets
!pip install accelerate
!pip install peft
!pip install -U trl

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
# import libraries
import pandas as pd
import numpy as np
import warnings
import logging
import random
import requests
import sys
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM, LlamaTokenizer, set_seed, TrainingArguments
from huggingface_hub import notebook_login
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
warnings.filterwarnings('ignore', category=UserWarning)

print(torch.__version__)
#tf.__version__

2.6.0+cu124


# Data prep

In [None]:
df = pd.read_csv('talents.csv')
df

Unnamed: 0,id,title,sentence_bert_cossim
0,1,innovative and driven professional seeking a r...,1.000000
1,431,aspiring data science professional focused on ...,0.769162
2,544,data analyst data scientist business analyst d...,0.768222
3,833,data analyst turning complex data into actiona...,0.745245
4,199,ms in information systems northeastern univers...,0.727268
...,...,...,...
1260,648,research specialist university of rochester di...,0.079923
1261,730,medical biller at brick pediatric group,0.072848
1262,990,ingeniero elctrico,0.067254
1263,296,company owner at armstrong cleans carpets,0.056890


In [None]:
job_titles = df["title"].tolist()


In [None]:
job_titles_short = df["title"].head(10).tolist()
job_titles_short

['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.',
 'aspiring data science professional focused on data analysis machine learning and data visualization actively seeking opportunities',
 'data analyst data scientist business analyst driving data-driven insights strategic solutions',
 'data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions',
 'ms in information systems northeastern university data scientist business intelligence data engineering data analyst transforming data into insights',
 'aspiring data scientist passion for data-driven decision making master of science in business analytics graduate  university of new hampshire',
 'data scientist and analyst driving business insights with advanced data techniques research expertise',
 'master of science in analytics at georgia institute of technology aspiring data scientist',
 'passion

In [None]:
job_ids_short = df["id"].head(10).tolist()
job_ids_short

[1, 431, 544, 833, 199, 28, 1282, 426, 963, 487]

In [None]:
target_title = "data analyst"

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Fine-Tuning Model

In [None]:
model_id = "google/gemma-1.1-2b-it"  # Update based on your available resources
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

data = [
    {
        "prompt": f"""
Return a list of the top 5 job candidates with full job title and job id from a job titles list ranked by their similirality to the search term in desecnding order.  Only show the answer. Do not reason or explain.
**Search term**
{target_title}

**job titles**
{job_titles_short}

**job ids**
{job_ids_short}


Show answer in following format:
Rank Job ID   Job Title
1 - 1: innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.
2 - ...
3 - ...
...
Answer: Top 5 are:
""",
        "response": """Rank Job ID   Job Title
1 - 199: data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions
2 - 431: aspiring data science professional focused on data analysis machine learning and data visualization actively seeking opportunities
3 - 833: data scientist and analyst driving business insights with advanced data techniques research expertise
4 - 28: data analyst data scientist business analyst driving data-driven insights strategic solutions
5 - 426: ms in information systems northeastern university data scientist business intelligence data engineering data analyst transforming data into insights"""
    }
]
# Convert to Hugging Face Dataset format
dataset = Dataset.from_list([
    {"text": f"{item['prompt']}{item['response']}"} for item in data
])

# Training arguments (CPU-friendly)
training_args = TrainingArguments(
    output_dir="./gemma-cpu-finetune",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=30,  # Keep small for testing
    learning_rate=5e-5,
    logging_steps=5,
    save_steps=15,
    save_total_limit=2,
    fp16=False,
    bf16=False,
    report_to="none",
    no_cuda=True
)

# Fine-tuning trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args
)

trainer.train()

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]



Converting train dataset to ChatML:   0%|          | 0/1 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/1 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
model_id = "./gemma-cpu-finetune""
generator = pipeline('text-generation', model=model_id)
set_seed(42)
prompt = f"""
Return a list of the top 5 job candidates with full job title and job id from a job titles list ranked by their similirality to the search term in desecnding order.  Only show the answer. Do not reason or explain.
**Search term**
{target_title}

**job titles**
{job_titles_short}

**job ids**
{job_ids_short}


Show answer in following format:
Rank Job ID   Job Title
1 - [job id]: [job title]

Example of answer:
Rank Job ID   Job Title
1 - 1: innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.
2 - ...
...
Answer: Top 5 are:
"""
output = generator(prompt, max_new_tokens=200, num_return_sequences=1)
print(output[0],['generated_text'])

In [None]:
generated_text = output[0]['generated_text']
answer_start_index = generated_text.find("Answer:")
# Extract the answer and print it
if answer_start_index != -1:
    # Add the length of the search string to get the true start of the answer
    answer_start_index += len("Answer:")
    print(generated_text[answer_start_index:].strip())
else:
    # If the marker was not found, print the whole output or a message
    print("Could not find the start of the answer in the generated text.")
    print(generated_text)