<a href="https://colab.research.google.com/github/pdrobny/Potential_Talents/blob/main/HF_gemma_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HugginFace with Gemma after fine-tuning






## Setup

In [None]:
!pip install transformers sentence-transformers
!pip install transformers torch
!pip install -U bitsandbytes
!pip install datasets
!pip install accelerate
!pip install peft
!pip install -U trl

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
# import libraries
import pandas as pd
import numpy as np
import warnings
import logging
import random
import requests
import sys
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM, LlamaTokenizer, set_seed, TrainingArguments
from huggingface_hub import notebook_login
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
warnings.filterwarnings('ignore', category=UserWarning)

print(torch.__version__)
#tf.__version__

2.6.0+cu124


# Data prep

In [None]:
df = pd.read_csv('talents.csv')
df.head(20)

Unnamed: 0,id,title,sentence_bert_cossim
0,1,innovative and driven professional seeking a r...,1.0
1,431,aspiring data science professional focused on ...,0.769162
2,544,data analyst data scientist business analyst d...,0.768222
3,833,data analyst turning complex data into actiona...,0.745245
4,199,ms in information systems northeastern univers...,0.727268
5,28,aspiring data scientist passion for data-drive...,0.720545
6,1282,data scientist and analyst driving business in...,0.717432
7,426,master of science in analytics at georgia inst...,0.717093
8,963,passionate data scientist seeking exciting opp...,0.711745
9,487,research assistant penn state seeking opportun...,0.708959


In [None]:
job_titles = df["title"].tolist()


In [None]:
job_titles_short = df["title"].head(10).tolist()
job_titles_short

['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.',
 'aspiring data science professional focused on data analysis machine learning and data visualization actively seeking opportunities',
 'data analyst data scientist business analyst driving data-driven insights strategic solutions',
 'data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions',
 'ms in information systems northeastern university data scientist business intelligence data engineering data analyst transforming data into insights',
 'aspiring data scientist passion for data-driven decision making master of science in business analytics graduate  university of new hampshire',
 'data scientist and analyst driving business insights with advanced data techniques research expertise',
 'master of science in analytics at georgia institute of technology aspiring data scientist',
 'passion

In [None]:
job_ids_short = df["id"].head(10).tolist()
job_ids_short

[1, 431, 544, 833, 199, 28, 1282, 426, 963, 487]

In [None]:
target_title = "data analyst"

In [None]:
notebook_login()

## Fine-Tuning Model

In [None]:
model_id = "google/gemma-1.1-2b-it"  # Update based on your available resources
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# QLoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

data = [
    {
        "prompt": f"""
Return a list of the top 5 job candidates with full job title and job id from a job titles list ranked by their similirality to the search term in desecnding order.  Only show the answer. Do not reason or explain.
**Search term**
{target_title}

**job titles**
{job_titles_short}

**job ids**
{job_ids_short}


Show answer in following format:
Rank Job ID   Job Title
1 - 1: innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.
2 - ...
3 - ...
...
Answer: Top 5 are:
""",
        "response": """Rank Job ID   Job Title
1 - 199: data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions
2 - 431: aspiring data science professional focused on data analysis machine learning and data visualization actively seeking opportunities
3 - 833: data scientist and analyst driving business insights with advanced data techniques research expertise
4 - 28: data analyst data scientist business analyst driving data-driven insights strategic solutions
5 - 426: ms in information systems northeastern university data scientist business intelligence data engineering data analyst transforming data into insights"""
    }
]
# Convert to Hugging Face Dataset format
dataset = Dataset.from_list([
    {"text": f"{item['prompt']}{item['response']}"} for item in data
])

# Training arguments (CPU-friendly)
training_args = TrainingArguments(
    output_dir="./gemma-cpu-finetune",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=30,  # Keep small for testing
    learning_rate=5e-5,
    logging_steps=5,
    save_steps=15,
    save_total_limit=2,
    fp16=True,
    bf16=False,
    report_to="none",
    no_cuda=False
)

# Fine-tuning trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args
)

trainer.train()

In [None]:
merged_model_dir = "./gemma-cpu-finetune-merged"

# Save the fine-tuned model
trainer.model.save_pretrained(merged_model_dir)

# Save the tokenizer to the same directory
tokenizer.save_pretrained(merged_model_dir)

# Load the merged model and tokenizer explicitly
merged_tokenizer = AutoTokenizer.from_pretrained(merged_model_dir)
merged_model = AutoModelForCausalLM.from_pretrained(merged_model_dir)




In [None]:
seed = random.randint(1000,9999)
#seed = 7308
set_seed(seed)
print(seed)
#good seeds: 7308

# Create a combined string of job ID and job title pairs
job_pairs = "\n".join([f"{job_id}: {job_title}" for job_id, job_title in zip(job_ids_short, job_titles_short)])

# Load the merged model and tokenizer using the pipeline
# You might need to specify the trust_remote_code=True for some models
# Pass the loaded model and tokenizer objects to the pipeline
generator = pipeline('text-generation', model=merged_model, tokenizer=merged_tokenizer)


prompt = f"""
Return a list of the top 5 job candidates with full unmodified job title and matching job id from a job titles list ranked by their similirality to the search term in desecnding order.  Only show the answer. Do not reason or explain.
**Search term**
{target_title}

**Job candidates (ID: Title)**
{job_pairs}


Show answer in following format:
Rank Job ID   Job Title
1 - 1: innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.
2 - ...
3 - ...
...
Answer: Top 5 are:
"""

output = generator(prompt, max_new_tokens=200, num_return_sequences=1)
print(output[0]['generated_text'])

Device set to use cuda:0


7308

Return a list of the top 5 job candidates with full unmodified job title and matching job id from a job titles list ranked by their similirality to the search term in desecnding order.  Only show the answer. Do not reason or explain.
**Search term**
data analyst

**Job candidates (ID: Title)**
1: innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.
431: aspiring data science professional focused on data analysis machine learning and data visualization actively seeking opportunities
544: data analyst data scientist business analyst driving data-driven insights strategic solutions
833: data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions
199: ms in information systems northeastern university data scientist business intelligence data engineering data analyst transforming data into insights
28: aspiring data scientist passion for data-driven

In [None]:
generated_text = output[0]['generated_text']
answer_start_index = generated_text.find("Answer:")
# Extract the answer and print it
if answer_start_index != -1:
    # Add the length of the search string to get the true start of the answer
    answer_start_index += len("Answer:")
    print(generated_text[answer_start_index:].strip())
else:
    # If the marker was not found, print the whole output or a message
    print("Could not find the start of the answer in the generated text.")
    print(generated_text)

Top 5 are:
1 - 833: data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions
2 - 28: aspiring data scientist passion for data-driven decision making master of science in business analytics graduate  university of new hampshire
3 - 199: ms in information systems northeastern university data scientist business intelligence data engineering data analyst transforming data into insights
4 - 487: research assistant penn state seeking opportunities in the data field data analyst with experience at sritech software expertise in machine learning data evaluation passionate about transforming data into insights
5 - 426: master of science in analytics at georgia institute of technology aspiring data scientist
