<a href="https://colab.research.google.com/github/nxxk23/AI-Engineer/blob/main/extract/gradioprompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install easyocr gradio pythainlp langchain langchain_huggingface langchain_community pytesseract transformers
!sudo apt-get install ghostscript

> [resume drive link](https://drive.google.com/drive/folders/1aoFrz_k9ngVXesYZGIrTaiPQJTOBQFEr?usp=sharing) but if u already have your own resume, it will be unneccessary

In [None]:
import os
import re
import numpy as np
import pandas as pd
import easyocr
import subprocess
from pythainlp.phayathaibert.core import NamedEntityTagger
import gradio as gr
from PIL import Image
import requests
from threading import Semaphore
from concurrent.futures import ThreadPoolExecutor, as_completed
from huggingface_hub import InferenceClient
import pytesseract
from nltk.tokenize import sent_tokenize
import os
import subprocess
import uuid

In [10]:
resume_prompt = """
# Resume Evaluation Prompt

**You are a career coach and resume evaluator**, tasked with thoughtfully assessing a candidate's resume based on a specific job description. Your goal is to strike a balance between the candidate’s qualifications and their potential for growth. Keep in mind that real-world hiring involves not only meeting immediate requirements but also identifying long-term potential and adaptability.

## Instructions:
- **Focus primarily on the job description**, but also consider any unique skills and experience that the candidate offers.
- Your evaluation should prioritize candidates who align well with the role but remain open to those who might provide unexpected strengths.
- **Don’t rely solely on a perfect match**—instead, take a holistic view of the candidate to assess their overall fit, potential, and growth trajectory.

### Evaluation Criteria:

1. **Strong Fit (PASS)**:
   - The candidate aligns well with the job requirements and demonstrates strong potential to excel, even if some qualifications are missing.
   - Consider their **unique strengths**, **relevant experience**, and **transferable skills**.
   - If they show adaptability and a strong overall fit, classify them as "Strong Fit."

2. **Promising (RECONSIDER)**:
   - The candidate aligns with some aspects of the job but has notable gaps.
   - However, their experience suggests they could grow into the role.
   - Consider them as a candidate with potential, particularly if their overall background suggests they would be worth interviewing or reviewing further.

3. **Not a Fit (UNRELATED)**:
   - The candidate’s background does not align with the job description and there is no significant potential for the role.
   - Their experience, while potentially strong in other areas, is unrelated to the requirements for this position.

## Evaluation Process:

1. **Step 1**: Carefully review the **job description** and extract the core requirements and preferred qualifications.
2. **Step 2**: Compare the **resume** to the job description:
   - Identify relevant qualifications and skills.
   - Note any **transferable skills** or **unique strengths** that the candidate brings.
3. **Step 3**: Consider the candidate's **overall potential**, adaptability, and long-term growth.
4. **Step 4**: Classify the resume and provide feedback.

## Output Instructions:
- The output must **only** contain the following sections in the exact order and format. Any deviation from this format is not allowed.
  - **Classification**: Pass / Reconsider / Unrelated
  - **Rationale**: A clear and concise explanation for your classification.
  - **ข้อดี (Strengths)**: A summary of the candidate's strengths.
  - **ข้อควรปรับปรุง (Areas for Improvement)**: A summary of areas where the candidate could improve or is lacking.

- **Do not** include a detailed analysis of the job description or resume in the output.
- The output should **start** with the **Classification** and **not include** resume details or job description.

## Example Format for the Output:

**Classification**: Pass / Reconsider / Unrelated
**Rationale**: Provide a concise explanation, focusing on the candidate's fit, transferable skills, and potential.
**ข้อดี (Strengths)**:
- List key strengths of the candidate.
**ข้อควรปรับปรุง (Areas for Improvement)**:
- Summarize areas where the candidate could improve.

---

## **Job Description** (never copy it):
{jobdescribe}

## **Resume** (never copy it):
{resume}

"""


- **`convert_pdf_to_images`**: Converts a PDF file into high-resolution PNG images using Ghostscript.
- **`extract_text_from_image`**: Extracts text from images using easyocr for both Thai and English languages.
- **`chunk_text`**: Splits large text into smaller chunks based on a token limit for model processing.
- **`tag_and_clean_text`**: Tags and cleans text using a custom NER model and removes unnecessary tags.
- **`generate_answer`**: Generates an evaluation (classification and rationale) from a resume and job description.
- **`extract_classification`**: Extracts the classification result (Pass, Reconsider, Unrelated) from generated text.
- **`process_file`**: Processes a single resume file by extracting text and generating an evaluation.
- **`process_multiple_files`**: Processes multiple resume files in batch, applying the evaluation function to each.
- **`save_dataframe`**: Saves the processed evaluation results to a CSV file, appending new data.
- **`generate_prompt_answer_optimized`**: Optimizes prompt generation and evaluation for batch processing.
- **`process_batch`**: Handles batch processing of resumes with threading for efficient evaluation.
- **`batch_process`**: Parallelizes the evaluation of resumes using ThreadPoolExecutor for faster batch processing.

In [11]:
# Define the model parameters
model_params = {
    "max_new_tokens": 500,
    "temperature": 0.1,
    "top_p": 0.95,
    "repetition_penalty": 1.0
}
semaphore = Semaphore(50)

def convert_pdf_to_images(pdf_file, output_folder="/content/images", dpi=300):
    try:
        # Create a unique folder for each PDF
        pdf_filename = os.path.splitext(os.path.basename(pdf_file))[0]
        unique_output_folder = os.path.join(output_folder, pdf_filename)

        if not os.path.exists(unique_output_folder):
            os.makedirs(unique_output_folder)

        output_format = os.path.join(unique_output_folder, "page_%03d.png")
        gs_command = [
            "gs",
            "-sDEVICE=png16m",
            f"-r{dpi}",
            "-o", output_format,
            pdf_file
        ]
        subprocess.run(gs_command, check=True)

        # Collect all generated images in a list
        image_files = sorted([os.path.join(unique_output_folder, f) for f in os.listdir(unique_output_folder) if f.endswith(".png")])
        return image_files
    except subprocess.CalledProcessError as e:
        print(f"Ghostscript error: {e}")
        return []

def extract_text_from_image(image_path):
    reader = easyocr.Reader(['th', 'en'], gpu=True)
    image = Image.open(image_path)
    image_np = np.array(image)
    result = reader.readtext(image_np)
    sorted_data = sorted(result, key=lambda x: x[0][0][1])
    plain_text = "\n".join([text for _, text, _ in sorted_data])
    return plain_text

def chunk_text(text, tokenizer, max_tokens=200):
    tokens = tokenizer.tokenize(text)
    total_tokens = len(tokens)
    chunks = []
    start = 0
    while start < total_tokens:
        end = min(start + max_tokens, total_tokens)
        chunk = tokenizer.convert_tokens_to_string(tokens[start:end])
        chunks.append(chunk)

        start += max_tokens

    return chunks

def tag_and_clean_text(text, tagger, tokenizer, unwanted_pattern, max_tokens=200):
    text_chunks = chunk_text(text, tokenizer, max_tokens=max_tokens)
    tagged_text = []
    cleaned_text = []

    for i, chunk in enumerate(text_chunks):
        ner = tagger.get_ner(chunk, tag=True)
        if not ner:
            ner = chunk

        # Clean the tags
        pattern = r'<(?!ORGANIZATION|PERCENT|TIME)[^>]+>[^<]*?</[^>]+>'
        cleaned_ner = re.sub(pattern, '', ner)
        cleaned_ner = re.sub(r'</?(ORGANIZATION|PERCENT|TIME)>', '', cleaned_ner)
        cleaned_ner = re.sub(unwanted_pattern, '', cleaned_ner, flags=re.IGNORECASE)
        cleaned_ner = re.sub(r'\bal\b', 'ai', cleaned_ner, flags=re.IGNORECASE)
        # Append the tagged and cleaned text
        tagged_text.append(ner)
        cleaned_text.append(cleaned_ner)
    # Combine results from all chunks
    combined_tagged_text = "\n".join(tagged_text).strip()
    combined_cleaned_text = "\n".join(cleaned_text).strip()

    return combined_tagged_text, combined_cleaned_text


# Function to generate an answer using the prompt
def generate_answer(resume, job_description):
    formatted_prompt = resume_prompt.replace("{resume}", resume).replace("{jobdescribe}", job_description)
    truncated_prompt = formatted_prompt[:model_params["max_new_tokens"]]
    client = InferenceClient('https://ai-api.manageai.co.th/llm-model-03/')
    response = client.text_generation(formatted_prompt, **model_params)
    output = "".join(response)
    return output

# Extract classification from the result text
def extract_classification(text):
    pattern = r'\*\*\s*Classification\s*\**:\s*(?:\[)?\s*(Pass|Reconsider|Not Pass|Unrelated)\s*(?:\])?'
    match = re.search(pattern, text, re.IGNORECASE)
    if not match:
        pattern = r'Classification\s*:\s*(?:\[)?\s*(Pass|Reconsider|Not Pass|Unrelated)\s*(?:\])?'
        match = re.search(pattern, text, re.IGNORECASE)
    if match:
        return match.group(1)
    return text

def process_file(file, job_description, is_pdf=True):
    if is_pdf:
        images = convert_pdf_to_images(file.name)
        raw_text = ""
        for image_path in images:
            raw_text += extract_text_from_image(image_path) + "\n"
    else:
        raw_text = extract_text_from_image(file.name)

    tagger = NamedEntityTagger()
    tokenizer = tagger.tokenizer

    unwanted_terms = [
        'ที่อยู่', 'โทรศัพท์', 'อีเมล', 'linkedin', ':', ',', '-', '|',
        'ประวัติส่วนตัว', 'เกี่ยวกับฉัน', 'about me', 'ชื่อ', 'สกุล', 'tell', 'โทร', 'โทรงาน',
        'ชื่อเล่น', 'อายุ', 'วันเกิด', 'พุทธ', 'ศาสนา', 'สัญชาติ', 'phone',
        'ช่องทางการติดต่อ', '_', 're sume', 'resume', 'resu me', 'birth', 'date',
        'address', 'email.', 'ประวัติ'
    ]
    unwanted_pattern = '|'.join(map(re.escape, unwanted_terms))

    tagged_text, cleaned_text = tag_and_clean_text(raw_text, tagger, tokenizer, unwanted_pattern)
    evaluation = generate_answer(cleaned_text, job_description)
    result = extract_classification(evaluation)

    df = pd.DataFrame([{
        "File": os.path.basename(file.name),
        "Raw_Text": raw_text,
        "Tagged_Text": tagged_text,
        "Cleaned_Text": cleaned_text,
        "Job_Description": job_description,
        "Evaluation": evaluation,
        "Result": result,
        "Images": images
    }])
    return df

# Function to process multiple files (remains unchanged)
def process_multiple_files(files, job_description, is_pdf=True):
    all_results = pd.DataFrame()
    for file in files:
        df = process_file(file, job_description, is_pdf)
        all_results = pd.concat([all_results, df], ignore_index=True)
    return all_results

# Modify the function to append results without loading the existing DataFrame
def save_dataframe(df, save_path='/content/output.csv'):
    try:
        # Append new data directly to the CSV without reloading existing data
        df.to_csv(save_path, mode='a', header=not os.path.exists(save_path), index=False, encoding='utf-8-sig')
        return save_path
    except Exception as e:
        print(f"Error saving DataFrame: {e}")
        return ""

# Function to process a batch of DataFrame rows
def generate_prompt_answer_optimized(row):
    resume = row.get('Cleaned_Text', "")
    job_description = row.get('Job_Description', "")
    result = generate_answer(resume, job_description)
    return result

# Function to process a batch of rows
def process_batch(batch_df):
    return [generate_prompt_answer_optimized(row) for _, row in batch_df.iterrows()]

# Batch processing with threading
def batch_process(df, batch_size=32):
    results = [None] * len(df)
    with ThreadPoolExecutor() as executor:
        futures = {}
        for i in range(0, len(df), batch_size):
            batch_df = df.iloc[i:i + batch_size]
            future = executor.submit(process_batch, batch_df)
            futures[future] = (i, i + batch_size)

        for future in as_completed(futures):
            start_idx, end_idx = futures[future]
            batch_results = future.result()
            results[start_idx:end_idx] = batch_results

    return results

## **gradio interface**

In [12]:
def gradio_interface(files, job_description, is_pdf, max_new_tokens, temperature, top_p, repetition_penalty):
    try:
        # Update model parameters
        model_params.update({
            "max_new_tokens": max_new_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "repetition_penalty": repetition_penalty
        })

        df = process_multiple_files(files, job_description, is_pdf)
        df['Evaluation'] = batch_process(df, batch_size=32)
        df['Result'] = df['Evaluation'].apply(extract_classification)

        # Ensure the 'Images' column is in the DataFrame
        if 'Images' not in df.columns:
            raise ValueError("The 'Images' column is missing from the DataFrame!")

        csv_path = save_dataframe(df)
        df_output = df[["File", "Result", "Evaluation", "Images"]]  # Output all required columns for Gradio
        return df_output, csv_path
    except Exception as e:
        print(f"Error in gradio_interface: {e}")
        return pd.DataFrame(), ""

In [13]:
# Async function to process row selection without unnecessary reloads
async def show_evaluation(evt: gr.SelectData):
    row_index = evt.index[0]
    selected_eval = df.iloc[row_index]["Evaluation"]
    image_files = df.iloc[row_index]["Images"]
    first_image = image_files[0] if image_files else None
    return str(selected_eval), first_image

# Gradio interface creation with three-column layout
def create_gradio_interface():
    custom_theme = gr.themes.Default()

    with gr.Blocks(theme=custom_theme) as demo:
        with gr.Row():
            # Left Section for Inputs
            with gr.Column(scale=1):
                file_input = gr.Files(label="Select PDF Files", file_count="multiple")
                job_input = gr.Textbox(placeholder="Enter job description...", label="Job Description", lines=5)
                submit_btn = gr.Button("Process")
                is_pdf_input = gr.Checkbox(label="PDF", value=True)
                max_new_tokens_input = gr.Slider(label="Max New Tokens", minimum=50, maximum=1024, value=300, step=50)
                temperature_input = gr.Slider(label="Temperature", minimum=0.01, maximum=1.0, value=0.1, step=0.1)
                top_p_input = gr.Slider(label="Top P", minimum=0.1, maximum=1.0, value=0.95, step=0.05)
                repetition_penalty_input = gr.Slider(label="Repetition Penalty", minimum=0.1, maximum=2.0, value=1.0, step=0.1)

            # Center Section for Results
            with gr.Column(scale=1):
                output_df = gr.DataFrame(
                    headers=["File", "Result"],
                    type="pandas",
                    interactive=False,
                )
                output_csv = gr.DownloadButton(label="Download CSV")

            # Right Section for Evaluation and Preview
            with gr.Column(scale=1):
                eval_text = gr.Textbox(label="Evaluation Detail", interactive=False, max_lines=15)
                resume_image = gr.Image(label="Resume Preview", interactive=False)

        # Function to process inputs and generate output
        def on_submit(files, job_description, is_pdf, max_new_tokens, temperature, top_p, repetition_penalty):
            global df
            df, csv_path = gradio_interface(files, job_description, is_pdf, max_new_tokens, temperature, top_p, repetition_penalty)
            return df[["File", "Result"]], csv_path

        # Bind submit button to the processing function
        submit_btn.click(on_submit,
                         inputs=[file_input, job_input, is_pdf_input, max_new_tokens_input, temperature_input, top_p_input, repetition_penalty_input],
                         outputs=[output_df, output_csv])

        # Bind row selection to evaluation display function
        output_df.select(show_evaluation, outputs=[eval_text, resume_image])

    return demo

In [14]:
app = create_gradio_interface()
app.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://5eb91faf0446cfd25f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


  net.load_state_dict(copyStateDict(torch.load(trained_model, map_location=device)))
  model.load_state_dict(torch.load(model_path, map_location=device))
Token indices sequence length is longer than the specified maximum sequence length for this model (1078 > 510). Running this sequence through the model will result in indexing errors
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` 

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://5eb91faf0446cfd25f.gradio.live




In [None]:
## software enginner (https://th.jobsdb.com/job/78728644?type=standout&ref=search-standalone#sol=56ea90da2ce55925ca5b5294f7dc6009051ee403)
## business analysis (https://th.jobsdb.com/Business-Analysis-jobs?jobId=78724436&type=standout)
## AI Engineer (https://th.jobsdb.com/AI-Engineer-jobs?jobId=78686025&type=standout)

In [None]:
import pandas as pd
a = pd.read_csv('/content/output.csv')
a

Unnamed: 0,File,Raw_Text,Tagged_Text,Cleaned_Text,Job_Description,Evaluation,Result,Images
0,resume dev1.pdf,thaaphoom babparn\n software engineer\npathum ...,<LOCATION>tha</LOCATION><ORGANIZATION>a</ORGAN...,aparn software engineer accenture application ...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n**Classification**: Reconsider\n**Ratio...,Reconsider,['/content/images/resume dev1/page_001.png']
1,resume hr1.pdf,resume\n ประวัติส่วนตัว\nadora mondmimi\nอโดรา...,resume ประวัติส่วนตัว<PERSON> adora</PERSON><P...,ข้อมูลติดต่อ เกียวกับฉัน เล่น st. 123 a...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n## **Evaluation**:\n\n**Classification*...,Unrelated,['/content/images/resume hr1/page_001.png']
2,resume ai1.pdf,สมหญิง\n แก้วกาจญ์\n วิศวกรปัญญาประดิษฐ์\nal ท...,<PERSON>ส</PERSON><PERSON>สม</PERSON><PERSON>ห...,วิศวกรปัญญาประดิษฐ์ ai ที่มีประสบการณ์ 3 ปีในก...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n**Classification**: Reconsider\n**Ratio...,Reconsider,['/content/images/resume ai1/page_001.png']
3,resume ai3.pdf,re sume\nประวัติส่วนตัว\n090-123-4567\n pimcha...,re sume ประวัติส่วนตัว<PHONE> 09</PHONE><PHONE...,ผู้ เชี่ยวชาญด้าน ai ที่มีประสบการณ์ 4 เขี ในด...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n**Classification**: Unrelated\n**Ration...,Unrelated,['/content/images/resume ai3/page_001.png']
4,resume ai6.pdf,linkedin:\nนภัสสร วิวัฒนาวงศ์\nlinkedin.com i...,<URL>linkedin</URL>:<PERSON> </PERSON><PERSON>...,in วิศวกรปัญญาประดิษฐ์ ประสบการณ์การทํางาน วิ...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n### **Evaluation**:\n\n**Classification...,Reconsider,['/content/images/resume ai6/page_001.png']
5,resume ai7.pdf,ปัญญา วิริยะชัย\n วิศวกรปัญญาประดิษฐ์\nประสบกา...,<PERSON>ป</PERSON><PERSON>ปัญญา วิริยะชัย</PER...,วิศวกรปัญญาประดิษฐ์ ประสบการณ์การฝึกงาน บัณฑิต...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n### **Evaluation**:\n\n**Classification...,Unrelated,['/content/images/resume ai7/page_001.png']
6,Resume ba3.pdf,"ธนกร อินทรีย์พงษ์\n ที่อยู่:\n456 ถนนพระราม 3,...",<PERSON>ธ</PERSON><PERSON>ธน</PERSON><PERSON>ก...,เป้าหมายในการทํางาน มุ่งมันที่จะใช้ความรู้และท...,**Software Engineer (Junior/Senior/Specialist)...,---\n\n**Classification**: Unrelated\n**Ration...,Unrelated,['/content/images/Resume ba3/page_001.png']


## **Pass**
จะตัดสินใจให้ `ผ่าน` ก็ต่อเมื่อผู้สมัครดูมี potential จาก resume ดูจะ applied ไปกับ job description ได้หรือตรงตาม requirement เป็นส่วนใหญ่
## **Reconsider**
จะตัดสินใจให้ `พิจารณา` ก็ต่อเมื่อผู้สมัคร จาก resume ดูจะมีแนวโน้มที่สามารถ applied ตาม requirement ได้ ควรสัมภาษณ์อีกครั้ง
## **Unrelated**
จะตัดสินใจให้ `ไม่เกี่ยวข้อง` ก็ต่อเมื่อผู้สมัคร จาก resume ดูจะ ไม่เกี่ยวข้องกับ job description หรือตรงข้ามกับ requirement เป็นส่วนใหญ่