1. Data Collection
Collect a variety of resumes in PDF format. You'll need a diverse dataset to ensure your model can generalize well.

2. Preprocessing
Convert PDF to text: You can use libraries like PyMuPDF or pdfminer.

Clean the text: Remove unnecessary characters and normalize the text.

3. Feature Extraction
Tokenization: Split the text into individual words or tokens.

Named Entity Recognition (NER): Use NER to identify and classify entities in the text. Libraries like spaCy are excellent for this task.

Regular Expressions: For identifying specific patterns like phone numbers and emails.

4. Building the Model
Use a pre-trained language model like BERT or fine-tune it for your specific use case.

Train the model on annotated resumes where entities like name, job role, etc., are labeled.

5. Model Evaluation
Use metrics like precision, recall, and F1-score to evaluate your model's performance.

6. Saving and Deployment
Save the trained model using a library like joblib or pickle.

Deploy the model using Streamlit for an interactive web application.

# Loading Libraries

In [None]:
import pandas as pd
import numpy as np

PDF to Text Conversion

In [None]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.4


In [None]:
!pip install spacy transformers joblib pickle5


Collecting pickle5
  Downloading pickle5-0.0.11.tar.gz (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.1/132.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pickle5
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for pickle5 (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for pickle5[0m[31m
[0m[?25h  Running setup.py clean for pickle5
Failed to build pickle5
[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pickle5)[0m[31m
[0m

In [None]:
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

Data Collection & Preprocessing

Load Resume Dataset

In [None]:
resume_data = pd.read_csv("/content/ner_filled_clean_resume_dataset.csv")

In [None]:
resume_data.head()

Unnamed: 0,Job Title,Applicant Name,Phone,Email,Linkedin Address,Years of Work Experience,Skills,Companies Worked For,Education Background,Education Institutions Attended,Certifications,Physical Address
0,Social Media Manager,Johnny Davidson,+49-101-66386554,johnny.davidson@yahoo.com,https://www.linkedin.com/in/johnny-davidson,16,[],"['Digital Marketing Specialist', 'Facebook', '...",[],[],[],"East Josephstad, Slovakia (Slovak Republic)"
1,Frontend Web Developer,Amanda Owen,+1-337-912-4766,amanda.owen@protonmail.com,https://www.linkedin.com/in/amanda-owen,26,['Java'],"['BCA', 'BCA', 'U']",[],[],[],"Lake Paulmouth, Saint Pierre and Miquelon"
2,Quality Control Manager,John Lowe,+1-253-734-6013,john.lowe@yahoo.com,https://www.linkedin.com/in/john-lowe,39,[],['Operations'],[],[],[],"Patriciaville, Iceland"
3,Wireless Network Engineer,David Spencer,+44-8358-811442,david.spencer@protonmail.com,https://www.linkedin.com/in/david-spencer,14,[],"['Network Engineer', 'Wireless']",[],[],[],"South Vincent, American Samoa"
4,Conference Manager,Jade Lopez,+33-6-42-03-88-59,jade.lopez@gmail.com,https://www.linkedin.com/in/jade-lopez,25,[],['MBA'],[],[],[],"Port Brittney, Malaysia"


In [None]:
resume_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32483 entries, 0 to 32482
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Job Title                        32483 non-null  object
 1   Applicant Name                   32483 non-null  object
 2   Phone                            32483 non-null  object
 3   Email                            32483 non-null  object
 4   Linkedin Address                 32483 non-null  object
 5   Years of Work Experience         32483 non-null  int64 
 6   Skills                           32483 non-null  object
 7   Companies Worked For             32483 non-null  object
 8   Education Background             32483 non-null  object
 9   Education Institutions Attended  32483 non-null  object
 10  Certifications                   32483 non-null  object
 11  Physical Address                 32483 non-null  object
dtypes: int64(1), object(11)
memory u

In [None]:
resume_data.columns

Index(['Job Title', 'Applicant Name', 'Phone', 'Email', 'Linkedin Address',
       'Years of Work Experience', 'Skills', 'Companies Worked For',
       'Education Background', 'Education Institutions Attended',
       'Certifications', 'Physical Address'],
      dtype='object')

In [None]:
import pandas as pd

# Assuming your dataframe is named resume_data
resume_data = resume_data[resume_data['Skills'] != '[]']

# Reset index after dropping rows
resume_data = resume_data.reset_index(drop=True)

# Display the updated dataframe
resume_data.head(25)


Unnamed: 0,Job Title,Applicant Name,Phone,Email,Linkedin Address,Years of Work Experience,Skills,Companies Worked For,Education Background,Education Institutions Attended,Certifications,Physical Address
0,Frontend Web Developer,Amanda Owen,+1-337-912-4766,amanda.owen@protonmail.com,https://www.linkedin.com/in/amanda-owen,26,['Java'],"['BCA', 'BCA', 'U']",[],[],[],"Lake Paulmouth, Saint Pierre and Miquelon"
1,Backend Developer,Joseph Tapia,+44-1061-012558,joseph.tapia@gmail.com,https://www.linkedin.com/in/joseph-tapia,7,"['Python', 'SQL', 'Java']","['Java Python Ruby api', 'Database', 'Express ...",[],[],[],"Raymondmouth, Saint Helena"
2,Front-End Developer,Brendan Hall,+44-4156-428791,brendan.hall@gmail.com,https://www.linkedin.com/in/brendan-hall,15,['Java'],"['UI Developer BBA Front', 'Cross', 'Develop',...",[],[],[],"North Adamton, Slovakia (Slovak Republic)"
3,Business Intelligence Analyst,Allison Smith,+49-349-96618225,allison.smith@outlook.com,https://www.linkedin.com/in/allison-smith,27,['SQL'],"['Tableau Power', 'Data', 'Gather', 'Tableau',...",[],[],[],Tableau
4,Automation Tester,Angela Booth,+1-751-280-1720,angela.booth@outlook.com,https://www.linkedin.com/in/angela-booth,18,"['Python', 'Java']",[],[],[],[],"East William, Peru"
5,Database Developer,Robert Brown,+91-78312-01236,robert.brown@protonmail.com,https://www.linkedin.com/in/robert-brown,9,"['Python', 'SQL', 'Java']","['BBA Database', 'Java Python datum security a...",[],[],[],"South Anthonyborough, China"
6,Mobile App Developer,Swift,+91-20412-12488,swift@protonmail.com,https://www.linkedin.com/in/swift,0,['Java'],"['React Native Flutter', 'Android']",[],[],[],Kotlin
7,Automation Test Engineer,Faith Hall,+44-2952-049356,faith.hall@yahoo.com,https://www.linkedin.com/in/faith-hall,6,['Python'],"['Develop', 'BCA', 'Selenium', 'Appium']",[],[],[],"Lake Greg, Mongolia"
8,SQL Database Developer,Mary Vasquez,+33-0-04-82-85-62,mary.vasquez@gmail.com,https://www.linkedin.com/in/mary-vasquez,2,['SQL'],"['MCA', 'MCA']",[],[],[],"South Christophershire, British Virgin Islands"
9,Front-End Developer,Krystal Weber,+33-4-34-79-89-67,krystal.weber@protonmail.com,https://www.linkedin.com/in/krystal-weber,18,['Java'],"['Cross', 'Develop', 'U']",[],[],[],"West Annette, Kazakhstan"


In [None]:
resume_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2989 entries, 0 to 2988
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Job Title                        2989 non-null   object
 1   Applicant Name                   2989 non-null   object
 2   Phone                            2989 non-null   object
 3   Email                            2989 non-null   object
 4   Linkedin Address                 2989 non-null   object
 5   Years of Work Experience         2989 non-null   int64 
 6   Skills                           2989 non-null   object
 7   Companies Worked For             2989 non-null   object
 8   Education Background             2989 non-null   object
 9   Education Institutions Attended  2989 non-null   object
 10  Certifications                   2989 non-null   object
 11  Physical Address                 2989 non-null   object
dtypes: int64(1), object(11)
memory usa

In [None]:
resume_data.columns

Index(['Job Title', 'Applicant Name', 'Phone', 'Email', 'Linkedin Address',
       'Years of Work Experience', 'Skills', 'Companies Worked For',
       'Education Background', 'Education Institutions Attended',
       'Certifications', 'Physical Address'],
      dtype='object')

In [None]:
import pandas as pd

# Function to Convert DataFrame to BIO Format
def convert_to_bio_text(resume_data, output_file="resume_bio.txt"):
    with open(output_file, "w") as f:
        for _, row in resume_data.iterrows():
            for col_name, col_value in row.items():
                words = str(col_value).split()  # Convert to string and split into words
                label = col_name.replace(" ", "_").upper()  # Convert column names to labels

                for i, word in enumerate(words):
                    tag = f"B-{label}" if i == 0 else f"I-{label}"
                    f.write(f"{word} {tag}\n")  # Write word and BIO tag to file

            f.write("\n")  # Add a blank line to separate records

# Example Resume Data (Replace with actual DataFrame)
resume_data = resume_data

# Convert to BIO format and save to file
convert_to_bio_text(resume_data, "resume_bio.txt")

print("BIO-formatted data saved to resume_bio.txt ✅")


BIO-formatted data saved to resume_bio.txt ✅


In [None]:
import json

# Function to convert BIO file to Hugging Face JSON format
def bio_to_huggingface_json(bio_file, json_file="resume_hf.json"):
    data = []
    tokens = []
    labels = []

    with open(bio_file, "r") as f:
        for line in f:
            line = line.strip()
            if not line:  # Blank line indicates a new resume entry
                if tokens:
                    data.append({"tokens": tokens, "labels": labels})
                    tokens, labels = [], []  # Reset for next entry
                continue

            parts = line.split()
            if len(parts) == 2:
                word, label = parts
                tokens.append(word)
                labels.append(label)

    # Save last entry if exists
    if tokens:
        data.append({"tokens": tokens, "labels": labels})

    # Save as JSON
    with open(json_file, "w") as f:
        json.dump(data, f, indent=4)

    print(f"Hugging Face formatted data saved to {json_file} ✅")

# Convert and save
bio_to_huggingface_json("resume_bio.txt", "resume_hf.json")


Hugging Face formatted data saved to resume_hf.json ✅


In [None]:
import json

# Function to preview the Hugging Face JSON file
def preview_hf_json(json_file, num_entries=5):
    with open(json_file, "r") as f:
        data = json.load(f)  # Load JSON

    # Print a preview of the first few resume entries
    print(f"Previewing {min(num_entries, len(data))} entries from {json_file}:\n")
    for i, entry in enumerate(data[:num_entries]):
        print(f"Resume Entry {i+1}:")
        print("Tokens:", entry["tokens"])
        print("Labels:", entry["labels"])
        print("-" * 50)  # Separator

# Call function to preview
preview_hf_json("resume_hf.json", num_entries=5)


Previewing 5 entries from resume_hf.json:

Resume Entry 1:
Tokens: ['Frontend', 'Web', 'Developer', 'Amanda', 'Owen', '+1-337-912-4766', 'amanda.owen@protonmail.com', 'https://www.linkedin.com/in/amanda-owen', '26', "['Java']", "['BCA',", "'BCA',", "'U']", '[]', '[]', '[]', 'Lake', 'Paulmouth,', 'Saint', 'Pierre', 'and', 'Miquelon']
Labels: ['B-JOB_TITLE', 'I-JOB_TITLE', 'I-JOB_TITLE', 'B-APPLICANT_NAME', 'I-APPLICANT_NAME', 'B-PHONE', 'B-EMAIL', 'B-LINKEDIN_ADDRESS', 'B-YEARS_OF_WORK_EXPERIENCE', 'B-SKILLS', 'B-COMPANIES_WORKED_FOR', 'I-COMPANIES_WORKED_FOR', 'I-COMPANIES_WORKED_FOR', 'B-EDUCATION_BACKGROUND', 'B-EDUCATION_INSTITUTIONS_ATTENDED', 'B-CERTIFICATIONS', 'B-PHYSICAL_ADDRESS', 'I-PHYSICAL_ADDRESS', 'I-PHYSICAL_ADDRESS', 'I-PHYSICAL_ADDRESS', 'I-PHYSICAL_ADDRESS', 'I-PHYSICAL_ADDRESS']
--------------------------------------------------
Resume Entry 2:
Tokens: ['Backend', 'Developer', 'Joseph', 'Tapia', '+44-1061-012558', 'joseph.tapia@gmail.com', 'https://www.linkedin.com/in

# Training an NER (Named Entity Recognition) Model

In [None]:
# install required libraries
!pip install transformers datasets seqeval torch


Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)


In [None]:
import json
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
from seqeval.metrics import classification_report
from tqdm import tqdm  # ✅ Import tqdm for progress bar


In [None]:
# Load and prepare data
# Load BIO-tagged JSON data
with open("resume_hf.json", "r") as f:
    data = json.load(f)

# Convert JSON to Hugging Face Dataset
def process_data(data):
    tokenized_inputs = []
    labels = []

    for entry in data:
        tokens = entry["tokens"]
        entity_labels = entry["labels"]

        tokenized_inputs.append(tokens)
        labels.append(entity_labels)

    return Dataset.from_dict({"tokens": tokenized_inputs, "ner_tags": labels})

dataset = process_data(data)



In [None]:
# Load a pretrained transformer model
# Define Model Name
#model_name = "bert-base-cased"
'''
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create Label Mapping
label2id = {label: i for i, label in enumerate(set(sum(dataset["ner_tags"], [])))}
id2label = {i: label for label, i in label2id.items()}

# Load Pre-trained Model
model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(label2id)
)
'''

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

RuntimeError: Error(s) in loading state_dict for BertForTokenClassification:
	size mismatch for classifier.weight: copying a param with shape torch.Size([9, 1024]) from checkpoint, the shape in current model is torch.Size([18, 1024]).
	size mismatch for classifier.bias: copying a param with shape torch.Size([9]) from checkpoint, the shape in current model is torch.Size([18]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

In [None]:
# Load a pretrained transformer model
# Define Model Name
model_name = "bert-base-cased"
#model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create Label Mapping
label2id = {label: i for i, label in enumerate(set(sum(dataset["ner_tags"], [])))}
id2label = {i: label for label, i in label2id.items()}

# Load Pre-trained Model
# Add ignore_mismatched_sizes=True
model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(label2id), ignore_mismatched_sizes=True
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Tokenize data and align labels
def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(example["tokens"], truncation=True, padding="max_length", max_length=512, is_split_into_words=True)

    labels = []
    for i, label in enumerate(example["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [label2id[label[word_idx]] if word_idx is not None else -100 for word_idx in word_ids]
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply Tokenization
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)



Map:   0%|          | 0/2989 [00:00<?, ? examples/s]

In [None]:
# Setup training arguments

training_args = TrainingArguments(
    output_dir="./ner_model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=0.0001,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    report_to="none"  # Disable default logging
)





In [None]:
# Implement custom trainer with progres bar

from transformers import TrainerCallback

class ProgressBarCallback(TrainerCallback):
    def __init__(self, total_epochs):
        self.total_epochs = total_epochs
        self.epoch_bar = None
        self.step_bar = None

    def on_train_begin(self, args, state, control, **kwargs):
        print("\n🚀 Training Started...\n")
        self.epoch_bar = tqdm(total=self.total_epochs, desc="Epochs", position=0, leave=True)

    def on_epoch_begin(self, args, state, control, **kwargs):
        self.step_bar = tqdm(total=state.max_steps // self.total_epochs, desc="Steps", position=1, leave=False)

    def on_step_end(self, args, state, control, **kwargs):
        self.step_bar.update(1)

    def on_epoch_end(self, args, state, control, **kwargs):
        self.epoch_bar.update(1)
        self.step_bar.close()

    def on_train_end(self, args, state, control, **kwargs):
        self.epoch_bar.close()
        print("\n✅ Training Complete!")

# Initialize Progress Bar Callback
progress_callback = ProgressBarCallback(total_epochs=training_args.num_train_epochs)

# Data Collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Define Trainer with Progress Bar
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,  # Use validation split if available
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[progress_callback]  # ✅ Add Progress Bar Callback
)


  trainer = Trainer(


In [None]:
# Train the model

trainer.train()


Epochs:   0%|          | 0/2 [40:16<?, ?it/s]

                                              [A


🚀 Training Started...



Epochs:   0%|          | 0/2 [00:00<?, ?it/s]
Steps:   0%|          | 0/374 [00:00<?, ?it/s][A
Steps:   0%|          | 1/374 [00:00<04:36,  1.35it/s][A

Epoch,Training Loss,Validation Loss
1,0.0222,0.003508
2,0.0094,0.001819



Steps:   1%|          | 2/374 [00:01<04:36,  1.34it/s][A
Steps:   1%|          | 3/374 [00:02<04:38,  1.33it/s][A
Steps:   1%|          | 4/374 [00:03<04:41,  1.31it/s][A
Steps:   1%|▏         | 5/374 [00:03<04:41,  1.31it/s][A
Steps:   2%|▏         | 6/374 [00:04<04:41,  1.31it/s][A
Steps:   2%|▏         | 7/374 [00:05<04:40,  1.31it/s][A
Steps:   2%|▏         | 8/374 [00:06<04:39,  1.31it/s][A
Steps:   2%|▏         | 9/374 [00:06<04:38,  1.31it/s][A
Steps:   3%|▎         | 10/374 [00:07<04:38,  1.30it/s][A
Steps:   3%|▎         | 11/374 [00:08<04:37,  1.31it/s][A
Steps:   3%|▎         | 12/374 [00:09<04:38,  1.30it/s][A
Steps:   3%|▎         | 13/374 [00:09<04:37,  1.30it/s][A
Steps:   4%|▎         | 14/374 [00:10<04:37,  1.30it/s][A
Steps:   4%|▍         | 15/374 [00:11<04:36,  1.30it/s][A
Steps:   4%|▍         | 16/374 [00:12<04:35,  1.30it/s][A
Steps:   5%|▍         | 17/374 [00:13<04:35,  1.30it/s][A
Steps:   5%|▍         | 18/374 [00:13<04:37,  1.28it/s][A
Step


✅ Training Complete!





TrainOutput(global_step=748, training_loss=0.043413970300698664, metrics={'train_runtime': 794.1615, 'train_samples_per_second': 7.527, 'train_steps_per_second': 0.942, 'total_flos': 1562257967542272.0, 'train_loss': 0.043413970300698664, 'epoch': 2.0})

In [None]:
# Evaluate the model

# Get Predictions
predictions, labels, _ = trainer.predict(tokenized_dataset)

# Convert Predictions to Labels, handling padding tokens (-100)
true_labels = [[id2label[id] for id in label if id != -100] for label in labels]
pred_labels = [[id2label[id] for id in pred if id != -100] for pred in predictions.argmax(-1)]

# Ensure consistent lengths for classification_report
# Find the minimum length among all sub-lists in true_labels and pred_labels
min_len = min(min(len(sub_list) for sub_list in true_labels), min(len(sub_list) for sub_list in pred_labels))

# Truncate all sub-lists to the minimum length
true_labels = [sub_list[:min_len] for sub_list in true_labels]
pred_labels = [sub_list[:min_len] for sub_list in pred_labels]

# Print Classification Report
print(classification_report(true_labels, pred_labels))

                                 precision    recall  f1-score   support

                 APPLICANT_NAME       0.06      0.06      0.06      3512
                 CERTIFICATIONS       0.18      0.35      0.24        23
           COMPANIES_WORKED_FOR       0.66      0.56      0.61      6058
           EDUCATION_BACKGROUND       0.63      0.41      0.50       201
EDUCATION_INSTITUTIONS_ATTENDED       0.62      0.34      0.44        74
                          EMAIL       0.91      0.91      0.91     32935
                      JOB_TITLE       0.32      0.37      0.34      5881
               LINKEDIN_ADDRESS       0.95      0.95      0.95     55637
                          PHONE       0.91      0.91      0.91     31941
               PHYSICAL_ADDRESS       0.00      0.33      0.00         6
                         SKILLS       0.76      0.74      0.75     14681
       YEARS_OF_WORK_EXPERIENCE       0.00      0.00      0.00      2887

                      micro avg       0.83      0

In [None]:
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#load and process resume pdf

import pdfplumber

def extract_text_from_pdf(pdf_path):
    """
    Extract text from a PDF file.
    """
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])
    return text

# Load the sample resume PDF
pdf_path = "/content/Moses Mugambi Data Analyst CV.pdf"
resume_text = extract_text_from_pdf(pdf_path)

# Print the extracted text (optional)
print(resume_text[:])  # Preview first 1000 characters


Moses Mugambi | Data Scientist
Email: mugambimoses2@gmail.com | Phone: +254718695260 | LinkedIn | GitHub
Professional Summary:
Data Scientist with a strong background in machine learning, and data-driven decision-
making. Skilled in extracting meaningful insights from complex datasets to drive
business strategies and optimize processes. Proficient in Python programming, SQL,
and data visualization tools such as Tableau, Matplotlib, Seaborn, Plotly, and Excel.
Experienced in building predictive models and implementing NLP solutions for various
industries. Passionate about leveraging data science to solve real-world problems and
improve organizational efficiency. Strong communication and collaboration skills with the
ability to translate technical findings into actionable business recommendations.
Work Experience:
Zindua School (2024 -2025)
• Built a Resume Screening system using Natural Language Processing models to
accurately extract key resume details for efficient job recruitment pur

In [None]:
# load the trained NER model

from transformers import pipeline
import os

# Get the absolute path to the model directory
model_path = os.path.abspath("./resume_ner_model")

# Load fine-tuned model and tokenizer using the absolute path
ner_pipeline = pipeline("ner", model=model_path, tokenizer=model_path, aggregation_strategy="simple")

Device set to use cuda:0


In [None]:
# Get NER predictions
ner_results = ner_pipeline(resume_text)

# Print extracted entities
for entity in ner_results:
    print(f"{entity['word']} -> {entity['entity_group']} (Confidence: {entity['score']:.2f})")


Moses -> LABEL_10 (Confidence: 0.68)
Mugambi -> LABEL_3 (Confidence: 0.99)
| Data -> LABEL_11 (Confidence: 0.68)
Scientist -> LABEL_6 (Confidence: 1.00)
Email -> LABEL_10 (Confidence: 0.97)
: mugambimos -> LABEL_3 (Confidence: 0.70)
##es2 @ gmail. com -> LABEL_8 (Confidence: 0.86)
| Phone -> LABEL_3 (Confidence: 0.58)
: -> LABEL_10 (Confidence: 0.56)
+ 254718695260 | -> LABEL_0 (Confidence: 0.97)
LinkedIn | -> LABEL_8 (Confidence: 0.52)
Git -> LABEL_3 (Confidence: 0.36)
##H -> LABEL_9 (Confidence: 0.40)
##ub -> LABEL_3 (Confidence: 0.51)
Professional Summary : Data -> LABEL_9 (Confidence: 0.62)
Scientist -> LABEL_6 (Confidence: 0.84)
with a strong -> LABEL_3 (Confidence: 0.26)
background in machine learning, and data - driven decision - making -> LABEL_13 (Confidence: 0.78)
. Skilled in -> LABEL_7 (Confidence: 0.63)
extract -> LABEL_9 (Confidence: 0.39)
##ing -> LABEL_7 (Confidence: 0.71)
meaningful insights from complex -> LABEL_9 (Confidence: 0.49)
data -> LABEL_13 (Confidence: 0.37)

In [None]:
# Get NER predictions
ner_results = ner_pipeline(resume_text)

# Print extracted entities with BIO tags
for entity in ner_results:
    # Hugging Face returns something like "LABEL_1", "LABEL_2" in entity_group
    label_id = entity["entity_group"]

    # Convert "LABEL_X" to numerical ID and then to BIO tag
    # Extract numerical ID from label_id (e.g., "LABEL_1" -> 1)
    numerical_id = int(label_id.split("_")[-1]) if label_id.startswith("LABEL_") else -1  # Handle non-LABEL_ cases

    # Get BIO tag if numerical_id is valid
    bio_tag = id2label.get(numerical_id, label_id)  # Fallback to label_id if not found in id2label

    print(f"{entity['word']} -> {bio_tag} (Confidence: {entity['score']:.2f})")

Moses -> B-APPLICANT_NAME (Confidence: 0.68)
Mugambi -> I-APPLICANT_NAME (Confidence: 0.99)
| Data -> B-JOB_TITLE (Confidence: 0.68)
Scientist -> I-JOB_TITLE (Confidence: 1.00)
Email -> B-APPLICANT_NAME (Confidence: 0.97)
: mugambimos -> I-APPLICANT_NAME (Confidence: 0.70)
##es2 @ gmail. com -> B-EMAIL (Confidence: 0.86)
| Phone -> I-APPLICANT_NAME (Confidence: 0.58)
: -> B-APPLICANT_NAME (Confidence: 0.56)
+ 254718695260 | -> B-PHONE (Confidence: 0.97)
LinkedIn | -> B-EMAIL (Confidence: 0.52)
Git -> I-APPLICANT_NAME (Confidence: 0.36)
##H -> I-COMPANIES_WORKED_FOR (Confidence: 0.40)
##ub -> I-APPLICANT_NAME (Confidence: 0.51)
Professional Summary : Data -> I-COMPANIES_WORKED_FOR (Confidence: 0.62)
Scientist -> I-JOB_TITLE (Confidence: 0.84)
with a strong -> I-APPLICANT_NAME (Confidence: 0.26)
background in machine learning, and data - driven decision - making -> I-SKILLS (Confidence: 0.78)
. Skilled in -> I-CERTIFICATIONS (Confidence: 0.63)
extract -> I-COMPANIES_WORKED_FOR (Confidenc

In [None]:
from collections import defaultdict

# Get NER predictions
ner_results = ner_pipeline(resume_text)

# Dictionary to store extracted entities
extracted_entities = defaultdict(list)

# Temporary variables for handling multi-word entities
current_entity = []
current_label = None

for entity in ner_results:
    word = entity['word'].replace("##", "").strip()  # Remove subword artifacts and whitespace
    label = entity["entity_group"]

    # Convert "LABEL_X" to BIO tag
    numerical_id = int(label.split("_")[-1]) if label.startswith("LABEL_") else -1
    bio_tag = id2label.get(numerical_id, label)

    # If new entity starts (B- tag or transition from different entity)
    if bio_tag.startswith("B-") or (bio_tag.startswith("I-") and current_label != bio_tag[2:]):
        # Save the previous entity
        if current_entity and current_label:
            extracted_entities[current_label].append(" ".join(current_entity))

        # Start new entity
        current_entity = [word]
        current_label = bio_tag[2:]  # Remove "B-" prefix
    elif bio_tag.startswith("I-") and current_label == bio_tag[2:]:
        # Continue the current entity
        current_entity.append(word)
    else:
        # If unexpected transition, save and reset
        if current_entity and current_label:
            extracted_entities[current_label].append(" ".join(current_entity))

        current_entity = []
        current_label = None

# Save the last entity if any
if current_entity and current_label:
    extracted_entities[current_label].append(" ".join(current_entity))

# **Post-processing Fixes**
def clean_text(text):
    """Fixes common errors like broken words, removes extra spaces, and formats properly."""
    text = text.replace(" - ", "-").replace(" . ", ".").replace(" , ", ", ").replace(" :", ":")
    text = " ".join(text.split())  # Remove extra spaces
    return text.strip()

# Clean extracted entities
for key in extracted_entities:
    extracted_entities[key] = [clean_text(value) for value in extracted_entities[key]]

# Ensure every category has at least one value
required_keys = ["APPLICANT_NAME", "JOB_TITLE", "EMAIL", "PHONE", "COMPANIES_WORKED_FOR",
                 "SKILLS", "EDUCATION", "CERTIFICATIONS"]

for key in required_keys:
    if key not in extracted_entities:
        extracted_entities[key] = ["N/A"]  # Fill missing values

# **Final Output Formatting**
for key, value in extracted_entities.items():
    if key == "SKILLS":
        print(f"{key}: {', '.join(value)}")  # Format skills as a list
    else:
        print(f"{key}: {' '.join(value)}")  # Ensure full extraction of multi-word entities


APPLICANT_NAME: Moses Mugambi Email: mugambimos | Phone : Git ub with a strong
JOB_TITLE: | Data Scientist Scientist
EMAIL: es2 @ gmail. com LinkedIn |
PHONE: + 254718695260 | 4-2025
COMPANIES_WORKED_FOR: H Professional Summary: Data extract meaningful insights from complex sets to drive business strategies and optimize processes in Python programming, SQL, and data visualization tools such as Tableau, Matplotlib, Seaborn, Plotly, and Excel. Work Experience: Zindua School ( 202 ) • Screening system using Natural Language Processing models Programming, Machine Learning, Data Analysis, Data Storytelling, Data Visualization, Web Scraping, SQL, Model / App Deployment B Sc Electrical and Electronics Engineering The Technical University of Kenya, 2013 – 2018: Second Upper Class Honours Relev Courses: Neur Networks, Probability, Statistics, Linear Algebra, Economics Skills Python Programming, Pandas, Num y, Scikit-Learn, SQL, Table , Machine Learning, Natural Language Process ( NL ), Web Scra

***********************************************************-----------------------------------------------

In [None]:
# pretrained model
'''
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
'''

SAVE THE MODEL

In [None]:
# using torch
import torch

# Define model save path
model_save_path = "ner_model.pth"

# Save model state dict
torch.save(model.state_dict(), model_save_path)

print(f"Model saved to {model_save_path}")


NameError: name 'model' is not defined

In [None]:
# Save the model in Hugging face format
model.save_pretrained("ner_model")
tokenizer.save_pretrained("ner_model")

print("Model and tokenizer saved in Hugging Face format!")


Model and tokenizer saved in Hugging Face format!


In [None]:
#compress the model for upload to google drive
!zip -r /content/ner_model.zip /content/ner_model
from google.colab import files
files.download("/content/ner_model.zip")



  adding: content/ner_model/ (stored 0%)
  adding: content/ner_model/model.safetensors (deflated 7%)
  adding: content/ner_model/vocab.txt (deflated 49%)
  adding: content/ner_model/tokenizer.json (deflated 70%)
  adding: content/ner_model/special_tokens_map.json (deflated 42%)
  adding: content/ner_model/tokenizer_config.json (deflated 75%)
  adding: content/ner_model/config.json (deflated 58%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Save model to google drive
from google.colab import drive
drive.mount('/content/drive')

!cp -r /content/ner_model /content/drive/MyDrive/

Mounted at /content/drive


## Model Evaluation

The Trainer API has a built-in evaluate() function that will compute the evaluation metrics for the model.

In [None]:
metrics = trainer.evaluate()
print(metrics)


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmugambimoses2[0m ([33mmugambimoses2-zindua-school[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 2.3461310863494873, 'eval_model_preparation_time': 0.0747, 'eval_runtime': 319.2336, 'eval_samples_per_second': 1.873, 'eval_steps_per_second': 0.119}


In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=b4c0505c061818e4828fe8ee4695e98903bac2130ca0ed8a86a835ba1fc3f23d
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
# Compute Precision, Recall, and F1-Score
import evaluate

# Load the seqeval metric for NER
metric = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = predictions.argmax(-1)  # Get the highest probability label

    # Convert predictions and labels into a list of entity names
    true_labels = [[id2label[label] for label in label_list if label != -100] for label_list in labels]
    pred_labels = [[id2label[pred] for pred, label in zip(pred_list, label_list) if label != -100] for pred_list, label_list in zip(predictions, labels)]

    results = metric.compute(predictions=pred_labels, references=true_labels)

    # Extract Precision, Recall, and F1-score
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


In [None]:
# Rerun the evaluation
trainer.compute_metrics = compute_metrics  # Set the metric function
metrics = trainer.evaluate()

# Print each metric on a separate line
print("Evaluation Metrics:")
for key, value in metrics.items():
    print(f"{key.capitalize()}: {value:.4f}")  # Format to 4 decimal places



  _warn_prf(average, modifier, msg_start, len(result))


Evaluation Metrics:
Eval_loss: 2.3461
Eval_model_preparation_time: 0.0747
Eval_precision: 0.0000
Eval_recall: 0.0000
Eval_f1: 0.0000
Eval_accuracy: 0.4231
Eval_runtime: 276.6602
Eval_samples_per_second: 2.1610
Eval_steps_per_second: 0.1370


### Visualizing Predictions (colour coded entity predictions)

In [None]:
!pip install PyMuPDF

In [None]:
import fitz  # PyMuPDF for extracting text from PDFs
import torch
from termcolor import colored
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Load Model and Tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained("ner_model")
model = AutoModelForTokenClassification.from_pretrained("ner_model")
model.eval()

# Label mapping
id2label = {0: "O", 1: "NAME", 2: "EMAIL", 3: "PHONE", 4: "SKILL", 5: "EXPERIENCE", 6: "EDUCATION"}

# Function to Extract Text from PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"
    return text

# Function to Visualize Predictions
def visualize_predictions(text):
    # Tokenize text without padding/truncation for proper alignment
    tokens = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    tokenized_text = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])  # List of subword tokens

    # Perform inference
    with torch.no_grad():
        outputs = model(**tokens)

    # Get predicted labels
    predictions = outputs.logits.argmax(-1).squeeze().tolist()

    # Merge subwords into full words
    words = []
    word_labels = []
    current_word = ""
    current_label = "O"

    for token, entity_id in zip(tokenized_text, predictions):
        entity = id2label.get(entity_id, "O")

        # Handle subwords correctly
        if token.startswith("##"):
            current_word += token[2:]  # Merge with previous word
        else:
            if current_word:  # Append previous word to list
                words.append((current_word, current_label))
            current_word = token
            current_label = entity if entity != "O" else current_label  # Preserve entity type if inside entity

    # Append the last word
    if current_word:
        words.append((current_word, current_label))

    # Print color-coded predictions
    for word, entity in words:
        color = "green" if entity != "O" else "white"
        print(colored(f"{word} ({entity})", color))

# Example Usage: Test on a Resume PDF
resume_pdf_path = "/content/Moses Mugambi Data Analyst CV.pdf"
resume_text = extract_text_from_pdf(resume_pdf_path)

print("\n📄 Extracted Resume Text:\n", resume_text[:])  # Print first 500 characters

print("\n🔍 NER Predictions:\n")
visualize_predictions(resume_text)



📄 Extracted Resume Text:
 Moses Mugambi | Data Scientist 
Email: mugambimoses2@gmail.com | Phone: +254718695260 | LinkedIn | GitHub 
  
Professional Summary:  
Data Scientist with a strong background in machine learning, and data-driven decision-
making. Skilled in extracting meaningful insights from complex datasets to drive 
business strategies and optimize processes. Proficient in Python programming, SQL, 
and data visualization tools such as Tableau, Matplotlib, Seaborn, Plotly, and Excel. 
Experienced in building predictive models and implementing NLP solutions for various 
industries. Passionate about leveraging data science to solve real-world problems and 
improve organizational efficiency. Strong communication and collaboration skills with the 
ability to translate technical findings into actionable business recommendations. 
  
Work Experience:  
Zindua School (2024 -2025) 
• 
Built a Resume Screening system using Natural Language Processing models to 
accurately extract key

# Streamlit code

In [None]:
import streamlit as st
import pdfplumber
import json
import torch
import nltk
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

nltk.download('punkt')

# Load pre-trained NER model
MODEL_NAME = "your-username/ner-resume-model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Load sentence embedding model for similarity
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Function to extract text from PDF
def extract_text_from_pdf(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        return "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])

# Function to extract skills and experience using NER
def extract_skills_experience(text):
    ner_results = ner_pipeline(text)
    skills, experience = set(), []

    for entity in ner_results:
        if entity["entity_group"] == "SKILL":
            skills.add(entity["word"])
        elif entity["entity_group"] == "EXPERIENCE":
            experience.append(entity["word"])

    return list(skills), experience

# Function to calculate matching and missing skills
def compare_skills(resume_skills, job_skills):
    resume_skills = set(resume_skills)
    job_skills = set(job_skills.split(", "))

    matching_skills = resume_skills & job_skills
    missing_skills = job_skills - resume_skills

    return matching_skills, missing_skills

# Function to calculate resume score
def calculate_resume_score(resume_text, job_description):
    resume_embedding = embedding_model.encode([resume_text])[0]
    job_embedding = embedding_model.encode([job_description])[0]

    score = cosine_similarity([resume_embedding], [job_embedding])[0][0]
    return round(score * 100, 2)

# Streamlit UI
st.title("📄 AI-Powered Resume Parser & Job Matching")
st.sidebar.header("Upload Resumes & Job Description")

uploaded_files = st.sidebar.file_uploader("Upload Resume(s)", accept_multiple_files=True, type=["pdf"])
job_description = st.sidebar.text_area("Enter Job Requirements & Description")

if st.sidebar.button("Analyze Resumes"):
    if not uploaded_files or not job_description:
        st.sidebar.error("Please upload at least one resume and enter a job description.")
    else:
        results = []
        for uploaded_file in uploaded_files:
            resume_text = extract_text_from_pdf(uploaded_file)
            resume_skills, resume_experience = extract_skills_experience(resume_text)
            matching_skills, missing_skills = compare_skills(resume_skills, job_description)
            resume_score = calculate_resume_score(resume_text, job_description)

            results.append({
                "filename": uploaded_file.name,
                "matching_skills": list(matching_skills),
                "missing_skills": list(missing_skills),
                "experience": resume_experience,
                "score": resume_score
            })

        # Display results
        for res in results:
            st.subheader(f"📂 Resume: {res['filename']}")
            st.write(f"**Resume Score:** {res['score']}%")
            st.write("✅ **Matching Skills:**", ", ".join(res["matching_skills"]) if res["matching_skills"] else "None")
            st.write("❌ **Missing Skills:**", ", ".join(res["missing_skills"]) if res["missing_skills"] else "None")
            st.write("📌 **Experience Extracted:**", ", ".join(res["experience"]) if res["experience"] else "None")

        # Display ranked resumes
        st.subheader("🏆 Ranked Resumes by Score")
        ranked_results = sorted(results, key=lambda x: x["score"], reverse=True)
        for i, res in enumerate(ranked_results, 1):
            st.write(f"**{i}. {res['filename']}** - Score: {res['score']}%")

